Elasticsearch: Advanced RAG Techniques Part 2: Querying

youngerjesus 2024. 8. 16. 10:25

2024. 8. 16. 10:25

Enriching Queries with Synonyms:

Quey Enriching 은 원본 Query 를 가지고 다양한 관점에서 유사한 쿼리를 생성하고, 이를 ES 에서 OR 검색에 적용시키는 것.

다음과 같은 "Who audits Elastic" 이라는 쿼리가 있다라고 해보자.

이를 프롬프트를 이용해서 적용한다면 다음과 같은 결과가 나올 것.

ELASTIC_SEARCH_QUERY_GENERATOR_PROMPT = '''
You are an AI assistant specialized in generating Elasticsearch query strings. Your task is to create the most effective query string for the given user question. This query string will be used to search for relevant documents in an Elasticsearch index.

Guidelines:
1. Analyze the user's question carefully.
2. Generate ONLY a query string suitable for Elasticsearch's match query.
3. Focus on key terms and concepts from the question.
4. Include synonyms or related terms that might be in relevant documents.
5. Use simple Elasticsearch query string syntax if helpful (e.g., OR, AND).
6. Do not use advanced Elasticsearch features or syntax.
7. Do not include any explanations, comments, or additional text.
8. Provide only the query string, nothing else.

For the question "What is Clickthrough Data?", we would expect a response like:
clickthrough data OR click-through data OR click through rate OR CTR OR user clicks OR ad clicks OR search engine results OR web analytics

AND operator is not allowed. Use only OR.

User Question:
[The user's question will be inserted here]

Generate the Elasticsearch query string:
'''

생성된 쿼리 결과:

'audits elastic OR 
elasticsearch audits OR 
elastic auditor OR 
elasticsearch auditor OR 
elastic audit firm OR 
elastic audit company OR 
elastic audit organization OR 
elastic audit service'

이것들을 Parsing 하고, ES 에서 쿼리로 적용하면 된다. 쿼리를 적용할 때 'multi_match' 를 이용해서 다양한 필드를 통한 검색을 하면 됨.

def parse_or_query(self, query_text: str) -> List[str]:
    # Split the query by 'OR' and strip whitespace from each term
    # This converts a string like "term1 OR term2 OR term3" into a list ["term1", "term2", "term3"]
    return [term.strip() for term in query_text.split(' OR ')]

 'query': {
    'bool': {
        'must': [
            {
                'multi_match': {
                'query': 'audits Elastic Elastic auditing Elastic audit process Elastic compliance Elastic security audit Elasticsearch auditing Elasticsearch compliance Elasticsearch security audit',
                'fields': [
                    'original_text',
                'keyphrases',
                'potential_questions',
                'entities'
                ],
                'type': 'best_fields',
                'operator': 'or'
                }
            }
      ]

HyDE (Hypothetical Document Embedding)

이건 원본 Query 에 대한 답을 LLM 을 통해 생성한 후, 이것과 비슷한 문서를 검색하도록 하는 방법임.

다음과 같은 프롬프트를 써서 가상의 답을 생성한 후 이를 임베딩해서 검색하면 된다.

HYDE_DOCUMENT_GENERATOR_PROMPT = '''
You are an AI assistant specialized in generating hypothetical documents based on user queries. Your task is to create a detailed, factual document that would likely contain the answer to the user's question. This hypothetical document will be used to enhance the retrieval process in a Retrieval-Augmented Generation (RAG) system.

Guidelines:
1. Carefully analyze the user's query to understand the topic and the type of information being sought.
2. Generate a hypothetical document that:
   a. Is directly relevant to the query
   b. Contains factual information that would answer the query
   c. Includes additional context and related information
   d. Uses a formal, informative tone similar to an encyclopedia or textbook entry
3. Structure the document with clear paragraphs, covering different aspects of the topic.
4. Include specific details, examples, or data points that would be relevant to the query.
5. Aim for a document length of 200-300 words.
6. Do not use citations or references, as this is a hypothetical document.
7. Avoid using phrases like "In this document" or "This text discusses" - write as if it's a real, standalone document.
8. Do not mention or refer to the original query in the generated document.
9. Ensure the content is factual and objective, avoiding opinions or speculative information.
10. Output only the generated document, without any additional explanations or meta-text.

User Question:
[The user's question will be inserted here]

Generate a hypothetical document that would likely contain the answer to this query:
'''

Hybrid Search

Lexical Search + Vector Search 를 결합한 검색 방법임. 서로의 장점과 단점이 있는데 이를 결합해서 상호보완적인 검색을 하도록 하는 것.

검색 예시는 다음과 같다:

def hybrid_vector_search(self, index_name: str, query_text: str, query_vector: List[float], 
                         text_fields: List[str], vector_field: str, 
                         num_candidates: int = 100, num_results: int = 10) -> Dict:
    """
    Perform a hybrid search combining text-based and vector-based similarity.

    Args:
        index_name (str): The name of the Elasticsearch index to search.
        query_text (str): The text query string, which may contain 'OR' separated terms.
        query_vector (List[float]): The query vector for semantic similarity search.
        text_fields (List[str]): List of text fields to search in the index.
        vector_field (str): The name of the field containing document vectors.
        num_candidates (int): Number of candidates to consider in the initial KNN search.
        num_results (int): Number of final results to return.

    Returns:
        Dict: A tuple containing the Elasticsearch response and the search body used.
    """
    try:
        # Parse the query_text into a list of individual search terms
        # This splits terms separated by 'OR' and removes any leading/trailing whitespace
        query_terms = self.parse_or_query(query_text)

        # Construct the search body for Elasticsearch
        search_body = {
            # KNN search component for vector similarity
            "knn": {
                "field": vector_field,  # The field containing document vectors
                "query_vector": query_vector,  # The query vector to compare against
                "k": num_candidates,  # Number of nearest neighbors to retrieve
                "num_candidates": num_candidates  # Number of candidates to consider in the KNN search
            },
            "query": {
                "bool": {
                    # The 'must' clause ensures that matching documents must satisfy this condition
                    # Documents that don't match this clause are excluded from the results
                    "must": [
                        {
                            # Multi-match query to search across multiple text fields
                            "multi_match": {
                                "query": " ".join(query_terms),  # Join all query terms into a single space-separated string
                                "fields": text_fields,  # List of fields to search in
                                "type": "best_fields",  # Use the best matching field for scoring
                                "operator": "or"  # Match any of the terms (equivalent to the original OR query)
                            }
                        }
                    ],
                    # The 'should' clause boosts relevance but doesn't exclude documents
                    # It's used here to combine vector similarity with text relevance
                    "should": [
                        {
                            # Custom scoring using a script to combine vector and text scores
                            "script_score": {
                                "query": {"match_all": {}},  # Apply this scoring to all documents that matched the 'must' clause
                                "script": {
                                    # Script to combine vector similarity and text relevance
                                    "source": """
                                    # Calculate vector similarity (cosine similarity + 1)
                                    # Adding 1 ensures the score is always positive
                                    double vector_score = cosineSimilarity(params.query_vector, params.vector_field) + 1.0;
                                    # Get the text-based relevance score from the multi_match query
                                    double text_score = _score;
                                    # Combine scores: 70% vector similarity, 30% text relevance
                                    # This weighting can be adjusted based on the importance of semantic vs keyword matching
                                    return 0.7 * vector_score + 0.3 * text_score;
                                    """,
                                    # Parameters passed to the script
                                    "params": {
                                        "query_vector": query_vector,  # Query vector for similarity calculation
                                        "vector_field": vector_field  # Field containing document vectors
                                    }
                                }
                            }
                        }
                    ]
                }
            }
        }

        # Execute the search request against the Elasticsearch index
        response = self.conn.search(index=index_name, body=search_body, size=num_results)
        # Log the successful execution of the search for monitoring and debugging
        logger.info(f"Hybrid search executed on index: {index_name} with text query: {query_text}")
        # Return both the response and the search body (useful for debugging and result analysis)
        return response, search_body
    except Exception as e:
        # Log any errors that occur during the search process
        logger.error(f"Error executing hybrid search on index: {index_name}. Error: {e}")
        # Re-raise the exception for further handling in the calling code
        raise e

이렇게 결과를 가져온 이후 Reverse Packing 매커니즘을 적용해서 거꾸로 순위를 만들어서 LLM Context 로 적용하면 됨.

실제로 LLM 에서 ES 에서 Retrieve 한 이후 컨택스트로 사용하는 코드는 다음과 같다:

def get_context(index_name, 
                match_query, 
                text_query, 
                fields, 
                num_candidates=100, 
                num_results=20, 
                text_fields=["original_text", 'keyphrases', 'potential_questions', 'entities'], 
                embedding_field="primary_embedding"):

    embedding=embedder.get_embeddings_from_text(text_query)

    results, search_body = es_query_maker.hybrid_vector_search(
        index_name=index_name,
        query_text=match_query,
        query_vector=embedding[0][0],
        text_fields=text_fields,
        vector_field=embedding_field,
        num_candidates=num_candidates,
        num_results=num_results
    )

    # Concatenates the text in each 'field' key of the search result objects into a single block of text.
    context_docs=['\n\n'.join([field+":\n\n"+j['_source'][field] for field in fields]) for j in results['hits']['hits']]

    # Reverse Packing to ensure that the highest ranking document is seen first by the LLM.
    context_docs.reverse()
    return context_docs, search_body

def retrieval_augmented_generation(query_text):
    match_query= gpt4o.generate_query(query_text)
    fields=['original_text']

    hyde_document=gpt4o.generate_HyDE(query_text)

    context, search_body=get_context(index_name, match_query, hyde_document, fields)

    answer= gpt4o.basic_qa(query=query_text, context=context)
    return answer, match_query, hyde_document, context, search_body

References:

https://www.elastic.co/search-labs/blog/advanced-rag-techniques-part-2#searching-and-retrieving-generating-answers

저작자표시 비영리

'Generative AI > RAG' 카테고리의 다른 글

대표적인 ANN(Approximate Nearest Neighbors) 알고리즘 (0)	2024.12.29
Elasticsearch: Advanced RAG Techniques Part 1: Data Processing (0)	2024.08.16
Cohere Rerank (0)	2024.08.10
Prompt Compression and Query Optimization (0)	2024.08.10
Choosing the right embedding model for your RAG application (0)	2024.08.06

내 블로그 - 관리자 홈 전환	`Q` `Q`
새 글 쓰기	`W` `W`

글 수정 (권한 있는 경우)	`E` `E`
댓글 영역으로 이동	`C` `C`

이 페이지의 URL 복사	`S` `S`
맨 위로 이동	`T` `T`
티스토리 홈 이동	`H` `H`
단축키 안내	`Shift` + `/` `⇧` + `/`

여정민의 블로그