Enriching Queries with Synonyms:
Quey Enriching 은 원본 Query 를 가지고 다양한 관점에서 유사한 쿼리를 생성하고, 이를 ES 에서 OR 검색에 적용시키는 것.
다음과 같은 "Who audits Elastic" 이라는 쿼리가 있다라고 해보자.
이를 프롬프트를 이용해서 적용한다면 다음과 같은 결과가 나올 것.
ELASTIC_SEARCH_QUERY_GENERATOR_PROMPT = '''
You are an AI assistant specialized in generating Elasticsearch query strings. Your task is to create the most effective query string for the given user question. This query string will be used to search for relevant documents in an Elasticsearch index.
Guidelines:
1. Analyze the user's question carefully.
2. Generate ONLY a query string suitable for Elasticsearch's match query.
3. Focus on key terms and concepts from the question.
4. Include synonyms or related terms that might be in relevant documents.
5. Use simple Elasticsearch query string syntax if helpful (e.g., OR, AND).
6. Do not use advanced Elasticsearch features or syntax.
7. Do not include any explanations, comments, or additional text.
8. Provide only the query string, nothing else.
For the question "What is Clickthrough Data?", we would expect a response like:
clickthrough data OR click-through data OR click through rate OR CTR OR user clicks OR ad clicks OR search engine results OR web analytics
AND operator is not allowed. Use only OR.
User Question:
[The user's question will be inserted here]
Generate the Elasticsearch query string:
'''
생성된 쿼리 결과:
'audits elastic OR
elasticsearch audits OR
elastic auditor OR
elasticsearch auditor OR
elastic audit firm OR
elastic audit company OR
elastic audit organization OR
elastic audit service'
이것들을 Parsing 하고, ES 에서 쿼리로 적용하면 된다. 쿼리를 적용할 때 'multi_match' 를 이용해서 다양한 필드를 통한 검색을 하면 됨.
def parse_or_query(self, query_text: str) -> List[str]:
# Split the query by 'OR' and strip whitespace from each term
# This converts a string like "term1 OR term2 OR term3" into a list ["term1", "term2", "term3"]
return [term.strip() for term in query_text.split(' OR ')]
'query': {
'bool': {
'must': [
{
'multi_match': {
'query': 'audits Elastic Elastic auditing Elastic audit process Elastic compliance Elastic security audit Elasticsearch auditing Elasticsearch compliance Elasticsearch security audit',
'fields': [
'original_text',
'keyphrases',
'potential_questions',
'entities'
],
'type': 'best_fields',
'operator': 'or'
}
}
]
HyDE (Hypothetical Document Embedding)
이건 원본 Query 에 대한 답을 LLM 을 통해 생성한 후, 이것과 비슷한 문서를 검색하도록 하는 방법임.
다음과 같은 프롬프트를 써서 가상의 답을 생성한 후 이를 임베딩해서 검색하면 된다.
HYDE_DOCUMENT_GENERATOR_PROMPT = '''
You are an AI assistant specialized in generating hypothetical documents based on user queries. Your task is to create a detailed, factual document that would likely contain the answer to the user's question. This hypothetical document will be used to enhance the retrieval process in a Retrieval-Augmented Generation (RAG) system.
Guidelines:
1. Carefully analyze the user's query to understand the topic and the type of information being sought.
2. Generate a hypothetical document that:
a. Is directly relevant to the query
b. Contains factual information that would answer the query
c. Includes additional context and related information
d. Uses a formal, informative tone similar to an encyclopedia or textbook entry
3. Structure the document with clear paragraphs, covering different aspects of the topic.
4. Include specific details, examples, or data points that would be relevant to the query.
5. Aim for a document length of 200-300 words.
6. Do not use citations or references, as this is a hypothetical document.
7. Avoid using phrases like "In this document" or "This text discusses" - write as if it's a real, standalone document.
8. Do not mention or refer to the original query in the generated document.
9. Ensure the content is factual and objective, avoiding opinions or speculative information.
10. Output only the generated document, without any additional explanations or meta-text.
User Question:
[The user's question will be inserted here]
Generate a hypothetical document that would likely contain the answer to this query:
'''
Hybrid Search
Lexical Search + Vector Search 를 결합한 검색 방법임. 서로의 장점과 단점이 있는데 이를 결합해서 상호보완적인 검색을 하도록 하는 것.
검색 예시는 다음과 같다:
def hybrid_vector_search(self, index_name: str, query_text: str, query_vector: List[float],
text_fields: List[str], vector_field: str,
num_candidates: int = 100, num_results: int = 10) -> Dict:
"""
Perform a hybrid search combining text-based and vector-based similarity.
Args:
index_name (str): The name of the Elasticsearch index to search.
query_text (str): The text query string, which may contain 'OR' separated terms.
query_vector (List[float]): The query vector for semantic similarity search.
text_fields (List[str]): List of text fields to search in the index.
vector_field (str): The name of the field containing document vectors.
num_candidates (int): Number of candidates to consider in the initial KNN search.
num_results (int): Number of final results to return.
Returns:
Dict: A tuple containing the Elasticsearch response and the search body used.
"""
try:
# Parse the query_text into a list of individual search terms
# This splits terms separated by 'OR' and removes any leading/trailing whitespace
query_terms = self.parse_or_query(query_text)
# Construct the search body for Elasticsearch
search_body = {
# KNN search component for vector similarity
"knn": {
"field": vector_field, # The field containing document vectors
"query_vector": query_vector, # The query vector to compare against
"k": num_candidates, # Number of nearest neighbors to retrieve
"num_candidates": num_candidates # Number of candidates to consider in the KNN search
},
"query": {
"bool": {
# The 'must' clause ensures that matching documents must satisfy this condition
# Documents that don't match this clause are excluded from the results
"must": [
{
# Multi-match query to search across multiple text fields
"multi_match": {
"query": " ".join(query_terms), # Join all query terms into a single space-separated string
"fields": text_fields, # List of fields to search in
"type": "best_fields", # Use the best matching field for scoring
"operator": "or" # Match any of the terms (equivalent to the original OR query)
}
}
],
# The 'should' clause boosts relevance but doesn't exclude documents
# It's used here to combine vector similarity with text relevance
"should": [
{
# Custom scoring using a script to combine vector and text scores
"script_score": {
"query": {"match_all": {}}, # Apply this scoring to all documents that matched the 'must' clause
"script": {
# Script to combine vector similarity and text relevance
"source": """
# Calculate vector similarity (cosine similarity + 1)
# Adding 1 ensures the score is always positive
double vector_score = cosineSimilarity(params.query_vector, params.vector_field) + 1.0;
# Get the text-based relevance score from the multi_match query
double text_score = _score;
# Combine scores: 70% vector similarity, 30% text relevance
# This weighting can be adjusted based on the importance of semantic vs keyword matching
return 0.7 * vector_score + 0.3 * text_score;
""",
# Parameters passed to the script
"params": {
"query_vector": query_vector, # Query vector for similarity calculation
"vector_field": vector_field # Field containing document vectors
}
}
}
}
]
}
}
}
# Execute the search request against the Elasticsearch index
response = self.conn.search(index=index_name, body=search_body, size=num_results)
# Log the successful execution of the search for monitoring and debugging
logger.info(f"Hybrid search executed on index: {index_name} with text query: {query_text}")
# Return both the response and the search body (useful for debugging and result analysis)
return response, search_body
except Exception as e:
# Log any errors that occur during the search process
logger.error(f"Error executing hybrid search on index: {index_name}. Error: {e}")
# Re-raise the exception for further handling in the calling code
raise e
이렇게 결과를 가져온 이후 Reverse Packing 매커니즘을 적용해서 거꾸로 순위를 만들어서 LLM Context 로 적용하면 됨.
실제로 LLM 에서 ES 에서 Retrieve 한 이후 컨택스트로 사용하는 코드는 다음과 같다:
def get_context(index_name,
match_query,
text_query,
fields,
num_candidates=100,
num_results=20,
text_fields=["original_text", 'keyphrases', 'potential_questions', 'entities'],
embedding_field="primary_embedding"):
embedding=embedder.get_embeddings_from_text(text_query)
results, search_body = es_query_maker.hybrid_vector_search(
index_name=index_name,
query_text=match_query,
query_vector=embedding[0][0],
text_fields=text_fields,
vector_field=embedding_field,
num_candidates=num_candidates,
num_results=num_results
)
# Concatenates the text in each 'field' key of the search result objects into a single block of text.
context_docs=['\n\n'.join([field+":\n\n"+j['_source'][field] for field in fields]) for j in results['hits']['hits']]
# Reverse Packing to ensure that the highest ranking document is seen first by the LLM.
context_docs.reverse()
return context_docs, search_body
def retrieval_augmented_generation(query_text):
match_query= gpt4o.generate_query(query_text)
fields=['original_text']
hyde_document=gpt4o.generate_HyDE(query_text)
context, search_body=get_context(index_name, match_query, hyde_document, fields)
answer= gpt4o.basic_qa(query=query_text, context=context)
return answer, match_query, hyde_document, context, search_body
References:
'Generative AI > RAG' 카테고리의 다른 글
Elasticsearch: Advanced RAG Techniques Part 1: Data Processing (0) | 2024.08.16 |
---|---|
Cohere Rerank (0) | 2024.08.10 |
Prompt Compression and Query Optimization (0) | 2024.08.10 |
Choosing the right embedding model for your RAG application (0) | 2024.08.06 |
The GraphRAG Manifesto: Adding Knowledge to GenAI (0) | 2024.07.14 |