Elasticsearch: Advanced RAG Techniques Part 1: Data Processing

youngerjesus 2024. 8. 16. 08:37

2024. 8. 16. 08:37

RAG 품질 향상을 위한 두 가지 주요 접근 방식:

Enhancing the Quality and Clarity of the Knowledge Base:
- Data Organization and Formatting: Ensuring that data is well-organized and clearly formatted is crucial. Poorly structured data can hinder the RAG system's ability to efficiently retrieve relevant information. Proper formatting allows for effective chunking, which is the process of breaking down data into manageable pieces for retrieval and generation.
- Contextual Metadata: Incorporating contextual metadata helps guide the chunking process. Metadata provides essential cues that ensure chunks are neither too broad nor too narrow, which can otherwise introduce noise and reduce retrieval accuracy
- Data Quality Assurance: Regular updates and quality checks of the knowledge base are necessary to maintain accuracy and relevance. Outdated or incorrect data can lead to misleading responses. Involving subject matter experts to review and validate content can help maintain a reliable knowledge base
- Granularity of Data Segmentation: Proper segmentation of data into semantically meaningful chunks is important. This helps in accurately matching user queries with relevant content, improving the overall retrieval process
Improving the Coverage and Specificity of Search Queries
- Query Specificity: More specific queries tend to perform better because they provide clearer signals to the search engine. Specificity helps in reducing ambiguity and improving the relevance of retrieved results
- Query Transformation: This involves modifying queries to better match the structure and expectations of the knowledge base. Techniques such as query rewriting, decomposition into sub-queries, and generating pseudo-documents can enhance retrieval performance by providing more context and alternative formulations.
- Query Enrichment: Expanding the original query with related terms or synonyms can increase the range of potential matches in the document corpus. This helps in retrieving relevant documents that may not exactly match the original query terms.
- Re-Ranking: After the initial retrieval, re-ranking techniques can be used to refine the ranking of results. This ensures that the most relevant information is prioritized in the generation process, improving the quality of the responses.

LlamaIndex 를 통한 간편한 Data Ingestion:

# llamaindex_processor.py
from llama_index.core import SimpleDirectoryReader

class LlamaIndexProcessor:
   def __init__(self):
       pass 

   def load_documents(self, directory_path):
       ''' 
       Load all documents in directory
       '''
       reader = SimpleDirectoryReader(input_dir=directory_path)
       return reader.load_data()

# main.ipynb
llamaindex_processor=LlamaIndexProcessor()
documents=llamaindex_processor.load_documents('./documents/')
documents=[dict(doc_obj) for doc_obj in documents]

다음과 같이 출력이 될 것:

{
  'id_': '5f76f0b3-22d8-49a8-9942-c2bbab14f63f',
  'metadata': {'page_label': '5',
   'file_name': 'Elastic_NV_Annual-Report-Fiscal-Year-2023.pdf',
   'file_path': '/Users/han/Desktop/Projects/truckasaurus/documents/Elastic_NV_Annual-Report-Fiscal-Year-2023.pdf',
   'file_type': 'application/pdf',
   'file_size': 3724426,
   'creation_date': '2024-07-27',
   'last_modified_date': '2024-07-27'},
   'text': 'Table of Contents\nPage\nPART I\nItem 1. Business 3\n15 Item 1A. Risk Factors\nItem 1B. Unresolved Staff Comments 48\nItem 2. Properties 48\nItem 3. Legal Proceedings 48\nItem 4. Mine Safety Disclosures 48\nPART II\nItem 5. Market for Registrant’s Common Equity, Related Stockholder Matters and Issuer Purchases of \nEquity Securities49\nItem 6. [Reserved] 49\nItem 7. Management’s Discussion and Analysis of Financial Condition and Results of Operations 50\nItem 7A. Quantitative and Qualitative Disclosures About Market Risk 64\nItem 8. Financial Statements and Supplementary Data 66\nItem 9. Changes in and Disagreements With Accountants on Accounting and Financial Disclosure 100\n100\n101Item 9A. Controls and Procedures\nItem 9B. Other Information\nItem 9C. Disclosure Regarding Foreign Jurisdictions That Prevent Inspections 101\nPART III\n102\n102\n102\n102Item 10. Directors, Executive Officers and Corporate Governance\nItem 11. Executive Compensation\nItem 12. Security Ownership of Certain Beneficial Owners and Management, and Related Stockholder Matters  \nItem 13. Certain Relationships and Related Transactions, and Director Independence\nItem 14. Principal Accountant Fees and Services 102\nPART IV\n103\n105Item 15. Exhibits and Financial Statement Schedules  \nItem 16. Form 10-K Summary\nSignatures 106\ni',
   ...
}

Sentence-level, Token-wise Chunking 적용:

Chunking 에서 중요한 것: 하나의 주제에 대한 온전한 의미를 담고 있는 것.
이를 위해선 LLM 을 써야하긴 하나, 느리고 비용이 크게 발생한다.
일반적인 Chunking 은 Text 단위로 그냥 잘라서한다. 이렇게하면 간편하긴 하나 Chunking 의 중요한 목적을 달성하기 어려움.
여기서는 최고의 방법은 아니지만 Sentence Level Chunking 을 사용함. 그리고 Token 을 이용한다. Token 단위로 자르는게 텍스트 단위로 자르는 것보다 의미 손상이 훨씬 적음. 토큰은 의미있는 단위로 나누는 것이니.
Sentence Level Chunking 은 Chunking 을 나눌 때 문장 중간에 짤리는 걸 방지하기 위한 방법임.
LangChain 의 recursivetextsplitter 는 문단, 문장 단위로 잘 자르긴 하지만 문장 중간에 잘리는 걸 막을 순 없다.

# chunker.py 

import uuid
import re


class Chunker: 
    def __init__(self, tokenizer):
        self.tokenizer = tokenizer 

    def split_into_sentences(self, text):
        """Split text into sentences."""
        return re.split(r'(?<=[.!?])\s+', text)

    def sentence_wise_tokenized_chunk_documents(self, documents, chunk_size=512, overlap=20, min_chunk_size=50):
        '''
        1. Split text into sentences.
        2. Tokenize using the provided tokenizer method.
        3. Build chunks up to the chunk_size limit.
        4. Create an overlap based on tokens - to preserve context.
        5. Only keep chunks that meet the minimum token size requirement.
        '''
        chunked_documents = []

        for doc in documents:
            sentences = self.split_into_sentences(doc['text'])
            tokens = []
            sentence_boundaries = [0]

            # Tokenize all sentences and keep track of sentence boundaries
            for sentence in sentences:
                sentence_tokens = self.tokenizer.encode(sentence, add_special_tokens=True)
                tokens.extend(sentence_tokens)
                sentence_boundaries.append(len(tokens))

            # Create chunks
            chunk_start = 0
            while chunk_start < len(tokens):
                chunk_end = chunk_start + chunk_size

                # Find the last complete sentence that fits in the chunk
                sentence_end = next((i for i in sentence_boundaries if i > chunk_end), len(tokens))
                chunk_end = min(chunk_end, sentence_end)

                # Create the chunk
                chunk_tokens = tokens[chunk_start:chunk_end]

                # Check if the chunk meets the minimum size requirement
                if len(chunk_tokens) >= min_chunk_size:
                    # Create a new document object for this chunk
                    chunk_doc = {
                        'id_': str(uuid.uuid4()),
                        'chunk': chunk_tokens,
                        'original_text': self.tokenizer.decode(chunk_tokens),
                        'chunk_index': len(chunked_documents),
                        'parent_id': doc['id_'],
                        'chunk_token_count': len(chunk_tokens)
                    }

                    # Copy all other fields from the original document
                    for key, value in doc.items():
                        if key != 'text' and key not in chunk_doc:
                            chunk_doc[key] = value

                    chunked_documents.append(chunk_doc)

                # Move to the next chunk start, considering overlap
                chunk_start = max(chunk_start + chunk_size - overlap, chunk_end - overlap)

        return chunked_documents

# main.ipynb 
# Initialize Embedding Model
HUGGINGFACE_EMBEDDING_MODEL = os.environ.get('HUGGINGFACE_EMBEDDING_MODEL')
embedder=EmbeddingModel(model_name=HUGGINGFACE_EMBEDDING_MODEL)

# Initialize Chunker
chunker=Chunker(embedder.tokenizer)

Chunking 결과는 다음과 같을 것:

print(chunked_documents[4]['original_text'])

[CLS] the aggregate market value of the ordinary shares held by non - affiliates of the registrant, 
based on the closing price of the shares of ordinary shares on the new york stock exchange on 
october 31, 2022 ( the last business day of the registrant ’ s second fiscal quarter ), was 
approximately $ 6. 1 billion. [SEP] [CLS] as of may 31, 2023, the registrant had 97, 390, 886 
ordinary shares, par value €0. 01 per share, outstanding. [SEP] [CLS] documents incorporated by 
reference portions of the registrant ’ s definitive proxy statement relating to the registrant ’ s 2
023 annual general meeting of shareholders are incorporated by reference into part iii of this annual 
...
...

Metadata Inclusion and Generation:

여기서는 유의미한 메타 데이터를 생성해서 더 검색이 잘되도록 할거임.
대표적인 검색에 사용되는 메타데이터는 text 에서 중요한 단어를 뽑는 것 (TextRank 사용), Text 와 관련된 질문을 생성 (GPT 사용), 주요한 엔터티 출력 (Spacy 사용) 을 할 거임.
이렇게 생성된 메타데이터는 임베딩 된 후, 가중치를 곱해서 하나의 임베딩으로 만들어질 거임. 이렇게 하면 그냥 텍스트만 청킹하고 임베딩 하는 것보다 더 검색이 잘 될 것.

관련 코드:

# documentenricher.py
from tqdm import tqdm

class DocumentEnricher:

    def __init__(self):
        pass 

    def enrich_document(self, documents, processors, text_col='text'):
        for doc in tqdm(documents, desc="Enriching documents using processors: "+str(processors)): 
            for (processor, field) in processors: 
                metadata=processor(doc[text_col])
                if isinstance(metadata, list):
                    metadata='\n'.join(metadata)
                doc.update({field: metadata})

# main.ipynb
# Initialize processor classes 
nltkprocessor=NLTKProcessor() // nltk_processor.py
entity_extractor=EntityExtractor() // entity_extractor.py
gpt4o = LLMProcessor(model='gpt-4o') // llm.py

# Initialize LLM
documentenricher=DocumentEnricher()

# Create new fields in the documents - These are the outputs of the processor functions.
processors=[
    (nltkprocessor.textrank_phrases, "keyphrases"),
    (gpt4o.generate_questions, "potential_questions"),
    (entity_extractor.extract_entities, "entities")
    ]

# .enrich_document() will modify chunked_docs in place. 
# To view the results, we'll print chunked_docs in the next few cells!
documentenricher.enrich_document(chunked_docs, text_col='original_text', processors=processors)

Keyphrases extracted by TextRank 결과:

print(chunked_documents[25]['keyphrases'])

'elastic agent stop', 'agent stop malware', 
'stop malware ransomware', 'malware ransomware environment', 
'ransomware environment wide', 'environment wide visibility', 
'wide visibility threat', 'visibility threat detection', 
'sep cl key', 'cl key feature'

Potential questions generated by GPT-4o 결과:

print(chunked_documents[25]['potential_questions'])

1. What are the primary functions that Elastic Agent provides in terms of cybersecurity?
2. Describe how Logstash contributes to data management within an IT environment.
3. List and explain any key features of Logstash mentioned in the document.
4. How does Elastic Agent enhance environment-wide visibility in threat detection?
5. What capabilities does Logstash offer for handling data beyond simple collection?
6. In what ways does the document suggest that Elastic Agent stops malware and ransomware?
7. Can you identify any relationships between the functionalities of Elastic Agent and Logstash in an integrated environment?
8. What implications might the advanced threat detection capabilities of Elastic Agent have for organizational security policies?
9. Compare and contrast the roles of Elastic Agent and Logstash based on their described functions.
10. How might the centralized collection ability of Logstash support the threat detection capabilities of Elastic Agent?

Entities extracted by Spacy 결과:

print(chunked_documents[29]['entities'])

'appdynamics', 'apm data', 'azure sentinel', 
'microsoft', 'mcafee', 'broadcom', 'cisco', 
'dynatrace', 'coveo', 'lucidworks'

Composite Multi-Field Embeddings (임베딩 결합 방법):

각 메타데이터와 텍스트 마다 임베딩을 생성한 후, 가중치를 곱해서 결합할 것.

# EmbeddingModel defined in embedding_model.py
embedder=EmbeddingModel(model_name=HUGGINGFACE_EMBEDDING_MODEL)

cols_to_embed=['keyphrases', 'potential_questions', 'entities']

embedding_cols=[]
for col in cols_to_embed:
    # Works on text input
    embedding_col=embedder.embed_documents_text_wise(chunked_documents, text_field=col)
    embedding_cols.append(embedding_col)
# Works on token input
embedding_col=embedder.embed_documents_token_wise(chunked_documents, token_field="chunk")
embedding_cols.append(embedding_col)

embedding_cols=[
                'keyphrases_embedding',
                'potential_questions_embedding',
                'entities_embedding',
                'chunk_embedding']
combination_weights=[
                    0.1,
                    0.15,
                    0.05,
                    0.7
                ]


from tqdm import tqdm 
def combine_embeddings(objects, embedding_cols, combination_weights, primary_embedding='primary_embedding'):
    # Ensure the number of weights matches the number of embedding columns
    assert len(embedding_cols) == len(combination_weights), "Number of embedding columns must match number of weights"

    # Normalize weights to sum to 1
    weights = np.array(combination_weights) / np.sum(combination_weights)

    for obj in tqdm(objects, desc="Combining embeddings"):
        # Initialize the combined embedding
        combined = np.zeros_like(obj[embedding_cols[0]])

        # Compute the weighted sum
        for col, weight in zip(embedding_cols, weights):
            combined += weight * np.array(obj[col])

        # Add the new combined embedding to the object
        obj.update({primary_embedding:combined.tolist()})

        # Remove the original embedding columns
        for col in embedding_cols:
            obj.pop(col, None)

combine_embeddings(chunked_documents, embedding_cols, combination_weights)

청킹이 끝난 총 결과:

{ 'id_': '7fe71686-5cd0-4831-9e79-998c6dbeae0c', 'chunk': [2312, 14613, ...], 'original_text': 'if an emerging growth company, indicate by check mark if the registrant has elected not to use the extended ...', 'chunk_index': 3, 'chunk_token_count': 399, 'metadata': {'page_label': '3', 'file_name': 'Elastic_NV_Annual-Report-Fiscal-Year-2023.pdf', ... 'keyphrases': 'sep cl unk\ncheck mark registrant\ncl unk indicate\nunk indicate check\nindicate check mark\nprincipal executive office\naccelerate filer unk\ncompany unk emerge\nunk emerge growth\nemerge growth company', 'potential_questions': '1. What are the different types of registrant statuses mentioned in the document?\n2. Under what section of the Sarbanes-Oxley Act must registrants file a report on the effectiveness of their internal ...', 'entities': 'the effe ctiveness of\nsection 13\nSEP\nUNK\nsection 21e\n1934\n1933\nu. s. c.\nsection 404\nsection 12\nal', 'primary_embedding': [-0.3946287803351879, -0.17586839850991964, ...] }

References:

저작자표시 비영리

'Generative AI > RAG' 카테고리의 다른 글

대표적인 ANN(Approximate Nearest Neighbors) 알고리즘 (0)	2024.12.29
Elasticsearch: Advanced RAG Techniques Part 2: Querying (0)	2024.08.16
Cohere Rerank (0)	2024.08.10
Prompt Compression and Query Optimization (0)	2024.08.10
Choosing the right embedding model for your RAG application (0)	2024.08.06

내 블로그 - 관리자 홈 전환	`Q` `Q`
새 글 쓰기	`W` `W`

글 수정 (권한 있는 경우)	`E` `E`
댓글 영역으로 이동	`C` `C`

이 페이지의 URL 복사	`S` `S`
맨 위로 이동	`T` `T`
티스토리 홈 이동	`H` `H`
단축키 안내	`Shift` + `/` `⇧` + `/`

여정민의 블로그

Elasticsearch: Advanced RAG Techniques Part 1: Data Processing

'Generative AI > RAG' 카테고리의 다른 글

+ Recent posts

티스토리툴바

단축키

내 블로그

블로그 게시글

모든 영역