Prompt Compression and Query Optimization

youngerjesus 2024. 8. 10. 08:42

2024. 8. 10. 08:42

1. Introdcution

Vector Database 에 Retriveal 한 내용들을 관련성이 있지만, 되도록 작게 유지하는 기법을 배워볼거임.

이렇게 하면 성능도 좋아지고 (관련성 없는 부분이 사라지니까) 그리고 비용도 감소한다. (토큰의 수가 줄어드니까)
이 기법을 Prompt Compression 이라고 함.

Retriveal 할 때 적용할 수 있는 Pre-filtering 과 Post-filtering 에 대해서도 배워봄:

Pre-filtering 은 Database 에 적용된 내용을 가지고 필터링 하는거, WHERE 절을 쓰는 것과 유사하다. 예시로 들자면 Rental App 에서 침대 수로 필터링하는 걸 말함.
Post-filtering 은 검색된 결과에서 조건에 맞는 걸 선택하는거임. 대표적으로 projection 과 같은 것들이 있다. 검색된 문서에서 다양성을 고려해서 선택하는 것도 post-filtering 의 예시중 하나일 것.

Reranking 도 Retriveal 에서 사용할만한 좋은 전략이다.

reranking 을 할 때는 다른 필드를 기준으로 적용할 수 있음.

Pre-filtering 고려사항:

키워드 매칭: 쿼리와 관련된 핵심 키워드를 포함하는 문서만 선택합니다.
메타데이터 필터링: 날짜, 저자, 카테고리 등 메타데이터를 기반으로 문서를 필터링합니다.
임베딩 기반 유사도 임계값: 쿼리와 문서 임베딩 간의 코사인 유사도가 특정 임계값 이상인 문서만 선택합니다.
언어 감지: 쿼리와 동일한 언어로 작성된 문서만 선택합니다.
문서 길이 필터링: 너무 짧거나 긴 문서를 제외합니다.

Post-filtering 고려사항:

의미론적 유사도 점수: 더 정교한 의미론적 유사도 모델을 사용하여 검색된 문서의 순위를 재조정합니다.
엔티티 매칭: 쿼리와 문서에서 추출한 엔티티(예: 사람, 장소, 조직)의 일치도를 기반으로 필터링합니다.
시간적 관련성: 쿼리가 시간에 민감한 경우, 최신 정보를 우선시합니다.
다양성 필터링: 너무 유사한 문서들을 제거하여 다양한 정보를 제공합니다.
팩트 체크: 검색된 정보의 사실성을 확인하고, 신뢰할 수 있는 소스의 정보를 우선시합니다.
맥락 적합성 검사: 검색된 문서가 쿼리의 전체적인 맥락과 얼마나 잘 맞는지 평가합니다.

Reranking 고려사항:

크로스 인코더 모델 사용: BERT나 RoBERTa 같은 사전 학습된 언어 모델을 기반으로 한 크로스 인코더를 사용하는 것. 쿼리와 각 검색된 문서를 쌍으로 입력하여 관련성 점수를 좀 더 잘 계산할 수 있음.
맥락 기반 Reranking 적용: 쿼리의 전체적인 맥락을 고려. 예를 들어, 사용자의 이전 쿼리나 세션 정보를 활용하여 현재 쿼리의 의도를 더 잘 이해하고 그에 맞게 순위를 조정합니다.
다양성 고려 Reranking: 단순히 관련성만이 아니라 결과의 다양성도 고려합니다. Maximal Marginal Relevance (MMR) 같은 알고리즘을 사용하여 유사한 문서들 사이에서 다양성을 증진시킵니다.
시간적 관련성 반영: 쿼리의 시간적 특성을 고려합니다. 최신 정보가 필요한 쿼리의 경우, 최근 문서에 더 높은 가중치를 부여합니다
Boosting 적용: 더 중요한 필드만 있다면 Boosting 을 통해서 상위 문서로 가져오는 것.

Prompt compression

Prompt Compression 정의:

검색된 컨텍스트를 효과적으로 압축하여 언어 모델의 입력 토큰 제한을 최적화하는 중요한 기술입니다.

Prompt Compression 적용할 때 고려사항:

Extractive Summarization:
- TextRank나 LexRank 같은 알고리즘을 사용하여 각 문서에서 가장 중요한 문장들을 추출합니다.
- 쿼리와의 관련성을 기준으로 문장의 중요도를 평가하고 상위 N개의 문장을 선택합니다.
Abstractive Summarization:
- T5나 BART 같은 사전 학습된 요약 모델을 사용하여 각 문서의 핵심 내용을 새로운 문장으로 생성합니다.
- 쿼리를 조건으로 사용하여 쿼리 중심의 요약을 생성할 수 있습니다.
문장 압축:
- 구문 분석을 통해 각 문장에서 부수적인 정보를 제거하고 핵심만 남깁니다.
- 예를 들어, 부사구나 형용사구 중 덜 중요한 것들을 제거할 수 있습니다.
정보 밀도 최적화:
- TF-IDF나 BM25 점수를 사용하여 각 단어나 구의 중요도를 계산합니다.
- 높은 정보 밀도를 가진 부분을 우선적으로 선택하여 압축된 프롬프트에 포함시킵니다
쿼리 기반 필터링:
- 쿼리와 가장 관련성이 높은 문장이나 단락만을 선택합니다.
- 임베딩 기반 유사도나 키워드 매칭을 통해 관련성을 평가할 수 있습니다.
중복 정보 제거:
- 여러 문서에서 반복되는 정보를 식별하고 제거합니다.
- 코사인 유사도나 Jaccard 유사도를 사용하여 중복성을 평가할 수 있습니다.

Prompt compression 예시 코드:

microsoft/llmlingua-2-bert-base-multilingual-cased-meetingbank 모델을 사용:
- BERT 모델 계열로, 양방향 인코더 디코더를 사용해서 문맥 이해를 더 잘하는게 특징임. 그래서 Prompt Compression 을 더 잘할 것.
랭킹 방법으로 rank_method="longllmlingua 를 사용:
- 이 랭킹 방법은 컨텍스트의 각 부분의 중요도를 평가하고, 가장 관련성 높은 정보를 우선적으로 선택하는 데 도움을 준다.
동적 컨택스트 압축으로 dynamic_context_compression_ratio=0.4 를 사용:
- 컨텍스트의 40%를 동적으로 압축한다는 것.
컨텍스트 예산 및 재정렬:
- context_budget="+100"는 기본 토큰 수에 100개의 추가 토큰을 할당함.
- reorder_context="sort"는 압축 후 컨텍스트를 재정렬한다.

import json
from llmlingua import PromptCompressor

llm_lingua = PromptCompressor(
    model_name="microsoft/llmlingua-2-bert-base-multilingual-cased-meetingbank",
    model_config={"revision": "main"},
    use_llmlingua2=True,
    device_map="cpu",
)

# Function definition
def compress_query_prompt(query):

    demonstration_str = query['demonstration_str']
    instruction = query['instruction']
    question = query['question']

    # 6x Compression
    compressed_prompt = llm_lingua.compress_prompt(
        demonstration_str.split("\n"), 
        instruction=instruction,
        question=question,
        target_token=500,
        rank_method="longllmlingua", 
        context_budget="+100",
        dynamic_context_compression_ratio=0.4,
        reorder_context="sort",
    )

    return json.dumps(compressed_prompt, indent=4)

def handle_user_query_with_compression(query, db, collection, stages=[], vector_index="vector_index_text"):
    # Perform vector search to get knowledge from the database
    get_knowledge = custom_utils.vector_search_with_filter(query, db, collection, stages, vector_index)

    # Check if there are any results
    if not get_knowledge:
        return None, "No results found."

    # Convert search results into a list of SearchResultItem models
    search_results_models = [SearchResultItem(**result) for result in get_knowledge]

    # Convert search results into a DataFrame for better rendering
    search_results_df = pd.DataFrame([item.dict() for item in search_results_models])

    # Prepare information for compression
    query_info = {
        'demonstration_str': search_results_df.to_string(),  # Results from information retrieval process
        'instruction': "Write a high-quality answer for the given question using only the provided search results.",
        'question': query  # User query
    }

    # Compress the query prompt using predefined function
    compressed_prompt = compress_query_prompt(query_info)

    # Optional: Print compressed prompts for debugging
    print("Compressed Prompt:\n")
    pprint.pprint(compressed_prompt)
    print("\n" + "=" * 80 + "\n")

    return search_results_df, compressed_prompt

Prompt Compression 에서 정보 밀도, 쿼리 기반 필터링, 문서의 다양성 고려 등의 개념을 더 추가해서 LLM 을 통해 구현한 코드 예시:

import numpy as np
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.metrics.pairwise import cosine_similarity
from transformers import AutoTokenizer, AutoModel
from summarizer import Summarizer
import torch

class AdvancedPromptCompressor:
    def __init__(self, model_name="bert-base-uncased", max_length=512):
        self.tokenizer = AutoTokenizer.from_pretrained(model_name)
        self.model = AutoModel.from_pretrained(model_name)
        self.summarizer = Summarizer(model=model_name)
        self.max_length = max_length
        self.tfidf = TfidfVectorizer()

    def get_embeddings(self, texts):
        inputs = self.tokenizer(texts, return_tensors="pt", padding=True, truncation=True)
        with torch.no_grad():
            outputs = self.model(**inputs)
        return outputs.last_hidden_state.mean(dim=1)

    def information_density(self, texts):
        tfidf_matrix = self.tfidf.fit_transform(texts)
        return tfidf_matrix.sum(axis=1).A1

    def query_relevance(self, query, texts):
        query_emb = self.get_embeddings([query])
        text_emb = self.get_embeddings(texts)
        return cosine_similarity(query_emb, text_emb)[0]

    def diversity_score(self, selected, candidate, diversity_weight=0.5):
        if not selected:
            return 1
        selected_emb = self.get_embeddings(selected)
        candidate_emb = self.get_embeddings([candidate])
        similarities = cosine_similarity(candidate_emb, selected_emb)[0]
        return 1 - (np.max(similarities) * diversity_weight)

    def compress_prompt(self, query, context, target_length):
        sentences = [sent.strip() for sent in context.split('.') if sent.strip()]

        # Calculate scores
        density_scores = self.information_density(sentences)
        relevance_scores = self.query_relevance(query, sentences)

        # Normalize scores
        density_scores = (density_scores - np.min(density_scores)) / (np.max(density_scores) - np.min(density_scores))
        relevance_scores = (relevance_scores - np.min(relevance_scores)) / (np.max(relevance_scores) - np.min(relevance_scores))

        # Combine scores
        combined_scores = (density_scores + relevance_scores) / 2

        selected_sentences = []
        current_length = 0

        while current_length < target_length and sentences:
            candidate_scores = []
            for i, sentence in enumerate(sentences):
                diversity_score = self.diversity_score(selected_sentences, sentence)
                score = combined_scores[i] * diversity_score
                candidate_scores.append((i, score, sentence))

            best_candidate = max(candidate_scores, key=lambda x: x[1])
            selected_sentences.append(best_candidate[2])
            current_length += len(self.tokenizer.encode(best_candidate[2]))
            sentences.pop(best_candidate[0])
            combined_scores = np.delete(combined_scores, best_candidate[0])

        compressed_text = '. '.join(selected_sentences)

        # Apply extractive summarization if still too long
        if current_length > target_length:
            compressed_text = self.summarizer(compressed_text, num_sentences=int(target_length/20))

        return compressed_text

    def dynamic_token_allocation(self, contexts, relevance_scores, total_tokens):
        normalized_scores = relevance_scores / np.sum(relevance_scores)
        token_allocations = (normalized_scores * total_tokens).astype(int)
        return token_allocations

    def compress_multiple_contexts(self, query, contexts, total_tokens):
        relevance_scores = self.query_relevance(query, contexts)
        token_allocations = self.dynamic_token_allocation(contexts, relevance_scores, total_tokens)

        compressed_contexts = []
        for context, allocation in zip(contexts, token_allocations):
            compressed = self.compress_prompt(query, context, allocation)
            compressed_contexts.append(compressed)

        return ' '.join(compressed_contexts)

# Usage example
compressor = AdvancedPromptCompressor()
query = "What are the main challenges in AI development?"
contexts = [
    "AI development faces numerous challenges. One major issue is data quality and quantity. AI models require large amounts of high-quality data for training, which can be difficult and expensive to obtain. Another challenge is the interpretability of AI systems, especially deep learning models that operate as 'black boxes'. This lack of transparency can be problematic in critical applications like healthcare or finance.",
    "Ethical considerations pose significant challenges in AI development. Issues such as bias in AI systems, privacy concerns with data collection and use, and the potential for AI to displace human workers are all major ethical hurdles. Additionally, there are concerns about AI safety and the potential for advanced AI systems to act in ways that are harmful to humans, either intentionally or unintentionally.",
    "Technical challenges in AI development include the need for more advanced algorithms and computational resources. As AI systems become more complex, they require increasingly powerful hardware and more sophisticated software architectures. Scalability is also a major issue, as many AI solutions that work well in controlled environments struggle when deployed in real-world, large-scale applications."
]

compressed_prompt = compressor.compress_multiple_contexts(query, contexts, 200)
print(compressed_prompt)

Conclusion

Retrieval 할 때 적굥할 수 있는 Pre-filtering, Post-filtering, Reranking, Prompt Compression 에 대한 고려샇아 내용은 기억해두는게 좋을듯.

저작자표시 비영리

'Generative AI > RAG' 카테고리의 다른 글

Elasticsearch: Advanced RAG Techniques Part 1: Data Processing (0)	2024.08.16
Cohere Rerank (0)	2024.08.10
Choosing the right embedding model for your RAG application (0)	2024.08.06
The GraphRAG Manifesto: Adding Knowledge to GenAI (0)	2024.07.14
What is a Sparse Vector? How to Achieve Vector-based Hybrid Search (0)	2024.07.10

여정민의 블로그

Prompt Compression and Query Optimization

1. Introdcution

Prompt compression

Conclusion

'Generative AI > RAG' 카테고리의 다른 글

+ Recent posts

티스토리툴바