Elasticsearch: Semantic Search with ELSER

youngerjesus 2024. 7. 15. 10:09

2024. 7. 15. 10:09

ELSER (Elastic Learned Sparse EncodeR) Definition:

It is an NLP (Natural Language Processing) model trained on Elastic
It performs semantic search using sparse vector representation..

ELSER usage limitations:

Only the first 512 extracted tokens per field are considered for semantic search. ㅅTherefore, in the case of long documents, the entire text may not be searchable.
Therefore, when using this it is recommended to split long documents into smaller segments before indexing.

The importance of ELSER:

It can overcome the limitations of traditional keyword-based search by extracting important words based on context rather than simply matching word frequencies
It can better understand the user's search intent and provide more relevant results

Lack of auto-scaling capability in ELSER deployment:

To use ELSER, you need to deploy this model on an Elasticsearch cluster. (This will be introduced later)
This ELSER deployment does not provide auto-scale capability according to resource requirements. Thus, it does not automatically expand or shrink
You need to manually configure it using the Trained Models UI in Kibana or by using the Update Trained Model Deployment API
Configurable items:
- Number of model allocations per node. Increasing this will improve throughput.
- Number of thread per mode

1. Requirements

This is an article about the requirements for using ELSER

The necessity of deploying the ELSER model:

For semantic search, the ELSER model must be deployed on the cluster
Refer to the ELSER documentation to learn how to download and deploy the model.
Reference: https://www.elastic.co/guide/en/machine-learning/8.14/ml-nlp-elser.html

Minimum hardware requirements:

The minimum dedicated ML node size for deploying and using the ELSER model in Elasticsearch Service is 4GB
This is the requirement when auto-scaling is turned off. (Currently, it seems that ELSER does not support auto-scaling)

2. Create the index mapping

Here, we will cover creating an Elasticsearch index mapping for using ELSER.

Purpose of index mapping:

Create the mapping for the target index to store the tokens generated by the ELSER model

Example of index mapping:

Description of key fields:
- content_embedding: A sparse_vector type field to store the tokens generated by ELSER
- content: A text type field to store the original text

PUT my-index
{
  "mappings": {
    "properties": {
      "content_embedding": { 
        "type": "sparse_vector" 
      },
      "content": { 
        "type": "text" 
      }
    }
  }
}

Field type requirements:

Fields for storing ELSER output must be of type sparse_vector or rank_features
Failure to adhere to this may result in errors like 'Limit of total fields [1000] has been exceeded

Space optimization:

There is a method to save disk space by excluding ELSER tokens from the document source. This will be introduced later.

3. Create an ingest pipeline with an inference processor

Here, we will cover an example of creating an ingest pipeline for using the ELSER (Elastic Learned Sparse EncodeR) model.

Purpose of the ingest pipeline:

Ingest pipelines let you perform common transformations on your data before indexing. For example, you can use pipelines to remove fields, extract values from text, and enrich your data.
To process text using the ELSER model and generate semantic embeddings when data is ingested into the index
To automatically generate semantic embeddings during the data indexing process

Example of pipeline creation:

Create a new pipeline named 'elser-v2-test'
Define an inference processor within the processors array
model_id:
- ".elser_model_2"
- This involves specifying the ID of the ELSER model to be used.
input_output:
- Define the input and output fields.
- input_field: "content"
  - The name of the field to be used as input for the model
- output_field: "content_embedding"
  - The name of the field where the model's output (embedding) will be stored.
This pipeline provides the text from the 'content' field as input to the ELSER model
The model's output (semantic embeddings) will be stored in the 'content_embedding' field
The names of these fields must match those defined in the previously created index mapping

PUT _ingest/pipeline/elser-v2-test
{
  "processors": [
    {
      "inference": {
        "model_id": ".elser_model_2",
        "input_output": [ 
          {
            "input_field": "content",
            "output_field": "content_embedding"
          }
        ]
      }
    }
  ]
}

4. Load Data

This section covers how to load data for use with the Elasticsearch ELSER (Elastic Learned Sparse EncodeR) model.

Data Set:

The tutorial uses the "msmarco-passagetest2019-top1000" data set.
This is a subset of the larger MS MARCO Passage Ranking data set.
It contains 200 queries, each with relevant text passages.
The data has been processed into a TSV (Tab-Separated Values) file containing unique passages and their IDs.

Data Loading Process:

Download the TSV file.
Upload it to your Elasticsearch cluster using the Data Visualizer in the Machine Learning UI.
When uploading, assign column names:
- First column: "id"
- Second column: "content"
Set the index name to "test-data".

4.1 Ingest the data through the inference ingest pipeline

Create the tokens from the text by reindexing the data throught the inference pipeline that uses ELSER as the inference model.

This operation reindexes data from a source index ("test-data") to a destination index ("my-index"), applying the ELSER model during the process.

The wait_for_completion=false parameter makes it an asynchronous operation.
source: Specifies the source index ("test-data") and batch size (50 documents).
dest: Specifies the destination index ("my-index") and the inference pipeline ("elser-v2-test").
The size parameter is batch size and set to 50, which is smaller than the default of 1000.

POST _reindex?wait_for_completion=false
{
  "source": {
    "index": "test-data",
    "size": 50 
  },
  "dest": {
    "index": "my-index",
    "pipeline": "elser-v2-test"
  }
}

after indexing text data in Elasticsearch, can the ELSER Model automatically generate semantic embeddings using ingest pipeline?

correct.
Here's how it works:
- Ingest Pipeline Setup:
  - You create an ingest pipeline that includes an inference processor.
  - This processor is configured to use the ELSER model.
- Automatic Processing:
  - When documents are indexed, they pass through this ingest pipeline.
  - The inference processor applies the ELSER model to the specified text field(s).
- Embedding Generation:
  - ELSER processes the text and generates sparse vector representations (embeddings).
  - These embeddings capture the semantic meaning of the text.
- Storage:
  - The generated embeddings are stored in a designated field (usually of type sparse_vector).

4.2 Semantic search by using the text_expansion query

This section cover how to perform semantic search in Elasticsearch using the ELSER.

Search Query:

This query searches the "my-index" index.
It uses the text_expansion query type, which is designed for semantic search.
The content_embedding field is where the ELSER embeddings are stored.
model_id specifies the ELSER model to use (".elser_model_2").
model_text is the actual search query in natural language.

GET my-index/_search
{
   "query": {
      "text_expansion": {
         "content_embedding": {
            "model_id": ".elser_model_2",
            "model_text": "How to avoid muscle soreness after running?"
         }
      }
   }
}

Search Process:

1. Elasticsearch uses the ELSER model to convert the query text into a semantic representation.
1. This representation is then compared with the pre-computed embeddings in the content_embedding field.
1. Documents are ranked based on their semantic similarity to the query.

Search Results:

"hits": {
  "total": {
    "value": 10000,
    "relation": "gte"
  },
  "max_score": 26.199875,
  "hits": [
    {
      "_index": "my-index",
      "_id": "FPr9HYsBag9jXmT8lEpI",
      "_score": 26.199875,
      "_source": {
        "content_embedding": {
          "muscular": 0.2821541,
          "bleeding": 0.37929374,
          "foods": 1.1718726,
          "delayed": 1.2112266,
          "cure": 0.6848574,
          "during": 0.5886185,
          "fighting": 0.35022718,
          "rid": 0.2752442,
          "soon": 0.2967024,
          "leg": 0.37649947,
          "preparation": 0.32974035,
          "advance": 0.09652356,
          (...)
        },
        "id": 1713868,
        "model_id": ".elser_model_2",
        "content": "For example, if you go for a run, you will mostly use the muscles in your lower body. Give yourself 2 days to rest those muscles so they have a chance to heal before you exercise them again. Not giving your muscles enough time to rest can cause muscle damage, rather than muscle development."
      }
    },
    (...)
  ]
}

Text Expansion Query:

The text expansion query uses a natural language processing model to convert the query text into a list of token-weight pairs which are then used in a query against a sparse vector or rank features field.

Combining semantic search with other queries

This section covers how to combine semantic search using ELSER with other types of queries in Elasticsearch.

Components:

a. Text Expansion Query (Semantic Search):
- This uses ELSER for semantic search.
- The boost is set to 1 (default), meaning no additional boosting.
b. Traditional Search:
- This is a traditional full-text search for the term "toxins".
- The boost is set to 4, increasing the relevance score of matches.
Minimum Score:
- "min_score": 10 at the end of the query.
- This filters out results with a score less than 10, pruning less relevant matches.
Combination Logic:
- The should clause means documents can match either or both queries.
- Documents matching both queries will likely score higher.

GET my-index/_search
{
  "query": {
    "bool": { 
      "should": [
        {
          "text_expansion": {
            "content_embedding": {
              "model_text": "How to avoid muscle soreness after running?",
              "model_id": ".elser_model_2",
              "boost": 1 
            }
          }
        },
        {
          "query_string": {
            "query": "toxins",
            "boost": 4 
          }
        }
      ]
    }
  },
  "min_score": 10 
}

Saving disk space by excluding the ELSER tokens from document source

This section cover how to optimize disk space usage when working with ELSER (Elastic Learned Sparse EncodeR) in Elasticsearch.

To save disk space by excluding ELSER-generated tokens from the document source while still keeping them indexed for search.

Key Concept:

ELSER tokens need to be indexed for the text_expansion query to work.
However, these tokens don't need to be stored in the document source (_source field).

it's not necessary to retain the _source field values because the embedding values are not meaningful when the document is returned?

exactly right.
The ELSER-generated embeddings (stored in the content_embedding field) are used for semantic search and matching. They represent the semantic meaning of the text in a format that's optimized for machine processing.
The embeddings are crucial for the search process (finding relevant documents).
However, when retrieving and displaying a document to a user, the original text (stored in the content field) is what's meaningful and useful.
By doing this, not only disk space will be reduced, but response time during retrieval will also decrease.

The mapping that excludes content_embedding from the _source field can be created by the following API call:

PUT my-index
{
  "mappings": {
    "_source": {
      "excludes": [
        "content_embedding"
      ]
    },
    "properties": {
      "content_embedding": {
        "type": "sparse_vector"
      },
      "content": {
        "type": "text"
      }
    }
  }
}

저작자표시 비영리

'Elasticsearch' 카테고리의 다른 글

Elasticsearch: Dense vector field type (0)	2024.08.05
Elasticsearch: kNN Methods (0)	2024.08.02
Elasticsearch: 벡터 유사도 메트릭(similarity metric) (0)	2024.08.02
Elasticsearch: Reciprocal rank fusion (0)	2024.07.12
Elasticsearch: Tune approximate kNN search (0)	2024.07.12

여정민의 블로그