High Dimensional Vector Search: A Practical Exploration Using dense_vector in Elasticsearch 8.X

In recent years, with the development of deep learning technology, vector search has attracted extensive attention. Elasticsearch introduced the dense_vector field type as early as version 7.2.0, which supports storing high-dimensional vector data, such as word embedding or document embedding, for operations such as similarity search. In this article, I will show how to use dense_vector for vector search in Elasticsearch 8.X releases.

1. Background introduction

First, we need to understand dense_vector. dense_vector is a field type used by Elasticsearch to store high-dimensional vectors, and is usually used in neural search to search for similar texts using embeddings generated by NLP and deep learning models. You can find more information about dense_vector at this link.

In the next section, I’ll show how to create a simple Elasticsearch index that includes vector search capabilities based on text embeddings.

2. Generating vectors: processing with Python

First, we need to generate text embeddings using Python and the BERT model. Here’s an example of how we do this:

import torch
from transformers import BertTokenizer, BertModel

tokenizer = BertTokenizer.from_pretrained("bert-base-uncased")
model = BertModel.from_pretrained("bert-base-uncased")


def get_bert_embedding(text):
    inputs = tokenizer(text, return_tensors="pt", max_length=128, truncation=True, padding="max_length")
    with torch.no_grad():
        outputs = model(**inputs)
    return outputs.last_hidden_state[:, :3, :].numpy()

def print_infos():
    docs = ["The barbecue city covering an area of 100 mu was successfully built in Zibo in just 20 days, and now it has become a popular place for thousands of people to compete for "roasting seats". ",
            "A newly built barbecue city covering an area of 100 mu in Zibo was built in just 20 days, attracting many barbecue lovers, and now "roasting seats" are hard to find.",
            "In Zibo, a 100-acre barbecue city that took 20 days to build has become the focus of everyone's attention. All kinds of delicious barbecues have attracted thousands of people to compete for "roasting seats". It can be said that it is hard to find a place.",
            "Zibo generally refers to Zibo City. Zibo City, referred to as "Zi", the former capital of Qi State, a prefecture-level city under the jurisdiction of Shandong Province, and a type II large city"]
    for doc in docs:
        print( f"Vector for '{doc}':", get_bert_embedding( doc ) )
    
if __name__ == '__main__':
    print_infos()

In the above script, we define a function get_bert_embedding to generate a vector representation of each document. We then generated four different document vectors and printed their output to the console. As shown below:

Result reference:

Vector for 'The barbecue city covering an area of 100 mu was successfully built in Zibo in just 20 days, and now it has become a popular place for thousands of people to compete for "roasting seats". ': [[[-0.2703271 0.38279012 -0.29274252 ... -0.24937081 0.7212287
    0.0751707]
  [ 0.01726123 0.1450473 0.16286954 ... -0.20245396 1.1556625
   -0.112049]
  [ 0.51697373 -0.01454506 0.1063835 ... -0.2986216 0.69151103
    0.13124703]]]
Vector for 'Zibo's newly built barbecue city covering an area of 100 mu was built in just 20 days, attracting many barbecue lovers, and now it is hard to find a "roasting place". ': [[[-0.22879271 0.43286988 -0.21742335 ... -0.22972387 0.75263715
    0.03716223]
  [ 0.1252176 -0.02892866 0.17054333 ... -0.30524847 0.94903445
   -0.46865308]
  [ 0.42650488 0.34019586 -0.01442122 ... -0.17345914 0.6688627
   -0.75012964]]]

3. Practical exploration: Import and search vectors into Elasticsearch

3.1 Create index

We first need to create a new index in Elasticsearch to store our documents and their vector representations. Here is the API call to create the index:

PUT /my_vector_index
{
  "mappings": {
    "properties": {
      "title": {
        "type": "text"
      },
      "content_vector": {
        "type": "dense_vector",
        "dims": 3
      }
    }
  }
}

In the above code, we created an index named my_vector_index and defined two fields: title and content_vector. Among them, the type of the content_vector field is set to dense_vector, and its dimension is specified as 3, which is consistent with the BERT vector dimension we generated earlier.

3.2 Import data

Next, we can import our documents and their corresponding vectors into the index. The following is an example bulk import API call:

POST my_vector_index/_bulk
{"index":{"_id":1}}
{"title":"The barbecue city covering an area of 100 mu was successfully built in Zibo in just 20 days, and now it has become a popular place for thousands of people to compete for "roasting seats". ","content_vector":[-0.2703271, 0.38279012, -0.29274252]}
{"index":{"_id":2}}
{"title":"A newly built barbecue city covering an area of 100 mu in Zibo was built in just 20 days, attracting many barbecue lovers, and now it is hard to find a "barbecue seat".","content_vector":[ -0.22879271, 0.43286988, -0.21742335]}
{"index":{"_id":3}}
{"title":"In Zibo, a 100-acre barbecue city that took 20 days to build has become the focus of everyone's attention. All kinds of delicious barbecues have attracted thousands of people to compete for "roasting seats". It can be said that it is hard to find a place."," content_vector":[-0.24912262, 0.40769795, -0.26663426]}
{"index":{"_id":4}}
{"title":"Zibo generally refers to Zibo City. Zibo City, referred to as "Zi", the former capital of Qi State, a prefecture-level city under the jurisdiction of Shandong Province, and a type II large city","content_vector":["0.32247472, 0.19048998, -0.36749798 ]}

In this example, we use the _bulk interface of Elasticsearch to import data in batches. The data for each document consists of two lines: one line contains the document’s ID, and the other line contains the document’s title and content vector. Note that the values of the vector are the same as we generated in the Python code.

3.3 Perform a search

After creating and importing the data, we can perform a similarity search. We will score queries using a script, where our scoring script will calculate the cosine similarity between the query vector and each document’s content vector.

The following is an example of an API call:

GET my_vector_index/_search
{
  "query": {
    "script_score": {
      "query": {
        "match_all": {}
      },
      "script": {
        "source": "cosineSimilarity(params. query_vector, 'content_vector') + 1.0",
        "params": {
          "query_vector": [-0.2703271, 0.38279012, -0.29274252]
        }
      }
    }
  }
}

In the above query, we defined a script score query script_score. This query first executes a query (match_all) that matches all documents, and then scores each document according to our script.

The scoring script cosineSimilarity(params.query_vector, ‘content_vector’) + 1.0 calculates the cosine similarity between the query vector and the content_vector field of each document, and adds 1 to the result (since cosine similarity ranges from -1 to 1, while Elasticsearch scores must be non-negative).

We take the vector of document 1 as the retrieval condition, and the execution results are as follows:

4. Conclusion

Vector-based search methods are constantly evolving, and Elasticsearch is constantly improving and expanding its capabilities to keep up with this trend.

To get the most out of Elasticsearch’s capabilities, make sure to follow its official documentation and updates so you know about the latest features and best practices. Using the dense_vector field and related search methods, we can implement complex vector searches in Elasticsearch, providing users with a more precise and personalized search experience.