es hybrid retrieval and langchain retrieval enhancement

Langchain Retriever

  • MultiQueryRetriever uses llm to generate three questions with similar meanings for the question, retrieves related documents based on the three questions and returns them all.

  • MultiVectorRetriever, when the same document has multiple records in the vector library because different vectors are stored, deduplication is performed through id. The code implementation is very simple. I don’t know what its use is. Why is it not stored as multiple vector fields instead of multiple documents? It may be because langchain’s vectorstore only supports retrieving one vector field.

    class MultiVectorRetriever(BaseRetriever):
        """Retrieve from a set of multiple embeddings for the same document."""
    
        vectorstore: VectorStore
        """The underlying vectorstore to use to store small chunks
        and their embedding vectors"""
        docstore: BaseStore[str, Document]
        """The storage layer for the parent documents"""
        id_key: str = "doc_id"
        search_kwargs: dict = Field(default_factory=dict)
        """Keyword arguments to pass to the search function."""
    
        def _get_relevant_documents(
            self, query: str, *, run_manager: CallbackManagerForRetrieverRun
        ) -> List[Document]:
            """Get documents relevant to a query.
            Args:
                query: String to find relevant documents for
                run_manager: The callbacks handler to use
            Returns:
                List of relevant documents
            """
            sub_docs = self.vectorstore.similarity_search(query, **self.search_kwargs)
            # We do this to maintain the order of the ids that are returned
            ids = []
            for d in sub_docs:
                if d.metadata[self.id_key] not in ids:
                    ids.append(d.metadata[self.id_key])
            docs = self.docstore.mget(ids)
            return [d for d in docs if d is not None]
    
  • Contextual compression. The retrieved documents may contain a lot of useless contextual information. Throwing them directly to llm will cause interference and increase response time. Use contextual compression to improve the relevance of context and questions. The key to this idea is how to compress the context, and langchain provides several implementations.

    • DocumentCompressorPipeline, pipeline, needs to provide a series of BaseDocumentTransformer or BaseDocumentCompressor

    • LLMChainExtractor, uses llm to extract effective context information.

    • LLMChainFilter uses llm to remove irrelevant contextual information.

    • CohereRerank, call Cohere Rerank API to rearrange the rating.

    • EmbeddingsFilter, another vector similarity measure?

  • Ensemble Retriever integrates a series of retrieved results and then performs rrf. The most common one is full-text search + vector search + rrf reciprocal sorting fusion. ES hybrid search is this process, but rrf requires a license.

  • Parent Document Retriever, usually when splitting documents, we want the document to be shorter so that the accuracy of full-text retrieval and vector retrieval can be achieved. However, the information contained in a document that is too short may be too narrow and cannot provide sufficient information for the problem of associating multiple documents. context. The Retriever splits the document into smaller chunks, and each chunk is associated with the id of its parent document. Small chunks are used to improve retrieval accuracy, and large parent documents are used to return context. Consider the above mentioned Context compression may be a good way to improve retrieval accuracy.

  • SelfQueryRetriever, LLM converts natural language into query statements.

  • TimeWeightedVectorStoreRetriever records the time when the document was last accessed. The longer and less visited the document, the lower the score.

    semantic_similarity + (1.0 - decay_rate) ^ hours_passed
    
  • WebResearchRetriever, retrieves content from the web to provide context.

    There are many retrieval enhancement classes under the langchain.retrievers package.

Elasticsearch vector retrieval

dense_vector type

Aggregation and sorting are not supported and cannot be in nested fields, otherwise they cannot be indexed.

{<!-- -->
  "mappings": {<!-- -->
    "properties": {<!-- -->
      "my_vector": {<!-- -->
        "type": "dense_vector",
        "dims": 1023,
        "index": true,
        "similarity": "dot_product"
      }
    }
  }
}

Supported properties

  • element_type

    • float, default, 4-byte floating point number.
    • byte, 1-byte integer, -218~127.
  • dims, required field, vector dimension, cannot exceed 2048.

  • index, defaults to fasle, set to true to support kNN search.

  • Similarity, similarity measurement algorithm, if index is true, this field must be set.

    • l2_norm, Euclidean distance
    • dot_product, dot product
    • cosine, cosine similarity

    It is recommended to normalize vectors and choose dot_product method to improve retrieval efficiency.

  • index_options, optional fields

    • type, required field, kNN algorithm, currently only supports hnsw.
    • m, required field, the number of neighbor nodes of each node in hnsw, the default value is 16.
    • es_construction, required field, the number of nodes to track when aggregating the adjacent points of each new node, the default is 100.

kNN retrieval

Search k nearest neighbor vectors through the similarity measure. The newer version of es already comes with a model. There is no need to encode text fields and query statements in the application. Elastic Cloud supports uploading models by yourself, but it seems that this function is not free?

Approximate kNN

Consumes less resources, responds quickly, and sacrifices accuracy

Notes
  • dot_product or cosine

    It is recommended to normalize vectors and choose dot_product method to improve retrieval efficiency; cosine does not need to be normalized and can be calculated directly.

  • enough memory

    Elasticsearch uses the HNSW algorithm for approximate KNN search. HNSW is a graph-based algorithm, and vectors must be kept in memory to work effectively. Therefore, it is necessary to ensure that the data node has enough memory to store vector data and index structures. To view the size of vector data, es provides an API to analyze index disk usage. Speaking from experience (using the default HNSW configuration), using the float type, the bytes occupied are approximately num_vectors * 4 * (num_dimensions + 12). When using the byte type, the space required is approximately num_vector * (num_dimensions + 12). The space referred to here is the file system cache, not the Java heap.

  • Warm up file system cache

    When es starts, the file system cache is empty, and the initial retrieval may be slow. You can preload index data to build the cache, but if too much data is loaded into the file system cache, the retrieval speed may be slowed down.

    Data file suffix required for approximate kNN retrieval

    • vec, vector value
    • vex, HNSW graph
    • vem, metadata
  • Reduce vector dimensions

    The larger the vector dimension, the more resources the calculation consumes. Some models can choose different encoding dimensions, or use dimensionality reduction methods to reduce dimensions, making a trade-off between accuracy and retrieval speed.

  • Don’t return vector fields

    Loading vector data returns takes time. You can use _source to exclude this field from the returned results. For information about how to exclude fields and the performance impact, you can view the performance comparison of _source, store_fields, and doc_values in ElasticSearch and the es official documentation.

  • There are also several points involving the underlying data structure of ES, which require certain tuning capabilities. You can check the official documentation.

kNN option
  • field, required, vector field name

  • filter, optional, query dsl filter, returns documents that satisfy both vector retrieval and filter filtering conditions.

  • k, required, the number of adjacent vectors returned must be less than num-candidates.

  • num-candidates, equivalent to k on each shard, es retrieves num_candidates vector results from each shard, and then summarizes them according to the score Returns k final results. Increasing this value can improve the accuracy of search results.

  • query_vector, optional, the vector to be retrieved must have the same dimensions as when the mapping was created.

  • query_vector_builder, optional, specifies the relevant information of the model, and leaves the task of encoding text into vectors to es. query_vector and query_vector_builder must be filled in and only one can be filled in.

  • similarity, optional, float type, a threshold for determining retrieval hits, related to the selected distance measurement method, not document score_score, use this value to evaluate documents Score, and apply boost if available.

    If it is l2_norm, the distance needs to be less than or equal to similarity

    If it is cosine or dot_product, the similarity needs to be greater than or equal to similarity.

  • boost, the coefficient when calculating the score, knn can be used together with query, the results of the two are combined and then the score is calculated, and the boost* score is then summed.

Accurate kNN

Query all documents to calculate similarity to ensure the accuracy of the results. You can first use query to filter some documents, and then perform accurate kNN to improve the retrieval speed.

If it is determined that the field does not require approximate kNN, you can set the index attribute of the field to false to improve the indexing speed.

Exact kNN query using script_score

{<!-- -->
  "query": {<!-- -->
    "script_score": {<!-- -->
      "query" : {<!-- -->
        "bool" : {<!-- -->
          "filter" : {<!-- -->
            "range" : {<!-- -->
              "price" : {<!-- -->
                "gte": 1000
              }
            }
          }
        }
      },
      "script": {<!-- -->
        "source": "cosineSimilarity(params.queryVector, 'product-vector') + 1.0",
        "params": {<!-- -->
          "queryVector": [-0.5, 90.0, -10, 14.8, -156.0]
        }
      }
    }
  }
}

Semantic retrieval

The so-called semantic retrieval of es is its own model and vector retrieval. es provides some NLP models, including dense vectors and sparse vectors. If you perform Chinese search, you need to upload the configuration model yourself. The usual steps to improve semantic retrieval are to choose a general model with better effect, accumulate corpus, train the model, and optimize the effect. However, the cost of training is not low. In order to provide a universal and easy to use, es provides a sparse vector encoder ELSER, which can be used out of the box with minimal fine-tuning. It is currently only available in English.

To put it simply, semantic retrieval means that the work of model encoding is also handed over to es. We do not need to encode it in advance and then send it to es for distance calculation. It includes four steps: deploying the model, creating vector fields, generating embedding vectors, and retrieving data. This function is not free, please check the official documentation for details.

Reciprocal fusion sorting (RRF)

rrf is used to merge multiple search result sets into a result set sorted by rrf_score. Usually a combination of multiple ranking methods has better results than a single ranking, such as full-text retrieval BM25 ranking and dense vector similarity ranking. Essentially, multiple ordered result sets are combined into a single ordered result set. It is theoretically possible to normalize the scores of each result set (since the original scores are in completely different ranges) and then do a linear combination to weight and order the final result set according to the scores of each ranking. This approach needs to provide the correct Weights, understanding the statistical distribution of scores for each method and being able to optimize the weightsaccording to the actual situation is not simple.

Another method is the rrf algorithm. Compared with optimizing the weight of each ranking method, rrf is relatively simple and crude. It does not use relevant scores, but only relies on ranking calculations, bypassing the influence of the statistical distribution of scores of different methods. The calculation formula of rrf_score is as follows

R

R

F

s

c

o

r

e

(

d

D

)

=

r

R

1

k

+

r

(

d

)

RRFscore(d \in D) = \sum_{r \in R} \frac{1}{k + r(d)}

RRFscore(d∈D)=r∈R∑?k + r(d)1?

  • D. The query document result set, such as the result set after BM25 sorting and the result set after vector retrieval.
  • R, the sorted sequence of the query’s document result set,

    1

    ,

    2

    ,

    3…

    N

    1,2,3…N

    1,2,3…N,

    r

    (

    d

    )

    r(d)

    r(d) represents the ranking of the document in the result set.

  • k, the degree to which documents in a single result set for each query influence the final ranked result set. Higher values indicate that lower-ranked documents have greater influence. This value must be greater than or equal to 1. Default is 60.
    The calculation process is to calculate and accumulate rrf_score for each document in each result set, and finally sort the documents according to rrf_score.
    Assuming k=10, the following is an example of sorting.
Documentation BM25 Relevance Ranking Dense Vector Relevance Ranking BM25 rrf_score Dense vector rrf_score Ranked by rff_score total score
A 1 3

1

1

+

10

=

1

11

\frac{1}{1 + 10}=\frac{1}{11}

1 + 101?=111?

1

3

+

10

=

1

13

\frac{1}{3 + 10}=\frac{1}{13}

3 + 101?=131?

1
B 2

1

10

+

2

=

1

12

\frac{1}{10 + 2}=\frac{1}{12}

10 + 21?=121?

3
C 3 1

1

3

+

10

=

1

13

\frac{1}{3 + 10}=\frac{1}{13}

3 + 101?=131?

1

1

+

10

=

1

11

\frac{1}{1 + 10}=\frac{1}{11}

1 + 101?=111?

1
D 2 4

1

2

+

10

=

1

12

\frac{1}{2 + 10}=\frac{1}{12}

2 + 101?=121?

1

4

+

10

=

1

14

\frac{1}{4 + 10}=\frac{1}{14}

4 + 101?=141?

2

In the example, the final total scores of documents A and C are the same. The original rrf_score calculation does not consider the weight of different retrieval calculations. Assuming that we believe that dense vector ranking is more accurate than BM25 ranking, then the weight of dense vector can be increased. Then the first document after sorting the sample data rrf is C.

In EnsembleRetriever of lanchain, there is a complete rrf code implementation with weight calculation added.

"""
Perform weighted Reciprocal Rank Fusion on multiple rank lists.
You can find more details about RRF here:
https://plg.uwaterloo.ca/~gvcormac/cormacksigir09-rrf.pdf

Args:
    doc_lists: A list of rank lists, where each rank list contains unique items.

Returns:
    list: The final aggregated list of items sorted by their weighted RRF
            scores in descending order.
"""
if len(doc_lists) != len(self.weights):
    raise ValueError(
        "Number of rank lists must be equal to the number of weights."
    )

# Create a union of all unique documents in the input doc_lists
all_documents = set()
for doc_list in doc_lists:
    for doc in doc_list:
        all_documents.add(doc.page_content)

# Initialize the RRF score dictionary for each document
rrf_score_dic = {<!-- -->doc: 0.0 for doc in all_documents}

# Calculate RRF scores for each document
for doc_list, weight in zip(doc_lists, self.weights):
    for rank, doc in enumerate(doc_list, start=1):
        rrf_score = weight * (1 / (rank + self.c))
        # It would be better to use the document ID as the key. The Document of langchain only has the metadata dictionary and page_content
        rrf_score_dic[doc.page_content] + = rrf_score

# Sort documents by their RRF scores in descending order
sorted_documents = sorted(
    rrf_score_dic.keys(), key=lambda x: rrf_score_dic[x], reverse=True
)

# Map the sorted page_content back to the original document objects
page_content_to_doc_map = {<!-- -->
    doc.page_content: doc for doc_list in doc_lists for doc in doc_list
}
sorted_docs = [
    page_content_to_doc_map[page_content] for page_content in sorted_documents
]

return sorted_docs

Langchain integrates Elasticsearch

self._embeddings = HuggingFaceBgeEmbeddings(model_name=Configuration.EMBEDDING_MODEL,
                                                    model_kwargs={<!-- -->'device': Configuration.DEVICE},
                                                    encode_kwargs={<!-- -->'normalize_embeddings': True})
self._es_client = Elasticsearch(hosts=f'http://{<!-- -->Configuration.ES_HOST}:{<!-- -->Configuration.ES_PORT}',
                                    basic_auth=(Configuration.ES_USER, Configuration.ES_PASSWORD))

self._es_vector_store = ElasticsearchStore(index_name=Configuration.INDEX_NAME, embedding=self._embeddings,
                                                   es_connection=self._es_client,
                                                   distance_strategy=DistanceStrategy.DOT_PRODUCT,
                                                   strategy=ApproxRetrievalStrategy(hybrid=True, rrf=True)) #Hybrid retrieval, rrf rearrangement

If the index is not established in advance, es will automatically create an index and add vector fields. The types of other fields will be automatically inferred by es. If you need to use full-text search, you need to specify a word segmenter when creating the index. At the same time, you can only search one vector field and As a text field, some parameters cannot be defined flexibly and are not very useful. It is recommended that you go through the process of adding documents and searching for documents by yourself.