Improve recall (Retrieval) and introduce reranking (Reranking) to improve the LLM application effect under the RAG architecture

Improve recall (Retrieval) and introduce reranking (Reranking) to improve the LLM application effect under RAG architecture

Original ully AI engineering 2023-08-24 21:08

included in collection

#LLM application architecture 3

#fieldtechnology13

Get hands-on and pay attention

Don’t get lost with useful information

As mentioned above, the origin and architecture of the retrieval enhancement (RAG) of the LLM application architecture are introduced. The RAG architecture solves the current problems of context window limitations in the prompt learning process of large models. The overall architecture is concise and clear, easy to implement, and has been widely used. , but there are a lot of practical problems that need to be improved and optimized in the actual implementation process.

Picture

RAG architecture under llamaindex implementation

Taking RAG recall as an example, the most original approach is to retrieve background data from the vector database in a top-k manner and then submit it directly to LLM to generate answers. However, this has the problem that the retrieved chunks are not necessarily completely related to the context. This ultimately leads to poor quality results generated by large models.

This problem is largely caused by insufficient recall relevance or too few recalls. Think from the perspective of expanding recall, learn from the practice of recommendation systems, and introduce rough ranking or rearrangement steps to improve the effect. The basic idea is that the original top-k vector retrieval recall expands the number of recalls, and then introduces a rough ranking model. The model here can be a strategy, a lightweight small model, or an LLM to rearrange the recall results based on the context. , through such an improvement model, the effect of RAG can be effectively improved.

Picture

The following introduces some specific ideas and implementations of llamaindex in this regard.

1) LLM-based recall or rearrangement

In a logical concept, this approach uses LLM to decide which documents/chunks of text are relevant to a given query. The prompt consists of a set of candidate documents, and the task of LLM at this time is to select the relevant document set and score its relevance using internal indicators. In order to avoid content fragmentation caused by chunking of large documents, certain optimizations can also be made during the database construction phase, and summary index can be used to index large documents.

Picture

Simple diagram of how LLM-based retrieval works

One principle in LLM development is to use the capabilities of large models as much as possible. LLM is not just for final answers. It can be used for keyword enhancement, answer consistency determination, etc. Here, large models can be used to determine the generated results. Best candidate Q&A. How to do a good prompt is the key. This is the built-in prompt of llamaindex. As you can see, the few-shot capability of the large model is used here:

?

A list of documents is shown below. Each document has a number next to it along with a summary of the document. A question is also provided.</code><code>Respond with the numbers of the documents you should consult to answer the question, in order of relevance, as well</code><code>as the relevance score. The relevance score is a number from 1–10 based on how relevant you think the document is to the question.</code><code>Do not include any documents that are not relevant to the question.</code><code>Example format:</code><code>Document 1:</code><code><summary of document 1> </code><code>Document 2:</code><code><summary of document 2></code><code>…</code><code>Document 10:</code><code><summary of document 10></code><code>Question: <question></code><code>Answer:</code><code>Doc: 9, Relevance: 7</code><code>Doc: 3, Relevance: 4</code><code>Doc: 7, Relevance: 3</code><code>Let's try this now:</code><code>{context_str}</code><code>Question: {query_str }</code><code>Answer:

In addition, this recall process can be performed multiple times to form batches, so that relevant documents can be recalled on a larger scale, and then the result scores obtained from the large model for each batch are summarized to obtain the final candidate documents. llama-index provides two forms of abstraction: as a standalone retrieval module (ListIndexLLMRetriever) or a reordering module (LLMRerank).

  • LLM Retriever (ListIndexLLMRetriever)

The module is defined via list indexing, which simply stores a set of nodes as a flat list. You can create a list index on a set of documents and then use an LLM retriever to retrieve related documents from the index.

from llama_index import GPTListIndex</code><code>from llama_index.indices.list.retrievers import ListIndexLLMRetriever</code><code>index = GPTListIndex.from_documents(documents, service_context=service_context)</code><code> # high - level API</code><code>query_str = "What did the author do during his time in college?"</code><code>retriever = index.as_retriever(retriever_mode="llm")</code> <code>nodes = retriever.retrieve(query_str)</code><code># lower-level API</code><code>retriever = ListIndexLLMRetriever()</code><code>response_synthesizer = ResponseSynthesizer.from_args()</code><code>query_engine = RetrieverQueryEngine(retriever=retriever, response_synthesizer=response_synthesizer)</code><code>response = query_engine.query(query_str)

Using this recall mode to replace the traditional vector retrieval mode, this implementation is relatively slow and suitable for situations where there are relatively few recalled documents, but it can save the rearrangement stage.

  • LLM Reranker (LLMRerank)

It is a typical implementation in this scenario and is defined as part of the NodePostprocessor abstraction for second-stage processing after the initial retrieval pass. Postprocessors can be used alone or as part of a RetrieverQueryEngine call. In the example below, we show how to use the postprocessor as a standalone module after the initial retrieval call via vector indexing.

from llama_index.indices.query.schema import QueryBundle</code><code>query_bundle = QueryBundle(query_str)</code><code># configure retriever</code><code>retriever = VectorIndexRetriever(</code><code>index=index,</code><code>similarity_top_k=vector_top_k,</code><code>)</code><code>retrieved_nodes = retriever.retrieve(query_bundle)</code><code># configure reranker</code><code>reranker = LLMRerank(choice_batch_size=5, top_n=reranker_top_n, service_context=service_context)</code><code>retrieved_nodes = reranker.postprocess_nodes(retrieved_nodes, query_bundle)

It should be noted that recall or rearrangement based on LLM has some shortcomings. The first is that it is slow. The second is that it increases the cost of calling LLM. Third, since the scoring is performed in batches, there is a problem that it cannot be globally aligned.

Comparison demonstration

The following is an example based on top-k and LLM recall, based on the Great Gatsby (The Great Gatsby) and the 2021 Lyft SEC 10-k data, only comparing the effects of the recall phase.

1. The Great Gatsby

In this example, “the Great Gatsby” is loaded as a document object and a vector index is created on it (the block size is set to 512).

# LLM Predictor (gpt-3.5-turbo) + service context</code><code>llm_predictor = LLMPredictor(llm=ChatOpenAI(temperature=0, model_name="gpt-3.5-turbo"))</code> <code>service_context = ServiceContext.from_defaults(llm_predictor=llm_predictor, chunk_size_limit=512)</code><code># load documents</code><code>documents = SimpleDirectoryReader('../../../examples/ gatsby/data').load_data()</code><code>index = GPTVectorStoreIndex.from_documents(documents, service_context=service_context)

We then define a get_retrieved_nodes function that can do either vector-based retrieval of the index only or vector-based retrieval + reordering.

def get_retrieved_nodes(</code><code> query_str, vector_top_k=10, reranker_top_n=3, with_reranker=False</code><code>):</code><code> query_bundle = QueryBundle(query_str)</code><code> # configure retriever</code><code> retriever = VectorIndexRetriever(</code><code> index=index,</code><code> similarity_top_k=vector_top_k,</code><code> )</code><code> retrieved_nodes = retriever.retrieve(query_bundle)</code><code> if with_reranker:</code><code> # configure reranker</code><code> reranker = LLMRerank(choice_batch_size=5, top_n =reranker_top_n, service_context=service_context)</code><code> retrieved_nodes = reranker.postprocess_nodes(retrieved_nodes, query_bundle)</code><code> return retrieved_nodes

Then we ask some questions. For original vector-based retrieval, we set k=3. For two-stage retrieval, we set k=10 for vector retrieval and n=3 for LLM-based reordering.

  • Test question: “Who was driving the car that hit Myrtle?”

For those unfamiliar with “The Great Gatsby,” the narrator later finds out from Gatsby that it was actually Daisy who was driving the car, but Gatsby took the blame for her.

The retrieved top context is shown in the figure below. We can see that in the embedding-based retrieval, the first two texts contain the semantics of a car accident but provide no details about who is really responsible. Only the third text contains the correct answer.

Picture

Retrieve recalled context (baseline) using top-k vectors

In contrast, the two-stage approach returns only a relevant context and it contains the correct answer.

Picture

Use vector retrieval + rearrange the obtained context

2. 2021 Lyft SEC 10-K

The test raises some questions regarding the 2021 Lyft SEC 10-K, specifically regarding the impact of and response to COVID-19. The Lyft SEC 10-K is 238 pages long, and pressing ctrl-f to search for “COVID-19” turns up 127 hits.

Uses a similar setup to the Gatsby example above. The main differences are setting the chunk size to 128 instead of 512, setting k=5 as the vector retrieval baseline, and setting k=40 and ranker n=5 as a combination of vector retrieval + reranking.

  • Test question: “What initiatives are the company focusing on independently of COVID-19?”

The results of the baseline are shown above. As can be seen, the results corresponding to indices 0, 1, 3, and 4 are all measures taken directly in response to COVID-19, although the question is specifically about company measures independent of the COVID-19 pandemic.

Picture

Retrieve recalled context (baseline) using top-k vectors

In method 2, the top k items are expanded to 40 items and then LLM is used to filter the top 5 items to get more relevant results. Independent company initiatives include “expansion of Light Vehicles” (1), “incremental investments in brand/marketing” (2), international expansion (3), and accounting for misc. risks such as natural disasters and operational risks in terms of financial performance(4).

Picture

Use vector retrieval + rearrange the obtained context

It can be seen that recall or rearrangement based on LLM has a relatively large improvement in effect compared with traditional Top-k direct vector retrieval, but there are also some problems, and it needs to be comprehensively selected based on actual scenarios.

2) Based on relatively lightweight models and algorithms

This approach is a simplification of the LLM model, using BM25, Cohere Rerank and other methods to roughly rank the recall results, achieving a compromise in effect and performance.

Picture

Usage example:

cohere_rerank = CohereRerank(api_key=os.environ["COHERE_API_KEY"], top_n=top_k)</code><code>reranking_query_engine = index.as_query_engine(</code><code> similarity_top_k=top_k,</code> <code> node_postprocessors=[cohere_rerank],</code><code>)

3) Rule-based

In the rough ranking stage, you can also learn from the practice in the recommendation system. Before entering the fine ranking model, the rough ranking model can also be replaced by some rules. Sometimes high-quality rules will perform better. Postprocessor can be set in llamaindex. These postprocessors can modify query results after they are returned from the index.

For example, add a strategy that prioritizes recent documents. It can be defined as FixedRecencyPostprocessor.

recency_postprocessor = FixedRecencyPostprocessor(service_context=service_context, top_k=1)</code><code>recency_query_engine = index.as_query_engine(</code><code> similarity_top_k=top_k,</code><code> node_postprocessors=[recency_postprocessor ],</code><code>)

Here the FixedRecencyPostprocessor can sort by filtering the Date field in the metadata on the chunknode.

> Source (Doc id: 24ec05e1-cb35-492e-8741-fdfe2c582e43): date: 2017-01-28 00:00:00</code>
<code>Under the category:</code><code>THE WORLDPOST:</code><code>World Leaders React To The Reality ...</code>
<code>> Source (Doc id: 098c2482-ce52-4e31-aa1c-825a385b56a1): date: 2015-01-18 00:00:00</code>
<code>Under the category:</code><code>POLITICS:</code><code>The Issue That's Looming Over The Final ...

It is worth mentioning that being good at using metadata can have miraculous effects on many problems in the RAG architecture. Later articles will introduce some use cases of Metadata.

Picture

Not only that, in llamaindex, these postprocessors can be used jointly to form some chain rules, such as combining the cohere_rerank and recency_postprocessor just now to further refine the sorting.

query_engine = index.as_query_engine(</code><code> similarity_top_k=top_k,</code><code> node_postprocessors=[cohere_rerank, recency_postprocessor],</code><code>)

Summary

The RAG architecture comes from actual problems, and many problems are similar. At the level of effect optimization, we can learn from the optimization experience of some traditional AI systems such as recommendation systems and migrate them over. This will be of great help in improving the RAG effect. In the following articles, we will continue to introduce some usage issues in specific scenarios, so please pay attention.

Note: Part of this article is referenced from the official blogs of llamaindex and qdrant.