Building a multi-agent RAG using Llama index

Source: DeepHub IMBA
This article is about 3,000 words, and it is recommended to read it for 6 minutes. This article introduces you to using Llama index to build a multi-agent RAG.

Retrieval-augmented generation (RAG) has become a powerful technique to enhance the capabilities of large language models (LLM). By retrieving relevant information from knowledge sources and incorporating it into prompts, RAG provides LLM with useful context to produce fact-based output.

However, existing single-agent RAG systems face the challenges of low retrieval efficiency, high latency, and suboptimal prompts. These issues limit real-world RAG performance. Multi-agent architecture provides an ideal framework to overcome these challenges and unlock the full potential of RAG. By dividing responsibilities, multi-agent systems allow specialized roles, parallel execution, and optimized collaboration.

Single Agent RAG

Current RAG systems use a single agent to handle the complete workflow – query analysis, paragraph retrieval, sorting, summarization and prompt enhancement.

This single approach provides a simple all-in-one solution. But relying on one agent for each task can lead to bottlenecks. Agents waste time retrieving irrelevant passages from large corpora. Long context summarization is poor, and the prompts fail to optimally integrate the original question and retrieved information.

These inefficiencies severely limit the scalability and speed of RAG for real-time applications.

Multi-agent RAG

Multi-agent architecture can overcome the limitations of single-agent. This is achieved by dividing the RAG into modular roles that execute concurrently:

Retrieval: Dedicated retrieval agents focus on efficient channel retrieval using optimized search technology. This will minimize latency.

Search: By excluding retrieval factors, searches can be parallelized across retrieval agents to reduce latency.

Ranking: A separate ranking agent evaluates retrieval generation for richness, specificity, and other relevant signals. This will filter for the largest correlations.

Summary: Summarize lengthy context into concise snippets containing only the most important facts.

Optimization tips: Dynamically adjust the integration of original tips and retrieved information.

Flexible system: Agents can be replaced and added to customize the system. Visual tool agents can provide insights into workflows.

By dividing the RAG into specialized collaborative roles, multi-agent systems enhance relevance, reduce latency, and optimize prompts. This will unlock a scalable, high-performance RAG.

Separating responsibilities allows search agents to combine complementary technologies such as vector similarity, knowledge graphs, and Internet scraping. This multi-signal approach allows retrieval of different content capturing different aspects of relevance.

By collaboratively decomposing retrieval and ranking among agents, relevance can be optimized from different perspectives. Combined with reading and orchestration agents, it supports scalable multi-angle RAG.

The modular architecture allows engineers to combine different retrieval technologies across specialized agents.

Multi-agent RAG for Llama index

The Llama index outlines specific examples of using multi-agent RAGs:

Document Agent – Perform QA and summarization in a single document.

Vector Indexing – Enables semantic search for each document agent.

Summary Indexing – Allows summarization of each document agent.

TOP-LEVEL Agents – Document agents are orchestrated to answer cross-document questions using tool retrieval.

For multi-document QA, real advantages are shown over the single-agent RAG baseline. Dedicated document agents coordinated by top agents provide more focused, relevant responses based on specific documents.

Let’s take a look at how Llama index is implemented:

We will download Wikipedia articles about different cities. Each article is stored individually. We only found 18 cities. Although it is not very large, it can already be a good demonstration of the advanced document retrieval function.

from llama_index import (
    VectorStoreIndex,
    SummaryIndex,
    SimpleKeywordTableIndex,
    SimpleDirectoryReader,
    ServiceContext,
 )
 from llama_index.schema import IndexNode
 from llama_index.tools import QueryEngineTool, ToolMetadata
 from llama_index.llms import OpenAI

Below is a list of cities:

wiki_titles = [
    "Toronto",
    "Seattle",
    "Chicago",
    "Boston",
    "Houston",
    "Tokyo",
    "Berlin",
    "Lisbon",
    "Paris",
    "London",
    "Atlanta",
    "Munich",
    "Shanghai",
    "Beijing",
    "Copenhagen",
    "Moscow",
    "Cairo",
    "Karachi",
 ]

The following is the code to download each city document:

from pathlib import Path


 import requests


 for title in wiki_titles:
    response = requests.get(
        "https://en.wikipedia.org/w/api.php",
        params={
            "action": "query",
            "format": "json",
            "titles": title,
            "prop": "extracts",
            # 'exintro': True,
            "explaintext": True,
        },
    ).json()
    page = next(iter(response["query"]["pages"].values()))
    wiki_text = page["extract"]


    data_path = Path("data")
    if not data_path.exists():
        Path.mkdir(data_path)


    with open(data_path / f"{title}.txt", "w") as fp:
        fp.write(wiki_text)

Load the downloaded document.

# Load all wiki documents
 city_docs = {}
 for wiki_title in wiki_titles:
    city_docs[wiki_title] = SimpleDirectoryReader(
        input_files=[f"data/{wiki_title}.txt"]
    ).load_data()

Define LLM + context + callback manager.

llm = OpenAI(temperature=0, model="gpt-3.5-turbo")
 service_context = ServiceContext.from_defaults(llm=llm)

We define a “document agent” for each document: a vector index (for semantic search) and a summary index (for summarization) for each document. These two query engines are then converted to be passed to the OpenAI function call tool.

The document agent can dynamically choose to perform semantic search or summarization in a given document. We create a separate document agent for each city.

from llama_index.agent import OpenAIAgent
 from llama_index import load_index_from_storage, StorageContext
 from llama_index.node_parser import SimpleNodeParser
 import os


 node_parser = SimpleNodeParser.from_defaults()


 # Build agents dictionary
 agents = {}
 query_engines = {}


 # this is for the baseline
 all_nodes = []


 for idx, wiki_title in enumerate(wiki_titles):
    nodes = node_parser.get_nodes_from_documents(city_docs[wiki_title])
    all_nodes.extend(nodes)


    if not os.path.exists(f"./data/{wiki_title}"):
        # build vector index
        vector_index = VectorStoreIndex(nodes, service_context=service_context)
        vector_index.storage_context.persist(
            persist_dir=f"./data/{wiki_title}"
        )
    else:
        vector_index = load_index_from_storage(
            StorageContext.from_defaults(persist_dir=f"./data/{wiki_title}"),
            service_context=service_context,
        )


    # build summary index
    summary_index = SummaryIndex(nodes, service_context=service_context)
    #define query engines
    vector_query_engine = vector_index.as_query_engine()
    summary_query_engine = summary_index.as_query_engine()


    #define tools
    query_engine_tools = [
        QueryEngineTool(
            query_engine=vector_query_engine,
            metadata=ToolMetadata(
                name="vector_tool",
                description=(
                    "Useful for questions related to specific aspects of"
                    f" {wiki_title} (e.g. the history, arts and culture,"
                    " sports, demographics, or more)."
                ),
            ),
        ),
        QueryEngineTool(
            query_engine=summary_query_engine,
            metadata=ToolMetadata(
                name="summary_tool",
                description=(
                    "Useful for any requests that require a holistic summary"
                    f" of EVERYTHING about {wiki_title}. For questions about"
                    " more specific sections, please use the vector_tool."
                ),
            ),
        ),
    ]


    # build agent
    function_llm = OpenAI(model="gpt-4")
    agent = OpenAIAgent.from_tools(
        query_engine_tools,
        llm=function_llm,
        verbose=True,
        system_prompt=f"""\
 You are a specialized agent designed to answer queries about {wiki_title}.
 You must ALWAYS use at least one of the tools provided when answering a question; do NOT rely on prior knowledge.\
 """,
    )


    agents[wiki_title] = agent
    query_engines[wiki_title] = vector_index.as_query_engine(
        similarity_top_k=2
    )

Below is a high-level agent that can be orchestrated across different document agents to answer any user query.

High-level agents can use all document agents as tools to perform retrieval. Here we are using top-k retriever, but the best way is to do custom retrieval according to our needs.

# define tool for each document agent
 all_tools = []
 for wiki_title in wiki_titles:
    wiki_summary = (
        f"This content contains Wikipedia articles about {wiki_title}. Use"
        f" this tool if you want to answer any questions about {wiki_title}.\\
"
    )
    doc_tool = QueryEngineTool(
        query_engine=agents[wiki_title],
        metadata=ToolMetadata(
            name=f"tool_{wiki_title}",
            description=wiki_summary,
        ),
    )
    all_tools.append(doc_tool)


 # define an "object" index and retriever over these tools
 from llama_index import VectorStoreIndex
 from llama_index.objects import ObjectIndex, SimpleToolNodeMapping


 tool_mapping = SimpleToolNodeMapping.from_objects(all_tools)
 obj_index = ObjectIndex.from_objects(
    all_tools,
    tool_mapping,
    VectorStoreIndex,
 )


 from llama_index.agent import FnRetrieverOpenAIAgent


 top_agent = FnRetrieverOpenAIAgent.from_retriever(
    obj_index.as_retriever(similarity_top_k=3),
    system_prompt=""" \
 You are an agent designed to answer queries about a set of given cities.
 Please always use the tools provided to answer a question. Do not rely on prior knowledge.\


 """,
    verbose=True,
 )

For comparison, we define a “simple” RAG pipeline that dumps all documents into a single vector index collection. Set top_k = 4.

base_index = VectorStoreIndex(all_nodes)
 base_query_engine = base_index.as_query_engine(similarity_top_k=4)

Let’s run some example queries comparing the QA/summary of a single document to the QA/summary of multiple documents.

response = top_agent.query("Tell me about the arts and culture in Boston")

The result is as follows:

=== Calling Function ===
 Calling function: tool_Boston with args: {
  "input": "arts and culture"
 }
 === Calling Function ===
 Calling function: vector_tool with args: {
  "input": "arts and culture"
 }
 Got output: Boston is known for its vibrant arts and culture scene. The city is home to a number of performing arts organizations, including the Boston Ballet, Boston Lyric Opera Company, Opera Boston, Boston Baroque, and the Handel and Haydn Society. There are also several theaters in or near the Theater District, such as the Cutler Majestic Theatre, Citi Performing Arts Center, the Colonial Theater, and the Orpheum Theatre. Boston is a center for contemporary classical music, with groups like the Boston Modern Orchestra Project and Boston Musica Viva. The city also hosts major annual events, such as First Night, the Boston Early Music Festival, and the Boston Arts Festival. In addition, Boston has several art museums and galleries, including the Museum of Fine Arts, the Isabella Stewart Gardner Museum, and the Institute of Contemporary Art.
 ========================
 Got output: Boston is renowned for its vibrant arts and culture scene. It is home to numerous performing arts organizations, including the Boston Ballet, Boston Lyric Opera Company, Opera Boston, Boston Baroque, and the Handel and Haydn Society. The city' s Theater District houses several theaters, such as the Cutler Majestic Theatre, Citi Performing Arts Center, the Colonial Theater, and the Orpheum Theatre.


 Boston is also a hub for contemporary classical music, with groups like the Boston Modern Orchestra Project and Boston Musica Viva. The city hosts major annual events, such as First Night, the Boston Early Music Festival, and the Boston Arts Festival, which contribute to its cultural richness.


 In terms of visual arts, Boston boasts several art museums and galleries. The Museum of Fine Arts, the Isabella Stewart Gardner Museum, and the Institute of Contemporary Art are among the most notable. These institutions offer a wide range of art collections, from ancient to contemporary, attracting art enthusiasts from around the world.
 ========================

Let’s look at the results of the simple RAG pipeline above:

# baseline
 response = base_query_engine.query(
    "Tell me about the arts and culture in Boston"
 )
 print(str(response))


 Boston has a rich arts and culture scene. The city is home to a variety of performing arts organizations, such as the Boston Ballet, Boston Lyric Opera Company, Opera Boston, Boston Baroque, and the Handel and Haydn Society. Additionally, there are numerous contemporary classical music groups associated with the city's conservatories and universities, like the Boston Modern Orchestra Project and Boston Musica Viva. The Theater District in Boston is a hub for theater, with notable venues including the Cutler Majestic Theatre, Citi Performing Arts Center , the Colonial Theater, and the Orpheum Theatre. Boston also hosts several significant annual events, including First Night, the Boston Early Music Festival, the Boston Arts Festival, and the Boston gay pride parade and festival. The city is renowned for its historic sites connected to the American Revolution, as well as its art museums and galleries, such as the Museum of Fine Arts, Isabella Stewart Gardner Museum, and the Institute of Contemporary Art.

You can see that the results of the multi-agent system we built are much better.

Summary

RAG systems must evolve multi-agent architectures to achieve enterprise-class performance. As this example illustrates, dividing responsibilities can yield gains in relevancy, speed, summary quality, and timely optimization. By decomposing RAGs into specialized collaborative roles, multi-agent systems can overcome the limitations of single agents and enable scalable, high-performance RAGs.

Editor: Yu Tengkai

Proofreading: Lin Yilin