An article explains clearly about Retrieval Enhanced Generation (RAG)

Article directory

What is search enhancement generation?
Working backwards: providing large language models with additional knowledge to answer questions
Give LLM custom instructions through system prompts
Provide specific knowledge sources for LLM
Put it all together and ask questions
Retrieval Step: Get the Right Information from Your Knowledge Base
What is embedding? How do they relate to knowledge retrieval?
Find the best pieces of knowledge using embeddings
Index your knowledge base
Review the entire process

Recommended reading How do domain-specific chatbots work? An Overview of Retrieval Augmented Generation (RAG), this article is the Chinese translation of the article.

There is a popular open source library called LangChain that can create chatbots, including using

3 lines of code to Q&A on any website/document. Here’s an example from the LangChain documentation.

from langchain.document_loaders import WebBaseLoader
from langchain.indexes import VectorstoreIndexCreator

loader = WebBaseLoader("http://www.paulgraham.com/greatwork.html")
index = VectorstoreIndexCreator().from_loaders([loader])
index.query("What should I work on?")

It outputs answers specific to Paul Graham’s article:

The work you choose should ideally have three qualities: it should be something you have a natural aptitude for, something you have a deep interest in, and something that offers scope to do great work. If you’re unsure, you could start by working on your own projects that seem excitingly ambitious to you. It’s also beneficial to be curious, try lots of things, meet lots of people, read lots of books, and ask lots of questions. When in doubt, optimize for interestingness. It’s okay to guess and be wrong sometimes, as this can lead to discovering what you’re truly good at or interested in.

Note: If you are interested, you can try a chatbot built based on Paul Graham’s article.

The first time you run it it feels like pure magic. How exactly does this work?

The answer is a process called Retrieval Augmented Generation, or RAG for short. It’s a very simple concept, but also has an incredible depth of implementation details.

This article will provide an in-depth overview of RAG. We’ll start with the overall workflow of what’s going on and then zoom in on all the individual parts. By the end, you should have a solid understanding of how these three lines of magic code work and all the principles involved in creating these Q&A bots.

If you were a developer trying to build a robot like this, you would learn which knobs you can adjust and how to adjust them. If you are a non-developer looking to use AI tools on your dataset, you will gain knowledge that will help you get the most out of these tools. And if you’re just a curious person, you can hopefully learn a thing or two about some of the technologies that are disrupting our lives.

Let’s take a closer look.

What is search enhancement generation?

Retrieval augmentation generation is the process of supplementing user input into a large language model (LLM) such as ChatGPT with additional information retrieved from elsewhere. The LLM can then use this information to enhance the responses it generates.

The image below shows how it works in practice:

It starts with the user’s question. For example How do I do ?

What happens first is the retrieval step. This is the process of taking a user question and searching the knowledge base for the most relevant content that might answer that question. The retrieval step is by far the most important and complex part of the RAG chain. But for now, think of it simply as a black box that knows how to extract the best relevant nuggets of information relevant to a user’s query.

Can’t we just give LLM the entire knowledge base?
You may be wondering why we bother retrieving instead of just sending the entire knowledge base to LLM. One reason is that models have built-in limits on the amount of text that can be consumed at one time (although these limits are increasing rapidly). The second reason is cost – sending a lot of texts can get quite expensive. Finally, there is evidence that sending less relevant information leads to better answers.

Once we have relevant information from the knowledge base, we send it to a large language model (LLM) along with the user’s question. The LLM (most commonly ChatGPT) then “reads” the provided information and answers the question. This is the enhancement generation step.

Pretty simple, right?

Working backwards: providing additional knowledge to large language models to answer questions

We’ll start with the last step: answer generation. That is, assume that we have extracted relevant information from the knowledge base that we believe will answer the question. How can we use this to generate answers?

This process may feel like black magic, but behind the scenes it’s just a language model. So, broadly speaking, the answer is “just ask the LLM”. How do we get large language models to do something like this?

We will use ChatGPT as an example. Just like regular ChatGPT, it all depends on prompts and messages.

Give LLM custom instructions through system prompts

The first component is the system prompt. The system prompts to give overall guidance to the language model. For ChatGPT, the system prompt is similar to You are a helpful assistant..

In this case we want it to perform a more specific action. And, since it’s a language model, we can tell it what we want it to do. Here is a brief system prompt example that provides more detailed instructions for LLM:

You are a Knowledge Bot. You will be given the extracted parts of a knowledge base (labeled with DOCUMENT) and a question. Answer the question using information from the knowledge base.

We’re basically saying Hey AI, we're gonna give you some stuff to read. Read it and then answer our question, k? Thx. and, because the AI is very good at following our instructions , so it kind of… works.

Provide LLM with specific knowledge sources

Next, we need to provide reading material for the AI. Again – the latest AI is really good at solving problems. However, we can help it with some structure and formatting.

The following is an example format you can use to pass documents to LLM:

--------------------- DOCUMENT 1 -------------

This document describes the blah blah blah...

------------ DOCUMENT 2 -------------

This document is another example of using x, y and z...

------------ DOCUMENT 3 -------------

[more documents here...]

Do you need all these formats? Probably not, but it’s good to make things as clear as possible. You can also use machine-readable formats such as JSON or YAML. Or, if you’re feeling active, you can dump everything into a big pile of text. However, in more advanced use cases it becomes important to maintain some consistent formatting, for example if you want the LLM to cite its sources.

Once we format the document, we simply send it to LLM as a normal chat message. Remember, in the system prompt we told it we were going to give it some files, and that’s all we’re doing here.

Put it all together and ask questions

Once we receive the system prompts and documentation messages, we simply send the user’s questions along with them to the big language model. Here’s what using the OpenAI ChatCompletion API looks like in Python code:

openai_response = openai.ChatCompletion.create(
    model="gpt-3.5-turbo",
    messages=[
        {<!-- -->
            "role": "system",
            "content": get_system_prompt(), # the system prompt as per above
        },
        {<!-- -->
            "role": "system",
            "content": get_sources_prompt(), # the formatted documents as per above
        },
        {<!-- -->
            "role": "user",
            "content": user_question, # the question we want to answer
        },
    ],
)

That’s it! One custom system prompt, two messages, and you get context-specific answers!

This is a simple use case that can be extended and improved. One thing we haven’t done yet is tell the AI what to do if it can’t find the answer in the source. We can add these instructions to the system prompt, typically telling it to refuse to answer, or to use its common sense, depending on the desired behavior of your bot. You can also have large language models reference the specific sources they use to answer questions. We’ll discuss these strategies in a future post, but for now, here are the basics of answer generation.

The easy part is done, it’s time to get back to that black box we skipped over…

Retrieval Steps: Get the Right Information from Your Knowledge Base

Above we assume we have the right pieces of knowledge to send to the big language model. But how do we actually get this from users’ questions? This is the retrieval step, which is a core part of the infrastructure in any chatting with data system.

At its core, retrieval is a search operation-we want to find the most relevant information based on user input. Just like search, there are two main parts:

Indexing: Turn your knowledge base into searchable/queryable content.
Querying: Extract the most relevant knowledge from search terms.

It is worth noting that any search process can be used for retrieval. Anything that takes user input and returns some results will work. So, for example, you could try to find text that matches the user’s question and send it to a big language model, or you could Google the question and send the top results — which, by the way, is about This is how the Bing chatbot works.

Most RAG systems today rely on semantic search, which uses another core part of AI technology: embeddings. Here we will focus on this use case.

So…what is embedding?

What is embedding? How do they relate to knowledge retrieval?

LLMs are weird. One of the strangest things about them is that no one really knows how they understand language. Embedding is an important part of this story.

If you ask someone how they translate words into meaning, they’re likely to fumble and say something vague and self-referential like “because I know what they mean.” Somewhere deep in our brains, there is a complex structure that knows that “child” and “kid” are basically the same, that “red” and “green” are colors, that “pleased”, “happy” and “elated” represents the same emotion, but to a different degree. We can’t explain how it works, we just know it.

Language models have a similarly sophisticated understanding of language, except, because they are computers, it’s not in their brains but instead consists of numbers. In the world of large language models, any human language can be represented as a numeric vector (a list). This numeric vector is an embedding.

A key part of LLM technology is the translator from human word-language to AI’s number-language. We call this translator an “embedding machine”, although behind the scenes it’s just an API call. Input human language and output AI numbers.

What do these numbers mean? Nobody knows! They only make sense to AI. However, what we do know is that similar words end up with similar sets of numbers. Because behind the scenes, AI uses these numbers to “read” and “speak.” So these numbers imbue some magical understanding into the language of AI, even if we can’t understand it. The embedder is our translator.

Now, now that we have these magic AI numbers, we can plot them. A simplified diagram of the above example might look like this – where the axes are just some abstract representation of human/AI language:

Once we plot them, we can see that the closer two points are to each other in this hypothetical language space, the more similar they are. Hello, how are you? and Hey, how’s it going? actually overlap with each other. Another greeting, Good morning!, is not far removed from these greetings. And I like cupcakes. is on a completely different island than the others.

Of course, you can’t represent the entire human language on a two-dimensional graph, but the theory is the same. In fact, the embedding has many more coordinates (the model currently used by OpenAI has 1536). But you can still do basic math to determine how close two embeddings (two pieces of text) are to each other.

These embeddings and determining “closeness” are core principles behind semantic search, powering the retrieval step.

Use embeddings to find the best pieces of knowledge

Once we understand how embedding search works, we can build a high-level picture of the retrieval steps.

In terms of indexing, first we have to break the knowledge base into chunks of text. This process is a complete optimization problem in itself, which we’ll cover next, but for now assume we know how to do it.

Once this is done, we pass each piece of knowledge through an embedding machine (actually the OpenAI API or similar mechanism) and return an embedded representation of that text. We then save the fragment along with the embedding in a vector database, which is optimized for numeric vectors.

Now we have a database with all our content embedded in it. Conceptually, you can think of this as a graph of our entire knowledge base on a “language” graph:

Once we have this graph, we perform a similar process on the query side. First we get the embedding of the user input:

We then plot it in the same vector space and find the closest fragment (in this case

1 and

2):

The magic embedding machine thinks these are the most relevant answers to the question asked, so these are the snippets we extract and send to the big language model!

In fact, the question “what is the closest point” is solved by querying our vector database. So the actual process looks more like this:

The query itself involves some semi-complex math – usually using something called cosine distance, although there are other ways to calculate it. Mathematics is a whole space that you have access to, but it’s beyond the scope of this article, and from a practical perspective is largely transferable to a library or database.

Back to LangChain
In our LangChain example, we have now covered everything done by the following line of code. This little function call hides a lot of complexity!
index.query("What should I work on?")

Index your knowledge base

Okay, we’re almost done. We now understand how to use embeddings to find the most relevant parts of a knowledge base, pass everything to a large language model, and get enhanced answers. The last step we’ll cover is creating the initial index from your knowledge base. In other words, it is the “knowledge splitting machine” in the picture below.

Perhaps surprisingly, indexing your knowledge base is often the hardest and most important part of the whole thing. Unfortunately, it’s more art than science and involves a lot of trial and error. Overall, the indexing process boils down to two high-level steps.

Loading: Retrieve the content of the knowledge base from the location where it is usually stored.
Splitting: Splitting knowledge into fragment-sized chunks suitable for embedded search.

Technical Clarification
Technically speaking, the distinction between “loaders” and “splitters” is somewhat arbitrary. You can imagine a single component doing all the work at once, or breaking the loading phase into multiple sub-components.
That said, “loaders” and “sharders” are how things are done in LangChain, and they provide useful abstractions on top of the basic concepts.

Let’s take my own use case as an example. I want to build a chatbot to answer questions about my saas boilerplate product, SaaS Pegasus. The first thing I want to add to my knowledge base is a documentation site. The loader is the infrastructure that accesses my document, finds out what pages are available, and then pulls each page. Once the loader is complete, it will output separate documents – one for each page on the website.

There’s a lot going on inside the loader! We need to crawl all pages, grab the content of each page, and then format the HTML into usable text. There are also loaders with different sections for other things like PDF or Google Drive. There’s also parallelization, error handling, and more to address. Again, this is an almost infinitely complex topic, but for the purposes of this article, we’ll mostly move it into a library. So now, let’s assume again that we have a magic box that has a “knowledge base” in it, and what comes out are individual “documents”.

Loader in LangChain
The built-in loader is one of the most useful components of LangChain. They offer a range of built-in loaders that can be used to extract content from anything from a Microsoft Word document to an entire Notion site.
The interface of the LangChain loader is exactly the same as described above. Enter a “knowledge base” and a series of “documents” will come out.

After coming out of the loader, we will have a collection of documents corresponding to each page in the documentation site. Additionally, ideally at this point the extra markup has been removed and only the underlying structure and text remain.

Now, we can pass these entire web pages to our embedder and use them as our knowledge snippets. However, each page can potentially cover a lot! Also, the more content there is on a page, the more “unspecific” the page’s embed will be. This means that our “proximity” search algorithm may be less effective.

It’s more likely that the topic of the user’s question matches some text within the page. This is what sharding does in the picture below. With sharding, we break any single document into small, embeddable chunks that are more suitable for search.

Note again that splitting a document is a whole art, including the size of the average snippets (too big and they don’t match the query well; too small and they don’t have enough useful context to generate an answer), how the content is split (usually by title, if any), etc. However, some reasonable defaults are enough to get started with and refine your data.

Sharder in LangChain
In LangChain, sharders belong to a larger category called document transformers. In addition to providing various strategies for splitting documents, they also provide tools for removing redundant content, translating, adding metadata, and more. We only focus on slicers here since they represent the vast majority of document transformations.

Once we have the document fragments we save them into our vector database and as mentioned above we are finally done!

This is a complete picture of the indexed knowledge base.

Back to LangChain
In LangChain, the entire indexing process is encapsulated in these two lines of code. First we initialize the website loader and tell it what content we want to use:
loader = WebBaseLoader("http://www.paulgraham.com/greatwork.html")
We then build the entire index from the loader and save it to our vector database:
index = VectorstoreIndexCreator().from_loaders([loader])
Loading, splitting, embedding and saving all happen behind the scenes.

Review the entire process

Finally, we can derive the entire RAG workflow. It looks like this:

First, we index our knowledge base. We take the knowledge and convert it into separate documents using a loader and then into small chunks or fragments using a slicer. Once we have these, we pass them to the embedding machine, which converts them into vectors that can be used for semantic search. We save these embeddings and their text fragments in our vector database.

Next is retrieval. It starts with a question, then is sent through the same embedding machine and passed to our vector database to determine the closest matching fragment, which we will use to answer the question.

Finally, enhance answer generation. We take pieces of knowledge, format them with custom system prompts and our questions, and finally get our context-specific answers.

Wow! Hopefully you now have a basic understanding of how search enhancement generation works. If you want to try it out on your own knowledge base without having to do all the setup work, check out Scriv.ai, which lets you build a domain-specific chatbot in just minutes without any coding skills.

In future articles we will expand on many of these concepts to include all the ways in which the “default” settings outlined here can be improved. As I mentioned, each of these sections has almost infinite depth, and we will delve into these sections one at a time in the future.