Article directory
- foreword
- environment
- 1. Build a knowledge base
- 2. Vectorize the knowledge base
- 3. Recall
- 4. Use LLM for reading comprehension
- 5. Effect
- Summarize
Foreword
Through this article, you will learn how to use langchain
to build your own Knowledge Base Q&A
In fact, the principles of most chatpdf
products are similar, and I will divide it into the following four steps:
- build knowledge base
- Vectorize the knowledge base
- recall
- Reading comprehension with LLM
Next, let’s take a look at how to use our own blog to create a ChatBlog
The knowledge base data used in this article comes from a blog I wrote before: Retrieval question answering system based on Sentence-Bert
Environment
As usual, the environment is an essential part:
langchain==0.0.148 openai==0.27.4 chromadb==0.3.21
1. Build a knowledge base
It’s relatively simple, just go to the code
def get_blog_text(): data_path = 'blog.txt' with open(data_path, 'r') as f: data = f. read() soup = BeautifulSoup(data, 'lxml') text = soup. get_text() return text # Customize the way sentences are segmented to ensure that sentences are not truncated def split_paragraph(text, max_length=300): text = text.replace('\\ ', '') text = text.replace('\\ \\ ', '') text = re.sub(r'\s + ', ' ', text) """ Divide the article into sections """ # First divide the article by sentence sentences = re.split('(;|.|!|\!|\.|?|\?)',text) new_sents = [] for i in range(int(len(sentences)/2)): sent = sentences[2*i] + sentences[2*i + 1] new_sents.append(sent) if len(sentences) % 2 == 1: new_sents.append(sentences[len(sentences)-1]) # Segmentation as required paragraphs = [] current_length = 0 current_paragraph = "" for sentence in new_sents: sentence_length = len(sentence) if current_length + sentence_length <= max_length: current_paragraph += sentence current_length += sentence_length else: paragraphs.append(current_paragraph.strip()) current_paragraph = sentence current_length = sentence_length paragraphs.append(current_paragraph.strip()) documents = [] for paragraph in paragraphs: new_doc = Document(page_content=paragraph) print(new_doc) documents.append(new_doc) return documents content = get_blog_text() documents = split_paragraph(content)
It must be explained here that I did not use the document division function provided by langchain
. langchain
provides many ways to divide documents. Interested students can check langchain. The source code in text_splitter
. Here I cut it out. There are probably so many kinds. In fact, they are all similar. The purpose of them is to divide the segments more reasonably.
Here we set a max_length
, this length, if you use chatgpt
, the maximum can be 4096
, because chatgpt
code>The maximum allowed input Token
is 4096
, if it is converted into Chinese, it is actually shorter, and the prompt
must be added code>Token length, so a certain amount of space needs to be reserved.
If the segmentation is not good, the impact on the output is still quite large. We divide it by sentence here. In fact, it is more reasonable to divide it by subtitle of the blog. CSDN’s question-and-answer robot does this. Haha, here is a hard push Bo, the effect is still very good, surpassing all human beings, if you are not convinced, you can challenge it:
https://ask.csdn.net/
Later, I will also find time to write a blog about the CSDN question-and-answer robot to share with you the implementation details, so pay attention and not get lost
2. Vectorize the knowledge base
# persistent vector data def persist_embedding(documents): # Persist embedding data to local disk persist_directory = 'db' embedding = OpenAIEembeddings() vectordb = Chroma.from_documents(documents=documents, embedding=embedding, persist_directory=persist_directory) vectordb. persist() vectordb = None
Here OpenAIEmbeddings
defaults to the text-embedding-ada-002
model for emdedding
, you can also change it to something else, langchain
provides the following embedding
methods
You can also load a sentence vector model from your local to embedding
, here you need to pay attention, if you are using the vectorized model of openai
, you need to open the Science Internet .
After the vectorization is finished, we need to save the vectorized result. Next time we use it, just load it directly. Here I use Chroma
to store the vectorized data. However, langchain
also supports other vector databases, as follows:
Chroma
is also the first time for me to use it. Interested students can find out by themselves. FAISS
should be used more often. I use FAISS
in the question-and-answer robot code>pgvector, because our database uses PostgresSQL
, pgvector
is a vectorized storage plug-in for PG, so we use this, there is nothing special The reason, in fact, all kinds of vectorized databases are similar. What affects the recall speed and effect is the construction method of the index. The well-known one is HNSW
. If you are interested, you can find out
3. Recall
global retriever def load_embedding(): embedding = OpenAIEembeddings() global retriever vectordb = Chroma(persist_directory='db', embedding_function=embedding) retriever = vectordb.as_retriever(search_kwargs={<!-- -->"k": 5})
k=5
refers to the result of recalling top 5
The as_retriever
function also has a search_type
parameter, the default is similarity
, the parameters are explained as follows:
search_type Search type: “similarity” or “mmr”. search_type=”similarity” uses a similarity search in the retriever object, where it selects the text block vector most similar to the question vector. search_type=”mmr” uses maximum marginal relevance search, where similarity is optimized for diversity among the queried selected documents.
4. Use LLM for reading comprehension
def prompt(query): prompt_template = """Please note: Please carefully evaluate the relevance of the query and the prompted Context information, and only answer according to the content of the text information entered in this paragraph. If the query has nothing to do with the provided materials, please answer"I do not Know", and don't answer irrelevant answers: Context: {context} Question: {question} Answer:""" PROMPT = PromptTemplate( template=prompt_template, input_variables=["context", "question"] ) docs = retriever. get_relevant_documents(query) # Prompt based on docs, return what you want chain = load_qa_chain(ChatOpenAI(temperature=0), chain_type="stuff", prompt=PROMPT) result = chain({<!-- -->"input_documents": docs, "question": query}, return_only_outputs=True) return result['output_text']
In fact, the recalled text is used as a part of the prompt
, and then chatgpt
summarizes the answer from the prompt
, which is exactly the same as reading comprehension.
The segmentation mentioned above has a great impact on the results, and it is also reflected in this place. If the segmentation is not good, the recalled data is not good, and it is difficult for chatgpt
to summarize the answer.
Note: Scientific Internet access is also required here.
5. Effect
Very correct
Summary
1. The whole is similar to reading comprehension, but you can adjust the prompt, for example: Please combine the Context and your own existing knowledge to answer the following questions
2. All codes: https://github.com/seanzhang-zhichen/ChatBlog