Realize a ChatBlog with LangChain

Article directory

  • foreword
  • environment
  • 1. Build a knowledge base
  • 2. Vectorize the knowledge base
  • 3. Recall
  • 4. Use LLM for reading comprehension
  • 5. Effect
  • Summarize

Foreword

Through this article, you will learn how to use langchain to build your own Knowledge Base Q&A
In fact, the principles of most chatpdf products are similar, and I will divide it into the following four steps:

  1. build knowledge base
  2. Vectorize the knowledge base
  3. recall
  4. Reading comprehension with LLM

Next, let’s take a look at how to use our own blog to create a ChatBlog
The knowledge base data used in this article comes from a blog I wrote before: Retrieval question answering system based on Sentence-Bert

Environment

As usual, the environment is an essential part:

langchain==0.0.148
openai==0.27.4
chromadb==0.3.21

1. Build a knowledge base

It’s relatively simple, just go to the code

def get_blog_text():
    data_path = 'blog.txt'
    with open(data_path, 'r') as f:
        data = f. read()
    soup = BeautifulSoup(data, 'lxml')
    text = soup. get_text()
    return text


# Customize the way sentences are segmented to ensure that sentences are not truncated
def split_paragraph(text, max_length=300):
    text = text.replace('\\
', '')
    text = text.replace('\\
\\
', '')
    text = re.sub(r'\s + ', ' ', text)
    """
    Divide the article into sections
    """
    # First divide the article by sentence
    sentences = re.split('(;|.|!|\!|\.|?|\?)',text)
    
    new_sents = []
    for i in range(int(len(sentences)/2)):
        sent = sentences[2*i] + sentences[2*i + 1]
        new_sents.append(sent)
    if len(sentences) % 2 == 1:
      new_sents.append(sentences[len(sentences)-1])

    # Segmentation as required
    paragraphs = []
    current_length = 0
    current_paragraph = ""
    for sentence in new_sents:
        sentence_length = len(sentence)
        if current_length + sentence_length <= max_length:
            current_paragraph += sentence
            current_length += sentence_length
        else:
            paragraphs.append(current_paragraph.strip())
            current_paragraph = sentence
            current_length = sentence_length
    paragraphs.append(current_paragraph.strip())
    documents = []
    for paragraph in paragraphs:
        new_doc = Document(page_content=paragraph)
        print(new_doc)
        documents.append(new_doc)
    return documents

content = get_blog_text()
documents = split_paragraph(content)

It must be explained here that I did not use the document division function provided by langchain. langchain provides many ways to divide documents. Interested students can check langchain. The source code in text_splitter. Here I cut it out. There are probably so many kinds. In fact, they are all similar. The purpose of them is to divide the segments more reasonably.
Please add a picture description
Here we set a max_length, this length, if you use chatgpt, the maximum can be 4096, because chatgpt code>The maximum allowed input Token is 4096, if it is converted into Chinese, it is actually shorter, and the prompt must be added code>Token length, so a certain amount of space needs to be reserved.

If the segmentation is not good, the impact on the output is still quite large. We divide it by sentence here. In fact, it is more reasonable to divide it by subtitle of the blog. CSDN’s question-and-answer robot does this. Haha, here is a hard push Bo, the effect is still very good, surpassing all human beings, if you are not convinced, you can challenge it:
https://ask.csdn.net/

Later, I will also find time to write a blog about the CSDN question-and-answer robot to share with you the implementation details, so pay attention and not get lost

Please add a picture description

2. Vectorize the knowledge base

# persistent vector data
def persist_embedding(documents):
    # Persist embedding data to local disk
    persist_directory = 'db'
    embedding = OpenAIEembeddings()
    vectordb = Chroma.from_documents(documents=documents, embedding=embedding, persist_directory=persist_directory)
    vectordb. persist()
    vectordb = None

Here OpenAIEmbeddings defaults to the text-embedding-ada-002 model for emdedding, you can also change it to something else, langchain provides the following embedding methods
Please add a picture description
You can also load a sentence vector model from your local to embedding, here you need to pay attention, if you are using the vectorized model of openai, you need to open the Science Internet .

After the vectorization is finished, we need to save the vectorized result. Next time we use it, just load it directly. Here I use Chroma to store the vectorized data. However, langchain also supports other vector databases, as follows:
Please add a picture description
Chroma is also the first time for me to use it. Interested students can find out by themselves. FAISS should be used more often. I use FAISS in the question-and-answer robot code>pgvector, because our database uses PostgresSQL, pgvector is a vectorized storage plug-in for PG, so we use this, there is nothing special The reason, in fact, all kinds of vectorized databases are similar. What affects the recall speed and effect is the construction method of the index. The well-known one is HNSW. If you are interested, you can find out

3. Recall

global retriever
def load_embedding():
    embedding = OpenAIEembeddings()
    global retriever
    vectordb = Chroma(persist_directory='db', embedding_function=embedding)
    retriever = vectordb.as_retriever(search_kwargs={<!-- -->"k": 5})

k=5 refers to the result of recalling top 5

The as_retriever function also has a search_type parameter, the default is similarity, the parameters are explained as follows:

search_type Search type: “similarity” or “mmr”. search_type=”similarity” uses a similarity search in the retriever object, where it selects the text block vector most similar to the question vector. search_type=”mmr” uses maximum marginal relevance search, where similarity is optimized for diversity among the queried selected documents.

4. Use LLM for reading comprehension

def prompt(query):
    prompt_template = """Please note: Please carefully evaluate the relevance of the query and the prompted Context information, and only answer according to the content of the text information entered in this paragraph. If the query has nothing to do with the provided materials, please answer"I do not Know", and don't answer irrelevant answers:
    Context: {context}
    Question: {question}
    Answer:"""
    PROMPT = PromptTemplate(
        template=prompt_template, input_variables=["context", "question"]
    )
    docs = retriever. get_relevant_documents(query)
    # Prompt based on docs, return what you want
    chain = load_qa_chain(ChatOpenAI(temperature=0), chain_type="stuff", prompt=PROMPT)
    result = chain({<!-- -->"input_documents": docs, "question": query}, return_only_outputs=True)

    return result['output_text']

In fact, the recalled text is used as a part of the prompt, and then chatgpt summarizes the answer from the prompt, which is exactly the same as reading comprehension.
The segmentation mentioned above has a great impact on the results, and it is also reflected in this place. If the segmentation is not good, the recalled data is not good, and it is difficult for chatgpt to summarize the answer.

Note: Scientific Internet access is also required here.

5. Effect

Please add a picture description
Very correct

Summary

1. The whole is similar to reading comprehension, but you can adjust the prompt, for example: Please combine the Context and your own existing knowledge to answer the following questions
2. All codes: https://github.com/seanzhang-zhichen/ChatBlog