LLM application development based on LangChain 6 – Evaluation

I examine myself three times a day.

For applications in the pre-AI era, most languages supported unit testing.

Unit testing is a software testing method that verifies that individual units of code in a program work as expected. A unit of code is the smallest testable part of software. In object-oriented programming, this is usually a method, whether in a base class (superclass), an abstract class, or a derived class (subclass). Unit tests are typically written by software developers to ensure that the code they write meets software requirements and follows development goals.

Unit tests in the AI era may be automatically generated by tools like Github Copilot and Cursor.

When using LLM to build complex applications, we also encounter similar problems: How to evaluate the performance of the application, does the application meet our acceptance criteria, and does it work as expected? This step is important and sometimes a little tricky. In addition, if we decide to change the implementation method, such as switching to a different LLM (this is very likely, LLM itself is constantly being upgraded, and the differences between different versions are relatively large; it may also be due to force majeure reasons, switching from a foreign LLM to a domestic one LLM), change the strategy of how to use the vector database or use a different vector database, use other methods to retrieve data, or modify other parameters of the system, etc. How can we know whether the results are better or worse?

In this article, we will discuss how to evaluate LLM-based applications, introduce some tools to help with evaluation, and finally introduce the evaluation platform under development.

If you use LangChain to develop an application, the application is actually many different chains and sequences. Then the first task is to understand what the input and output of each step are. We will use some visualization tools and debugging tools to assist. Testing the model using a large number of different data sets helps us gain a comprehensive understanding of the model’s performance.

One way to see things is directly with our naked eyes, and a natural idea is to use large language models and chains to evaluate other language models, chains, and applications.

Preparation

Similarly, first initialize the environment variables through the .env file. Remember that we are using Microsoft Azure’s GPT. For details, please refer to the first article of this column.

from dotenv import load_dotenv, find_dotenv

_ = load_dotenv(find_dotenv()) # read local .env file

deployment = "gpt-4"
model = "gpt-3.5-turbo"

We use the documentation Q&A program from the previous article for evaluation. For the convenience of reading, the corresponding source code is listed here. For specific explanation, please refer to the previous article in the column.

from langchain.chains import RetrievalQA
# from langchain.chat_models import ChatOpenAI
from langchain.chat_models import AzureChatOpenAI
from langchain.document_loaders import CSVLoader
from langchain.indexes import VectorstoreIndexCreator
from langchain.vectorstores import DocArrayInMemorySearch

file = 'OutdoorClothingCatalog_1000.csv'
loader = CSVLoader(file_path=file)
data = loader.load()
index = VectorstoreIndexCreator(
    vectorstore_cls=DocArrayInMemorySearch
).from_loaders([loader])
# llm = ChatOpenAI(temperature = 0.0, model=llm_model)
llm = AzureChatOpenAI(temperature=0, model_name=model, deployment_name=deployment)
qa = RetrievalQA.from_chain_type(
    llm=llm,
    chain_type="stuff",
    retriever=index.vectorstore.as_retriever(),
    verbose=True,
    chain_type_kwargs = {<!-- -->
        "document_separator": "<<<<>>>>>"
    }
)

Generate data set

To evaluate, first figure out what kind of data set to use.

Manually generate data sets

The first way is to manually generate some good example data sets ourselves. Since large language models are still prone to hallucinations, I think this is essential at present. It feels more confident to create some data sets by hand.

View the two rows of data in the csv document:

Document(page_content=”shirt id: 10\
name: Cozy Comfort Pullover Set, Stripe\
description: Perfect for lounging, this striped knit set lives up to its name. We used ultrasoft fabric and an easy design that’s as comfortable at bedtime as it is when we have to make a quick run out.\
\
Size & amp; Fit\
– Pants are Favorite Fit: Sits lower on the waist.\
– Relaxed Fit: Our most generous fit sits farthest from the body.\ n\
Fabric & amp; Care\
– In the softest blend of 63% polyester, 35% rayon and 2% spandex.\
\
Additional Features\
– Relaxed fit top with raglan sleeves and rounded hem.\
– Pull- on pants have a wide elastic waistband and drawstring, side pockets and a modern slim leg.\
\
Imported.”, metadata={‘source’: ‘OutdoorClothingCatalog_1000.csv’, ‘row’: 10})

Document(page_content=’shirt id: 11\
name: Ultra-Lofty 850 Stretch Down Hooded Jacket\
description: This technical stretch down jacket from our DownTek collection is sure to keep you warm and comfortable with its full-stretch construction providing exceptional range of motion. With a slightly fitted style that falls at the hip and best with a midweight layer, this jacket is suitable for light activity up to 20° and moderate activity up to -30°. The soft and durable 100% polyester shell offers complete windproof protection and is insulated with warm, lofty goose down. Other features include welded baffles for a no-stitch construction and excellent stretch, an adjustable hood, an interior media port and mesh stash pocket and a hem drawcord. Machine wash and dry. Imported. ‘, metadata={‘source’: ‘OutdoorClothingCatalog_1000.csv’, ‘row’: 11})

For these two rows of data, you can write the following questions and answers.

examples = [
    {<!-- -->
        "query": "Do the Cozy Comfort Pullover Set have side pockets?",
        "answer": "Yes"
    },
    {<!-- -->
        "query": "What collection is the Ultra-Lofty 850 Stretch Down Hooded Jacket from?",
        "answer": "The DownTek collection"
    }
]

This method is not easy to expand, and it takes time to review the document and understand the specific content of the document. In the AI era, there should be more Automatic, right?

Let’s take a look at how to automatically generate a test data set.

Automatically generate test data set

LangChain has a QAGenerateChain chain that reads documents and generates a set of questions and answers from each document.

from langchain.evaluation.qa import QAGenerateChain
example_gen_chain = QAGenerateChain.from_llm(AzureChatOpenAI(deployment_name=deployment))
new_examples = example_gen_chain.apply_and_parse(
    [{<!-- -->"doc": t} for t in data[:3]]
)
for example in new_examples:
    print(example)

The above program uses the apply_and_parse method to parse the running results. The purpose is to obtain the Python dictionary object to facilitate subsequent processing.

Running the program yields the following results:

{qa_pairs’: {query’: “What is the approximate weight of the Women’s Campside Oxfords?”, answer’: The approximate weight is 1 lb.1 oz. per pair.’}}
{‘qa_pairs’: {‘query’: ‘What are the dimensions of the small and medium Recycled Waterhog Dog Mat?’, ‘answer’: ‘The small Recycled Waterhog Dog Mat measures 18″ x 28″, while the medium one measures 22.5″ x 34.5″.’}}
{qa_pairs’: {query’: What is the name of the shirt with id: 2?’, answer’: “Infant and Toddler Girls’ Coastal Chill Swimsuit, Two-Piece”}}

The automatically generated data sets are also added to the manually generated data sets.

examples.extend([example["qa_pairs"] for example in new_examples])

Evaluation

Now that we have the data set, how do we evaluate it?

Track a single call

Let’s first observe an example:

qa.run(examples[0]["query"])

Pass in the first question and answer of the data set and run it and you will get the result:

‘Yes, the Cozy Comfort Pullover Set does have side pockets. The pull-on pants in the set feature a wide elastic waistband, drawstring, side pockets, and a modern slim leg.’

But we cannot observe the internal details of the chain: for example, what is the prompt passed to the large language model, and what documents are retrieved from the vector database? For a complex chain of multiple steps, what are the intermediate results of each step? So just seeing the end result is not enough. (You can compare it to the fact that you can’t see Stack Trace when the program goes wrong or you can’t see the call chain when calling microservices)

To solve this problem, LangChain provides the “langchain debug” gadget. Enable debug mode by langchain.debug = True and execute qa.run(examples[0][“query”]). Remember to use langchain.debug = False to turn off the debugging mode after use, otherwise you will have to wait for the debugging information to refresh the screen.

import langchain
langchain.debug = True
qa.run(examples[0]["query"])
# Turn off the debug mode
langchain.debug = False

As you can see from the figure below, each step of the call is displayed: first call the RetrievalQA chain, then call StuffDocumentsChain (when building the RetrievalQA chain, we passed in chain_type=”stuff”), and finally came to LLMChain(you can see the incoming question: Do the Cozy Comfort Pullover Set have side pockets? and context: shirt id: 10\
name… , context It is composed of four document blocks retrieved based on the question: shirt id: 10, shirt id: 73, shirt id: 419, shirt id: 632. The document blocks are separated by <<<<>>>>> Separated, this was configured previously through chain_type_kwargs = {“document_separator”: “<<<<>>>>>”}.)

When building a document question and answer system, you should pay attention to the fact that returning wrong results is not necessarily a problem with the large language model itself. It may have been wrong at the retrieval step. Looking at the context returned by the retrieval can help identify the problem.

Remember, we are essentially doing Prompt-based programming now, so we need to continue to look at the prompts passed to LLM. This is also a good opportunity for us to learn from AI experts.
Take a look at the “persona” we give the system: Use the following pieces of context to answer the user’s question. \
If you don’t know the answer, just say that you don’t know, don’t try to make up an answer. Implementing a question and answer system is very important. It ensures that LLM does not make up random things. If you know something, you know it. If you don’t know it, you don’t know it. If you don’t know, just answer “I don’t know.”

I wonder if any students have discovered why the entire call chain is 1, 3, 4, and 5? This is probably a bug in the new version of LangChain. In Mr. Andrew Ng’s video, the numbers are 1, 2, 3, and 4.
[1:chain:RetrievalQA > 3:chain:StuffDocumentsChain > 4:chain:LLMChain > 5:llm:AzureChatOpenAI]

Batch check

We now know how to check the status of a single call, so for all the examples we created manually and automatically, how do we batch check whether they are correct?

LangChain provides the QAEvalChain chain to help determine whether the results of the question and answer system match the answers of the test data set. It does the work of assertEqual in ordinary program unit tests.

predictions = qa.apply(examples)
from langchain.evaluation.qa import QAEvalChain
llm = AzureChatOpenAI(temperature=0, model_name=model, deployment_name=deployment)
eval_chain = QAEvalChain.from_llm(llm)
graded_outputs = eval_chain.evaluate(examples, predictions)
for i, eg in enumerate(examples):
    print(f"Example {<!-- -->i}:")
    print("Question: " + predictions[i]['query'])
    print("Real Answer: " + predictions[i]['answer'])
    print("Predicted Answer: " + predictions[i]['result'])
    # print("Predicted Grade: " + graded_outputs[i]['text'])
    print("Predicted Grade: " + graded_outputs[i]['results'])
    print()

First execute the retrieval chain: predictions = qa.apply(examples) to generate predictions. Predictions has the following structure (taking the first element as an example): query is the question, answer is the answer, and result is the result obtained by our question and answer system. .

{query’: Do the Cozy Comfort Pullover Set have side pockets?’,
answer’: Yes’,
result’: Yes, the Cozy Comfort Pullover Set does have side pockets.’}

Then create eval_chain, eval_chain.evaluate(examples, predictions) for evaluation (passing in examples is a bit redundant, predictions already contains questions and answers), and the evaluation results are placed in graded_outputs[i][results’]: CORRECT or INCORRECT.

Example 0:
Question: Do the Cozy Comfort Pullover Set have side pockets?
Real Answer: Yes
Predicted Answer: Yes, the Cozy Comfort Pullover Set does have side pockets.
Predicted Grade: CORRECT

Evaluation Platform

The above is executed locally. LangChain is also developing an online evaluation platform. The original URL is: https://www.langchain.plus/ and now it is https://smith.langchain.com/ Agent Smith of The Matrix ?

The evaluation platform currently requires an invitation code to register. You can try it with this invitation code: lang_learners_2023. At the time of writing this article, this invitation code is still available.

The evaluation platform is easy to use: you only need to simply set the environment variables, execute the application developed with LangChain, and you can see the call chain on the platform. The interface is quite pleasing to the eye.

Note that a considerable amount of information will be uploaded. If it is a sensitive project, do not enable this.

Reference

  1. Short course: https://learn.deeplearning.ai/langchain/lesson/6/evaluation