Automated tool systems for LLMs (HuggingGPT, AutoGPT, WebGPT, WebCPM)

The enhanced language model and Tool Learning have been briefly introduced in the previous two blog posts. This article looks at four representative automation frameworks, HuggingGPT, AutoGPT, WebGPT, and WebCPM.

Augmented Language Models
Toolformer and Tool Learning (How LLMs use tools)

HuggingGPT
HuggingGPT is a category of tool-augmented learning in tool learning. Specifically, it is a framework that uses LLMs as the controller to manage many small models from the Huggingface community. The user’s natural language request will be regarded as a general interface. After analysis and planning by LLMs, model selection is performed according to the description of the Huggingface small model tool. After the task of each small module is executed, the results are processed and returned to the user.

As shown in the figure above, its execution steps are divided into 4 steps:

Task Planning. Task planning, which uses ChatGPT to analyze user requests and decompose user requests into a series of solvable subtasks. However, complex requests often involve multiple tasks, and the dependencies and execution order of these tasks need to be determined. Therefore, HuggingGPT uses specification-based instructions and demonstration-based parsing in its prompt design, as shown in the figure below. For the input \ “Look at /exp1.jpg, Can you tell me how many objects in the picture?”, the model ends up with two subtasks.

[{<!-- -->"task": "image-to-text",
  "id": 0, "dep": [-1],
  "args": {<!-- -->"image": "/exp1.jpg"}},
  
{<!-- -->"task": "object-detection",
 "id": 0, "dep": [-1],
 "args": {<!-- -->"image": "/exp1.jpg" }}]

Model selection. Model selection, according to the decomposed subtasks and the understanding of the model description (the description includes model functions, architecture, supported languages and domains, licenses, etc.), ChatGPT selects various expert small models hosted on Hugging Face. As shown in the figure below, the selection of the model will first recall some models that can do the current subtask according to the task, and then select the top K models as candidates according to the download volume, and then input it as the prompt context to ChatGPT for selection.

Task execution. Task execution, call and execute each expert small model, and return the result to ChatGPT.
Response generation. Response generation, HuggingGPT first integrates all information from the first three stages (task planning, model selection, and task execution) into a concise summary, including a list of planned tasks, model selection, and inference results. Then use ChatGPT to integrate the prediction results of all small models and return them to the user. As shown below.

Since HuggingGPT can invoke all model “tools” in the community, it can also perform multimodal tasks. However, the main disadvantages of HuggingGPT are:

Efficiency issues. Every stage of HuggingGPT needs to interact with LLMs, resulting in inefficiency.
Context length. LLMs have a limited maximum number of tokens.
System stability. One is the prediction and output error of LLMs; the other is the uncontrollability and error of the small expert model.

paper: HuggingGPT: Solving AI Tasks with ChatGPT and its Friends in HuggingFace
arxiv: https://arxiv.org/abs/2303.17580
code: https://huggingface.co/spaces/microsoft/HuggingGPT
code (semi-flat replacement code, chatGLM + Baidu small model): https://github.com/SolarWindRider/All-In-One

AutoGPT
As an open source project that github broke 20w stars in a short period of time, AutoGPT is too hot. AutoGPT is also a model that disassembles tasks and executes them, similar to HuggingGPT, but more powerful. It does not require mandatory human input (unattended), assigns itself goals more automatically, utilizes the internet and other tools to complete tasks in automatic loops, and even utilizes databases/files to manage short-term and long-term memory, use search engines And browsing the web, voice output, etc., it is more versatile and smarter.

AutoGPT mainly has the following characteristics:

Based on GPT-4. Using OpenAI’s GPT-4 as the core of the system, it is responsible for completing tasks, generating new tasks based on the completed results, and determining the priority of tasks in real time.
Based on Pinecone. Pinecone is a vector search platform that provides efficient search and storage capabilities for high-dimensional vector data. In AutoGPT’s system, Pinecone is used to store and retrieve task-related data, such as task description, constraints, and results.
Based on LangChain. The LangChain framework allows AI agents to be data-aware and interact with their environment, resulting in more powerful and differentiated systems.
task management. The system will maintain a task list, which is represented by a double-ended queue, for managing and prioritizing tasks. The system automatically creates new tasks based on completed results and reprioritizes the task list accordingly.

Specifically, the execution schematic diagram from GPT4 painting is as above, and AutoGPT includes the following steps:

provide objective & task. Ask questions, set a goal.
complete task. Disassemble the main task and maintain a priority task queue.
send task result. Execute the highest priority task in the task queue, and then get the result. This result is stored in Pinecone if necessary.
add new tasks. When a task is completed, new subtasks are generated according to the results of the completed task, and these new tasks do not overlap with existing tasks. At the same time, the system re-prioritizes the task list according to the new tasks generated and their priorities, and uses GPT-4 to assist in prioritization.

Some of the operations that AutoGPT can perform are as follows, which are supported by search engines and web browsing:

Google Search: "google", args: "input": "<search>"
Browse Website: "browse_website", args: "url": "<url>", "question": "<what_you_want_to_find_on_website>"
Start GPT Agent: "start_agent", args: "name": "<name>", "task": "<short_task_desc>", "prompt": "<prompt >"
Message GPT Agent: "message_agent", args: "key": "<key>", "message": "<message>"
List GPT Agents: "list_agents", args: ""
Delete GPT Agent: "delete_agent", args: "key": "<key>"
Write to file: "write_to_file", args: "file": "<file>", "text": "<text>"
Read file: "read_file", args: "file": "<file>"
Append to file: "append_to_file", args: "file": "<file>", "text": "<text>"
Delete file: "delete_file", args: "file": "<file>"
Search Files: "search_files", args: "directory": "<directory>"
Evaluate Code: "evaluate_code", args: "code": "<full_code_string>"
Get Improved Code: "improve_code", args: "suggestions": "<list_of_suggestions>", "code": "<full_code_string>"
Write Tests: "write_tests", args: "code": "<full_code_string>", "focus": "<list_of_focus_areas>"
Execute Python File: "execute_python_file", args: "file": "<file>"
Task Complete (Shutdown): "task_complete", args: "reason": "<reason>"
Generate Image: "generate_image", args: "prompt": "<prompt>"
Do Nothing: "do_nothing", args: ""

The prompt setting is shown below, including the specification of the execution task (constraints), available resources (resources), and effect evaluation and reflection (performance_evaluations).

constraints: [
  '~4000 word limit for short term memory. Your short term memory is short, so immediately save important information to files.',
  'If you are unsure how you previously did something or want to recall past events, thinking about similar events will help you remember.',
  'No user assistance',
  'Exclusively use the commands listed below e.g. command_name'
]
resources: [
  'Internet access for searches and information gathering.', #Internet search
  'Long Term memory management.', #Long Term memory management
  'GPT-3.5 powered Agents for delegation of simple tasks.', #GPT-3.5
  'File output.' #File output
]
performance_evaluations: [
  'Continuously review and analyze your actions to ensure you are performing to the best of your abilities.', # Analyze your actions and confirm that you are performing to the best of your abilities
  'Constructively self-criticize your big-picture behavior constantly.', # Self-reflection from the big picture
  'Reflect on past decisions and strategies to refine your approach.', # Reflect on past decisions and strategies, optimization methods
  'Every command has a cost, so be smart and efficient. Aim to complete tasks in the least number of steps.', # Every command has a cost, so be smart and efficient. Aim to complete tasks in the least number of steps.
  'Write all code to a file.' #Write all codes to a file
]

github: https://github.com/Significant-Gravitas/Auto-GPT

WebGPT
Produced by OpenAI, based on GPT-3 to imitate the behavior of human browsing the web (click, scroll wheel, etc.), and get answers by searching information. The scenario to be solved is Long-form Question Answering (Long-form Question Answering, LFQA), this task is compared to Compared with traditional machine reading comprehension or text question answering, when it gives an answer, it not only needs to find the answer (retrieval) quickly and accurately in a certain document collection or even the entire web, but also needs to have information Ability to integrate to generate long paragraphs (integrate).

But the biggest problem with LFQA at present is that it can only start from the original problem, and it is not an interactive system. In contrast, humans can filter out high-quality information through real-time search. Especially for complex problems, subtasks will be split and searched sequentially, etc., so a model that can be searched interactively is very important. Therefore, WebGPT aims to search and organize information by imitating the behavior of human browsing the web to complete complex problems. The training environment is based on Microsoft’s Bing search, and the data is based on ELI5 (Explain Like I’m Five). The interface is as follows,

The behavior of the model imitating human beings is mainly as shown in the following figure: start the Bing API, click on the link to jump, slide the scroll wheel, mark, etc. In this way, the model collects paragraphs from the web page, and then uses these paragraphs to write the answer .

The training method is similar to the InstructGPT and ChatGPT types of OpenAI’s series of works, using human feedback + RLHF for learning. Specifically, WebGPT mainly has three models of 760M, 13B and 175B based on GPT-3. The training method is the same as that of InstructGPT and ChatGPT, and the name is slightly different.

Behavior cloning (BC), behavioral cloning. Fine-tuning is performed using supervised learning, where human instructions serve as labels.
Reward modeling (RM), reward modeling. A reward model is trained based on the BC model, representing the probability that an action is better than another.
Reinforcement learning (RL), reinforcement learning. The BC model is still fine-tuned using the PPO algorithm.
Rejection sampling (best-of-n), reject sampling. An alternative to RM optimization, where a fixed number of answers (4, 16, or 64) are sampled directly from the BC or RL model, and the RM selects the highest-scoring result.

paper: https://cdn.openai.com/WebGPT.pdf

WebCPM
WebCPM is Tsinghua’s model based on BMTools. It is the same group as the Tool Learning authors of the previous blog post. It is also the first open source framework for Chinese question and answer based on interactive web search. Although the idea is similar to WebGPT, it is open source [Wang firewood].

The solution scenario is the same as WebGPT, which is Long-form Question Answering (LFQA). Currently, the method of solving LFQA generally adopts the retrieval-integration method, which mainly includes information retrieval (collecting relevant information from search engines) and information synthesis. (Integrate the collected information to generate an answer) Two core links. But they are non-interactive methods and cannot search for more diverse information through multiple rounds of collection and screening like humans. In addition, the relevant details of WebGPT have not been fully disclosed, so the completely open source WebCPM is still very valuable.

Its interface is shown below, and its actions are similar to WebGPT, including searching Bing, returning, browsing pages, sliding pages, marking, etc.

The model framework is shown in the figure below, which still includes the search (Question and Facts) and integration modules (Answer), including four small modules.

Action Prediction Module (gray). Search behavior prediction, that is, to decide what specific behavior action to perform, just perform multi-classification of 10 actions.
Query Generation Module (blue). Query statement generation, generating for bing search
Q

(

t

+

1

)

Q_(t + 1)

Q(?t + 1).
Fact Extraction Module (purple). Support fact summaries, and quote related information by browsing web pages.
Synthesis Model (green). Generate coherent answers based on the information gathered.

This process can be understood in detail with the examples in the paper.

paper: https://arxiv.org/abs/2305.06849
code: https://github.com/thunlp/WebCPM