PromptAgent: Strategy planning based on LLM to achieve expert-level prompt optimization

A team from the University of California proposed a framework called PromptAgent, which can automatically optimize prompts. This framework combines the self-reflection characteristics of large models with the Monte Carlo tree search planning algorithm to find the path to the optimal prompt by automatically iteratively checking prompts and improving them.
For open source LLM, some papers point out that its internal state or gradient can be used to train additional parameters, such as Soft prompt, gradient search or RL-based methods to obtain discrete prompts, but these methods are not suitable for already encapsulated closed source models (ChatGPT3. 5-turbo, GPT4) are not available, and the method of gradient-free prompt optimization basically adopts an iterative process based on prompt sampling, that is, starting from the initial prompt, iteratively samples candidate prompts and scores them, and then selects the next Best tips for iteration. These methods emphasize diverse prompt candidates, including editing-based methods (deleting or exchanging phrases, back-translation, etc.), sampling-based optimization methods (Monte Carlo search, Gibbs sampling, beam search, etc.) and PromptAgent follows The methods mentioned above are all different. This method can change the initial prompt from mediocre to a prompt that is comparable to manual design by human experts.

Paper title: PROMPTAGENT: STRATEGIC PLANNING WITH LANGUAGE MODELS ENABLES EXPERT-LEVEL PROMPT OPTIMIZATION

Paper link: https://arxiv.org/pdf/2310.16427.pdf

First, let’s feel the differences between ordinary users’ Prompts, Prompts based on sampling methods and expert-level Prompts through the examples given in the article.

Let us take the named entity recognition task in the biomedical field as an example, which requires the extraction of entities such as diseases from sentences.
Normal and Expert
In the example, the prompt settings for ordinary users are:

Extract the disease or condition (if any) from the sentence

The above simple and crude Prompt written by ordinary users may be able to complete some simple tasks, but its effect is not good.
The expert prompt based on the sampling method provides richer domain-specific detailed information and structured guidance, including task description, domain knowledge, solution guidance, exception handling and output format.
The PromptAgent can point out errors through the results obtained by the Prompt and continuously optimize the Prompt. The final Prompt obtained is as follows:

Your task is to extract a disease or condition... Avoid including any relevant elements such as inheritance patterns (such as autosomal dominant inheritance), genes or gene loci (such as PAH), proteins or biological pathways...
Think about specific diseases as well as broader categories, and remember that diseases and conditions can also appear as common abbreviations or variations. Provide the identified disease or condition in the following format: {entity_1,entity_2,....} …
Note that the term "locus" should be considered a genomic location rather than a disease name.

The above figure depicts the MCTS state-action transition trajectory of the NCBI highest average reward path in the PromptAgent framework. The initial state is

s_0

s0?, equipped with human-written prompts. At each state transition step, the previous state is adjusted based on error feedback to make a new prompt. Indicate similar domain-specific insights by highlighting colors. The last state integrates the information of the entire trajectory and

F1 score improved from 0.521 to 0.645.

Specific methods

PromptAgent represents the prompt optimization problem as a tuple

(

)

(S, A, T, r)

Markov decision process (MDP) composed of (S, A, T, r). here,

S represents the state space,

A is the action space,

T defines the transfer function T:

→

S×A→S

S×A→S, r is the reward function r:

→

S × A → \mathbb{R}

S×A→R.
As shown in Figure (a) below, for any given current state

s_t

st?, PromptAgent is based on

(

∣

)

a_t ~ p_O (a|s_t, m_1)

at?~pO?(a|st?,m1?) iteratively generates an action

a_t

at?, among which

m_1

m1? is the optimizer

LLM O

Metacues used by LLMO to facilitate action generation. Specifically, Figure (b) below shows the two-step process of action generation: collecting errors of the base model from training samples (step 1) and reflecting these errors to derive useful error feedback (step 2). Afterwards, PromptAgent uses the conversion function

(

∣

)

pO (s_{t + 1}|s_t, a_t, m_2)

pO(st + 1?∣st?,at?,m2?) obtains a new state, where

m_2

m2? is another meta-hint that helps state transition update hints, also

O operation. More specifically, given the current error feedback as an action,

m_2

m2? asks the optimizer to generate new hints (states) that leverage any domain knowledge and efficiently resolve model errors, similar to how a hint expert modifies hints based on error feedback. Finally, apply the action

a_t

Each newly generated state after at?

s_t

The quality of st? is given by the reward function

(

)

rt = r(s_t, a_t)

rt=r(st?,at?) OK. Similar to the complexity of reward engineering in reinforcement learning (RL), the design of rewards can be complex to accommodate domain-specific knowledge or preferences specified for the task of interest. Without losing the generality of this framework across a variety of tasks, we directly define reward as task performance on a holdout set separate from the given training samples. However, the exact definition of the reward will depend on task-specific metrics.

PromptAgent framework design

The goal of PromptAgent is to effectively integrate expert-level prior knowledge into task prompts while ensuring efficient and strategic search in the vast prompt space. The so-called expert knowledge is generated through large models such as GPT-4, and its search strategy uses the famous Monte Carlo tree search. The overall framework is shown in the figure below:
Framework Design
This article defines task prompt as status

s_t

st?, and the modification process of Prompt is defined as execution action

a_t

at?. As shown in (b) above:

Given the current state
s

t

s_t

st? (that is, the initial Prompt), the basic model (gpt-3.5-turbo) obtains the initial output from the task data set. The initial output is often unsatisfactory and needs further optimization.
Use the optimizer model (gpt-4) to provide error feedback and suggest improvements.
The optimized model updates Prompt based on feedback and transitions to the next state.
s

t

+

1

s_{t + 1}

st+1?.

This cycle continues until it finally leads to the expert prompt.

The above process of prompt optimization can seamlessly combine PromptAgent with major planning algorithms, especially Monte Carlo Tree Search (MCTS). This results in the most universal expert-level prompt.

Strategy optimization process

Monte Carlo Tree Search (MCTS) implements strategy search by gradually building a tree structure, as shown in (a) above, where each node represents a state and each edge represents an action for state transition. MCTS performs four steps of selection, expansion, simulation and backpropagation to iteratively search. The iterative process ends after reaching the predefined number of iterations, and the path with the highest reward is selected as the final prompt.

Selection: Select the most promising nodes at each layer for further expansion and exploration. In each iteration, it starts from the root node, traverses each level of each tree, selects subsequent child nodes of each level, and stops at a leaf node. When selecting the child nodes of each layer, the upper bound confidence tree algorithm (UCT) is used to help find a good balance between “selecting the most promising path” and “exploring new paths”. The details are as follows:

in
Q

(

s

t

,

a

t

‘

)

Q(s_t,a^{‘}_{t})

Q(st?,at′?) represents the state

s

t

s_t

st?Execute action

a

t

‘

a^{‘}_{t}

the possible return at′?,

A

(

s

t

)

A(s_t)

A(st?) represents the action set of the node,

N

(

s

t

)

\textit{N}(s_t)

N(st?) represents node

s

t

s_t

The number of visits to st?,

c

h

(

s

t

,

a

t

‘

)

ch(s_t,a^{‘}_{t})

ch(st?,at′?) means applying action

a

t

‘

a^{‘}_{t}

at′? to node

s

t

s_t

The child node obtained after st? is a constant used to adjust exploration.

The first term in the formula measures the value of the path, while the second term measures the uncertainty of the visited nodes. In other words, if a node is less explored and its children are less visited, the value of the second term will be higher.

Expand: Add new child nodes below the leaf nodes selected in the previous step to expand the tree structure. This is accomplished by applying action generation and state transitions multiple times ((b) above), resulting in multiple new actions and states. It should be noted that this article samples multiple training batches to obtain diverse error feedback (actions). Among the new nodes, the node with the highest return is selected as the input to the next simulation step.
Simulation: Simulate the future trajectory of the selected node during the expansion phase and calculate the possible rewards if this path is chosen. The choice of simulation strategy is flexible, such as choosing to move randomly until a terminal state is reached. In order to reduce the computational cost of simulation and simplify the process, this article chooses to continuously generate multiple actions and select the node with the highest return among them to quickly enter the next tree level.
Backpropagation: Backpropagation occurs when a termination state is encountered during the simulation. The termination status is determined by a preset maximum depth or early stopping condition. At this time, future returns are calculated by backpropagating along the path from the root node to the termination node by updating the Q-value function. For each state-action pair in the next path, aggregate the slave state
s

t

s_t

st? is updated with the returns of all future trajectories starting from

Q

(

s

t

,

a

t

)

Q(s_t,a_t)

Q(st?,at?), the update method is as follows:

here

M

M

M stands for slave state

s

t

s_t

The number of future trajectories starting from st?,

S

s

t

j

S^{j}_{s_t}

Sst?j?and

A

a

t

j

A^{j}_{a_t}

Aat?j? represents the slave state respectively

s

t

s_t

st? and action

a

t

a_t

at?The beginning of the

j

j

j state sequences and action sequences.

PromptAgent uses a preset number of iterations to perform the above four operations. When the number of iterations is reached, the best node (i.e. Prompt) in the best path with the highest return is selected for final evaluation.

Experiment

Experimental settings

In order to comprehensively evaluate the impact of PromptAgent on various applications, the author selected 12 tasks from three different fields for in-depth experiments:

6 BIG-Bench Hard (BBH) tasks that emphasize domain knowledge (such as geometric shapes and causal judgments) and complex reasoning abilities (such as penguins on the table, object counting, epistemological reasoning, and time series).
3 biomedical domain-specific tasks: disease named entity recognition (NER), biomedical sentence similarity task (Biosses) and medical question answering task (Med QA).
3 well-known natural language understanding tasks, including two text classification tasks (TREC and Subj) and a natural language reasoning task (CB).

Experimental results and analysis

Overall effect

The table below shows that PromptAgent significantly outperforms all baselines on the BBH task. Compared with human Prompt (ZS), CoT and APE methods, the improvements are 28.9%, 9.5% and 11.2% respectively.
Table 1

Human Prompt and CoTPrompt do not work well for biological domain tasks that require extensive domain knowledge and deep LLM Prompt engineering experience. APE incorporates some domain knowledge through automatic prompt sampling and optimization, reducing manual intervention and improving the effect. However, PromptAgent improves on average by 7.3% relative to APE, which shows that PromptAgent can better guide effective domain knowledge, generate expert prompts, and bridge the knowledge gap between novice and expert prompt engineers.

For general NLU tasks, PromptAgent’s capabilities and versatility also beat all baselines.
Table 2

Prompt generalization

The author also evaluates whether Prompt optimized by PromptAgent can be generalized to other basic LLM models. Since lower-level and smaller-scale LLM models (such as GPT-2 or LLaMA) may not be able to master the subtleties of these expert-level prompts, significant performance degradation will occur. This evaluation selected a more powerful model (GPT-4) and a weaker model (PaLM 2) than GPT-3.5. The results show that PromptAgent has great potential:

The optimized Expert Prompt achieved further improvements in almost all tasks (11/12) when using the more powerful GPT-4.
When transferring expert prompts to PaLM 2, performance may not be as good as more powerful models, but improvements can still be gained in certain tasks such as Penguins.

Table 3

Ablation experiment

This article also compares the effects of multiple search strategies, including a single Monte Carlo (MC) search that randomly samples and selects an action each time, a greedy depth-first search (Greedy) that always selects the best sample among multiple samples, and Beam search retains multiple useful paths at each level. The table below shows:

Both Greedy and Beam significantly improve the MC baseline, indicating that structured iterative exploration is necessary.
Beam and Greedy operate strictly in the forward direction, without conducting strategic searches in the Prompt space, and lack the ability to foresee future results and review past decisions. In contrast, MCTS’s strategy planning allows PromptAgent to traverse the complex expert prompt space more efficiently, significantly outperforming all search variants on all tasks.

Search efficiency analysis

In addition to its excellent performance, a key advantage of PromptAgent is its ability to search efficiently through strategic planning. Search efficiency is measured by the number of prompts during the search process, that is, the number of nodes generated during the search process. The relationship between search efficiency and task performance is plotted in the figure below. It can be seen that the data points of PromptAgent are clustered in the upper left corner, indicating that with higher accuracy, fewer nodes are searched.
Table 5

Conclusion

This article introduces PromptAgent, a novel prompt optimization framework that combines the self-reflection ability of LLMs to incorporate the domain-specific knowledge of the task into the newly generated prompts, and uses the ability of MCTS planning to efficiently traverse the complex prompt space to find expert prompts. After PromptAgent is optimized, Prompt always exhibits expert-level characteristics, enriching domain-specific details and guidance.

In the future, there will inevitably be more and more powerful large language models that can understand and support more and more complex instructions. It is not enough to rely solely on manual expert prompts. Automatically building expert-level prompts will be a very promising direction. .