Skip to content

Commit f167890

Browse files
google-genai-botcopybara-github
authored andcommitted
feat: Add documentation and instructions to help configure gepa experiments
PiperOrigin-RevId: 828911323
1 parent e3caf79 commit f167890

File tree

3 files changed

+210
-34
lines changed

3 files changed

+210
-34
lines changed
Lines changed: 117 additions & 2 deletions
Original file line numberDiff line numberDiff line change
@@ -1,4 +1,119 @@
11
# Example: optimizing an ADK agent with Genetic-Pareto
22

3-
This directory contains an example demonstrating how to use the Agent Development
4-
Kit (ADK) to run and optimize an LLM-based agent in a simulated environment with the Genetic-Pareto prompt optimization algorithm ([GEPA: Reflective Prompt Evolution Can Outperform Reinforcement Learning](https://arxiv.org/abs/2507.19457)) on benchmarks like Tau-bench.
3+
This directory contains an example demonstrating how to use the Agent
4+
Development Kit (ADK) to run and optimize an LLM-based agent in a simulated
5+
environment with the Genetic-Pareto prompt optimization algorithm
6+
([GEPA: Reflective Prompt Evolution Can Outperform Reinforcement Learning](https://arxiv.org/abs/2507.19457))
7+
on benchmarks like Tau-bench.
8+
9+
## Goal
10+
11+
The goal of this demo is to take an agent with a simple, underperforming prompt
12+
and automatically improve it using GEPA, increasing the agent's reliability on a
13+
customer support task.
14+
15+
## Tau-Bench Retail Environment
16+
17+
We use the `'retail'` environment from
18+
[Tau-bench](https://github.com/sierra-research/tau-bench), a benchmark designed
19+
to test agents in realistic, conversational scenarios involving tool use and
20+
adherence to policies. In this environment, our agent acts as a customer
21+
support agent for an online store. It needs to use a set of tools (like
22+
`check_order_status`, `issue_refund`, etc.) to help a simulated user resolve
23+
their issues, while following specific support policies (e.g., only refunding
24+
orders less than 30 days old). The agent is built with ADK using a standard
25+
tool-calling strategy. It receives the conversation history and a list of
26+
available tools, and it must decide whether to respond to the user or call a
27+
tool.
28+
29+
## GEPA Overview
30+
31+
**GEPA (Genetic-Pareto)** is a prompt optimization algorithm that learns from
32+
trial and error, using LLM-based reflection to understand failures and guide
33+
prompt evolution. Here's a simplified view of how it works:
34+
35+
1. **Run & Collect:** It runs the agent with a candidate prompt on a few
36+
training examples to collect interaction trajectories.
37+
2. **Reflect:** It gives the trajectories of failed rollouts to a "reflection"
38+
model, which analyzes what went wrong and generates high-level insights or
39+
"rules" for improvement. For example, it might notice *"The agent should
40+
always confirm the order number before issuing a refund."*
41+
3. **Evolve:** It uses these insights to propose new candidate prompts by
42+
editing existing prompts or combining ideas from different successful ones,
43+
inspired by genetic algorithms.
44+
4. **Evaluate & Select:** It evaluates these new prompts on a validation set
45+
and keeps only the best-performing, diverse set of prompts (the "Pareto
46+
frontier").
47+
5. **Repeat:** It repeats this loop—collect, reflect, evolve, evaluate—until
48+
it reaches its budget (`max_metric_calls`).
49+
50+
This can result in a more detailed and robust prompt that has learned from its
51+
mistakes, and capturing nuances that are sometimes difficult to discover
52+
through manual prompt engineering.
53+
54+
## Running the experiment
55+
56+
The easiest way to run this demo is through the provided Colab notebook:
57+
[`gepa_tau_bench.ipynb`](https://colab.research.google.com/github/google/adk-python/blob/main/contributing/samples/gepa/gepa_tau_bench.ipynb).
58+
59+
Alternatively, you can run GEPA optimization using the `run_experiment.py`
60+
script:
61+
62+
```bash
63+
python -m run_experiment \
64+
--output_dir=/path/to/gepa_experiments/ \
65+
--num_eval_trials=8 \
66+
--max_concurrency=32 \
67+
--train_batch_size=8
68+
```
69+
70+
To run only evaluation with the seed prompt, use `--eval_mode`:
71+
72+
```bash
73+
python -m run_experiment \
74+
--output_dir=/path/to/gepa_experiments/ \
75+
--num_eval_trials=8 \
76+
--max_concurrency=32 \
77+
--eval_mode
78+
```
79+
80+
## Choosing Hyperparameters
81+
82+
Setting the right hyperparameters is crucial for a successful and efficient
83+
run. The following hyperparameters can be set via command-line flags in
84+
`run_experiment.py`:
85+
86+
* `--max_metric_calls`: Total budget for GEPA prompt evaluations. This is the
87+
main control for runtime/cost. One could start with 100 and increase to
88+
500+ for further optimization.
89+
* `--eval_set_size`: Size of the dev set to use for Pareto frontier
90+
evaluation in GEPA. If None, uses all available dev tasks. A larger size
91+
gives a more stable, less noisy fitness score with more coverage but is
92+
more expensive and slows down the GEPA runtime. A few tens of examples
93+
might suffice for simpler tasks and up to a few hundreds
94+
for more complex and variable tasks.
95+
* `--train_batch_size`: Number of trajectories sampled from rollouts
96+
to be used by the reflection model in each GEPA step to generate prompt
97+
improvements. This corresponds to the mini-batch size in GEPA used as a
98+
fast, preliminary filter for new candidate prompts. It trades-off signal
99+
quality and cost of evaluation. The GEPA paper uses a default of 3.
100+
Increasing the batch size may help provide a more stable
101+
signal and estimate of a prompt quality but entails higher cost and less
102+
iterations, given a fixed budget. One can start with a low value and
103+
increase the size if significant variations are observed.
104+
* `--num_eval_trials`: Number of times each task is run during evaluation.
105+
Higher values give more stable evaluation metrics but increase runtime.
106+
Recommended: 4-8.
107+
* `--num_test_records`: Size of the test set for final evaluation of the
108+
optimized prompt. If None, uses all available test tasks.
109+
110+
## LLM-based Rater
111+
112+
When agent reward signals are not available, you can instead use an LLM rater
113+
by setting the `--use_rater` flag.
114+
115+
This rater evaluates agent trajectories based on a rubric assessing whether
116+
"The agent fulfilled the user's primary request." It provides a score (0 or 1)
117+
and detailed feedback including evidence and rationale for its verdict. This
118+
score is then used by GEPA as the fitness function to optimize. The rater is
119+
implemented in `rater_lib.py`.

contributing/samples/gepa/gepa_tau_bench.ipynb

Lines changed: 32 additions & 21 deletions
Original file line numberDiff line numberDiff line change
@@ -14,10 +14,12 @@
1414
"test agents in realistic, conversational scenarios involving tool use and\n",
1515
"adherence to policies.\n",
1616
"\n",
17-
"**Our Goal:** To take a simple, underperforming prompt and automatically\n",
17+
"**Goal:** To take a simple, underperforming prompt and automatically\n",
1818
"improve it using GEPA, increasing the agent's reliability on a customer\n",
1919
"support task.\n",
2020
"\n",
21+
"**Note:** You can find more options to run GEPA with an ADK agent in the [README file](https://github.com/google/adk-python/blob/main/contributing/samples/gepa/README.md).\n",
22+
"\n",
2123
"## Prerequisites\n",
2224
"\n",
2325
"* **Google Cloud Project:** You'll need access to a Google Cloud Project with\n",
@@ -36,7 +38,7 @@
3638
},
3739
"outputs": [],
3840
"source": [
39-
"#@title Install Tau-bench and GEPA\n",
41+
"# @title Install Tau-bench and GEPA\n",
4042
"!git clone https://github.com/google/adk-python.git\n",
4143
"!git clone https://github.com/sierra-research/tau-bench.git\n",
4244
"%cd tau-bench/\n",
@@ -45,13 +47,13 @@
4547
"%cd ..\n",
4648
"!pip install gepa --quiet\n",
4749
"\n",
48-
"!pip install retry --quiet\n"
50+
"!pip install retry --quiet"
4951
]
5052
},
5153
{
5254
"cell_type": "code",
5355
"source": [
54-
"#@title Configure python dependencies\n",
56+
"# @title Configure python dependencies\n",
5557
"import sys\n",
5658
"\n",
5759
"sys.path.append('/content/tau-bench')\n",
@@ -67,8 +69,9 @@
6769
{
6870
"cell_type": "code",
6971
"source": [
70-
"#@title Authentication\n",
72+
"# @title Authentication\n",
7173
"from google.colab import auth\n",
74+
"\n",
7275
"auth.authenticate_user()"
7376
],
7477
"metadata": {
@@ -87,23 +90,23 @@
8790
},
8891
"outputs": [],
8992
"source": [
90-
"#@title Setup\n",
93+
"# @title Setup\n",
9194
"from datetime import datetime\n",
9295
"import json\n",
9396
"import logging\n",
9497
"import os\n",
9598
"\n",
96-
"from google.genai import types\n",
9799
"import experiment as experiment_lib\n",
100+
"from google.genai import types\n",
98101
"\n",
99102
"\n",
100103
"# @markdown ### ☁️ Configure Vertex AI Access\n",
101104
"# @markdown Enter your Google Cloud Project ID and Location.\n",
102105
"\n",
103-
"#@markdown Configure Vertex AI Access\n",
106+
"# @markdown Configure Vertex AI Access\n",
104107
"\n",
105-
"GCP_PROJECT = '' #@param {type: 'string'}\n",
106-
"GCP_LOCATION = 'us-central1' #@param {type: 'string'}\n",
108+
"GCP_PROJECT = '' # @param {type: 'string'}\n",
109+
"GCP_LOCATION = 'us-central1' # @param {type: 'string'}\n",
107110
"\n",
108111
"# @markdown ---\n",
109112
"# @markdown ### 🧠 Configure LLM Models\n",
@@ -116,14 +119,19 @@
116119
"\n",
117120
"# @markdown ---\n",
118121
"# @markdown ### ⚙️ Configure Experiment Parameters\n",
119-
"# @markdown These control the dataset size, evaluation runs, and GEPA budget.\n",
120-
"# @markdown For a quick demo, keep these values small. For a real run, you might\n",
121-
"# @markdown increase `MAX_DATASET_SIZE` to 50-100 and `MAX_METRIC_CALLS` to 100+.\n",
122+
"# @markdown Number of trajectories sampled from rollouts to be used by the reflection model in each GEPA step:\n",
123+
"MINI_BATCH_SIZE = 8 # @param {type: 'integer'}\n",
124+
"# @markdown Size of the pareto and feedback datasets (small setting for demo purposes):\n",
122125
"MAX_DATASET_SIZE = 10 # @param {type: 'integer'}\n",
126+
"# @markdown Number of times each task is run during evaluation:\n",
123127
"NUM_EVAL_TRIALS = 4 # @param {type: 'integer'}\n",
128+
"# @markdown Total budget for GEPA prompt evaluations:\n",
124129
"MAX_METRIC_CALLS = 100 # @param {type: 'integer'}\n",
130+
"# @markdown Maximum number of parallel agent-environment interactions\n",
125131
"MAX_CONCURRENCY = 4 # @param {type: 'integer'}\n",
126132
"\n",
133+
"# @markdown **Note:** You can find more information on how to configure GEPA in the [README file](https://github.com/google/adk-python/blob/main/contributing/samples/gepa/README.md).\n",
134+
"\n",
127135
"# The ADK uses these environment variables to connect to Vertex AI via the\n",
128136
"# Google GenAI SDK.\n",
129137
"os.environ['GOOGLE_GENAI_USE_VERTEXAI'] = 'true'\n",
@@ -165,7 +173,7 @@
165173
{
166174
"cell_type": "code",
167175
"source": [
168-
"#@title Define an initial instruction\n",
176+
"# @title Define an initial instruction\n",
169177
"\n",
170178
"# @markdown This is our starting \"seed\" prompt. It's very generic and doesn't give the agent much guidance on how to behave or use tools.\n",
171179
"BASE_SYSTEM_INSTRUCTION = 'you are a customer support agent helping customers resolve their issues by using the right tools' # @param {type: 'string'}\n",
@@ -226,7 +234,7 @@
226234
}
227235
],
228236
"source": [
229-
"#@title Initial Inference: A First Look at Our Agent\n",
237+
"# @title Initial Inference: A First Look at Our Agent\n",
230238
"\n",
231239
"from tau_bench.types import EnvRunResult, RunConfig\n",
232240
"\n",
@@ -373,7 +381,8 @@
373381
}
374382
],
375383
"source": [
376-
"#@title Let's visualize one of the sampled trajectory\n",
384+
"# @title Let's visualize one of the sampled trajectory\n",
385+
"\n",
377386
"\n",
378387
"def display_trajectory(trajectory):\n",
379388
" \"\"\"Formats and prints a trajectory for display in Colab.\"\"\"\n",
@@ -400,7 +409,7 @@
400409
" f'**{role.upper()}**: ↪️ Tool Response from'\n",
401410
" f' `{fr[\"name\"]}`: `{fr[\"response\"][\"result\"]}`'\n",
402411
" )\n",
403-
" print() # new line after each turn\n",
412+
" print() # new line after each turn\n",
404413
"\n",
405414
"\n",
406415
"# Let's inspect the \"trajectory\" of the first run. A trajectory is the full\n",
@@ -485,7 +494,9 @@
485494
" rnd_seed=42,\n",
486495
" max_metric_calls=MAX_METRIC_CALLS, # GEPA budget: max prompt evaluations\n",
487496
" reflection_model=REFLECTION_MODEL_NAME, # Model for GEPA's reflection step\n",
488-
" reflection_minibatch_size=8,\n",
497+
" # Number of trajectories sampled from failed rollouts to be used by the\n",
498+
" # reflection model in each GEPA step to generate prompt improvements.\n",
499+
" reflection_minibatch_size=MINI_BATCH_SIZE,\n",
489500
" use_rater=False, # Optional: LLM rater for nuanced feedback\n",
490501
" # For this demo, we use the same small dataset for all splits.\n",
491502
" # In a real optimization run, you would use separate datasets:\n",
@@ -1330,7 +1341,7 @@
13301341
}
13311342
],
13321343
"source": [
1333-
"#@title Run GEPA (this might take ~10 minutes)\n",
1344+
"# @title Run GEPA (this might take ~10 minutes)\n",
13341345
"# This process can take around 10 minutes for the demo settings, as it\n",
13351346
"# involves multiple rounds of running the agent and calling the reflection model.\n",
13361347
"# A real run with more metric calls will take longer.\n",
@@ -1424,7 +1435,7 @@
14241435
}
14251436
],
14261437
"source": [
1427-
"#@title Visualize the optimized prompt\n",
1438+
"# @title Visualize the optimized prompt\n",
14281439
"# Now, let's look at the final, optimized prompt that GEPA produced.\n",
14291440
"# It should be much more detailed than our initial one-line prompt!\n",
14301441
"print('\\n--- Optimized Prompt from GEPA ---')\n",
@@ -1489,7 +1500,7 @@
14891500
}
14901501
],
14911502
"source": [
1492-
"#@title Run evaluation\n",
1503+
"# @title Run evaluation\n",
14931504
"\n",
14941505
"# Let's create a new directory for this final evaluation run.\n",
14951506
"final_eval_dir = os.path.join(\n",

contributing/samples/gepa/run_experiment.py

Lines changed: 61 additions & 11 deletions
Original file line numberDiff line numberDiff line change
@@ -26,15 +26,67 @@
2626
import experiment
2727
from google.genai import types
2828

29-
_OUTPUT_DIR = flags.DEFINE_string('output_dir', None, '')
30-
_EVAL_SET_SIZE = flags.DEFINE_integer('eval_set_size', None, '')
31-
_MAX_METRIC_CALLS = flags.DEFINE_integer('max_metric_calls', 500, '')
32-
_NUM_TEST_RECORDS = flags.DEFINE_integer('num_test_records', None, '')
33-
_NUM_EVAL_TRIALS = flags.DEFINE_integer('num_eval_trials', 4, '')
34-
_MAX_CONCURRENCY = flags.DEFINE_integer('max_concurrency', 8, '')
35-
_EVAL_MODE = flags.DEFINE_bool('eval_mode', False, '')
36-
_USE_RATER = flags.DEFINE_bool('use_rater', False, '')
37-
_TRAIN_BATCH_SIZE = flags.DEFINE_integer('train_batch_size', 3, '')
29+
_OUTPUT_DIR = flags.DEFINE_string(
30+
'output_dir',
31+
None,
32+
'Directory to save experiment results and artifacts.',
33+
required=True,
34+
)
35+
_EVAL_SET_SIZE = flags.DEFINE_integer(
36+
'eval_set_size',
37+
None,
38+
'Size of the dev set to use for Pareto frontier evaluation in GEPA. If'
39+
' None, uses all available dev tasks. A few tens of examples might'
40+
' suffice more simpler tasks and up to a few hundreds for '
41+
' more complex and variable tasks. Increase the size to mitigate effect of'
42+
' variability at greater cost.',
43+
)
44+
_MAX_METRIC_CALLS = flags.DEFINE_integer(
45+
'max_metric_calls',
46+
500,
47+
'Total budget for GEPA prompt evaluations. This is the main control for'
48+
' runtime/cost. One could start with 100 and increase to 500+ for further'
49+
' optimization.',
50+
)
51+
_NUM_TEST_RECORDS = flags.DEFINE_integer(
52+
'num_test_records',
53+
None,
54+
'Size of the test set for final evaluation of the optimized prompt. If'
55+
' None, uses all available test tasks.',
56+
)
57+
_NUM_EVAL_TRIALS = flags.DEFINE_integer(
58+
'num_eval_trials',
59+
4,
60+
'Number of times each task is run during evaluation. Higher values give'
61+
' more stable evaluation metrics but increase runtime. Recommended: 4-8.',
62+
)
63+
_MAX_CONCURRENCY = flags.DEFINE_integer(
64+
'max_concurrency',
65+
8,
66+
'Maximum number of parallel agent-environment interactions. Increase if'
67+
' you have sufficient API quota.',
68+
)
69+
_EVAL_MODE = flags.DEFINE_bool(
70+
'eval_mode',
71+
False,
72+
'If set, run evaluation only using the seed prompt, skipping GEPA'
73+
' optimization.',
74+
)
75+
_USE_RATER = flags.DEFINE_bool(
76+
'use_rater',
77+
False,
78+
'If set, use an LLM rater to score trajectories.',
79+
)
80+
_TRAIN_BATCH_SIZE = flags.DEFINE_integer(
81+
'train_batch_size',
82+
3,
83+
'Number of trajectories sampled from rollouts to be used by the'
84+
' reflection model in each GEPA step to generate prompt improvements.'
85+
' Increasing the batch size may help provide a more stable signal and'
86+
' estimate of a prompt quality but entails higher cost. One can start with'
87+
' a low value and increase the size if significant variations are'
88+
' observed.',
89+
)
3890

3991

4092
def main(argv: Sequence[str]) -> None:
@@ -53,8 +105,6 @@ def main(argv: Sequence[str]) -> None:
53105
logger.setLevel(logging.WARNING)
54106

55107
types.logger.addFilter(experiment.FilterInferenceWarnings())
56-
if not _OUTPUT_DIR.value:
57-
raise ValueError('outptut dir must be specified')
58108
output_dir = os.path.join(
59109
_OUTPUT_DIR.value, datetime.now().strftime('%Y%m%d%H%M%S%f')
60110
)

0 commit comments

Comments
 (0)