feat: Add documentation and instructions to help configure gepa experiments

google-genai-bot · copybara-github · commit f167890d0073 · 2025-11-06T05:31:24.000-08:00
PiperOrigin-RevId: 828911323
diff --git a/contributing/samples/gepa/README.md b/contributing/samples/gepa/README.md
@@ -1,4 +1,119 @@
 # Example: optimizing an ADK agent with Genetic-Pareto
 
-This directory contains an example demonstrating how to use the Agent Development
-Kit (ADK) to run and optimize an LLM-based agent in a simulated environment with the Genetic-Pareto prompt optimization algorithm ([GEPA: Reflective Prompt Evolution Can Outperform Reinforcement Learning](https://arxiv.org/abs/2507.19457)) on benchmarks like Tau-bench.
+This directory contains an example demonstrating how to use the Agent
+Development Kit (ADK) to run and optimize an LLM-based agent in a simulated
+environment with the Genetic-Pareto prompt optimization algorithm
+([GEPA: Reflective Prompt Evolution Can Outperform Reinforcement Learning](https://arxiv.org/abs/2507.19457))
+on benchmarks like Tau-bench.
+
+## Goal
+
+The goal of this demo is to take an agent with a simple, underperforming prompt
+and automatically improve it using GEPA, increasing the agent's reliability on a
+customer support task.
+
+## Tau-Bench Retail Environment
+
+We use the `'retail'` environment from
+[Tau-bench](https://github.com/sierra-research/tau-bench), a benchmark designed
+to test agents in realistic, conversational scenarios involving tool use and
+adherence to policies. In this environment, our agent acts as a customer
+support agent for an online store. It needs to use a set of tools (like
+`check_order_status`, `issue_refund`, etc.) to help a simulated user resolve
+their issues, while following specific support policies (e.g., only refunding
+orders less than 30 days old). The agent is built with ADK using a standard
+tool-calling strategy. It receives the conversation history and a list of
+available tools, and it must decide whether to respond to the user or call a
+tool.
+
+## GEPA Overview
+
+**GEPA (Genetic-Pareto)** is a prompt optimization algorithm that learns from
+trial and error, using LLM-based reflection to understand failures and guide
+prompt evolution. Here's a simplified view of how it works:
+
+1.  **Run & Collect:** It runs the agent with a candidate prompt on a few
+    training examples to collect interaction trajectories.
+2.  **Reflect:** It gives the trajectories of failed rollouts to a "reflection"
+    model, which analyzes what went wrong and generates high-level insights or
+    "rules" for improvement. For example, it might notice *"The agent should
+    always confirm the order number before issuing a refund."*
+3.  **Evolve:** It uses these insights to propose new candidate prompts by
+    editing existing prompts or combining ideas from different successful ones,
+    inspired by genetic algorithms.
+4.  **Evaluate & Select:** It evaluates these new prompts on a validation set
+    and keeps only the best-performing, diverse set of prompts (the "Pareto
+    frontier").
+5.  **Repeat:** It repeats this loop—collect, reflect, evolve, evaluate—until
+    it reaches its budget (`max_metric_calls`).
+
+This can result in a more detailed and robust prompt that has learned from its
+mistakes, and capturing nuances that are sometimes difficult to discover
+through manual prompt engineering.
+
+## Running the experiment
+
+The easiest way to run this demo is through the provided Colab notebook:
+[`gepa_tau_bench.ipynb`](https://colab.research.google.com/github/google/adk-python/blob/main/contributing/samples/gepa/gepa_tau_bench.ipynb).
+
+Alternatively, you can run GEPA optimization using the `run_experiment.py`
+script:
+
+```bash
+python -m run_experiment \
+  --output_dir=/path/to/gepa_experiments/ \
+  --num_eval_trials=8 \
+  --max_concurrency=32 \
+  --train_batch_size=8
+```
+
+To run only evaluation with the seed prompt, use `--eval_mode`:
+
+```bash
+python -m run_experiment \
+  --output_dir=/path/to/gepa_experiments/ \
+  --num_eval_trials=8 \
+  --max_concurrency=32 \
+  --eval_mode
+```
+
+## Choosing Hyperparameters
+
+Setting the right hyperparameters is crucial for a successful and efficient
+run. The following hyperparameters can be set via command-line flags in
+`run_experiment.py`:
+
+*   `--max_metric_calls`: Total budget for GEPA prompt evaluations. This is the
+    main control for runtime/cost. One could start with 100 and increase to
+    500+ for further optimization.
+*   `--eval_set_size`: Size of the dev set to use for Pareto frontier
+    evaluation in GEPA. If None, uses all available dev tasks. A larger size
+    gives a more stable, less noisy fitness score with more coverage but is
+    more expensive and slows down the GEPA runtime. A few tens of examples
+    might suffice for simpler tasks and up to a few hundreds
+    for more complex and variable tasks.
+*   `--train_batch_size`: Number of trajectories sampled from rollouts
+    to be used by the reflection model in each GEPA step to generate prompt
+    improvements. This corresponds to the mini-batch size in GEPA used as a
+    fast, preliminary filter for new candidate prompts. It trades-off signal
+    quality and cost of evaluation. The GEPA paper uses a default of 3.
+    Increasing the batch size may help provide a more stable
+    signal and estimate of a prompt quality but entails higher cost and less
+    iterations, given a fixed budget. One can start with a low value and
+    increase the size if significant variations are observed.
+*   `--num_eval_trials`: Number of times each task is run during evaluation.
+    Higher values give more stable evaluation metrics but increase runtime.
+    Recommended: 4-8.
+*   `--num_test_records`: Size of the test set for final evaluation of the
+    optimized prompt. If None, uses all available test tasks.
+
+## LLM-based Rater
+
+When agent reward signals are not available, you can instead use an LLM rater
+by setting the `--use_rater` flag.
+
+This rater evaluates agent trajectories based on a rubric assessing whether
+"The agent fulfilled the user's primary request." It provides a score (0 or 1)
+and detailed feedback including evidence and rationale for its verdict. This
+score is then used by GEPA as the fitness function to optimize. The rater is
+implemented in `rater_lib.py`.
diff --git a/contributing/samples/gepa/gepa_tau_bench.ipynb b/contributing/samples/gepa/gepa_tau_bench.ipynb
@@ -14,10 +14,12 @@
         "test agents in realistic, conversational scenarios involving tool use and\n",
         "adherence to policies.\n",
         "\n",
-        "**Our Goal:** To take a simple, underperforming prompt and automatically\n",
+        "**Goal:** To take a simple, underperforming prompt and automatically\n",
         "improve it using GEPA, increasing the agent's reliability on a customer\n",
         "support task.\n",
         "\n",
+        "**Note:** You can find more options to run GEPA with an ADK agent in the [README file](https://github.com/google/adk-python/blob/main/contributing/samples/gepa/README.md).\n",
+        "\n",
         "## Prerequisites\n",
         "\n",
         "*   **Google Cloud Project:** You'll need access to a Google Cloud Project with\n",
@@ -36,7 +38,7 @@
       },
       "outputs": [],
       "source": [
-        "#@title Install Tau-bench and GEPA\n",
+        "# @title Install Tau-bench and GEPA\n",
         "!git clone https://github.com/google/adk-python.git\n",
         "!git clone https://github.com/sierra-research/tau-bench.git\n",
         "%cd tau-bench/\n",
@@ -45,13 +47,13 @@
         "%cd ..\n",
         "!pip install gepa --quiet\n",
         "\n",
-        "!pip install retry --quiet\n"
+        "!pip install retry --quiet"
       ]
     },
     {
       "cell_type": "code",
       "source": [
-        "#@title Configure python dependencies\n",
+        "# @title Configure python dependencies\n",
         "import sys\n",
         "\n",
         "sys.path.append('/content/tau-bench')\n",
@@ -67,8 +69,9 @@
     {
       "cell_type": "code",
       "source": [
-        "#@title Authentication\n",
+        "# @title Authentication\n",
         "from google.colab import auth\n",
+        "\n",
         "auth.authenticate_user()"
       ],
       "metadata": {
@@ -87,23 +90,23 @@
       },
       "outputs": [],
       "source": [
-        "#@title Setup\n",
+        "# @title Setup\n",
         "from datetime import datetime\n",
         "import json\n",
         "import logging\n",
         "import os\n",
         "\n",
-        "from google.genai import types\n",
         "import experiment as experiment_lib\n",
+        "from google.genai import types\n",
         "\n",
         "\n",
         "# @markdown ### ☁️ Configure Vertex AI Access\n",
         "# @markdown Enter your Google Cloud Project ID and Location.\n",
         "\n",
-        "#@markdown Configure Vertex AI Access\n",
+        "# @markdown Configure Vertex AI Access\n",
         "\n",
-        "GCP_PROJECT = '' #@param {type: 'string'}\n",
-        "GCP_LOCATION = 'us-central1' #@param {type: 'string'}\n",
+        "GCP_PROJECT = ''  # @param {type: 'string'}\n",
+        "GCP_LOCATION = 'us-central1'  # @param {type: 'string'}\n",
         "\n",
         "# @markdown ---\n",
         "# @markdown ### 🧠 Configure LLM Models\n",
@@ -116,14 +119,19 @@
         "\n",
         "# @markdown ---\n",
         "# @markdown ### ⚙️ Configure Experiment Parameters\n",
-        "# @markdown These control the dataset size, evaluation runs, and GEPA budget.\n",
-        "# @markdown For a quick demo, keep these values small. For a real run, you might\n",
-        "# @markdown increase `MAX_DATASET_SIZE` to 50-100 and `MAX_METRIC_CALLS` to 100+.\n",
+        "# @markdown Number of trajectories sampled from rollouts to be used by the reflection model in each GEPA step:\n",
+        "MINI_BATCH_SIZE = 8  # @param {type: 'integer'}\n",
+        "# @markdown Size of the pareto and feedback datasets (small setting for demo purposes):\n",
         "MAX_DATASET_SIZE = 10  # @param {type: 'integer'}\n",
+        "# @markdown Number of times each task is run during evaluation:\n",
         "NUM_EVAL_TRIALS = 4  # @param {type: 'integer'}\n",
+        "# @markdown Total budget for GEPA prompt evaluations:\n",
         "MAX_METRIC_CALLS = 100  # @param {type: 'integer'}\n",
+        "# @markdown Maximum number of parallel agent-environment interactions\n",
         "MAX_CONCURRENCY = 4  # @param {type: 'integer'}\n",
         "\n",
+        "# @markdown **Note:** You can find more information on how to configure GEPA in the [README file](https://github.com/google/adk-python/blob/main/contributing/samples/gepa/README.md).\n",
+        "\n",
         "# The ADK uses these environment variables to connect to Vertex AI via the\n",
         "# Google GenAI SDK.\n",
         "os.environ['GOOGLE_GENAI_USE_VERTEXAI'] = 'true'\n",
@@ -165,7 +173,7 @@
     {
       "cell_type": "code",
       "source": [
-        "#@title Define an initial instruction\n",
+        "# @title Define an initial instruction\n",
         "\n",
         "# @markdown This is our starting \"seed\" prompt. It's very generic and doesn't give the agent much guidance on how to behave or use tools.\n",
         "BASE_SYSTEM_INSTRUCTION = 'you are a customer support agent helping customers resolve their issues by using the right tools'  # @param {type: 'string'}\n",
@@ -226,7 +234,7 @@
         }
       ],
       "source": [
-        "#@title Initial Inference: A First Look at Our Agent\n",
+        "# @title Initial Inference: A First Look at Our Agent\n",
         "\n",
         "from tau_bench.types import EnvRunResult, RunConfig\n",
         "\n",
@@ -373,7 +381,8 @@
         }
       ],
       "source": [
-        "#@title Let's visualize one of the sampled trajectory\n",
+        "# @title Let's visualize one of the sampled trajectory\n",
+        "\n",
         "\n",
         "def display_trajectory(trajectory):\n",
         "  \"\"\"Formats and prints a trajectory for display in Colab.\"\"\"\n",
@@ -400,7 +409,7 @@
         "              f'**{role.upper()}**: ↪️ Tool Response from'\n",
         "              f' `{fr[\"name\"]}`: `{fr[\"response\"][\"result\"]}`'\n",
         "          )\n",
-        "    print() # new line after each turn\n",
+        "    print()  # new line after each turn\n",
         "\n",
         "\n",
         "# Let's inspect the \"trajectory\" of the first run. A trajectory is the full\n",
@@ -485,7 +494,9 @@
         "    rnd_seed=42,\n",
         "    max_metric_calls=MAX_METRIC_CALLS,  # GEPA budget: max prompt evaluations\n",
         "    reflection_model=REFLECTION_MODEL_NAME,  # Model for GEPA's reflection step\n",
-        "    reflection_minibatch_size=8,\n",
+        "    # Number of trajectories sampled from failed rollouts to be used by the\n",
+        "    # reflection model in each GEPA step to generate prompt improvements.\n",
+        "    reflection_minibatch_size=MINI_BATCH_SIZE,\n",
         "    use_rater=False,  # Optional: LLM rater for nuanced feedback\n",
         "    # For this demo, we use the same small dataset for all splits.\n",
         "    # In a real optimization run, you would use separate datasets:\n",
@@ -1330,7 +1341,7 @@
         }
       ],
       "source": [
-        "#@title Run GEPA (this might take ~10 minutes)\n",
+        "# @title Run GEPA (this might take ~10 minutes)\n",
         "# This process can take around 10 minutes for the demo settings, as it\n",
         "# involves multiple rounds of running the agent and calling the reflection model.\n",
         "# A real run with more metric calls will take longer.\n",
@@ -1424,7 +1435,7 @@
         }
       ],
       "source": [
-        "#@title Visualize the optimized prompt\n",
+        "# @title Visualize the optimized prompt\n",
         "# Now, let's look at the final, optimized prompt that GEPA produced.\n",
         "# It should be much more detailed than our initial one-line prompt!\n",
         "print('\\n--- Optimized Prompt from GEPA ---')\n",
@@ -1489,7 +1500,7 @@
         }
       ],
       "source": [
-        "#@title Run evaluation\n",
+        "# @title Run evaluation\n",
         "\n",
         "# Let's create a new directory for this final evaluation run.\n",
         "final_eval_dir = os.path.join(\n",
diff --git a/contributing/samples/gepa/run_experiment.py b/contributing/samples/gepa/run_experiment.py
@@ -26,15 +26,67 @@
 import experiment
 from google.genai import types
 
-_OUTPUT_DIR = flags.DEFINE_string('output_dir', None, '')
-_EVAL_SET_SIZE = flags.DEFINE_integer('eval_set_size', None, '')
-_MAX_METRIC_CALLS = flags.DEFINE_integer('max_metric_calls', 500, '')
-_NUM_TEST_RECORDS = flags.DEFINE_integer('num_test_records', None, '')
-_NUM_EVAL_TRIALS = flags.DEFINE_integer('num_eval_trials', 4, '')
-_MAX_CONCURRENCY = flags.DEFINE_integer('max_concurrency', 8, '')
-_EVAL_MODE = flags.DEFINE_bool('eval_mode', False, '')
-_USE_RATER = flags.DEFINE_bool('use_rater', False, '')
-_TRAIN_BATCH_SIZE = flags.DEFINE_integer('train_batch_size', 3, '')
+_OUTPUT_DIR = flags.DEFINE_string(
+    'output_dir',
+    None,
+    'Directory to save experiment results and artifacts.',
+    required=True,
+)
+_EVAL_SET_SIZE = flags.DEFINE_integer(
+    'eval_set_size',
+    None,
+    'Size of the dev set to use for Pareto frontier evaluation in GEPA. If'
+    ' None, uses all available dev tasks. A few tens of examples might'
+    ' suffice more simpler tasks and up to a few hundreds for '
+    ' more complex and variable tasks. Increase the size to mitigate effect of'
+    ' variability at greater cost.',
+)
+_MAX_METRIC_CALLS = flags.DEFINE_integer(
+    'max_metric_calls',
+    500,
+    'Total budget for GEPA prompt evaluations. This is the main control for'
+    ' runtime/cost. One could start with 100 and increase to 500+ for further'
+    ' optimization.',
+)
+_NUM_TEST_RECORDS = flags.DEFINE_integer(
+    'num_test_records',
+    None,
+    'Size of the test set for final evaluation of the optimized prompt. If'
+    ' None, uses all available test tasks.',
+)
+_NUM_EVAL_TRIALS = flags.DEFINE_integer(
+    'num_eval_trials',
+    4,
+    'Number of times each task is run during evaluation. Higher values give'
+    ' more stable evaluation metrics but increase runtime. Recommended: 4-8.',
+)
+_MAX_CONCURRENCY = flags.DEFINE_integer(
+    'max_concurrency',
+    8,
+    'Maximum number of parallel agent-environment interactions. Increase if'
+    ' you have sufficient API quota.',
+)
+_EVAL_MODE = flags.DEFINE_bool(
+    'eval_mode',
+    False,
+    'If set, run evaluation only using the seed prompt, skipping GEPA'
+    ' optimization.',
+)
+_USE_RATER = flags.DEFINE_bool(
+    'use_rater',
+    False,
+    'If set, use an LLM rater to score trajectories.',
+)
+_TRAIN_BATCH_SIZE = flags.DEFINE_integer(
+    'train_batch_size',
+    3,
+    'Number of trajectories sampled from rollouts to be used by the'
+    ' reflection model in each GEPA step to generate prompt improvements.'
+    ' Increasing the batch size may help provide a more stable signal and'
+    ' estimate of a prompt quality but entails higher cost. One can start with'
+    ' a low value and increase the size if significant variations are'
+    ' observed.',
+)
 
 
 def main(argv: Sequence[str]) -> None:
@@ -53,8 +105,6 @@ def main(argv: Sequence[str]) -> None:
     logger.setLevel(logging.WARNING)
 
   types.logger.addFilter(experiment.FilterInferenceWarnings())
-  if not _OUTPUT_DIR.value:
-    raise ValueError('outptut dir must be specified')
   output_dir = os.path.join(
       _OUTPUT_DIR.value, datetime.now().strftime('%Y%m%d%H%M%S%f')
   )