feat(testing): Enable sandboxed prompt evaluations

To ensure prompt evaluation tests are hermetic, this change runs
gemini-cli within a sandbox by default. This prevents tests from
having side effects on the host system, which is critical for
running on CI/CQ bots.

The test runner now pre-fetches the sandbox image and adds the
--sandbox flag to gemini-cli calls.

A --no-sandbox flag has been added to allow developers to run tests
locally without a container runtime.

Bug: 441944057
Change-Id: If8567383f519b7027e2970b44965b9f5eb8a2033
Reviewed-on: https://chromium-review.googlesource.com/c/chromium/src/+/6955305
Commit-Queue: Struan Shrimpton <[email protected]>
Reviewed-by: Brian Sheedy <[email protected]>
Reviewed-by: Struan Shrimpton <[email protected]>
Auto-Submit: James Woo <[email protected]>
Cr-Commit-Position: refs/heads/main@{#1517443}
GitOrigin-RevId: 6661d9b2cf29df839cec7c88d046a6d0ead37995
4 files changed
tree: 21cb55d80202c7c9ad29e390d6e2e7ad590ea326
  1. asserts/
  2. .style.yapf
  3. eval_prompts.py
  4. eval_prompts_unittest.py
  5. gemini_provider.py
  6. OWNERS
  7. PRESUBMIT.py
  8. README.md
README.md

Prompt Evaluation

This directory contains an experimental script for running prompt evaluation tests on extensions and prompts under //agents. It currently only works locally and will make temporary changes to your Chromium repo.

Usage

Existing tests can be run via the //agents/testing/eval_prompts.py script. It should handle everything automatically, although it is advised to commit any changes before running this script. It will automatically retrieve a temporary copy of promptfoo, perform repo setup, run configured tests, and perform teardown.

By default, it will build promptfoo from ToT, but specific behavior can be configured via command line arguments, including use of stable releases via npm which will likely result in faster setup.

Running without a container runtime

If you are running eval_prompts.py on a system without a container runtime like Docker or Podman, you will need to pass the --no-sandbox flag. This is because the script uses sandboxing by default to isolate the test environment.

btrfs Chromium Setup

The prompt eval is intended to be run with Chromium in a btrfs file system. The tests should still run in a normal checkout but will be significantly slower and take up significantly more disk space. These steps can be used to fetch a new Chromium solution in a virtual btrfs file system mounted in your home dir.

The following commands can be used to set up the environment:

# 1. Ensure btrfs is installed
sudo apt install btrfs-progs

# 2. Create the virtual image file
truncate -s 500G ~/btrfs_virtual_disk.img

# 3. Format the image with btrfs
mkfs.btrfs ~/btrfs_virtual_disk.img

# 4. Mount the image
mkdir ~/btrfs
sudo mount -o loop ~/btrfs_virtual_disk.img ~/btrfs

# 5. Update owner
sudo chown $(whoami):$(id -ng) ~/btrfs

# 6. Create a btrfs subvolume for the checkout
btrfs subvolume create ~/btrfs/chromium

# 7. Fetch a new Chromium checkout into the subvolume.
# This will place the 'src' directory inside '~/btrfs/chromium/'.
cd ~/btrfs/chromium
fetch chromium

# For an existing checkout, you would instead move the contents, e.g.:
# mv ~/your_old_chromium/* ~/btrfs/chromium/

After Chromium is checked out, agents/testing/eval_prompts.py can then be run from ~/btrfs/chromium/src/.

Adding Extensions

The script only installs the extensions in the EXTENSIONS_TO_INSTALL list at the top of the file. If an extension should be present for testing, add the extension name to this list.

Adding Tests

Each independent test case should have its own promptfoo yaml config file. See the promptfoo documentation for more information on this. If multiple prompts are expected to result in the same behavior, and thus can be tested in the same way, the config file can contain multiple prompts. promptfoo will automatically test each prompt individually.

Config files should be placed in a tests/promptfoo/ subdirectory of the relevant prompt or extension directory. After they exist on disk, new yaml files will need to be added to the PROMPTFOO_CONFIG_COMPONENTS list at the top of the script for the tests to actually be run.

Advanced Usage: Testing Custom Options

The gemini_provider.py supports several custom options for advanced testing scenarios, such as applying file changes or loading specific templates. Below is an example of a promptfoo.yaml file that demonstrates how to use the changes option to patch and stage files before a test prompt is run.

This example can be used as a template for writing tests that require a specific file state.

Example: custom_options.promptfoo.yaml

prompts:
  - "What is the staged content of the file `path/to/dummy.txt`?"
providers:
  - id: "python:../../../testing/gemini_provider.py"
    config:
      extensions:
        - depot_tools
      changes:
        - apply: "path/to/add_dummy_content.patch"
        - stage: "path/to/dummy.txt"
tests:
  - description: "Test with custom options"
    assert:
      # Check that the agent ran git diff and found the new content.
      - type: icontains
        value: "dummy content"

Example Patch File

The changes field points to standard .patch files. The test runner will apply them.

add_dummy_content.patch

diff --git a/path/to/dummy.txt b/path/to/dummy.txt
index e69de29..27332d3 100644
--- a/path/to/dummy.txt
+++ b/path/to/dummy.txt
@@ -0,0 +1 @@
+dummy content