Tests
Tests are individual test entries within an evaluation file. Each test defines input messages, expected outcomes, and optional evaluator overrides.
Basic Structure
Section titled “Basic Structure”tests: - id: addition criteria: Correctly calculates 15 + 27 = 42
input: What is 15 + 27?
expected_output: "42"Fields
Section titled “Fields”| Field | Required | Description |
|---|---|---|
id | Yes | Unique identifier for the test |
criteria | Yes | Description of what a correct response should contain |
input | Yes | Input sent to the target (string, object, or message array). Alias: input |
expected_output | No | Expected response for comparison (string, object, or message array). Alias: expected_output |
execution | No | Per-case execution overrides (target, evaluators) |
workspace | No | Per-case workspace config (overrides suite-level) |
metadata | No | Arbitrary key-value pairs passed to setup/teardown scripts |
rubrics | No | Structured evaluation criteria |
assert | No | Per-test assertion evaluators (alternative to execution.evaluators) |
sidecar | No | Additional metadata passed to evaluators |
The simplest form is a string, which expands to a single user message:
input: What is 15 + 27?For multi-turn or system messages, use a message array:
input: - role: system content: You are a helpful math tutor. - role: user content: What is 15 + 27?Expected Output
Section titled “Expected Output”Optional reference response for comparison by evaluators. A string expands to a single assistant message:
expected_output: "42"For structured or multi-message expected output, use a message array:
expected_output: - role: assistant content: "42"Per-Case Execution Overrides
Section titled “Per-Case Execution Overrides”Override the default target or evaluators for specific tests:
tests: - id: complex-case criteria: Provides detailed explanation input: Explain quicksort algorithm
execution: target: gpt4_target evaluators: - name: depth_check type: llm_judge prompt: ./judges/depth.mdPer-case evaluators are merged with root-level execution.evaluators — test-specific evaluators run first, then root-level defaults are appended. To opt out of root-level evaluators for a specific test, set skip_defaults: true:
execution: evaluators: - name: latency_check type: latency threshold: 5000
tests: - id: normal-case criteria: Returns correct answer input: What is 2+2? # Gets latency_check from root-level evaluators
- id: special-case criteria: Handles edge case input: Handle this edge case execution: skip_defaults: true evaluators: - name: custom_eval type: llm_judge # Does NOT get latency_checkPer-Case Workspace Config
Section titled “Per-Case Workspace Config”Override the suite-level workspace config for individual tests. Test-level fields replace suite-level fields:
workspace: setup: script: ["bun", "run", "default-setup.ts"]
tests: - id: case-1 criteria: Should work input: Do something workspace: setup: script: ["bun", "run", "custom-setup.ts"]
- id: case-2 criteria: Should also work input: Do something else # Inherits suite-level setupSee Workspace Setup/Teardown for the full workspace config reference.
Per-Case Metadata
Section titled “Per-Case Metadata”Pass arbitrary key-value pairs to setup/teardown scripts via the metadata field. This is useful for benchmark datasets where each case needs repo info, commit hashes, or other context:
tests: - id: sympy-20590 criteria: Bug should be fixed input: Fix the diophantine equation bug metadata: repo: sympy/sympy base_commit: "abc123def" workspace: setup: script: ["python", "checkout_repo.py"]The metadata field is included in the stdin JSON passed to setup and teardown scripts as case_metadata.
Per-Test Assertions
Section titled “Per-Test Assertions”The assert field provides a concise way to attach evaluators directly to a test, as an alternative to nesting them inside execution.evaluators. It supports both deterministic assertion types and LLM-based rubric evaluation.
Deterministic Assertions
Section titled “Deterministic Assertions”These evaluators run without an LLM call and produce binary (0 or 1) scores:
| Type | Field | Description |
|---|---|---|
contains | value | Checks if output contains the substring |
regex | value | Checks if output matches the regex pattern |
is_json | — | Checks if output is valid JSON |
equals | value | Checks if output exactly equals the value (trimmed) |
tests: - id: json-api criteria: Returns valid JSON with status field input: Return the system status as JSON assert: - type: is_json - type: contains value: '"status"'Assertion evaluators auto-generate a name when one is not provided (e.g., contains-DENIED, is_json).
Rubric Assertions
Section titled “Rubric Assertions”Use type: rubrics with a criteria array to define structured LLM-judged evaluation criteria inline:
tests: - id: denied-party criteria: Must identify denied party input: - role: user content: Screen "Acme Corp" against denied parties list expected_output: - role: assistant content: "DENIED" assert: - type: contains value: "DENIED" required: true - type: rubrics criteria: - id: accuracy outcome: Correctly identifies the denied party weight: 5.0 - id: reasoning outcome: Provides clear reasoning for the decision weight: 3.0Required Gates
Section titled “Required Gates”Any evaluator (in assert or execution.evaluators) can be marked as required. When a required evaluator fails, the overall test verdict is fail regardless of the aggregate score.
| Value | Behavior |
|---|---|
required: true | Must score >= 0.8 (default threshold) to pass |
required: 0.6 | Must score >= 0.6 to pass (custom threshold between 0 and 1) |
assert: - type: contains value: "DENIED" required: true # must pass (>= 0.8) - type: rubrics required: 0.6 # must score at least 0.6 criteria: - id: quality outcome: Response is well-structured weight: 1.0Required gates are evaluated after all evaluators run. If any required evaluator falls below its threshold, the verdict is forced to fail.
Assert vs execution.evaluators
Section titled “Assert vs execution.evaluators”assert and execution.evaluators serve the same purpose — both define evaluators for a test. Choose whichever fits your style:
assert: Flat, concise syntax at the test level. Best for assertion-heavy evals.execution.evaluators: Nested underexecution. Useful when you also settargetorskip_defaults.
When both are present on the same test, assert takes precedence over execution.evaluators. Suite-level assert and execution.evaluators follow the same merge rules — they are appended to per-test evaluators unless skip_defaults: true is set.
Sidecar Metadata
Section titled “Sidecar Metadata”Pass additional context to evaluators via the sidecar field:
tests: - id: code-gen criteria: Generates valid Python sidecar: language: python difficulty: medium input: Write a function to sort a list