Skip to main content
AKOS

Evals

Run structured test suites against your workflows and track pass rates over time.

The Evals screen lets you define test cases for your workflows, run them on demand, and review the results. Use evals to catch regressions before promoting a flow to production.

The screen has three sections in the section rail: Suites, Results, and Benchmarks.

Suites

A suite is a named collection of test cases for a specific workflow. Each test case defines an input and an expected output (or assertion criteria).

Suite list Each row shows:

  • Suite name and ID.
  • The workflow it is bound to.
  • The number of test cases in the suite.
  • Run — execute all cases in the suite against the current workflow version.
  • Delete — remove the suite.

If no suites exist, an empty state links to the Config editor where you can define them.

Creating a suite Suites are defined in the workspace configuration file (Config editor). Each suite specifies:

  • flowId — the ID of the workflow to test.
  • cases — an array of test cases, each with an input object and optionally an expected object or assertion expression.

Results

The Results section shows the outcomes of suite runs. Each row shows:

  • The suite ID and the time the run started.
  • Pass/fail counts and a percentage badge (green for 100% pass, red for any failures).

Click a result to expand the per-case breakdown, showing which cases passed, which failed, and the actual vs expected output for each failure.

If no results exist yet, an empty state prompts you to go to Suites and run one.

Benchmarks

The Benchmarks section shows standardised benchmark scores for your agents and workflows. Benchmarks measure performance on pre-defined industry datasets or task sets.

Each benchmark row shows:

  • The benchmark name and provider.
  • The agent or workflow being measured.
  • The score and the date it was last run.
  • A Run button to run the benchmark again.

Benchmarks require a connected evaluation provider. Configure the provider in the Connections screen.

On this page

Evals · AKOS