Task eval command reference
runme eval is an experimental command group for running repeatable task datasets against local coding agents.
It builds on Harbor's task, dataset, trial, and job model, but Runme presents the workflow through the runme eval CLI.
runme eval does not run Markdown cells from a notebook. It runs evaluation task datasets, collects eval job evidence, compares results, and optionally promotes changes backed by eval output.
runme eval
Run agent task evals with Runme.
runme eval [dataset-path] [flags] [-- harbor-flags...]
When dataset-path is omitted, Runme uses ./evals/tasks.
Important flags:
--agent: Harbor agent to use. Defaults tooracle, Harbor's built-in oracle solution runner.--task-dir: task directory name to include from the Harbor dataset.--jobs-dir: eval jobs directory. Defaults to.runme/evals/jobs.--ask: do not auto-accept Harbor confirmation prompts.--agent-env,--ae: environment variable to pass to the agent inKEY=VALUEformat. Can be repeated.--agent-kwarg,--ak: Harbor agent kwarg. Can be repeated.--model: Harbor agent model.--env,-e: Harbor environment to use. When omitted, Runme's environment is used. Pass a Harbor environment name when the selected task dataset provides one. For Docker environment configuration, see Harbor's task environment docs.--runme-bin: Runme binary used by the Harbor environment.--runme-arg: additional Runme argument used by the Harbor environment. Can be repeated.--runme-harbor-bin:runme-harborexecutable.--debug: print delegated commands.
Supported agents:
oracle: Harbor's built-in oracle solution runner. It runs the task's provided solution and is not an external agent CLI.codexclaude-codecursor-cliopenclaw
External agent runs require the selected agent CLI to be installed and authenticated locally.
Runtime environment paths
Runme exposes runtime path metadata to task and verifier commands as environment variables.
Use these variables in task scripts, verifiers, and artifact writers instead of hardcoding workspace, log, or reward paths:
RUNME_TASK_WORKDIR: resolved task workspace.RUNME_TASK_DIR: task definition directory.RUNME_TASK_NAME: task name.RUNME_TESTS_DIR: uploaded verifier tests directory.RUNME_LOGS_DIR: base trial logs directory.RUNME_AGENT_LOG_DIR: agent logs directory.RUNME_ARTIFACTS_DIR: artifact output directory.RUNME_VERIFIER_DIR: verifier output directory.RUNME_REWARD_PATH: canonical reward JSON path.RUNME_REWARD_DETAILS_PATH: optional detailed reward JSON path.
Runtime path variables are set by Runme and take precedence over task environment values with the same names.
runme eval view
View eval jobs in the dashboard.
runme eval view [jobs-dir] [flags]
Important flags:
--port: dashboard port. Defaults to the first open port from8080.--no-open: do not open the dashboard in the default browser.--runme-harbor-bin:runme-harborexecutable.--debug: print delegated commands.
runme eval compare
Compare eval jobs for a dataset.
runme eval compare [dataset-path] [flags]
When dataset-path is omitted, Runme uses ./evals/tasks. Runme compares the latest Git-tracked eval job for that dataset with the latest local eval job for the same dataset under the jobs directory.
Important flags:
--jobs-dir: eval jobs directory. Defaults to.runme/evals/jobsunder the project root.--job: compare against a specific local eval job.--base: Git ref used to find the tracked baseline eval job. Defaults toHEAD.--format: output format, eithertextorjson. Defaults totext.--include-oracle: allow comparing eval jobs that only used Harbor's built-in oracle solution runner.--allow-errors: allow comparing eval jobs with errored trials.
Behavior:
- The command is read-only.
- It filters tracked baselines, local candidates, and explicit
--jobvalues by dataset path. - It uses job counters and overlapping result rewards.
- It prints an advisory recommendation.
- It does not commit, promote, or enforce policy.
runme eval promote
Commit staged changes with eval job evidence.
runme eval promote [dataset-path] [flags]
When dataset-path is omitted, Runme uses ./evals/tasks. Runme selects, compares, and promotes eval jobs for that dataset.
Important flags:
--job: eval job directory to promote.--latest: promote the latest eval job for the dataset under--jobs-dir.--dry-run: print what would be committed without staging or committing.--evidence-only: commit only the selected eval job evidence when no source changes are staged.--artifacts: include full eval artifacts such as logs and trial outputs. These may contain sensitive information.--include-oracle: allow promoting eval jobs that only used Harbor's built-in oracle solution runner.--allow-errors: allow promoting eval jobs with errored trials.--promote-anyway: promote even when eval comparison blocks promotion.--message: commit subject line. Eval evidence is added to the commit body.
Behavior:
- Requires either
--jobor--latest. - It filters latest-job selection, explicit
--jobvalues, comparison baselines, and newer-job warnings by dataset path. - Adds eval evidence to a commit.
- Requires staged changes unless
--evidence-onlyis used. - Blocks promotion on comparison failures unless
--promote-anywayis used. - In
--dry-runmode, prints the selected job, evidence mode, files to add, comparison result, and proposed commit message.