Documentation
Reproducible benchmark framework for SolverForge evidence across routing and scalar-assignment problems.
solverforge-bench
solverforge-bench is the official SolverForge benchmark surface. It groups
benchmarks by SolverForge solve shape and runs them through one shared Python
harness so timing, watchdog handling, result rows, logging, CSV output, and
optional PostgreSQL persistence stay comparable across problem families.
Problem packages are adapters. They load cases, build solver callables, validate returned solutions, compute benchmark-specific fields, and hand the result back to the shared framework.
What It Provides
- List-variable CVRP benchmarks under
list-variable/cvrp/, exercising SolverForge route-list solving on vehicle-routing instances - Scalar-variable employee scheduling benchmarks under
scalar-variable/employee-scheduling/, using the bundled INRC-II nurse rostering corpus - One shared harness for solver registration, run matrices, wall-clock timing, overshoot calculation, watchdog containment, result rows, CSV files, run logs, and solver stdout/stderr capture
- TOML configuration for benchmark selection, solvers, time limits, run kind, logging, watchdog settings, and PostgreSQL persistence
- PostgreSQL warehouse output for cron-driven and release-tagged runs, backed by checked-in SQL migrations and latest-run views
Setup
The repository currently targets Python 3.14:
python3.14 -m venv .venv
. .venv/bin/activate
pip install -e .
The root Makefile uses the same .venv for CVRP, employee scheduling,
normalization, and nightly runs. Use make install-python-deps to create or
refresh it through the repository workflow.
Benchmark Families
| Family | Path | Scope |
|---|---|---|
| CVRP | list-variable/cvrp/ |
List-variable route planning |
| Employee scheduling | scalar-variable/employee-scheduling/ |
Scalar-variable shift assignment |
CVRP is the canonical list-variable benchmark. Employee scheduling is the canonical scalar-variable benchmark and uses INRC-II nurse-to-shift assignment instances.
Common Commands
Validate adapters and reference data before treating a benchmark run as evidence:
make validate-cvrp
make validate-employee-scheduling
make validate-employee-model-parity
Run quick smoke benchmarks:
make bench-cvrp-quick
make bench-cvrp-solverforge-quick
make bench-employee-scheduling-quick
make bench-employee-scheduling-solverforge-quick
Run canonical benchmark sets:
make bench-cvrp
make bench-employee-scheduling
The quick CVRP target runs three instances at 1 and 10 seconds. The quick
employee-scheduling target runs n005w4 at 1 and 10 seconds. The canonical
employee-scheduling target uses the canonical group from the bundled INRC-II
manifest and the 1, 10, and 60 second budgets.
Unified Harness
The Make targets call the same root benchmark entrypoint:
PYTHONPATH=src:list-variable/cvrp/src:scalar-variable/employee-scheduling/src \
.venv/bin/python3 scripts/run_benchmark.py cvrp \
--run-kind quick \
--num-instances 3 \
--time-limits 1 10
PYTHONPATH=src:list-variable/cvrp/src:scalar-variable/employee-scheduling/src \
.venv/bin/python3 scripts/run_benchmark.py employee-scheduling \
--run-kind quick \
--datasets n005w4 \
--time-limits 1 10
Shared options include:
--config CONFIG
--solver SOLVER
--time-limits SECONDS...
--run-kind quick|candidate|tag
--nightly | --no-nightly
--release-tag TAG
--save-postgres | --no-save-postgres
--postgres-url URL
--log-level LEVEL
--show-solver-output | --no-show-solver-output
--capture-solver-output | --no-capture-solver-output
cvrp adds --num-instances. employee-scheduling adds --dataset-set and
--datasets.
Timing Contract
The requested time limit is passed to the solver and measured with wall-clock timing around the solver call. The watchdog is separate: it exists to contain runaway invocations, not to discard every solution that returns slightly after the nominal budget.
The harness records:
overshoot_seconds = max(0, actual_time_seconds - time_limit_seconds)
overshoot_ratio = overshoot_seconds / time_limit_seconds
wall_time_over_limit = actual_time_seconds > time_limit_seconds * 1.1
Only watchdog-killed invocations lose the returned solution, because the child process was forcibly terminated.
TOML Configuration
Use a TOML file when a run needs to be repeatable outside a Make target:
benchmark = "cvrp"
solver = ["solverforge"]
time_limits = [1]
run_kind = "quick"
nightly = false
[postgres]
save = false
url = "postgresql://postgres@localhost/solverforge_bench"
[logging]
level = "INFO"
show_solver_output = true
capture_solver_output = true
[benchmarks.cvrp]
num_instances = 3
[benchmarks.employee-scheduling]
dataset_set = "quick"
datasets = ["n005w4"]
Command-line options override TOML values. Make targets accept the same file
through BENCH_CONFIG:
make bench-cvrp-quick BENCH_CONFIG=benchmark.example.toml
Use benchmark.nightly.example.toml as the cron-oriented template for the
combined nightly job.
PostgreSQL Results
Normal benchmark targets write CSV evidence artifacts. The -db targets apply
migrations and save the same run to PostgreSQL:
make db-check
make db-create
make db-migrate
make bench-cvrp-quick-db
make bench-employee-scheduling-quick-db
make bench-nightly-db
make bench-nightly-db is the cron entrypoint. It builds both benchmark stacks,
applies migrations once, then runs full CVRP and canonical employee scheduling
as one PostgreSQL-saving nightly job.
PostgreSQL stores run catalog data, solver versions, and per-result rows.
Display consumers should read benchmark_result_facts,
latest_benchmark_runs, or latest_benchmark_result_facts instead of
reconstructing the run/result join.
Result Columns
Generated CSV rows use one global snake_case schema. Core columns include:
run_id, benchmark_name, benchmark_category, dataset, dataset_set, instance,
instance_size, solver, solver_version, time_limit_seconds,
actual_time_seconds, overshoot_seconds, overshoot_ratio,
wall_time_over_limit, watchdog_limit_seconds, watchdog_killed, run_error,
solver_stdout_path, solver_stderr_path, hard_feasible, cost, reported_cost,
fresh_cost, reference_cost, quality_ratio, validation_error,
solution_artifact
Benchmark-specific fields such as nurses, weeks,
validator_model_delta, and score_drift are preserved as optional native
columns.