No-tools priors versus scored runs
We keep the same score axis as the main benchmark so you can see how much of the corpus a model reaches before any tool use.
bots-bench / AI SOC evals / Splunk BOTSv3
Loading benchmark corpus and run snapshots…
Benchmark Program
Frontier Leaderboard
One point per model line and reasoning level. We keep the best published config in each bucket, then plot first-pass pass rate against total time spent on solved questions so both quality and spend stay visible.
Benchmark Atlas
The point is representative investigation work, not toy prompts. This makes the breadth visible: 100+ overlapping log and alert providers, incident families grounded in the BOTSv3 IR corpus, and question-level hardness.
Chart Wall
Each dot is a benchmark question. Left means fewer configs solve it. Up means it burns more time or more attempts. Bigger dots mean more repeated attempts, and color stays tied to track.
Benchmark Hygiene
This asks whether a model can answer the benchmark from prior exposure or memory before using the investigation tools.
We keep the same score axis as the main benchmark so you can see how much of the corpus a model reaches before any tool use.
Loading literal-answer coverage and traceability statuses...
MCP vs CLI
Looking for publishable same-model MCP-versus-curl comparisons...
Prompt Effects
Looking for clean cross-validation and OODA / OSCAR-style planning-loop pairs...
Latest Runs
Looking for local run artifacts…
Retry Lab
These rows show configs where the evaluator allows extra retry passes after a miss. That can improve score, but it also blows out time and token budgets.
Config Microscope
The map and microscope share one selection. Pick a config here, or click a point in the solve-versus-time map, and inspect its question-level pattern instead of only looking at rollups.
Config Map
Human-readable config labels, not raw run IDs. Pick the score dimension and the time dimension you care about, then click a point to sync the microscope below.
Supporting Boards
These stay here as supporting diagnostics, not the headline benchmark verdict.
Experiment Surface
This stays on the page for auditability. It shows what we are varying, but it is not the headline result.