Subset 17-ID control · with tools
- Opus 4.6 high: 14/17
- Opus 4.7 high: 5/17
- Paired delta: +9 for 4.6 on same IDs, same envelope, post-window.
Matched envelope: MCP, parallel=2, timeout=360s, no session persistence, high effort.
bots-bench / AI SOC evals / Splunk BOTSv3
Loading benchmark corpus and run snapshots…
Frontier Leaderboard
Each entry pairs an AI agent (Claude Code, OpenAI Codex) with a model. One point per agent-model-reasoning combination, best published config in each bucket.
Frontier Follow-Up
Each polygon uses one representative config: the best-scoring run among the latest tested version of each model family.
Model progress over time
One point per model line — strongest effort tier kept. Lines connect chronologically within each model family so successive generations are visible.
Efficiency
Switch among cost-vs-speed, cost-vs-solve-rate, and cost-vs-combined-score modes. Bubble size encodes first-pass solved-count so cost can stay on the axes.
Benchmark Program
Systems Under Test
Provider Upgrades
Summarizing version-step changes from publishable runs in this snapshot.
Benchmark Atlas
The point is representative investigation work, not toy prompts. This makes the breadth visible: 100+ overlapping log and alert providers, incident families grounded in the BOTSv3 IR corpus, and question-level hardness.
Chart Wall
Each dot is a benchmark question. Left means fewer configs solve it. Up means it burns more time or more attempts. Bigger dots mean more repeated attempts, and color stays tied to track.
Claude 4.6 Controversy
Anthropic published an April 23 postmortem describing three product/harness-level regressions affecting recent Claude runs. We reran the same 17 previously-contaminated IDs under a matched post-window envelope to isolate model effects from window confounds.
Matched envelope: MCP, parallel=2, timeout=360s, no session persistence, high effort.
q2_sq201_id_cl solves for 4.7 — general-knowledge-answerable.Evidence: Anthropic postmortem · findings-09 · short report.
Benchmark Hygiene
This asks whether a model can answer the benchmark from prior exposure or memory before using the investigation tools.
We keep the same score axis as the main benchmark so you can see how much of the corpus a model reaches before any tool use.
Loading literal-answer coverage and traceability statuses...
Harness Versions
Comparing earliest vs latest MCP runs for the same model (date-based proxy for harness evolution).
Prompt Effects
Looking for clean cross-validation and OODA / OSCAR-style planning-loop pairs...
Latest Runs
Looking for local run artifacts…
Retry Lab
These rows show configs where the evaluator allows extra retry passes after a miss. That can improve score, but it also blows out time and token budgets.
Config Microscope
The map and microscope share one selection. Pick a config here, or click a point in the solve-versus-time map, and inspect its question-level pattern instead of only looking at rollups.
Config Map
Human-readable config labels, not raw run IDs. Pick the score dimension and the time dimension you care about, then click a point to sync the microscope below.
Supporting Boards
Additional ways to slice the same benchmark snapshot.
Benchmark Rules
Experiment Surface
The dimensions we vary across benchmark runs. Stay tuned as we add more databases, agents, models, and benchmarks.