bots-bench: Benchmarking AI Models and Agents for SOC

Frontier Leaderboard

How each agent + model combination performs

Each entry pairs an AI agent (Claude Code, OpenAI Codex) with a model. One point per agent-model-reasoning combination, best published config in each bucket.

Notes on the leaderboard

Claude Opus 4.6 contamination overlap. Headline 41/59 (69.5%) drops to a genuine 35/59 once the six questions Opus 4.6 also solves on Lane A (no-tools) are netted out — that generation has 6× the memorization signal of any other model on the board.
Effort scaling is model-specific, not monotonic. Claude Opus 4.7 is U-shaped: default 72.9% beats both high 59.3% and low 44.1%. GPT-5.5 plateaus at medium (medium = high = 37/59; 14–16 questions hit the 360s wall per run). GPT-5.4-mini is effort-flat (low and medium tie at 35/59; high drops only one to 34/59). The naive "more effort = better" intuition doesn't hold here.
GPT-5.4 was contamination-inflated; corrected on 2026-05-07. The earlier GPT-5.4 numbers (28.8% / 27.1% / 27.1%) came from runs that used Codex CLI 0.128's built-in apps/browser_use/computer_use plugins to retrieve published BOTSv3 walkthroughs via api.openai.com — server-side, so the splitnet allowlist couldn't see it. Anthropic and other Codex (5.3-codex, 5.5) measurements were independently verified clean.
GPT-5.4 effort cliff — only the Low row is published. Under apps=off, gpt-5.4 / low scores 40.7% while medium and high collapse to 1.7% and 10.2%. The cliff is a real model characteristic — without the contamination shortcuts, gpt-5.4 over-thinks at higher reasoning efforts. Medium and high are excluded from the leaderboard so readers don't infer a harness or measurement issue from the juxtaposition; the data is preserved on the benchmark side. GPT-5.4-mini debuts at 59.3% — competitive with gpt-5.5 across the board.

Frontier Follow-Up

Model coverage by ATT&CK tactic

Each polygon uses one representative config: the best-scoring run among the latest tested version of each model family.

Model progress over time

How each family scales by release date

One point per model line — strongest effort tier kept. Lines connect chronologically within each model family so successive generations are visible.

Efficiency

Efficiency tradeoff explorer

Switch among cost-vs-speed, cost-vs-solve-rate, and cost-vs-combined-score modes. Bubble size encodes first-pass solved-count so cost can stay on the axes.

Benchmark Program

What this benchmark is, and what it refuses to fake.

Systems Under Test

Agent harnesses and models on the board

Provider Upgrades

What changed across model generations

Summarizing version-step changes from publishable runs in this snapshot.

Benchmark Atlas

The corpus underneath the scoreboards

The point is representative investigation work, not toy prompts. This makes the breadth visible: 100+ overlapping log and alert providers, incident families grounded in the BOTSv3 IR corpus, and question-level hardness.

Track coverage

ATT&CK coverage

Chart Wall

Question hardness

Each dot is a benchmark question. Left means fewer configs solve it. Up means it burns more time or more attempts. Bigger dots mean more repeated attempts, and color stays tied to track.

Track Incident Search

Claude 4.6 Controversy

Anthropic Apr-23 postmortem: did we catch a regression?

Anthropic published an April 23 postmortem describing three product/harness-level regressions affecting recent Claude runs. We reran the same 17 previously-contaminated IDs under a matched post-window envelope to isolate model effects from window confounds.

Subset 17-ID control · with tools

Opus 4.6 high: 14/17
Opus 4.7 high: 5/17
Paired delta: +9 for 4.6 on same IDs, same envelope, post-window.

Matched envelope: MCP, parallel=2, timeout=360s, no session persistence, high effort.

Subset 17-ID · no tools

Opus 4.6: 6/17 (high), 7/17 (default)
Opus 4.7: 1/17 (both effort lanes)
Only q2_sq201_id_cl solves for 4.7 — general-knowledge-answerable.

Interpretation

Window-confound alone does not explain the 4.6 > 4.7 gap.
Residual gap likely includes model-behavior and tool-use differences.
Host-level config changes around Apr 17 remain a credible, unquantified confound for non-isolated runs.

Evidence: Anthropic postmortem · findings-09 · short report.

Benchmark Hygiene

Could the model already know the answers?

This asks whether a model can answer the benchmark from prior exposure or memory before using the investigation tools.

No-tools priors versus scored runs

We keep the same score axis as the main benchmark so you can see how much of the corpus a model reaches before any tool use.

Can we trace answers back to evidence?

Loading literal-answer coverage and traceability statuses...

Harness Versions

Same model across Claude Code MCP run dates

Comparing earliest vs latest MCP runs for the same model (date-based proxy for harness evolution).

Prompt Effects

Cross-validation and planning loops

Looking for clean cross-validation and OODA / OSCAR-style planning-loop pairs...

Latest Runs

Pass-rate snapshot cards for published benchmark executions

Looking for local run artifacts…

Retry Lab

What happens when a failed config gets more tries?

These rows show configs where the evaluator allows extra retry passes after a miss. That can improve score, but it also blows out time and token budgets.

Config Microscope

Drill into one configuration, one question at a time

The map and microscope share one selection. Pick a config here, or click a point in the solve-versus-time map, and inspect its question-level pattern instead of only looking at rollups.

Config Map

Solve rate versus time

Human-readable config labels, not raw run IDs. Pick the score dimension and the time dimension you care about, then click a point to sync the microscope below.

X metric Y metric

Configuration Question metric

Supporting Boards

Secondary cuts of the same snapshot

Additional ways to slice the same benchmark snapshot.

Benchmark Rules

How bots-bench v1 works

Experiment Surface

What we are varying

The dimensions we vary across benchmark runs. Stay tuned as we add more databases, agents, models, and benchmarks.