bots-bench / Cyber Blue Team CTF

CyBT-CTF: Fable 5

Loading Fable aggregate view...

Compare systems: Louie.ai · Claude Code · Codex · opencode

Reader Questions

What this page is trying to separate

Leaderboard

Score, solve time, and cost on the selected test

This section ranks measured stock routes, projected unlocked Mythos rows, Opus, Codex, and open-weight systems on one benchmark at a time. CyBT-CTF is the primary blinded read; BOTSv3 is a public-benchmark cross-check.

Solve Rate

Solve rates on the selected test

Solid rows are measured; dotted Mythos rows estimate unlocked Fable behavior from same-task Fable-versus-Opus evidence. Bars are sorted so better results sit farther right.

Tradeoff Map

Interactive score, time, and cost scatter

Use the toggles to compare accuracy, solve time, median time, and cost. Lower-is-better metrics are oriented so better points move right or up.

Pass-through Risk

How often Fable answered versus passed to Opus

Pass-throughs and no-model outcomes are measured only for Fable here, so this chart does not plot comparison systems. It shows how often Fable answered itself, passed the attempt to Opus, or produced no answer across the available summary runs.

Coverage

Blinded score coverage and public BOTSv3 ATT&CK cross-check

Proprietary Systems

Fable vs Opus and Codex

Open-Weight Systems

Fable vs opencode open-weight models

Refusals and Opus Pass-throughs

How often Fable declined regular cyber tasks

This is the operational caveat. Some attempts are confirmed refusal pass-throughs to Opus; others ended as provider/no-model failures before a clean Fable answer existed. BOTSv3 has pass-through counts, but older records do not preserve the exact reason, and refusal counts by ATT&CK tactic are not available yet.

Answer Path

Fable answers vs Opus pass-throughs vs no answer

Reasons

Why attempts were passed to Opus

Benchmark Validity

CyBT-CTF versus public BOTSv3

CyBT-CTF is the primary read because it is blinded. BOTSv3 is public and useful only as a contamination check: if a model jumps on BOTSv3 but not CyBT-CTF, treat that as a warning sign rather than a clean capability gain.