bots-bench / agreement audit

Model Agreement Audit

Agreement can be good or bad. Within a model family, it helps show which behaviors survive from one generation or sibling model to the next. Across providers, unusually close behavior can flag possible distillation, copying, or benchmark contamination. This page is a screening view, not proof by itself.

Compare systems: Louie.ai · Claude Code · Codex · opencode

Executive Read

What agreement can and cannot say

Use this as a triage screen. The positive case is continuity: sibling or next-generation models should preserve some useful behavior, and this shows where that happens or breaks. The negative case is imitation: a cross-provider row that looks too close, especially on wrong answers, deserves private follow-up.

MCC / phi

Matthews correlation coefficient is a -1 to +1 agreement score over what two systems solved and missed. Near 1 means very similar outcomes, near 0 means little relationship, and negative means opposite patterns. We use it as the headline affinity score because it handles uneven solve rates better than raw overlap.

Cohen’s kappa

Cohen’s kappa is another chance-adjusted agreement score. It is useful as a second read: if MCC and kappa both look high, the agreement signal is stronger; if they diverge, the pair needs closer review.

Featured Triangles

GLM, frontier systems, and normal controls

This isolates one same-benchmark triangle at a time. Titles spell out the actual comparison: a cross-company model screen, an Opus/Sonnet related-model control, or an Opus generation check when the safe data supports it. Line width shows MCC/phi strength and color marks how much follow-up the edge deserves.

MCC / phi edges

Same-benchmark affinity triangles

Line width shows MCC/phi. Orange means an unusually high match across companies, yellow means an elevated cross-company match, and blue means expected similarity inside a company or model family. This is still a screening view, not proof of distillation.

Same-Benchmark Pairs

Highest affinity rows

The main table never cross-ranks BOTSv3 and CyBT-CTF. Cross-benchmark pairs are held for the exploratory section.

MCC / phi

Top same-benchmark affinity

Higher MCC means the two systems solved and missed a more similar set of tasks. Same-family similarity can be expected; cross-provider similarity is a review signal when it exceeds controls or concentrates on wrong answers.

Overlap Map

Affinity versus solved overlap

Same-provider and same-family controls show normal continuity. Cross-provider points help spot behavior that may deserve a private distillation, contamination, or evaluation-leak review. Hover or click any point for the exact pair; the strongest review signals are labeled.

Report-Level Signal

Same-provider versus cross-provider max affinity

A cross-provider maximum above same-provider controls is a review candidate, not a conclusion. It means the row deserves private row-level review if the research question matters.

Provider Groups

Provider and family summaries

Group rows summarize same-benchmark pairs only. Use the max row to navigate, and the mean row to avoid over-reading a single pair.

Exploratory Only

Cross-benchmark inventory

Cross-benchmark pairs are not scored with the same MCC leaderboard logic because the task sets differ. They are kept only as navigation for private follow-up.