bots-bench / Cyber Blue Team CTF / Open weights

CyBT-CTF: GLM 5.2

GLM 5.2 sits in the top open-weight tier in the current CyBT-CTF view, tying frontier proprietary references opencode / Opus 4.8 and Codex / GPT-5.5 at 28/59. Qwen3.7 Plus, a proprietary Alibaba model, now shares that score with lower aggregate time and cost, while GLM 5.2 remains unusually important because it has a high agreement signal with GPT-5.5: MCC 0.797 and Cohen's kappa 0.795 on the blinded test. That is serious enough for distillation/copying review, but not proof by itself because exact same wrong text is 0.

Frontier comparison

GLM 5.2 and Qwen3.7 Plus share the 28/59 tier

On the blinded test, both opencode / Fireworks / GLM 5.2 and opencode / Fireworks / Qwen3.7 Plus solve 28/59. GLM remains the agreement-audit focus because it shows a high signal with GPT-5.5: MCC 0.797 and Cohen's kappa 0.795.

Compare systems: Louie.ai · Claude Code · Codex · opencode

Executive Read

Where GLM fits

Use CyBT-CTF as the main read because it is blinded. BOTSv3 is included as a public benchmark cross-check, not the headline result.

Score

GLM versus Opus, Sonnet, Codex, and open-weight rows

Bars are sorted by solve rate. GLM rows are highlighted; score ties are called out in the table and ordered by solved time, then cost. Projected or raw private artifacts are not included.

Cost and Time

How much does the score cost?

The scatter shows solve rate versus total time spent on solved tasks. Marker size follows total model cost when available.

Agreement Audit

Does GLM look unusually close to frontier closed models?

This is a compact extract from the dedicated agreement audit. High cross-company agreement can be a distillation or copying review signal, especially when wrong answers overlap, but it is not proof by itself.

CyBT-CTF same-benchmark triangle

GLM 5.2 vs GPT-5.5 is the standout edge

MCC 0.763 Kappa 0.763 MCC 0.797 Kappa 0.795 MCC 0.630 Kappa 0.626 Opus 4.8 opencode GLM 5.2 Fireworks GPT-5.5 Codex

Strongest review signal

Z.ai / GLM 5.2 vs OpenAI / GPT-5.5

0.795 Cohen's kappa
MCC / phi
0.797
Solved overlap
80.0%
Observed / expected
1.95x
Same wrong calls
10 canonical / 0 exact text

A kappa near 0.80 is unusually high for systems from different companies on the same blinded task set. Read it as a serious distillation/copying review signal, not a public accusation or proof of copying or distillation.

Open the full agreement audit

Leaderboard

Rows behind the charts

Only sanitized aggregate rows are shown. Tied ranks share the same solve count; rows inside a tied score band are ordered by solved time, then total cost. No task text, raw traces, prompts, answers, SPL, or private benchmark markers are included.