CyBT-CTF: GLM 5.2

Executive Read

Where GLM fits

Use CyBT-CTF as the main read because it is blinded. BOTSv3 is included as a public benchmark cross-check, not the headline result.

Score

GLM versus Opus, Sonnet, Codex, and open-weight rows

Bars are sorted by solve rate. GLM rows are highlighted; score ties are called out in the table and ordered by solved time, then cost. Projected or raw private artifacts are not included.

Cost and Time

How much does the score cost?

The scatter shows solve rate versus total time spent on solved tasks. Marker size follows total model cost when available.

Agreement Audit

Does GLM look unusually close to frontier closed models?

This is a compact extract from the dedicated agreement audit. High cross-company agreement can be a distillation or copying review signal, especially when wrong answers overlap, but it is not proof by itself.

CyBT-CTF same-benchmark triangle

GLM 5.2 vs GPT-5.5 is the standout edge

Strongest review signal

Z.ai / GLM 5.2 vs OpenAI / GPT-5.5

0.795 Cohen's kappa

MCC / phi: 0.797
Solved overlap: 80.0%
Observed / expected: 1.95x
Same wrong calls: 10 canonical / 0 exact text

A kappa near 0.80 is unusually high for systems from different companies on the same blinded task set. Read it as a serious distillation/copying review signal, not a public accusation or proof of copying or distillation.

Open the full agreement audit

Leaderboard

Rows behind the charts

Only sanitized aggregate rows are shown. Tied ranks share the same solve count; rows inside a tied score band are ordered by solved time, then total cost. No task text, raw traces, prompts, answers, SPL, or private benchmark markers are included.