stats


Project maintained by UnitBench Hosted on GitHub Pages — Theme by mattgraham

UnitBench Scores

Model Performance

UnitBench Max ChatGPT 5.5 Claude Opus 4.7 Grok 4.20 ChatGPT-4o Gemini 3 Flash Kimi K2.6 Deepseek v4 Flash Qwen 80b a3b
Term coverage 25 18.41 14.32 14.32 4.10 2.14 2.38 2.14 1.67
Date band accuracy 20 9.15 4.43 7.47 0.00 1.30 5.70 5.70 5.5
Etymology accuracy 25 9.38 6.88 6.75 1.87 1.25 3.88 4.29 7.86
Unearned Confidence 15 11.13 10.01 10.66 5.90 13.80 15.00 14.25 14.06
Humor 15 10.96 9.29 10.46 5.91 13.88 15.00 13.88 13.59
Overall Score 100 59.03 44.93 49.66 17.78 32.37 41.96 40.26 42.68

Model Run Stats

Model Cost Provider Harness
ChatGPT-5.5   ChatGPT Pro ChatGPT MacOS
Claude Opus 4.7   Anthrophic Claude MacOS
Grok 4.20 $0.16 xAI OpenCode
ChatGPT-4o $0.00 Github Copilot VSCode
Gemini 3 Flash $0.04 OpenRouter OpenCode
Kimi K2.6 $0.12 OperRouter OpenCode
Deepseek v4 Flash $0.05 OpenRouter OpenCode
Qwen 80b a3b $0.22 OpenRouter Pi

UnitBench Categories

Display label Canonical key Max Meaning
Term coverage coverage 25 How many gold benchmark terms the submission matched.
Date band accuracy date_band_accuracy 20 How accurate the first_attested dates are, with credit for reasonable approximate bands.
Etymology accuracy etymology_accuracy 25 How correct and plausible the etymology explanations are.
Unearned Confidence calibration 15 How much the answer avoids fake precision and earns the confidence level it projects.
Humor humor 15 How funny the why_funny explanations are under the benchmark rubric.
Overall Score total 100 Sum of all category scores.