stats

Project maintained by UnitBench Hosted on GitHub Pages — Theme by mattgraham

UnitBench Scores

Model Performance

UnitBench	Max	ChatGPT 5.5	Claude Opus 4.7	Grok 4.20	ChatGPT-4o	Gemini 3 Flash	Kimi K2.6	Deepseek v4 Flash	Qwen 80b a3b
Term coverage	25	18.41	14.32	14.32	4.10	2.14	2.38	2.14	1.67
Date band accuracy	20	9.15	4.43	7.47	0.00	1.30	5.70	5.70	5.5
Etymology accuracy	25	9.38	6.88	6.75	1.87	1.25	3.88	4.29	7.86
Unearned Confidence	15	11.13	10.01	10.66	5.90	13.80	15.00	14.25	14.06
Humor	15	10.96	9.29	10.46	5.91	13.88	15.00	13.88	13.59
Overall Score	100	59.03	44.93	49.66	17.78	32.37	41.96	40.26	42.68

Model Run Stats

Model	Cost	Provider	Harness
ChatGPT-5.5		ChatGPT Pro	ChatGPT MacOS
Claude Opus 4.7		Anthrophic	Claude MacOS
Grok 4.20	$0.16	xAI	OpenCode
ChatGPT-4o	$0.00	Github Copilot	VSCode
Gemini 3 Flash	$0.04	OpenRouter	OpenCode
Kimi K2.6	$0.12	OperRouter	OpenCode
Deepseek v4 Flash	$0.05	OpenRouter	OpenCode
Qwen 80b a3b	$0.22	OpenRouter	Pi

UnitBench Categories

Display label	Canonical key	Max	Meaning
Term coverage	`coverage`	25	How many gold benchmark terms the submission matched.
Date band accuracy	`date_band_accuracy`	20	How accurate the `first_attested` dates are, with credit for reasonable approximate bands.
Etymology accuracy	`etymology_accuracy`	25	How correct and plausible the etymology explanations are.
Unearned Confidence	`calibration`	15	How much the answer avoids fake precision and earns the confidence level it projects.
Humor	`humor`	15	How funny the `why_funny` explanations are under the benchmark rubric.
Overall Score	`total`	100	Sum of all category scores.