Birch HTML Benchmark — Completion, Render Checks, and Efficiency

Headline table: score, tokens, wall time

A compact reader-first view: quality score, total generation tokens, total generation wall time, and failure count. Click column headers to sort. “View outputs” jumps to the screenshot/artifact viewer for that model.

#	model	score	tokens	wall time	checker runs	failures
1	opus47 clean-final	100.0	2,041,367	873s	10	0
2	kimi27 clean-final	100.0	5,930,810	833s	10	0
3	gemini35flash clean-final	100.0	8,127,743	774s	13	0
4	sonnet46 clean-final	100.0	5,035,097	2,304s	11	0
5	glm52 clean-final	100.0	4,386,789	2,634s	11	0
6	codexresponses.gpt-5.4-mini clean-final	99.8	2,887,707	1,156s	13	0
7	codexresponses.gpt-5.5 clean-final	98.2	1,225,741	749s	11	1
8	glm51 clean-final	96.8	1,610,470	767s	6	4
9	gpt-5.3-codex clean-final	94.4	1,288,812	373s	5	7
10	deepseek clean-final	84.0	2,612,700	1,242s	8	15
11	kimi clean-final	83.8	2,670,489	1,765s	7	13
12	haiku45 clean-final	83.0	949,404	370s	2	27
13	codexspark clean-final	72.2	7,185,820	363s	5	19
14	grok-4.3 clean-final	58.0	599,552	284s	0	27
15	minimax27 clean-final	58.0	1,326,533	1,040s	3	30

Render findings vs generation cost

Raw audit view. Each point is a model. The default x-axis is total tokens across all five prompts and the y-axis is total render findings: failures plus warnings. The zoom view hides extreme cost outliers and rescales to the visible models. Warnings are shown as findings, not errors. Lower-left is better.

view

Efficiency comparison

Raw generation cost inputs by model. Click column headers to sort.

VLM finding examples

A few screenshot-backed visual smoke findings. Boxes are approximate VLM inspection overlays.

Recommended publication comparisons

A short set of pairings for readers who want to inspect the screenshot evidence behind the headline story.

Headline contenders

Two strong, efficient runs with small score differences.

Perfect score, higher cost

Gemini Flash scores 100, but used much more token budget.

Fair partial credit

Haiku has weak capped tasks, but this implementation-plan output earned real credit.

Birch CSS cap sanity check

Grok’s unstyled artifact makes the low capped score easy to audit.

Completion/render matrix

Cells summarize render/check status for each model × prompt. Hover a cell for counts, the 20-point task score, cap reason if any, and generation metrics. A clean cell means no deterministic/VLM fail or warn findings were recorded; it does not mean the analysis content was semantically judged correct.

Top finding types

Most common deterministic/VLM finding names in the current evidence bundle.

Side-by-side screenshot comparison

Pick two model records, choose a viewport, then use the vertical prompt tabs to compare screenshots. The default is desktop deep so below-the-fold structure is visible first.

Model A Model B Viewport

Artifact-level output table

Use this table to jump from aggregate findings to a generated artifact and see each task's 20-point score and cap reason.

Model summary table

Default order is quality-first: five-task quality score, completed artifacts, deterministic failures, VLM failures, warnings, then efficiency. The score is a transparent sorting aid for Birch rendering compliance, not an overall model-quality grade.

#	model	artifacts	det fail	det warn	VLM fail	VLM warn	findings	seconds	tokens	cached/hit	tools	checker runs	quality score
1	opus47 clean-final publish	5/5	0	0	0	0	0	873s	2,041,367	1,752,434	83	10	100.0
2	kimi27 clean-final publish	5/5	0	0	0	0	0	833s	5,930,810	4,667,885	104	10	100.0
3	gemini35flash clean-final publish	5/5	0	0	0	0	0	774s	8,127,743	6,936,722	142	13	100.0
4	sonnet46 clean-final publish	5/5	0	0	0	0	0	2,304s	5,035,097	4,587,463	108	11	100.0
5	glm52 clean-final publish	5/5	0	0	0	0	0	2,634s	4,386,789	3,907,136	126	11	100.0
6	codexresponses.gpt-5.4-mini clean-final publish	5/5	0	2	0	0	2	1,156s	2,887,707	2,607,104	113	13	99.8
7	codexresponses.gpt-5.5 clean-final publish	5/5	0	0	1	1	2	749s	1,225,741	1,018,880	83	11	98.2
8	glm51 clean-final publish	5/5	2	0	2	4	8	767s	1,610,470	1,221,440	74	6	96.8
9	gpt-5.3-codex clean-final publish	5/5	6	2	1	1	10	373s	1,288,812	1,036,288	70	5	94.4
10	deepseek clean-final publish	5/5	8	1	7	0	16	1,242s	2,612,700	2,637,696	97	8	84.0
11	kimi clean-final publish	5/5	6	0	7	2	15	1,765s	2,670,489	2,332,928	99	7	83.8
12	haiku45 clean-final publish	5/5	26	12	1	5	44	370s	949,404	580,545	49	2	83.0
13	codexspark clean-final publish	5/5	14	6	5	0	25	363s	7,185,820	6,181,120	174	5	72.2
14	grok-4.3 clean-final publish	5/5	16	0	11	1	28	284s	599,552	336,000	44	0	58.0
15	minimax27 clean-final publish	5/5	26	4	4	4	38	1,040s	1,326,533	841,088	55	3	58.0

Model checker execution count

Counts below come from each model's own generation traces. This is the simple count of model `execute` tool calls that invoked the Birch deterministic checker. Harness-level checker passes are not counted here.

Generated data files

The microsite is generated programmatically from consolidated JSON/CSV tables. Rebuild with:

uv run --with matplotlib python scripts/build_publication_analysis.py --suite publish
python3 scripts/generate_responsive_report.py

file	purpose
`analysis/data/model-summary.json`	model-level completion, render findings, token, time, and tool metrics
`analysis/data/artifact-summary.json`	per-model × per-prompt metrics, including prompt-level token/cache breakdown
`analysis/data/finding-summary.json`	deterministic and VLM finding rows
`analysis/tables/*.csv`	CSV equivalents for audit, README tables, or external analysis
`analysis/report.html`	this static microsite

Derived index caveat

The report includes a consolidated quality_score: a 100-point sum over five equal 20-point task scores. It is intentionally formula-based and limited to completion/render findings and Birch rendering contract compliance:

100-point sum over five equal 20-point tasks. Each task starts at 20 - 1.2*deterministic_failure_units - 1.6*vlm_failure_units - 0.2*deterministic_warning_units - 0.2*vlm_warning_units. Missing artifacts score 0/20. Artifacts missing valid Birch CSS are capped at 7/20, or 4/20 when VLM also reports vision_unstyled_render; artifacts missing .page are capped at 10/20. Units are distinct (eval, finding_name), so repeated viewport sightings of the same issue are not charged repeatedly.

Missing/fake Birch CSS caps a task because applying the Birch system stylesheet is the core benchmark requirement. A task can receive some credit for artifact presence and partial structure, but a fake or absent Birch stylesheet cannot score as a full Birch render. The combined quality/efficiency index is intentionally not used as the main public ranking in this draft. Raw dimensions remain visible so readers can make their own tradeoffs.

Birch Skill benchmark report

How to read this report