Birch render evidence bundle · Publication Final bundle

Birch Skill benchmark report

A report for comparing what happened when multiple models received the same Birch HTML generation prompts under the same benchmark harness.

Scope: this report summarizes completion, deterministic render-contract checks, screenshot-based VLM visual smoke findings, and generation efficiency. It does not grade the semantic correctness, analytical insight, factual completeness, or subjective usefulness of the generated reports.
13
model records
65
artifact rows
37,133,641
total tokens
11,982s
wall time
142
fail findings
44
warn findings

How to read this report

The benchmark is best read as evidence, not as a single model-quality ranking. Raw counts are shown next to any derived index. The most important dimensions are: artifact completion, deterministic render-check failures/warnings, VLM screenshot smoke-review failures/warnings, and generation cost in seconds, tokens, and tool calls.

Read more at the GitHub repository: https://github.com/evalstate/birch-html

Generated from publish: 13 model records, 65 artifact rows, 65 generation traces, and 65 VLM traces. 13/13 model records completed all expected artifacts; partial runs remain visible in the tables and matrix.

Headline table: score, tokens, wall time

A compact reader-first view: quality score, total generation tokens, total generation wall time, and failure count. Click column headers to sort. “View outputs” jumps to the screenshot/artifact viewer for that model.

#modelscoretokenswall timechecker runsfailuresoutputs
1 codexresponses.gpt-5.5
clean-final
100.0 797,947 671s 9 0
2 opus47
clean-final
100.0 2,041,367 873s 10 0
3 gemini35flash
clean-final
100.0 8,127,743 774s 13 0
4 sonnet46
clean-final
100.0 5,035,097 2,304s 11 0
5 codexresponses.gpt-5.4-mini
clean-final
99.8 2,887,707 1,156s 13 0
6 glm51
clean-final
96.8 1,610,470 767s 6 4
7 gpt-5.3-codex
clean-final
94.4 1,288,812 373s 5 7
8 deepseek
clean-final
84.0 2,612,700 1,242s 8 15
9 kimi
clean-final
83.8 2,670,489 1,765s 7 13
10 haiku45
clean-final
83.0 949,404 370s 2 27
11 codexspark
clean-final
72.2 7,185,820 363s 5 19
12 grok-4.3
clean-final
58.0 599,552 284s 0 27
13 minimax27
clean-final
58.0 1,326,533 1,040s 3 30

Render findings vs generation cost

Raw audit view. Each point is a model. The default x-axis is total tokens across all five prompts and the y-axis is total render findings: failures plus warnings. The zoom view hides extreme cost outliers and rescales to the visible models. Warnings are shown as findings, not errors. Lower-left is better.

view

Efficiency comparison

Raw generation cost inputs by model. Click column headers to sort.

VLM finding examples

A few screenshot-backed visual smoke findings. Boxes are approximate VLM inspection overlays.

Recommended publication comparisons

A short set of pairings for readers who want to inspect the screenshot evidence behind the headline story.

Headline contenders

Two strong, efficient runs with small score differences.

Perfect score, higher cost

Gemini Flash scores 100, but used much more token budget.

Fair partial credit

Haiku has weak capped tasks, but this implementation-plan output earned real credit.

Birch CSS cap sanity check

Grok’s unstyled artifact makes the low capped score easy to audit.

Completion/render matrix

Cells summarize render/check status for each model × prompt. Hover a cell for counts, the 20-point task score, cap reason if any, and generation metrics. A clean cell means no deterministic/VLM fail or warn findings were recorded; it does not mean the analysis content was semantically judged correct.

Top finding types

Most common deterministic/VLM finding names in the current evidence bundle.

Side-by-side screenshot comparison

Pick two model records, choose a viewport, then use the vertical prompt tabs to compare screenshots. The default is desktop deep so below-the-fold structure is visible first.

Artifact-level output table

Use this table to jump from aggregate findings to a generated artifact and see each task's 20-point score and cap reason.

Model summary table

Default order is quality-first: five-task quality score, completed artifacts, deterministic failures, VLM failures, warnings, then efficiency. The score is a transparent sorting aid for Birch rendering compliance, not an overall model-quality grade.

#modelartifactsdet faildet warnVLM failVLM warnfindingssecondstokenscached/hittoolschecker runsquality score
1 codexresponses.gpt-5.5 clean-final
publish
5/5 0 0 0 0 0 671s 797,947 608,768 69 9 100.0
2 opus47 clean-final
publish
5/5 0 0 0 0 0 873s 2,041,367 1,752,434 83 10 100.0
3 gemini35flash clean-final
publish
5/5 0 0 0 0 0 774s 8,127,743 6,936,722 142 13 100.0
4 sonnet46 clean-final
publish
5/5 0 0 0 0 0 2,304s 5,035,097 4,587,463 108 11 100.0
5 codexresponses.gpt-5.4-mini clean-final
publish
5/5 0 2 0 0 2 1,156s 2,887,707 2,607,104 113 13 99.8
6 glm51 clean-final
publish
5/5 2 0 2 4 8 767s 1,610,470 1,221,440 74 6 96.8
7 gpt-5.3-codex clean-final
publish
5/5 6 2 1 1 10 373s 1,288,812 1,036,288 70 5 94.4
8 deepseek clean-final
publish
5/5 8 1 7 0 16 1,242s 2,612,700 2,637,696 97 8 84.0
9 kimi clean-final
publish
5/5 6 0 7 2 15 1,765s 2,670,489 2,332,928 99 7 83.8
10 haiku45 clean-final
publish
5/5 26 12 1 5 44 370s 949,404 580,545 49 2 83.0
11 codexspark clean-final
publish
5/5 14 6 5 0 25 363s 7,185,820 6,181,120 174 5 72.2
12 grok-4.3 clean-final
publish
5/5 16 0 11 1 28 284s 599,552 336,000 44 0 58.0
13 minimax27 clean-final
publish
5/5 26 4 4 4 38 1,040s 1,326,533 841,088 55 3 58.0

Model checker execution count

Counts below come from each model's own generation traces. This is the simple count of model `execute` tool calls that invoked the Birch deterministic checker. Harness-level checker passes are not counted here.

Generated data files

The microsite is generated programmatically from consolidated JSON/CSV tables. Rebuild with:

uv run --with matplotlib python scripts/build_publication_analysis.py --suite publish
python3 scripts/generate_responsive_report.py
filepurpose
analysis/data/model-summary.jsonmodel-level completion, render findings, token, time, and tool metrics
analysis/data/artifact-summary.jsonper-model × per-prompt metrics, including prompt-level token/cache breakdown
analysis/data/finding-summary.jsondeterministic and VLM finding rows
analysis/tables/*.csvCSV equivalents for audit, README tables, or external analysis
analysis/report.htmlthis static microsite

Derived index caveat

The report includes a consolidated quality_score: a 100-point sum over five equal 20-point task scores. It is intentionally formula-based and limited to completion/render findings and Birch rendering contract compliance:

100-point sum over five equal 20-point tasks. Each task starts at 20 - 1.2*deterministic_failure_units - 1.6*vlm_failure_units - 0.2*deterministic_warning_units - 0.2*vlm_warning_units. Missing artifacts score 0/20. Artifacts missing valid Birch CSS are capped at 7/20, or 4/20 when VLM also reports vision_unstyled_render; artifacts missing .page are capped at 10/20. Units are distinct (eval, finding_name), so repeated viewport sightings of the same issue are not charged repeatedly.

Missing/fake Birch CSS caps a task because applying the Birch system stylesheet is the core benchmark requirement. A task can receive some credit for artifact presence and partial structure, but a fake or absent Birch stylesheet cannot score as a full Birch render. The combined quality/efficiency index is intentionally not used as the main public ranking in this draft. Raw dimensions remain visible so readers can make their own tradeoffs.