Headline table: score, tokens, wall time
A compact reader-first view: quality score, total generation tokens, total generation wall time, and failure count. Click column headers to sort. “View outputs” jumps to the screenshot/artifact viewer for that model.
| # | model | score | tokens | wall time | checker runs | failures | outputs |
|---|---|---|---|---|---|---|---|
| 1 | codexresponses.gpt-5.5 clean-final |
100.0 | 797,947 | 671s | 9 | 0 | |
| 2 | opus47 clean-final |
100.0 | 2,041,367 | 873s | 10 | 0 | |
| 3 | gemini35flash clean-final |
100.0 | 8,127,743 | 774s | 13 | 0 | |
| 4 | sonnet46 clean-final |
100.0 | 5,035,097 | 2,304s | 11 | 0 | |
| 5 | codexresponses.gpt-5.4-mini clean-final |
99.8 | 2,887,707 | 1,156s | 13 | 0 | |
| 6 | glm51 clean-final |
96.8 | 1,610,470 | 767s | 6 | 4 | |
| 7 | gpt-5.3-codex clean-final |
94.4 | 1,288,812 | 373s | 5 | 7 | |
| 8 | deepseek clean-final |
84.0 | 2,612,700 | 1,242s | 8 | 15 | |
| 9 | kimi clean-final |
83.8 | 2,670,489 | 1,765s | 7 | 13 | |
| 10 | haiku45 clean-final |
83.0 | 949,404 | 370s | 2 | 27 | |
| 11 | codexspark clean-final |
72.2 | 7,185,820 | 363s | 5 | 19 | |
| 12 | grok-4.3 clean-final |
58.0 | 599,552 | 284s | 0 | 27 | |
| 13 | minimax27 clean-final |
58.0 | 1,326,533 | 1,040s | 3 | 30 |
Render findings vs generation cost
Raw audit view. Each point is a model. The default x-axis is total tokens across all five prompts and the y-axis is total render findings: failures plus warnings. The zoom view hides extreme cost outliers and rescales to the visible models. Warnings are shown as findings, not errors. Lower-left is better.
Efficiency comparison
Raw generation cost inputs by model. Click column headers to sort.
VLM finding examples
A few screenshot-backed visual smoke findings. Boxes are approximate VLM inspection overlays.
Recommended publication comparisons
A short set of pairings for readers who want to inspect the screenshot evidence behind the headline story.
Headline contenders
Two strong, efficient runs with small score differences.
Perfect score, higher cost
Gemini Flash scores 100, but used much more token budget.
Fair partial credit
Haiku has weak capped tasks, but this implementation-plan output earned real credit.
Birch CSS cap sanity check
Grok’s unstyled artifact makes the low capped score easy to audit.
Completion/render matrix
Cells summarize render/check status for each model × prompt. Hover a cell for counts, the 20-point task score, cap reason if any, and generation metrics. A clean cell means no deterministic/VLM fail or warn findings were recorded; it does not mean the analysis content was semantically judged correct.
Top finding types
Most common deterministic/VLM finding names in the current evidence bundle.
Side-by-side screenshot comparison
Pick two model records, choose a viewport, then use the vertical prompt tabs to compare screenshots. The default is desktop deep so below-the-fold structure is visible first.
Artifact-level output table
Use this table to jump from aggregate findings to a generated artifact and see each task's 20-point score and cap reason.
Model summary table
Default order is quality-first: five-task quality score, completed artifacts, deterministic failures, VLM failures, warnings, then efficiency. The score is a transparent sorting aid for Birch rendering compliance, not an overall model-quality grade.
| # | model | artifacts | det fail | det warn | VLM fail | VLM warn | findings | seconds | tokens | cached/hit | tools | checker runs | quality score |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 1 | codexresponses.gpt-5.5 clean-final publish |
5/5 | 0 | 0 | 0 | 0 | 0 | 671s | 797,947 | 608,768 | 69 | 9 | 100.0 |
| 2 | opus47 clean-final publish |
5/5 | 0 | 0 | 0 | 0 | 0 | 873s | 2,041,367 | 1,752,434 | 83 | 10 | 100.0 |
| 3 | gemini35flash clean-final publish |
5/5 | 0 | 0 | 0 | 0 | 0 | 774s | 8,127,743 | 6,936,722 | 142 | 13 | 100.0 |
| 4 | sonnet46 clean-final publish |
5/5 | 0 | 0 | 0 | 0 | 0 | 2,304s | 5,035,097 | 4,587,463 | 108 | 11 | 100.0 |
| 5 | codexresponses.gpt-5.4-mini clean-final publish |
5/5 | 0 | 2 | 0 | 0 | 2 | 1,156s | 2,887,707 | 2,607,104 | 113 | 13 | 99.8 |
| 6 | glm51 clean-final publish |
5/5 | 2 | 0 | 2 | 4 | 8 | 767s | 1,610,470 | 1,221,440 | 74 | 6 | 96.8 |
| 7 | gpt-5.3-codex clean-final publish |
5/5 | 6 | 2 | 1 | 1 | 10 | 373s | 1,288,812 | 1,036,288 | 70 | 5 | 94.4 |
| 8 | deepseek clean-final publish |
5/5 | 8 | 1 | 7 | 0 | 16 | 1,242s | 2,612,700 | 2,637,696 | 97 | 8 | 84.0 |
| 9 | kimi clean-final publish |
5/5 | 6 | 0 | 7 | 2 | 15 | 1,765s | 2,670,489 | 2,332,928 | 99 | 7 | 83.8 |
| 10 | haiku45 clean-final publish |
5/5 | 26 | 12 | 1 | 5 | 44 | 370s | 949,404 | 580,545 | 49 | 2 | 83.0 |
| 11 | codexspark clean-final publish |
5/5 | 14 | 6 | 5 | 0 | 25 | 363s | 7,185,820 | 6,181,120 | 174 | 5 | 72.2 |
| 12 | grok-4.3 clean-final publish |
5/5 | 16 | 0 | 11 | 1 | 28 | 284s | 599,552 | 336,000 | 44 | 0 | 58.0 |
| 13 | minimax27 clean-final publish |
5/5 | 26 | 4 | 4 | 4 | 38 | 1,040s | 1,326,533 | 841,088 | 55 | 3 | 58.0 |
Model checker execution count
Counts below come from each model's own generation traces. This is the simple count of model `execute` tool calls that invoked the Birch deterministic checker. Harness-level checker passes are not counted here.
Generated data files
The microsite is generated programmatically from consolidated JSON/CSV tables. Rebuild with:
uv run --with matplotlib python scripts/build_publication_analysis.py --suite publish
python3 scripts/generate_responsive_report.py
| file | purpose |
|---|---|
analysis/data/model-summary.json | model-level completion, render findings, token, time, and tool metrics |
analysis/data/artifact-summary.json | per-model × per-prompt metrics, including prompt-level token/cache breakdown |
analysis/data/finding-summary.json | deterministic and VLM finding rows |
analysis/tables/*.csv | CSV equivalents for audit, README tables, or external analysis |
analysis/report.html | this static microsite |
Derived index caveat
The report includes a consolidated quality_score: a 100-point sum over five equal 20-point task scores. It is intentionally formula-based and limited to completion/render findings and Birch rendering contract compliance:
100-point sum over five equal 20-point tasks. Each task starts at 20 - 1.2*deterministic_failure_units - 1.6*vlm_failure_units - 0.2*deterministic_warning_units - 0.2*vlm_warning_units. Missing artifacts score 0/20. Artifacts missing valid Birch CSS are capped at 7/20, or 4/20 when VLM also reports vision_unstyled_render; artifacts missing .page are capped at 10/20. Units are distinct (eval, finding_name), so repeated viewport sightings of the same issue are not charged repeatedly.
Missing/fake Birch CSS caps a task because applying the Birch system stylesheet is the core benchmark requirement. A task can receive some credit for artifact presence and partial structure, but a fake or absent Birch stylesheet cannot score as a full Birch render. The combined quality/efficiency index is intentionally not used as the main public ranking in this draft. Raw dimensions remain visible so readers can make their own tradeoffs.