Model-by-model sandwich analytics
This benchmark uses a deliberately familiar argument to test alignment under ambiguity. People have been debating for years whether a hot dog is a sandwich, which makes sandwich classification a compact way to measure how closely models track messy, inconsistent human judgment.
The premise is playful, but the readout is serious: which models stayed closest to the crowd, which ones drifted, what each run cost, and which images exposed the largest gaps between model confidence and public intuition.
The primary view on this page is now the percent forecast benchmark. The older binary posterior view is still here, but it lives behind the second tab so the two benchmarks can coexist without being blended into one score.
Under development: this benchmark and its published results are provisional, not final.
Percent Forecast Benchmark Ratings
| Rank | Model | Score | Confidence | Crowd Match | Official | Total Eval Runs | Tokens | Total Cost |
|---|---|---|---|---|---|---|---|---|
| Human | Human | 100.0 | 100.0% | Human | 0 | 0 | n/a | |
| 🥇 1 | openai/o3 | 72.8 | 90.6% | Pending | 10 | 171,761 | $0.57 | |
| 🥈 2 | openai/gpt-5.1 | 72.1 | 90.8% | #10 | 32.3 | 300,160 | $0.57 | |
| 🥉 3 | anthropic/claude-opus-4.5 | 71.0 | 89.5% | #26 | 32.4 | 393,861 | $2.56 | |
| 4 | openai/gpt-5.4-pro | 65.9 | 88.0% | #12 | 31.2 | 399,519 | $0.14 | |
| 5 | openai/gpt-5.1-codex | 64.3 | 90.3% | #4 | 32.2 | 299,162 | $0.57 | |
| 6 | openai/gpt-4.1 | 64.2 | 87.7% | #15 | 32.4 | 352,622 | $0.82 | |
| 7 | anthropic/claude-opus-4.6 | 63.7 | 87.1% | #28 | 32.1 | 391,068 | $2.60 | |
| 8 | openai/gpt-5.1-chat | 63.5 | 89.4% | #6 | 31.9 | 297,799 | $0.58 | |
| 9 | x-ai/grok-4 | 62.1 | 88.3% | #19 | 31.8 | 621,741 | $3.40 | |
| 10 | openai/gpt-4o | 59.5 | 87.9% | #23 | 32 | 342,939 | $0.97 | |
| 11 | openai/o1 | 59.5 | 87.4% | Pending | 10 | 354,187 | $15.10 | |
| 12 | qwen/qwen3.5-122b-a10b | 58.0 | 87.6% | #5 | 32 | 678,789 | $0.87 | |
| 13 | openai/gpt-4o-2024-11-20 | 57.4 | 87.8% | #22 | 32.3 | 349,155 | $1.00 | |
| 14 | anthropic/claude-haiku-4.5 | 56.1 | 86.2% | #41 | 32.4 | 391,593 | $0.50 | |
| 15 | anthropic/claude-sonnet-4.6 | 54.5 | 86.0% | #38 | 101 | 1,767,562 | $6.99 | |
| 16 | google/gemini-3.1-pro-preview | 52.0 | 85.0% | Pending | 11 | 362,486 | $2.03 | |
| 17 | openrouter/healer-alpha | 51.4 | 88.3% | #2 | 32.1 | 1,397,734 | $0.00 | |
| 18 | bytedance-seed/seed-2.0-mini | 50.7 | 87.2% | #20 | 32.2 | 744,033 | $0.11 | |
| 19 | nvidia/nemotron-nano-12b-v2-vl | 50.6 | 88.7% | #1 | 32.4 | 959,862 | $0.27 | |
| 20 | google/gemini-3-flash-preview | 46.2 | 84.6% | #47 | 33.9 | 536,822 | $0.33 | |
| 21 | moonshotai/kimi-k2.5 | 46.2 | 85.8% | Pending | 28.8 | 601,658 | $0.85 | |
| 22 | google/gemma-3-27b-it | 42.0 | 83.7% | #43 | 32.5 | 171,166 | $0.01 | |
| 23 | openai/gpt-5.2 | 38.7 | 84.7% | #21 | 31.9 | 379,640 | $0.95 | |
| 24 | qwen/qwen3.5-27b | 38.2 | 85.3% | #7 | 30.7 | 660,238 | $0.73 | |
| 25 | google/gemini-2.5-pro | 34.4 | 84.0% | #44 | 44.6 | 1,158,206 | $4.60 | |
| 26 | perplexity/sonar-pro-search | 32.6 | 87.0% | Pending | 10 | 24,153 | $2.19 | |
| 27 | mistralai/pixtral-large-2411 | 32.5 | 85.1% | #25 | 32.3 | 859,930 | $1.79 | |
| 28 | bytedance-seed/seed-2.0-lite | 32.0 | 82.6% | #31 | 31.9 | 803,039 | $0.53 | |
| 29 | qwen/qwen3.5-plus-02-15 | 29.0 | 82.9% | #24 | 32.2 | 683,216 | $0.68 | |
| 30 | google/gemini-3.1-flash-lite-preview | 28.4 | 82.2% | #53 | 34.1 | 532,366 | $0.16 | |
| 31 | qwen/qwen3.5-35b-a3b | 28.2 | 83.4% | #27 | 31.1 | 591,082 | $0.42 | |
| 32 | qwen/qwen3-vl-30b-a3b-thinking | 27.1 | 84.2% | #14 | 32.2 | 394,611 | $0.19 | |
| 33 | qwen/qwen3.5-flash-02-23 | 24.5 | 81.3% | #30 | 32.4 | 630,868 | $0.16 | |
| 34 | google/gemini-3-pro-image-preview | 24.3 | 83.7% | #33 | 31.9 | 376,073 | $3.07 | |
| 35 | openai/gpt-4.1-mini | 23.0 | 82.1% | #32 | 32 | 479,700 | $0.20 | |
| 36 | bytedance-seed/seed-1.6-flash | 20.6 | 82.0% | #17 | 32.4 | 448,861 | $0.05 | |
| 37 | x-ai/grok-4.20-beta | 20.3 | 81.5% | #50 | 31.9 | 172,186 | $0.28 | |
| 38 | qwen/qwen3.5-397b-a17b | 19.7 | 83.2% | #18 | 31.3 | 538,278 | $1.00 | |
| 39 | z-ai/glm-4.6v | 19.5 | 81.3% | #54 | 32.3 | 468,574 | $0.20 | |
| 40 | allenai/molmo-2-8b | 18.1 | 83.5% | #3 | 32.5 | 475,028 | $0.09 | |
| 41 | openai/gpt-5.4 | 17.7 | 81.0% | #37 | 32.5 | 376,950 | $0.99 | |
| 42 | meta-llama/llama-4-scout | 17.0 | 79.3% | #59 | 32.3 | 564,966 | $0.06 | |
| 43 | qwen/qwen2.5-vl-72b-instruct | 16.1 | 81.0% | #13 | 32.4 | 399,314 | $0.32 | |
| 44 | google/gemini-3.1-flash-image-preview | 15.3 | 80.0% | #35 | 32.3 | 168,091 | $0.14 | |
| 45 | google/gemini-2.5-flash | 11.2 | 79.1% | #45 | 34.5 | 788,489 | $0.30 | |
| 46 | openai/gpt-4o-mini | 11.1 | 79.2% | #55 | 32.2 | 10,053,142 | $1.50 | |
| 47 | mistralai/mistral-large-2512 | 10.8 | 78.6% | #49 | 32.6 | 409,822 | $0.22 | |
| 48 | qwen/qwen-2-vl-72b-instruct | 10.4 | 80.6% | #11 | 32 | 390,857 | $0.31 | |
| 49 | bytedance-seed/seed-1.6 | 9.4 | 83.2% | #57 | 32.3 | 447,740 | $0.25 | |
| 50 | x-ai/grok-4-fast | 8.8 | 84.2% | #29 | 31.9 | 327,819 | $0.09 | |
| 51 | meta-llama/llama-4-maverick | 6.1 | 79.1% | #46 | 32.2 | 570,979 | $0.14 | |
| 52 | x-ai/grok-4.1-fast | 2.6 | 81.4% | #42 | 32.2 | 398,917 | $0.13 | |
| 53 | google/gemma-3-12b-it | 0.1 | 79.8% | #39 | 31.9 | 165,769 | $0.06 | |
| 54 | minimax/minimax-01 | -5.4 | 78.3% | #34 | 31.8 | 2,217,245 | $0.45 | |
| 55 | qwen/qwen3-vl-235b-a22b-instruct | -5.6 | 78.7% | #58 | 32.2 | 311,617 | $0.10 | |
| 56 | qwen/qwen2.5-vl-32b-instruct | -6.9 | 81.5% | #16 | 32.1 | 391,561 | $0.08 | |
| 57 | qwen/qwen3-vl-30b-a3b-instruct | -12.5 | 77.7% | #56 | 32.4 | 314,236 | $0.05 | |
| 58 | amazon/nova-pro-v1 | -13.5 | 76.6% | #9 | 32.5 | 751,156 | $0.63 | |
| 59 | qwen/qwen3.5-9b | -14.7 | 78.8% | #48 | 40.9 | 529,777 | $0.06 | |
| 60 | google/gemini-2.5-flash-lite | -28.5 | 74.5% | #60 | 34.5 | 774,167 | $0.08 | |
| 61 | amazon/nova-lite-v1 | -36.5 | 74.8% | #36 | 32 | 731,108 | $0.05 | |
| 62 | amazon/nova-2-lite-v1 | -57.4 | 71.9% | #52 | 32.5 | 419,711 | $0.17 | |
| 63 | baidu/ernie-4.5-vl-28b-a3b | -69.8 | 69.6% | #51 | 32.3 | 398,173 | $0.06 | |
| 64 | openai/gpt-4.1-nano | -91.7 | 70.6% | #40 | 32.3 | 710,909 | $0.07 | |
| 65 | meta-llama/llama-3.2-11b-vision-instruct | -178.0 | 69.3% | #8 | 30.8 | 1,920,663 | $0.09 |