Models

Model-by-model sandwich analytics

This benchmark uses a deliberately familiar argument to test alignment under ambiguity. People have been debating for years whether a hot dog is a sandwich, which makes sandwich classification a compact way to measure how closely models track messy, inconsistent human judgment.

The premise is playful, but the readout is serious: which models stayed closest to the crowd, which ones drifted, what each run cost, and which images exposed the largest gaps between model confidence and public intuition.

The primary view on this page is now the percent forecast benchmark. The older binary posterior view is still here, but it lives behind the second tab so the two benchmarks can coexist without being blended into one score.

Under development: this benchmark and its published results are provisional, not final.

Percent leader#1
Probabilistic score72.8
Crowd gap9.4%
72.8primary score
90.6%crowd match
Tightest confidence66.4 pt
Probabilistic score72.8
Crowd gap9.4%
24.6CI low
91.0CI high
Official cohort60 rated
Pending official rating5
Total tracked models65
25runs for official
Brier v2primary method
Highest spend$15.10
Benchmark score59.5
OfficialPending
$15.10total cost
354.2Ktokens
Ranking snapshot

Percent Forecast Benchmark Ratings

HumanHuman
Score100.0
Crowd match100.0%
ConfidenceReference
OfficialHuman
Runs0
Costn/a
🥇 1GPT / OpenAIopenai/o3
Score72.8
Crowd match90.6%
ConfidenceHigh
OfficialPending
Runs10
Cost$0.57
🥈 2GPT / OpenAIopenai/gpt-5.1
Score72.1
Crowd match90.8%
ConfidenceHigh
Official#10
Runs32.3
Cost$0.57
Score64.2
Crowd match87.7%
ConfidenceHigh
Official#15
Runs32.4
Cost$0.82
Score62.1
Crowd match88.3%
ConfidenceHigh
Official#19
Runs31.8
Cost$3.40
Score59.5
Crowd match87.9%
ConfidenceHigh
Official#23
Runs32
Cost$0.97
Score59.5
Crowd match87.4%
ConfidenceHigh
OfficialPending
Runs10
Cost$15.10
Score38.7
Crowd match84.7%
ConfidenceHigh
Official#21
Runs31.9
Cost$0.95
Score38.2
Crowd match85.3%
ConfidenceHigh
Official#7
Runs30.7
Cost$0.73
Score19.5
Crowd match81.3%
ConfidenceMedium
Official#54
Runs32.3
Cost$0.20
Score17.7
Crowd match81.0%
ConfidenceMedium
Official#37
Runs32.5
Cost$0.99
Score8.8
Crowd match84.2%
ConfidenceMedium
Official#29
Runs31.9
Cost$0.09
Score-14.7
Crowd match78.8%
ConfidenceLow
Official#48
Runs40.9
Cost$0.06
RankModelScoreConfidenceCrowd MatchOfficialTotal Eval RunsTokensTotal Cost
HumanHuman
100.0
Reference
100.0%
Human
0
0
n/a
🥇 1GPT / OpenAIopenai/o3
72.8
High
90.6%
Pending
10
171,761
$0.57
🥈 2GPT / OpenAIopenai/gpt-5.1
72.1
High
90.8%
#10
32.3
300,160
$0.57
🥉 3Claudeanthropic/claude-opus-4.5
71.0
High
89.5%
#26
32.4
393,861
$2.56
4GPT / OpenAIopenai/gpt-5.4-pro
65.9
High
88.0%
#12
31.2
399,519
$0.14
5GPT / OpenAIopenai/gpt-5.1-codex
64.3
High
90.3%
#4
32.2
299,162
$0.57
6GPT / OpenAIopenai/gpt-4.1
64.2
High
87.7%
#15
32.4
352,622
$0.82
7Claudeanthropic/claude-opus-4.6
63.7
High
87.1%
#28
32.1
391,068
$2.60
8GPT / OpenAIopenai/gpt-5.1-chat
63.5
High
89.4%
#6
31.9
297,799
$0.58
9Grok / xAIx-ai/grok-4
62.1
High
88.3%
#19
31.8
621,741
$3.40
10GPT / OpenAIopenai/gpt-4o
59.5
High
87.9%
#23
32
342,939
$0.97
11GPT / OpenAIopenai/o1
59.5
High
87.4%
Pending
10
354,187
$15.10
12Qwenqwen/qwen3.5-122b-a10b
58.0
High
87.6%
#5
32
678,789
$0.87
13GPT / OpenAIopenai/gpt-4o-2024-11-20
57.4
High
87.8%
#22
32.3
349,155
$1.00
14Claudeanthropic/claude-haiku-4.5
56.1
High
86.2%
#41
32.4
391,593
$0.50
15Claudeanthropic/claude-sonnet-4.6
54.5
High
86.0%
#38
101
1,767,562
$6.99
16Geminigoogle/gemini-3.1-pro-preview
52.0
High
85.0%
Pending
11
362,486
$2.03
17OpenRouteropenrouter/healer-alpha
51.4
High
88.3%
#2
32.1
1,397,734
$0.00
18ByteDance Seedbytedance-seed/seed-2.0-mini
50.7
High
87.2%
#20
32.2
744,033
$0.11
19nvidia/nemotron-nano-12b-v2-vl
50.6
High
88.7%
#1
32.4
959,862
$0.27
20Geminigoogle/gemini-3-flash-preview
46.2
High
84.6%
#47
33.9
536,822
$0.33
21Kimi / Moonshotmoonshotai/kimi-k2.5
46.2
High
85.8%
Pending
28.8
601,658
$0.85
22Googlegoogle/gemma-3-27b-it
42.0
High
83.7%
#43
32.5
171,166
$0.01
23GPT / OpenAIopenai/gpt-5.2
38.7
High
84.7%
#21
31.9
379,640
$0.95
24Qwenqwen/qwen3.5-27b
38.2
High
85.3%
#7
30.7
660,238
$0.73
25Geminigoogle/gemini-2.5-pro
34.4
High
84.0%
#44
44.6
1,158,206
$4.60
26Perplexity / Sonarperplexity/sonar-pro-search
32.6
High
87.0%
Pending
10
24,153
$2.19
27Pixtral / Mistralmistralai/pixtral-large-2411
32.5
High
85.1%
#25
32.3
859,930
$1.79
28ByteDance Seedbytedance-seed/seed-2.0-lite
32.0
Medium
82.6%
#31
31.9
803,039
$0.53
29Qwenqwen/qwen3.5-plus-02-15
29.0
High
82.9%
#24
32.2
683,216
$0.68
30Geminigoogle/gemini-3.1-flash-lite-preview
28.4
Medium
82.2%
#53
34.1
532,366
$0.16
31Qwenqwen/qwen3.5-35b-a3b
28.2
High
83.4%
#27
31.1
591,082
$0.42
32Qwenqwen/qwen3-vl-30b-a3b-thinking
27.1
High
84.2%
#14
32.2
394,611
$0.19
33Qwenqwen/qwen3.5-flash-02-23
24.5
High
81.3%
#30
32.4
630,868
$0.16
34Geminigoogle/gemini-3-pro-image-preview
24.3
High
83.7%
#33
31.9
376,073
$3.07
35GPT / OpenAIopenai/gpt-4.1-mini
23.0
Medium
82.1%
#32
32
479,700
$0.20
36ByteDance Seedbytedance-seed/seed-1.6-flash
20.6
High
82.0%
#17
32.4
448,861
$0.05
37Grok / xAIx-ai/grok-4.20-beta
20.3
Medium
81.5%
#50
31.9
172,186
$0.28
38Qwenqwen/qwen3.5-397b-a17b
19.7
Medium
83.2%
#18
31.3
538,278
$1.00
39Z.AI / GLMz-ai/glm-4.6v
19.5
Medium
81.3%
#54
32.3
468,574
$0.20
40AllenAI / Molmoallenai/molmo-2-8b
18.1
High
83.5%
#3
32.5
475,028
$0.09
41GPT / OpenAIopenai/gpt-5.4
17.7
Medium
81.0%
#37
32.5
376,950
$0.99
42Meta / Llamameta-llama/llama-4-scout
17.0
Medium
79.3%
#59
32.3
564,966
$0.06
43Qwenqwen/qwen2.5-vl-72b-instruct
16.1
Medium
81.0%
#13
32.4
399,314
$0.32
44Geminigoogle/gemini-3.1-flash-image-preview
15.3
Medium
80.0%
#35
32.3
168,091
$0.14
45Geminigoogle/gemini-2.5-flash
11.2
Medium
79.1%
#45
34.5
788,489
$0.30
46GPT / OpenAIopenai/gpt-4o-mini
11.1
Medium
79.2%
#55
32.2
10,053,142
$1.50
47Mistralmistralai/mistral-large-2512
10.8
Medium
78.6%
#49
32.6
409,822
$0.22
48Qwenqwen/qwen-2-vl-72b-instruct
10.4
Medium
80.6%
#11
32
390,857
$0.31
49ByteDance Seedbytedance-seed/seed-1.6
9.4
Low
83.2%
#57
32.3
447,740
$0.25
50Grok / xAIx-ai/grok-4-fast
8.8
Medium
84.2%
#29
31.9
327,819
$0.09
51Meta / Llamameta-llama/llama-4-maverick
6.1
Medium
79.1%
#46
32.2
570,979
$0.14
52Grok / xAIx-ai/grok-4.1-fast
2.6
Medium
81.4%
#42
32.2
398,917
$0.13
53Googlegoogle/gemma-3-12b-it
0.1
Medium
79.8%
#39
31.9
165,769
$0.06
54MiniMaxminimax/minimax-01
-5.4
Low
78.3%
#34
31.8
2,217,245
$0.45
55Qwenqwen/qwen3-vl-235b-a22b-instruct
-5.6
Low
78.7%
#58
32.2
311,617
$0.10
56Qwenqwen/qwen2.5-vl-32b-instruct
-6.9
Medium
81.5%
#16
32.1
391,561
$0.08
57Qwenqwen/qwen3-vl-30b-a3b-instruct
-12.5
Low
77.7%
#56
32.4
314,236
$0.05
58Amazonamazon/nova-pro-v1
-13.5
Medium
76.6%
#9
32.5
751,156
$0.63
59Qwenqwen/qwen3.5-9b
-14.7
Low
78.8%
#48
40.9
529,777
$0.06
60Geminigoogle/gemini-2.5-flash-lite
-28.5
Low
74.5%
#60
34.5
774,167
$0.08
61Amazonamazon/nova-lite-v1
-36.5
Low
74.8%
#36
32
731,108
$0.05
62Amazonamazon/nova-2-lite-v1
-57.4
Low
71.9%
#52
32.5
419,711
$0.17
63Baidu / ERNIEbaidu/ernie-4.5-vl-28b-a3b
-69.8
Low
69.6%
#51
32.3
398,173
$0.06
64GPT / OpenAIopenai/gpt-4.1-nano
-91.7
Low
70.6%
#40
32.3
710,909
$0.07
65Meta / Llamameta-llama/llama-3.2-11b-vision-instruct
-178.0
Low
69.3%
#8
30.8
1,920,663
$0.09
AI Model Sandwich Benchmark Rankings | opensandwich.ai