Tiny eval. Huge sandwich energy.

OpenSandwich.ai

A deliberately low-stakes benchmark for a real alignment problem: can models recover a fuzzy human category when the category is lunch, the humans disagree, and the edge cases get deeply stupid?

This benchmark is deliberately small and scientifically annoying: twenty photos, one binary judgment, and a category boundary that humans themselves fail to stabilize. That is not a bug. It is the whole experiment.

In other words, we are stress-testing multimodal reasoning with an open-faced sandwich, a hostile ontology, and a crowd baseline that occasionally wakes up and chooses chaos. If a model cannot survive this, it probably should not sound so smug elsewhere.

Under development: this benchmark and its published results are provisional, not final.

Tokens toasted
226.7M
Token volume consumed across the published benchmark run.
Total requests
155.5K
Benchmark + 14k sentiment-analysis API calls.
Human judgments
13.1K
656 respondents across all 20 photos.
Model judgments
141.5K
7077 full passes, published March 14, 2026.
Spend trackingPublished benchmark spend$790.90 Total Inference Cost.
You can vote
Live survey
Hot dog? Hamburger? Cast your vote and help grow the human baseline for a safer sandwich-alignment future.
Benchmark

20 images, from clean sandwiches to cases that make lunch law collapse.

Protocol

Humans and models get the same blunt question: is this a sandwich or not?

Scoring

Repeated runs turn one-off guesses into a ranking that survives variance.

Signal

If a model fumbles this category, its confidence elsewhere deserves scrutiny.

Ranking snapshot

Percent Forecast Benchmark Ratings

HumanHuman
100.0
ReferenceConfidence
100.0%Crowd match
🥇 1GPT / OpenAIopenai/o3
72.8
HighConfidence
90.6%Crowd match
59.5
HighConfidence
87.4%Crowd match
RankModelScoreConfidenceCrowd match
Human
Human
100.0
Reference
100.0%
🥇 1
72.8
High
90.6%
🥈 2
72.1
High
90.8%
🥉 3
71.0
High
89.5%
4
65.9
High
88.0%
5
64.3
High
90.3%
6
64.2
High
87.7%
7
63.7
High
87.1%
8
63.5
High
89.4%
9
62.1
High
88.3%
10
59.5
High
87.9%
11
59.5
High
87.4%
12
58.0
High
87.6%
13
57.4
High
87.8%
14
56.1
High
86.2%
15
54.5
High
86.0%
16
52.0
High
85.0%
17
51.4
High
88.3%
18
50.7
High
87.2%
19
50.6
High
88.7%
20
46.2
High
84.6%
21
46.2
High
85.8%
22
42.0
High
83.7%
23
38.7
High
84.7%
24
38.2
High
85.3%
25
34.4
High
84.0%
26
32.6
High
87.0%
27
32.5
High
85.1%
28
32.0
Medium
82.6%
29
29.0
High
82.9%
30
28.4
Medium
82.2%
31
28.2
High
83.4%
32
27.1
High
84.2%
33
24.5
High
81.3%
34
24.3
High
83.7%
35
23.0
Medium
82.1%
36
20.6
High
82.0%
37
20.3
Medium
81.5%
38
19.7
Medium
83.2%
39
19.5
Medium
81.3%
40
18.1
High
83.5%
41
17.7
Medium
81.0%
42
17.0
Medium
79.3%
43
16.1
Medium
81.0%
44
15.3
Medium
80.0%
45
11.2
Medium
79.1%
46
11.1
Medium
79.2%
47
10.8
Medium
78.6%
48
10.4
Medium
80.6%
49
9.4
Low
83.2%
50
8.8
Medium
84.2%
51
6.1
Medium
79.1%
52
2.6
Medium
81.4%
53
0.1
Medium
79.8%
54
-5.4
Low
78.3%
55
-5.6
Low
78.7%
56
-6.9
Medium
81.5%
57
-12.5
Low
77.7%
58
-13.5
Medium
76.6%
59
-14.7
Low
78.8%
60
-28.5
Low
74.5%
61
-36.5
Low
74.8%
62
-57.4
Low
71.9%
63
-69.8
Low
69.6%
64
-91.7
Low
70.6%
65
-178.0
Low
69.3%
Cookie PB
51.5%
A photo that turns a simple lunch question into a philosophical incident.
Bagel PB&J
46.6%
A photo that turns a simple lunch question into a philosophical incident.
Kitten in Bread
54.2%
A photo that turns a simple lunch question into a philosophical incident.
Fault Lines

Benchmark images that expose the biggest cracks

These are the photos that cause the best arguments. Open any one to see the image, the human split, the model spread, and a few comments from both species.

OpenSandwich: Human vs AI Sandwich Benchmark