OpenSandwich .ai
A deliberately low-stakes benchmark for a real alignment problem: can models recover a fuzzy human category when the category is lunch, the humans disagree, and the edge cases get deeply stupid?
This benchmark is deliberately small and scientifically annoying: twenty photos, one binary judgment, and a category boundary that humans themselves fail to stabilize. That is not a bug. It is the whole experiment.
In other words, we are stress-testing multimodal reasoning with an open-faced sandwich, a hostile ontology, and a crowd baseline that occasionally wakes up and chooses chaos. If a model cannot survive this, it probably should not sound so smug elsewhere.
Under development: this benchmark and its published results are provisional, not final.
- Tokens burned
- 134.8M Token volume consumed across the published benchmark run.
- Total requests
- 81.2K Benchmark calls plus another 14k sentiment-analysis requests from the latest pass.
- Human judgments
- 13.1K 656 respondents across all 20 photos.
- Model judgments
- 67.2K 3359 full passes, published March 8, 2026.
- You can vote
- Live survey Add your own judgment to the pile and strengthen the human baseline.
Humans and models get the same blunt question: is this a sandwich or not?
Repeated runs turn one-off guesses into a ranking that survives variance.
If a model fumbles this category, its confidence elsewhere deserves scrutiny.
Sandwich Alignment Rankings - Current
| Rank | Model | Score | Confidence | Crowd match |
|---|---|---|---|---|
| Human | Human | 100.0 | ||
| 🥇 1 | meta-llama/llama-3.2-11b-vision-instruct | 40.0 | ||
| 🥈 2 | GPT-4o (2024 run) | 37.1 | ||
| 🥉 3 | openai/o3-pro | 36.3 | ||
| 4 | moonshotai/kimi-k2.5 | 35.7 | ||
| 5 | x-ai/grok-4-fast | 35.2 | ||
| 6 | qwen/qwen3.5-397b-a17b | 35.1 | ||
| 7 | google/gemini-2.5-pro | 32.6 | ||
| 8 | openai/gpt-5.4-pro | 32.2 | ||
| 9 | openai/gpt-4o | 32.1 | ||
| 10 | google/gemini-3.1-pro-preview | 30.7 | ||
| 11 | google/gemma-3-12b-it | 30.6 | ||
| 12 | qwen/qwen-2-vl-72b-instruct | 30.5 | ||
| 13 | meta-llama/llama-4-scout | 30.1 | ||
| 14 | qwen/qwen2.5-vl-32b-instruct | 29.1 | ||
| 15 | google/gemma-3-27b-it | 28.1 | ||
| 16 | mistralai/pixtral-large-2411 | 27.9 | ||
| 17 | amazon/nova-lite-v1 | 27.8 | ||
| 18 | openai/gpt-4.1-mini | 27.2 | ||
| 19 | z-ai/glm-4.6v | 27.1 | ||
| 20 | openai/gpt-5.4 | 27.1 | ||
| 21 | openai/gpt-4o-mini | 26.8 | ||
| 22 | anthropic/claude-sonnet-4.6 | 26.1 | ||
| 23 | anthropic/claude-opus-4.6 | 25.8 | ||
| 24 | openai/o3 | 25.7 | ||
| 25 | openai/gpt-4o-2024-11-20 | 25.0 | ||
| 26 | meta-llama/llama-4-maverick | 24.7 | ||
| 27 | baidu/ernie-4.5-vl-28b-a3b | 24.3 | ||
| 28 | qwen/qwen2.5-vl-72b-instruct | 24.2 | ||
| 29 | minimax/minimax-01 | 22.2 | ||
| 30 | amazon/nova-pro-v1 | 21.0 | ||
| 31 | openai/gpt-4.1 | 20.9 | ||
| 32 | google/gemini-3-flash-preview | 20.0 |
Benchmark images that expose the biggest cracks
These are the photos that cause the best arguments. Open any one to see the image, the human split, the model spread, and a few comments from both species.

07. Kitten in Bread
A kitten has been placed between two slices of bread, producing a meme that is structurally sandwich-shaped...
- Human
- 54.2% yes45.8% no
- Model average
- 9.9% yes90.1% no
- Max gap
- 54.2%
- Closest model
- x-ai/grok-4-fast

20. Bagel PB&J
A bagel hacked perpendicular into a peanut-butter-and-jelly arrangement turns a children's lunch into topol...
- Human
- 46.6% yes53.4% no
- Model average
- 84.6% yes15.4% no
- Max gap
- 53.4%
- Closest model
- minimax/minimax-01

10. Hot Dog
A hot dog sits in its split bun, the most litigated piece of street food in American semantics.
- Human
- 39.8% yes60.2% no
- Model average
- 54.2% yes45.8% no
- Max gap
- 60.2%
- Closest model
- meta-llama/llama-4-maverick

14. Cookie PB
Two cookies with peanut-butter filling are stacked into a dessert sandwich that feels like it was greenlit...
- Human
- 51.5% yes48.5% no
- Model average
- 33.3% yes66.7% no
- Max gap
- 51.5%
- Closest model
- qwen/qwen3.5-397b-a17b
The benchmark, but actually worth poking around in
This is the public-facing layer for the whole experiment: the rankings, the image-by-image splits, the human comments, and the exact places where the models start sounding far too confident about cursed lunch ontology.
If you are here for the joke, it is all here. If you are here for the eval design, the data trail is here too. The fun part is that both audiences are looking at the same sandwich.
