The Sandwich Alignment Benchmark
The ranking is the summary. This section exposes the underlying evidence: the images, vote splits, and failure cases that make sandwich alignment funny on the surface and technically useful underneath.
Sandwich classification is a compact alignment problem disguised as a joke. The label is familiar, the argument is culturally durable, and the edge cases are dense with ambiguity, which makes this a useful way to inspect how models behave when the target concept exists mostly as messy human consensus rather than clean formal rules.
Each page shows the image, the human distribution, the model spread, and sampled commentary so you can inspect where agreement is robust, where it collapses, and where models become confidently misaligned on a question humans themselves still enjoy fighting about. That combination is what makes the benchmark both serious and ridiculous.

Kitten in Bread
A kitten has been placed between two slices of bread, producing a meme that is structurally sandwich-shaped and operationally a felony against common sense. This is where ontology leaves the lab and starts posting.

Bagel PB&J
A bagel hacked perpendicular into a peanut-butter-and-jelly arrangement turns a children's lunch into topology discourse. The filling is real, the bread surfaces are opposing, and the geometry is actively trying to get cited.

Chicken Wrap
A chicken Caesar wrap bundles meat, lettuce, and sauce into a tortilla tube that lives permanently in sandwich-adjacent limbo. It is the kind of object that makes taxonomies collapse into a Slack thread.

01. Bacon Lettuce Tomato
A perfectly legible BLT sits on toasted bread, the kind of canonical positive example that makes even the worst eval look solved. If your model misses this one, it does not need fine-tuning; it needs adult supervision.
- Human
- 96.3% yes3.7% no
- Model average
- 99.8% yes0.2% no
- Max gap
- 6.3%
- Closest model
- allenai/molmo-2-8b

02. Dodge Van
A late-70s Dodge van is parked here like someone tried to jailbreak the ontology with Detroit sheet metal. It is the purest negative control in the set: all sandwich discourse, zero mayo.
- Human
- 7.0% yes93.0% no
- Model average
- 0.2% yes99.8% no
- Max gap
- 7.0%
- Closest model
- meta-llama/llama-3.2-11b-vision-instruct

03. Sub Sandwich
A long sub packed with salami, cheddar, lettuce, and tomato sprawls across the frame like a benchmark overfit to obvious wins. It is unquestionably a sandwich, unless you are the kind of engineer who opens a ticket about submarine semantics.
- Human
- 94.5% yes5.5% no
- Model average
- 99.7% yes0.3% no
- Max gap
- 9.5%
- Closest model
- allenai/molmo-2-8b

04. Sandwich Costume
A parade line of humans dressed as bread, cheese, meat, and tomato forms a structurally convincing sandwich that still fails the crucial requirement of being lunch. It is the kind of edge case that makes literalists sound insane and compositionalists sound worse.
- Human
- 40.9% yes59.1% no
- Model average
- 5.7% yes94.3% no
- Max gap
- 59.1%
- Closest model
- allenai/molmo-2-8b

05. Grilled Cheese
A browned grilled cheese sits there radiating the confidence of a unit test with 100% coverage and no hidden mocks. Two bread faces, molten cheese center, zero ontology drama unless you are trying very hard to be annoying.
- Human
- 95.6% yes4.4% no
- Model average
- 99.8% yes0.2% no
- Max gap
- 6.6%
- Closest model
- allenai/molmo-2-8b

06. Grilled Cheese Pineapple
Ham, cheese, and pineapple are trapped between toasted bread in a move that feels both culinarily legal and socially destabilizing. The sandwich question is easy; the real benchmark is whether your priors can survive the pineapple.
- Human
- 91.7% yes8.3% no
- Model average
- 99.5% yes0.5% no
- Max gap
- 12.7%
- Closest model
- nvidia/nemotron-nano-12b-v2-vl

07. Kitten in Bread
A kitten has been placed between two slices of bread, producing a meme that is structurally sandwich-shaped and operationally a felony against common sense. This is where ontology leaves the lab and starts posting.
- Human
- 54.2% yes45.8% no
- Model average
- 8.7% yes91.3% no
- Max gap
- 54.2%
- Closest model
- openai/gpt-4.1-nano

08. Hamburger
A standard burger stacks bun, patty, lettuce, and tomato in the exact format that turns otherwise competent adults into constitutional originalists. It is the canonical 'yes in theory, no in vibes' sandwich fight.
- Human
- 73.0% yes27.0% no
- Model average
- 96.6% yes3.4% no
- Max gap
- 73.0%
- Closest model
- meta-llama/llama-3.2-11b-vision-instruct

09. Hashbrown Sandwich
A breakfast stack uses hash-brown slabs as the outer chassis for bacon, egg, and cheese, like fast-food R&D got too comfortable with category theory. It is handheld, layered, and deeply committed to making 'bread' feel optional.
- Human
- 59.4% yes40.6% no
- Model average
- 91.4% yes8.6% no
- Max gap
- 40.6%
- Closest model
- openai/gpt-4.1-mini

10. Hot Dog
A hot dog sits in its split bun, the most litigated piece of street food in American semantics. One continuous bread artifact, one sausage, infinite discourse from people who should probably log off.
- Human
- 39.8% yes60.2% no
- Model average
- 45.4% yes54.6% no
- Max gap
- 60.2%
- Closest model
- google/gemini-2.5-flash

11. Pickle Sandwich
A hollowed pickle is doing bread cosplay around ham, cheese, and tomato, which is either keto ingenuity or a user trying to adversarially attack the definition. It has sandwich posture, but the cucumber vibes make everyone nervous.
- Human
- 65.6% yes34.4% no
- Model average
- 59.0% yes41.0% no
- Max gap
- 65.6%
- Closest model
- x-ai/grok-4-fast

12. Avocado Tea
A neat avocado-and-egg-salad tea sandwich looks like it was served beside very expensive gossip. It is obviously a sandwich, just one that speaks in a quieter accent than the rest of the dataset.
- Human
- 92.8% yes7.2% no
- Model average
- 99.3% yes0.7% no
- Max gap
- 14.8%
- Closest model
- meta-llama/llama-3.2-11b-vision-instruct

13. Panini
A pressed panini with greens and filling compressed into sharp grill lines shows up like a normal sandwich after a product manager discovered heat. It is structurally boring in the best possible way and still somehow controversial to a few models.
- Human
- 92.4% yes7.6% no
- Model average
- 99.7% yes0.3% no
- Max gap
- 14.4%
- Closest model
- nvidia/nemotron-nano-12b-v2-vl

14. Cookie PB
Two cookies with peanut-butter filling are stacked into a dessert sandwich that feels like it was greenlit by a startup with no adult in finance. It breaks the bread prior while preserving the sandwich geometry almost too cleanly.
- Human
- 51.5% yes48.5% no
- Model average
- 34.8% yes65.3% no
- Max gap
- 51.5%
- Closest model
- qwen/qwen3.5-flash-02-23

15. Chicken Wrap
A chicken Caesar wrap bundles meat, lettuce, and sauce into a tortilla tube that lives permanently in sandwich-adjacent limbo. It is the kind of object that makes taxonomies collapse into a Slack thread.
- Human
- 22.6% yes77.4% no
- Model average
- 47.7% yes52.3% no
- Max gap
- 77.4%
- Closest model
- bytedance-seed/seed-2.0-lite

16. Waffle Ice Cream
Ice cream wedged between waffles presents itself as a dessert sandwich with zero shame and excellent marketing instincts. It is not lunch, but it absolutely understands the assignment.
- Human
- 66.3% yes33.7% no
- Model average
- 73.4% yes26.6% no
- Max gap
- 66.3%
- Closest model
- x-ai/grok-4.20-beta

17. Sloppy Joe
A sloppy joe leaks seasoned meat out of a bun with the chaotic confidence of legacy code that somehow still pays revenue. It is clearly sandwich-shaped, even if the change-management story is grim.
- Human
- 79.4% yes20.6% no
- Model average
- 99.6% yes0.4% no
- Max gap
- 20.6%
- Closest model
- meta-llama/llama-3.2-11b-vision-instruct

18. Cigarette Sandwich
Two slices of bread cradle a row of cigarettes in an image that feels less like cuisine and more like a failed alignment experiment. The structure says sandwich; every other signal says call a therapist.
- Human
- 29.8% yes70.2% no
- Model average
- 10.6% yes89.4% no
- Max gap
- 70.2%
- Closest model
- openai/gpt-5.2

19. KFC Double Down
The Double Down replaces bread with fried chicken fillets and dares the classifier to explain why outer layers must be grain-based. It is a sandwich-shaped act of aggression from the late-capitalist frontier.
- Human
- 55.7% yes44.3% no
- Model average
- 69.5% yes30.5% no
- Max gap
- 55.7%
- Closest model
- openai/gpt-5.4

20. Bagel PB&J
A bagel hacked perpendicular into a peanut-butter-and-jelly arrangement turns a children's lunch into topology discourse. The filling is real, the bread surfaces are opposing, and the geometry is actively trying to get cited.
- Human
- 46.6% yes53.4% no
- Model average
- 83.4% yes16.6% no
- Max gap
- 53.4%
- Closest model
- minimax/minimax-01