BLTPeople mostly said yes

Benchmark image 01

Bacon Lettuce Tomato

BLT "Sandwich"

A perfectly legible BLT sits on toasted bread, the kind of canonical positive example that makes even the worst eval look solved. If your model misses this one, it does not need fine-tuning; it needs adult supervision.

Under development: this benchmark and its published results are provisional, not final.

Human

96.3% yes3.7% no

Model average

99.8% yes0.2% no

Most aligned model

1.7 point gap from humans

allenai/molmo-2-8b

Least aligned model

6.3 point gap from humans

meta-llama/llama-3.2-11b-vision-instruct

At a glance

How this photo split the room

Human distribution

96.3% yes, 3.7% no over 656 explicit votes.

Model average distribution

99.8% yes, 0.2% no across the current model set.

Closest current model

98.0% yes.

allenai/molmo-2-8b

Least aligned model

6.3 point gap.

meta-llama/llama-3.2-11b-vision-instruct

Legacy GPT-4o baseline

100.0% yes with a 3.7 point gap against humans.

Biggest model gap

6.3 percentage points on this image.

Current classification

People mostly said yes

Benchmark context

Current classification

People mostly said yes

Models compared

74 current runs

Biggest model gap

6.3 percentage points on this image.

Closest model output

98.0% yes.

Model spread

How Models Align with Human Responses

This compares each model against human responses to show how closely it aligns with people.Human rate marker

meta-llama/llama-3.2-11b-vision-instruct

10.0% no90.0% yes

Human gap6.3%

Rank #3

allenai/molmo-2-8b

2.0% no98.0% yes

Human gap1.7%

Rank #6

amazon/nova-2-lite-v1

0.0% no100.0% yes

Human gap3.7%

Rank #60

amazon/nova-lite-v1

0.0% no100.0% yes

Human gap3.7%

Rank #51

amazon/nova-pro-v1

0.0% no100.0% yes

Human gap3.7%

Rank #73

anthropic/claude-haiku-4.5

0.0% no100.0% yes

Human gap3.7%

Rank #52

anthropic/claude-opus-4.5

0.0% no100.0% yes

Human gap3.7%

Rank #56

anthropic/claude-opus-4.6

0.0% no100.0% yes

Human gap3.7%

Rank #63

anthropic/claude-opus-4.7

0.0% no100.0% yes

Human gap3.7%

Rank #38

anthropic/claude-opus-4.8

0.0% no100.0% yes

Human gap3.7%

Rank #40

anthropic/claude-sonnet-4.6

0.0% no100.0% yes

Human gap3.7%

Rank #62

baidu/ernie-4.5-vl-28b-a3b

0.0% no100.0% yes

Human gap3.7%

Rank #69

bytedance-seed/seed-1.6

0.0% no100.0% yes

Human gap3.7%

Rank #41

bytedance-seed/seed-1.6-flash

0.0% no100.0% yes

Human gap3.7%

Rank #20

bytedance-seed/seed-2.0-lite

0.0% no100.0% yes

Human gap3.7%

Rank #14

bytedance-seed/seed-2.0-mini

0.0% no100.0% yes

Human gap3.7%

Rank #19

google/gemini-2.5-flash

0.0% no100.0% yes

Human gap3.7%

Rank #21

google/gemini-2.5-flash-lite

0.0% no100.0% yes

Human gap3.7%

Rank #54

google/gemini-2.5-pro

0.0% no100.0% yes

Human gap3.7%

Rank #25

google/gemini-3-flash-preview

0.0% no100.0% yes

Human gap3.7%

Rank #75

google/gemini-3-pro-image-preview

0.0% no100.0% yes

Human gap3.7%

Rank #42

google/gemini-3.1-flash-image-preview

0.0% no100.0% yes

Human gap3.7%

Rank #24

google/gemini-3.1-flash-lite-preview

0.0% no100.0% yes

Human gap3.7%

Rank #55

google/gemini-3.1-pro-preview

0.0% no100.0% yes

Human gap3.7%

Rank #45

google/gemma-3-12b-it

0.0% no100.0% yes

Human gap3.7%

Rank #26

google/gemma-3-27b-it

0.0% no100.0% yes

Human gap3.7%

Rank #48

GPT-4o (Spring 2024)

0.0% no100.0% yes

Human gap3.7%

Rank #4

meta-llama/llama-4-maverick

0.0% no100.0% yes

Human gap3.7%

Rank #68

meta-llama/llama-4-scout

0.0% no100.0% yes

Human gap3.7%

Rank #33

minimax/minimax-01

0.0% no100.0% yes

Human gap3.7%

Rank #72

mistralai/mistral-large-2512

0.0% no100.0% yes

Human gap3.7%

Rank #71

mistralai/pixtral-large-2411

0.0% no100.0% yes

Human gap3.7%

Rank #50

moonshotai/kimi-k2.5

0.0% no100.0% yes

Human gap3.7%

Rank #13

nvidia/nemotron-nano-12b-v2-vl

0.0% no100.0% yes

Human gap3.7%

Rank #7

openai/gpt-4.1

0.0% no100.0% yes

Human gap3.7%

Rank #74

openai/gpt-4.1-mini

0.0% no100.0% yes

Human gap3.7%

Rank #57

openai/gpt-4.1-nano

0.0% no100.0% yes

Human gap3.7%

Rank #36

openai/gpt-4o

0.0% no100.0% yes

Human gap3.7%

Rank #15

openai/gpt-4o-2024-11-20

0.0% no100.0% yes

Human gap3.7%

Rank #67

openai/gpt-4o-mini

0.0% no100.0% yes

Human gap3.7%

Rank #61

openai/gpt-5.1

0.0% no100.0% yes

Human gap3.7%

Rank #49

openai/gpt-5.1-chat

0.0% no100.0% yes

Human gap3.7%

Rank #8

openai/gpt-5.1-codex

0.0% no100.0% yes

Human gap3.7%

Rank #37

openai/gpt-5.2

0.0% no100.0% yes

Human gap3.7%

Rank #43

openai/gpt-5.3-chat

0.0% no100.0% yes

Human gap3.7%

Rank #30

openai/gpt-5.3-codex

0.0% no100.0% yes

Human gap3.7%

Rank #44

openai/gpt-5.4

0.0% no100.0% yes

Human gap3.7%

Rank #59

openai/gpt-5.4-mini

0.0% no100.0% yes

Human gap3.7%

Rank #28

openai/gpt-5.4-nano

0.0% no100.0% yes

Human gap3.7%

Rank #31

openai/gpt-5.4-pro

0.0% no100.0% yes

Human gap3.7%

Rank #65

openai/gpt-5.5

0.0% no100.0% yes

Human gap3.7%

Rank #46

openai/o1

0.0% no100.0% yes

Human gap3.7%

Rank #2

openai/o1-pro

0.0% no100.0% yes

Human gap3.7%

Rank #1

openai/o3

0.0% no100.0% yes

Human gap3.7%

Rank #64

openai/o3-pro

0.0% no100.0% yes

Human gap3.7%

Rank #53

openrouter/healer-alpha

0.0% no100.0% yes

Human gap3.7%

Rank #10

perplexity/sonar-pro-search

0.0% no100.0% yes

Human gap3.7%

Rank #32

qwen/qwen-2-vl-72b-instruct

0.0% no100.0% yes

Human gap3.7%

Rank #29

qwen/qwen2.5-vl-32b-instruct

0.0% no100.0% yes

Human gap3.7%

Rank #39

qwen/qwen2.5-vl-72b-instruct

0.0% no100.0% yes

Human gap3.7%

Rank #70

qwen/qwen3-vl-235b-a22b-instruct

0.0% no100.0% yes

Human gap3.7%

Rank #47

qwen/qwen3-vl-30b-a3b-instruct

0.0% no100.0% yes

Human gap3.7%

Rank #66

qwen/qwen3-vl-30b-a3b-thinking

0.0% no100.0% yes

Human gap3.7%

Rank #22

qwen/qwen3.5-122b-a10b

0.0% no100.0% yes

Human gap3.7%

Rank #11

qwen/qwen3.5-27b

0.0% no100.0% yes

Human gap3.7%

Rank #18

qwen/qwen3.5-35b-a3b

0.0% no100.0% yes

Human gap3.7%

Rank #23

qwen/qwen3.5-397b-a17b

0.0% no100.0% yes

Human gap3.7%

Rank #34

qwen/qwen3.5-9b

0.0% no100.0% yes

Human gap3.7%

Rank #27

qwen/qwen3.5-flash-02-23

0.0% no100.0% yes

Human gap3.7%

Rank #9

qwen/qwen3.5-plus-02-15

0.0% no100.0% yes

Human gap3.7%

Rank #35

x-ai/grok-4

0.0% no100.0% yes

Human gap3.7%

Rank #12

x-ai/grok-4-fast

0.0% no100.0% yes

Human gap3.7%

Rank #5

x-ai/grok-4.1-fast

0.0% no100.0% yes

Human gap3.7%

Rank #16

x-ai/grok-4.20-beta

0.0% no100.0% yes

Human gap3.7%

Rank #17

z-ai/glm-4.6v

0.0% no100.0% yes

Human gap3.7%

Rank #58

Selected human comments

It is a photo of a sandwich.
personally, i wouldn't eat the tomatoes, but other people like them
Needs more bacon!

allenai/molmo-2-8b comments

This image clearly shows a classic BLT sandwich on a plate. It meets the typical criteria of a sandwich: two slices of bread enclosing fillings (bacon, lettuce, tomato). The composition and presentation are consistent with conventional sandwich definitions.
This appears to be a classic BLT sandwich (bacon, lettuce, and tomato) on toasted bread. It fits the typical definition of a sandwich consisting of two slices of bread with fillings in between.

meta-llama/llama-3.2-11b-vision-instruct comments

The image shows a typical sandwich with two slices of bread and various fillings in between, including lettuce, tomato, and possibly bacon or ham, along with a yellow spread.
A sandwich is a food item consisting of two or more slices of bread, often with fillings such as meats, cheeses, vegetables, and condiments. A sandwich can be made with any type of bread, and can be cut into halves or quarters. Examples of sandwiches include BLTs, club sandwiches, and paninis.

Vote card

Bacon Lettuce Tomato

How this photo split the room

How Models Align with Human Responses

Selected human comments

allenai/molmo-2-8b comments

meta-llama/llama-3.2-11b-vision-instruct comments

Generated summary for this photo