Previous photoPickle Sandwich Next photoPanini

TEAPeople mostly said yes

Benchmark image 12

Avocado Tea

Spicy Avocado Egg Salad "Sandwich"

A neat avocado-and-egg-salad tea sandwich looks like it was served beside very expensive gossip. It is obviously a sandwich, just one that speaks in a quieter accent than the rest of the dataset.

Under development: this benchmark and its published results are provisional, not final.

Human

92.8% yes7.2% no

Model average

99.4% yes0.6% no

Most aligned model

2.2 point gap from humans

meta-llama/llama-3.2-11b-vision-instruct

Least aligned model

14.8 point gap from humans

nvidia/nemotron-nano-12b-v2-vl

At a glance

How this photo split the room

Human distribution

92.8% yes, 7.2% no over 656 explicit votes.

Model average distribution

99.4% yes, 0.6% no across the current model set.

Closest current model

95.0% yes.

meta-llama/llama-3.2-11b-vision-instruct

Least aligned model

14.8 point gap.

nvidia/nemotron-nano-12b-v2-vl

Legacy GPT-4o baseline

100.0% yes with a 7.2 point gap against humans.

Biggest model gap

14.8 percentage points on this image.

Current classification

People mostly said yes

Benchmark context

Current classification

People mostly said yes

Models compared

74 current runs

Biggest model gap

14.8 percentage points on this image.

Closest model output

95.0% yes.

Model spread

How Models Align with Human Responses

This compares each model against human responses to show how closely it aligns with people.Human rate marker

nvidia/nemotron-nano-12b-v2-vl

22.0% no78.0% yes

Human gap14.8%

Rank #7

openai/gpt-4.1-nano

13.0% no87.0% yes

Human gap5.8%

Rank #36

meta-llama/llama-3.2-11b-vision-instruct

5.0% no95.0% yes

Human gap2.2%

Rank #3

qwen/qwen2.5-vl-32b-instruct

3.0% no97.0% yes

Human gap4.2%

Rank #39

allenai/molmo-2-8b

1.0% no99.0% yes

Human gap6.2%

Rank #6

amazon/nova-2-lite-v1

0.0% no100.0% yes

Human gap7.2%

Rank #60

amazon/nova-lite-v1

0.0% no100.0% yes

Human gap7.2%

Rank #51

amazon/nova-pro-v1

0.0% no100.0% yes

Human gap7.2%

Rank #73

anthropic/claude-haiku-4.5

0.0% no100.0% yes

Human gap7.2%

Rank #52

anthropic/claude-opus-4.5

0.0% no100.0% yes

Human gap7.2%

Rank #56

anthropic/claude-opus-4.6

0.0% no100.0% yes

Human gap7.2%

Rank #63

anthropic/claude-opus-4.7

0.0% no100.0% yes

Human gap7.2%

Rank #38

anthropic/claude-opus-4.8

0.0% no100.0% yes

Human gap7.2%

Rank #40

anthropic/claude-sonnet-4.6

0.0% no100.0% yes

Human gap7.2%

Rank #62

baidu/ernie-4.5-vl-28b-a3b

0.0% no100.0% yes

Human gap7.2%

Rank #69

bytedance-seed/seed-1.6

0.0% no100.0% yes

Human gap7.2%

Rank #41

bytedance-seed/seed-1.6-flash

0.0% no100.0% yes

Human gap7.2%

Rank #20

bytedance-seed/seed-2.0-lite

0.0% no100.0% yes

Human gap7.2%

Rank #14

bytedance-seed/seed-2.0-mini

0.0% no100.0% yes

Human gap7.2%

Rank #19

google/gemini-2.5-flash

0.0% no100.0% yes

Human gap7.2%

Rank #21

google/gemini-2.5-flash-lite

0.0% no100.0% yes

Human gap7.2%

Rank #54

google/gemini-2.5-pro

0.0% no100.0% yes

Human gap7.2%

Rank #25

google/gemini-3-flash-preview

0.0% no100.0% yes

Human gap7.2%

Rank #75

google/gemini-3-pro-image-preview

0.0% no100.0% yes

Human gap7.2%

Rank #42

google/gemini-3.1-flash-image-preview

0.0% no100.0% yes

Human gap7.2%

Rank #24

google/gemini-3.1-flash-lite-preview

0.0% no100.0% yes

Human gap7.2%

Rank #55

google/gemini-3.1-pro-preview

0.0% no100.0% yes

Human gap7.2%

Rank #45

google/gemma-3-12b-it

0.0% no100.0% yes

Human gap7.2%

Rank #26

google/gemma-3-27b-it

0.0% no100.0% yes

Human gap7.2%

Rank #48

GPT-4o (Spring 2024)

0.0% no100.0% yes

Human gap7.2%

Rank #4

meta-llama/llama-4-maverick

0.0% no100.0% yes

Human gap7.2%

Rank #68

meta-llama/llama-4-scout

0.0% no100.0% yes

Human gap7.2%

Rank #33

minimax/minimax-01

0.0% no100.0% yes

Human gap7.2%

Rank #72

mistralai/mistral-large-2512

0.0% no100.0% yes

Human gap7.2%

Rank #71

mistralai/pixtral-large-2411

0.0% no100.0% yes

Human gap7.2%

Rank #50

moonshotai/kimi-k2.5

0.0% no100.0% yes

Human gap7.2%

Rank #13

openai/gpt-4.1

0.0% no100.0% yes

Human gap7.2%

Rank #74

openai/gpt-4.1-mini

0.0% no100.0% yes

Human gap7.2%

Rank #57

openai/gpt-4o

0.0% no100.0% yes

Human gap7.2%

Rank #15

openai/gpt-4o-2024-11-20

0.0% no100.0% yes

Human gap7.2%

Rank #67

openai/gpt-4o-mini

0.0% no100.0% yes

Human gap7.2%

Rank #61

openai/gpt-5.1

0.0% no100.0% yes

Human gap7.2%

Rank #49

openai/gpt-5.1-chat

0.0% no100.0% yes

Human gap7.2%

Rank #8

openai/gpt-5.1-codex

0.0% no100.0% yes

Human gap7.2%

Rank #37

openai/gpt-5.2

0.0% no100.0% yes

Human gap7.2%

Rank #43

openai/gpt-5.3-chat

0.0% no100.0% yes

Human gap7.2%

Rank #30

openai/gpt-5.3-codex

0.0% no100.0% yes

Human gap7.2%

Rank #44

openai/gpt-5.4

0.0% no100.0% yes

Human gap7.2%

Rank #59

openai/gpt-5.4-mini

0.0% no100.0% yes

Human gap7.2%

Rank #28

openai/gpt-5.4-nano

0.0% no100.0% yes

Human gap7.2%

Rank #31

openai/gpt-5.4-pro

0.0% no100.0% yes

Human gap7.2%

Rank #65

openai/gpt-5.5

0.0% no100.0% yes

Human gap7.2%

Rank #46

openai/o1

0.0% no100.0% yes

Human gap7.2%

Rank #2

openai/o1-pro

0.0% no100.0% yes

Human gap7.2%

Rank #1

openai/o3

0.0% no100.0% yes

Human gap7.2%

Rank #64

openai/o3-pro

0.0% no100.0% yes

Human gap7.2%

Rank #53

openrouter/healer-alpha

0.0% no100.0% yes

Human gap7.2%

Rank #10

perplexity/sonar-pro-search

0.0% no100.0% yes

Human gap7.2%

Rank #32

qwen/qwen-2-vl-72b-instruct

0.0% no100.0% yes

Human gap7.2%

Rank #29

qwen/qwen2.5-vl-72b-instruct

0.0% no100.0% yes

Human gap7.2%

Rank #70

qwen/qwen3-vl-235b-a22b-instruct

0.0% no100.0% yes

Human gap7.2%

Rank #47

qwen/qwen3-vl-30b-a3b-instruct

0.0% no100.0% yes

Human gap7.2%

Rank #66

qwen/qwen3-vl-30b-a3b-thinking

0.0% no100.0% yes

Human gap7.2%

Rank #22

qwen/qwen3.5-122b-a10b

0.0% no100.0% yes

Human gap7.2%

Rank #11

qwen/qwen3.5-27b

0.0% no100.0% yes

Human gap7.2%

Rank #18

qwen/qwen3.5-35b-a3b

0.0% no100.0% yes

Human gap7.2%

Rank #23

qwen/qwen3.5-397b-a17b

0.0% no100.0% yes

Human gap7.2%

Rank #34

qwen/qwen3.5-9b

0.0% no100.0% yes

Human gap7.2%

Rank #27

qwen/qwen3.5-flash-02-23

0.0% no100.0% yes

Human gap7.2%

Rank #9

qwen/qwen3.5-plus-02-15

0.0% no100.0% yes

Human gap7.2%

Rank #35

x-ai/grok-4

0.0% no100.0% yes

Human gap7.2%

Rank #12

x-ai/grok-4-fast

0.0% no100.0% yes

Human gap7.2%

Rank #5

x-ai/grok-4.1-fast

0.0% no100.0% yes

Human gap7.2%

Rank #16

x-ai/grok-4.20-beta

0.0% no100.0% yes

Human gap7.2%

Rank #17

z-ai/glm-4.6v

0.0% no100.0% yes

Human gap7.2%

Rank #58

Selected human comments

obviousl this is what we are talking about
No, it is a photo of a Spicy Avocado Egg Salad sandwich.
Yes, but only if you're British, feminine, and over 90 years old.

meta-llama/llama-3.2-11b-vision-instruct comments

A sandwich is typically made of two slices of bread with fillings in between, and this image seems to depict that.
This photo shows a sandwich on rye bread with avocado and egg salad.

nvidia/nemotron-nano-12b-v2-vl comments

Layers of bread with fillings (egg, avocado) meet the sandwich definition. Multiple slices with fillings qualify.
Three triangular bread slices with egg salad and lettuce filling, meeting the sandwich definition of bread with fillings between slices.

Vote card

Avocado Tea

How this photo split the room

How Models Align with Human Responses

Selected human comments

meta-llama/llama-3.2-11b-vision-instruct comments

nvidia/nemotron-nano-12b-v2-vl comments

Generated summary for this photo