Previous photoPickle SandwichNext photoPanini
Human 92.8% yes7.2% no Model average 99.3% yes0.7% no Human distribution 92.8% yes, 7.2% no over 656 explicit votes. Model average distribution 99.3% yes, 0.7% no across the current model set. Closest current model 95.0% yes. Least aligned model 14.8 point gap. Legacy GPT-4o baseline 100.0% yes with a 7.2 point gap against humans. Biggest model gap 14.8 percentage points on this image. Current classification People mostly said yes Current classification People mostly said yes Models compared 67 current runs Biggest model gap 14.8 percentage points on this image. Closest model output 95.0% yes. 

TEAPeople mostly said yes
Benchmark image 12
Avocado Tea
Spicy Avocado Egg Salad "Sandwich"
A neat avocado-and-egg-salad tea sandwich looks like it was served beside very expensive gossip. It is obviously a sandwich, just one that speaks in a quieter accent than the rest of the dataset.
Under development: this benchmark and its published results are provisional, not final.
At a glance
How this photo split the room
meta-llama/llama-3.2-11b-vision-instruct
nvidia/nemotron-nano-12b-v2-vl
Benchmark context
Model spread
How Models Align with Human Responses
This compares each model against human responses to show how closely it aligns with people.Human rate marker
Vote card
Generated summary for this photo



Selected human comments
meta-llama/llama-3.2-11b-vision-instruct comments
nvidia/nemotron-nano-12b-v2-vl comments