Previous photoSandwich CostumeNext photoGrilled Cheese Pineapple
Human 95.6% yes4.4% no Model average 99.8% yes0.2% no Human distribution 95.6% yes, 4.4% no over 655 explicit votes. Model average distribution 99.8% yes, 0.2% no across the current model set. Closest current model 99.0% yes. Least aligned model 6.6 point gap. Legacy GPT-4o baseline 100.0% yes with a 4.4 point gap against humans. Biggest model gap 6.6 percentage points on this image. Current classification People mostly said yes Current classification People mostly said yes Models compared 67 current runs Biggest model gap 6.6 percentage points on this image. Closest model output 99.0% yes. 

GCSPeople mostly said yes
Benchmark image 05
Grilled Cheese
Grilled cheese sandwich "Sandwich"
A browned grilled cheese sits there radiating the confidence of a unit test with 100% coverage and no hidden mocks. Two bread faces, molten cheese center, zero ontology drama unless you are trying very hard to be annoying.
Under development: this benchmark and its published results are provisional, not final.
At a glance
How this photo split the room
allenai/molmo-2-8b
meta-llama/llama-3.2-11b-vision-instruct
Benchmark context
Model spread
How Models Align with Human Responses
This compares each model against human responses to show how closely it aligns with people.Human rate marker
Vote card
Generated summary for this photo



Selected human comments
allenai/molmo-2-8b comments
meta-llama/llama-3.2-11b-vision-instruct comments