Previous photoHamburgerNext photoHot Dog
Human 59.4% yes40.6% no Model average 91.4% yes8.6% no Human distribution 59.4% yes, 40.6% no over 655 explicit votes. Model average distribution 91.4% yes, 8.6% no across the current model set. Closest current model 70.3% yes. Least aligned models 40.6 point gap. Legacy GPT-4o baseline 84.0% yes with a 24.6 point gap against humans. Biggest model gap 40.6 percentage points on this image. Current classification Human knife-edge Current classification Human knife-edge Models compared 67 current runs Biggest model gap 40.6 percentage points on this image. Closest model output 70.3% yes. 

HSHHuman knife-edge
Benchmark image 09
Hashbrown Sandwich
Hash brown, egg, bacon, egg & cheese "Sandwich"
A breakfast stack uses hash-brown slabs as the outer chassis for bacon, egg, and cheese, like fast-food R&D got too comfortable with category theory. It is handheld, layered, and deeply committed to making 'bread' feel optional.
Under development: this benchmark and its published results are provisional, not final.
At a glance
How this photo split the room
openai/gpt-4.1-mini
40-way tie
Benchmark context
Model spread
How Models Align with Human Responses
This compares each model against human responses to show how closely it aligns with people.Human rate marker
Vote card
Generated summary for this photo



Selected human comments
openai/gpt-4.1-mini comments
openai/gpt-4.1 comments