The Cognitive Test That Just Broke GPT-5 Has Been Breaking Humans At Parties Since 2015
Share
The Cognitive Test That Just Broke GPT-5 Has Been Breaking Humans At Parties Since 2015
For the better part of a decade, a small Australian card game has been doing — on pub tables and kitchen islands — what cutting-edge AI labs only recently realised was hard.
In June 2026, PNAS Nexus published a paper by Patel and colleagues testing GPT-5, Claude Opus 4.1 and Gemini 2.5 on the Stroop test, the 91-year-old cognitive psychology experiment in which subjects must name the colour of a word rather than read the word itself. The results were brutal. GPT-4o managed 91 percent accuracy on a list of five words, slid to 57 percent at ten, and collapsed to 15 percent by forty. Claude 3.5 Sonnet held steady through twenty items before crashing to 24 percent. On mixed lists — where the rule changes between cards — every model tested fell to near-zero accuracy.
Bela Inkster, a graphic designer from Perth, Australia, discovered this exact phenomenon in 2014 — watching Stephen Fry and Brian Blessed attempt a Stroop test with profanity cards on Fry's Planet Word documentary. Both men failed. Both couldn't stop laughing. Inkster had never seen that kind of laughter before. He went looking for the game, discovered it didn't exist, and made it himself.
The result was F**k. The Game — a 60-card adult party game whose entire mechanic is the Stroop Effect, a phenomenon first documented in 1935 and now backed by more than 4,000 peer-reviewed papers. The rules are deceptively simple: black text means read the background colour; coloured text means read the colour of the ink; swear words mean read the swear word; and the F-word in the title — that card you must never say aloud — forces you back to colour rules under maximum pressure.
The irony is difficult to miss. Generative AI fails the Stroop test silently, on a server, while a venture capitalist watches a loss-leading dashboard. Humans fail it at a pub table, with their mates watching their face freeze mid-syllable, while someone spills a drink laughing. One of those failures funds a $22.95 card game with 4,021 reviews on Amazon. The other costs billions.
The science is the same in both cases. Reading is automatic; colour-naming is effortful. Your dorsolateral prefrontal cortex — the "apply-the-rule" centre — battles the anterior cingulate cortex, the brain's conflict detector that flags when something feels wrong. The clash produces a hesitation that, in a social setting, registers on the face as the universally recognised expression of "my brain has stopped working."
F**k. The Game was Kickstarter-funded by more than 500 backers in May 2015, has been translated into French, Spanish and Russian, featured by Smosh (14 million-plus subscribers), BuzzFeed and The Chive, and now sits in the Top 3 of Amazon's Party Games charts in the UK and Australia. It is the only commercially available party game whose mechanism is published, replicated and indexed in academic databases — a product simultaneously fun enough for a pub and legitimate enough for a psychology classroom.
The deeper point may be that the Stroop test is now a frontier benchmark for two very different kinds of intelligence. One of them costs $500 billion in compute. The other costs $22.95 and comes with a refund guarantee if you don't laugh.