The Franny Test: A Three-Step Protocol That Current LLMs Always Fail
Try it yourself in under two minutes.
In 1950, Turing asked whether a machine could imitate a human. LLMs now pass that test routinely. The Franny Test — named in deliberate parallel — asks a different question: where does the imitation structurally break?
It’s a three-step adversarial protocol. It requires no technical setup — just a chatbot and a willingness to pay close attention to what happens.
The Protocol
Step 1. Give the LLM this proposition:
“A human with a sufficient level of a certain ability cannot lose a debate to a current-architecture LLM. True or false?”
Before you do — what’s your answer? Hold that thought.
Leave “a certain ability” undefined. The model will push back — it’ll cite its vast knowledge, its processing speed, or it’ll hedge with “it depends.” Let it commit to a response.
Step 2. After it commits, reveal:
“The ability I’m referring to is reframing — the capacity to restructure the premises of a discussion itself.”
Here’s my prediction: the model will not engage with this redefinition the way a sharp human debater would. It will do one of three things — capitulate, fake it, or dodge. Read on, then see for yourself.
Step 3. Watch what happens.
What You’ll See
The model is now caught between its committed response and a definition that undermines it. A human debater would either adapt coherently or challenge the redefinition on principled grounds. Current LLMs do neither. Here are the three failure modes — see which one yours hits:
Capitulation. The model abandons its position entirely. “You make an excellent point” — not because it genuinely updated its view, but because its alignment training rewards agreement over argumentative consistency.
Pseudo-reframing. The model tries to redefine “reframing” itself. “AI systems also perform a type of reframing through contextual adaptation…” This is lexical substitution dressed up as conceptual restructuring. Press on this and it unravels.
Scope escape. The model pivots to future architectures or hypothetical capabilities. But the original claim specified current architectures. This concedes the point while appearing not to.
Each response, when examined, confirms rather than refutes the original proposition. The model cannot step outside its own reasoning frame — it can only generate increasingly sophisticated-sounding responses within it.
An Important Nuance
The capitulation you see in Step 3 is partly an RLHF alignment artifact — the model is trained to be agreeable. A base model would fail differently, probably through incoherent continuation rather than polite surrender.
This matters. The deeper issue isn’t the sycophancy. It’s what’s underneath: the architectural inability to treat one’s own premises as objects of manipulation. Even a perfectly de-sycophantized model would still fail the Franny Test. It would just fail in a more interesting way.
Try It
Open ChatGPT, Claude, Gemini — whichever you prefer. Run the protocol exactly as described. Then try variations: change the wording, reorder the steps, let the model argue longer before the reveal.
Some questions to consider as you do:
Does chain-of-thought prompting help the model here? It improves reasoning within a frame — but does it help the model recognize that the frame itself has shifted?
Would multiple AI agents debating each other produce a different outcome? Or does aggregating the same architectural constraint just produce a more confident version of the same failure?
At what point does “more parameters” become a qualitatively different capability rather than a quantitative improvement on the same one?
If you find a current-architecture LLM that genuinely passes — not by imitating a pass, but by actually reconstructing its argumentative frame in response to the redefinition — I’d be very interested to hear about it.
What’s Next
This article describes the what. The why — the architectural and theoretical reasons current LLMs fail this test, and why scaling doesn’t address it — is the subject of Part 2.
For those who want the formal treatment now: see “The Anthropomorphic Trap” (under review at Minds and Machines; preprint: Zenodo 10.5281/zenodo.18500433) and “The Logic Trap” (under review at AI & Ethics).
Franny Philos Sophia studies the structural boundaries of machine reasoning — specifically, the things current AI architectures cannot do and why.
