Even GPT-5.2 Can't Count to Five: Zero-Error Horizons in Trustworthy LLMs
by daigoba66 on 4/2/2026, 3:35:31 PM
https://arxiv.org/abs/2601.15714
Comments
by: grey-area
To those saying this is not surprising, yes it will be surprising to the general public who are being served ads from huge companies like MS or OpenAI saying LLMs can help with their accounting, help them close deals by crunching the numbers in seconds, write complex code for them etc etc.<p>This is important information for anyone to understand who thinks these systems are thinking, reasoning, and learning from them or that they’re having a conversation with them i.e. 90% of users of LLMs.
4/2/2026, 6:07:47 PM
by: hu3
> we found that GPT-5.2 cannot even compute the parity of a short string like 11000, and GPT-5.2 cannot determine whether the parentheses in ((((()))))) are balanced.<p>I think there is a valid insight here which many already know: LLMs are much more reliable at creating scripts and automation to do certain tasks than doing these tasks themselves.<p>For example if I provide an LLM my database schema and tell it to scan for redundant indexes and point out wrong naming conventions, it might do a passable but incomplete job.<p>But if I tell the LLM to code a python or nodejs script to do the same, I get significantly better results. And it's often faster too to generate and run the script than to let LLMs process large SQL files.
4/2/2026, 6:24:14 PM
by: emp17344
There’s a certain type of user here who reacts with rage when anyone points out flaws with LLMs. Why is that?
4/2/2026, 6:13:26 PM
by: pants2
Doesn't this just look like another case of "count the r's in strawberry" ie not understanding how tokenization works?<p>This is well known and not that interesting to me - ask the model to use python to solve any of these questions and it will get it right every time.
4/2/2026, 5:55:42 PM
by: staticshock
LLMs seem to me closer to Kahneman's System 1 than to System 2. When understood in this way, it is obvious why LLMs are bad at counting r's in "strawberries". But it also makes ZEH feel like it couldn't possibly be a useful metric, because it's a System 2 evaluation applied to a System 1 system.
4/2/2026, 5:33:01 PM
by: BugsJustFindMe
People are going to misinterpret this and overgeneralize the claim. This does not say that AI isn't reliable for things. It provides a method for quantifying the reliability for specific tasks.<p>You wouldn't say that a human who doesn't know how to read isn't reliable in everything, just in reading.<p>Counting is something that even humans need to learn how to do. Toddlers also don't understand quantity. If a 2 year old is able to count to even 10 it's through memorization and not understanding. It takes them like 2 more years of learning before they're able to comprehend things like numerical correspondence. But they do still know how to do other things that aren't counting before then.
4/2/2026, 5:39:16 PM
by: kenjackson
Whenveer I see these papers and try them, they always work. This paper is two months old, which in LLM years is like 10 years of progress.<p>It would be interesting to actively track how far long each progressive model gets...
4/2/2026, 5:41:20 PM
by: simianwords
Can someone produce a single example <20 characters that fails with latest thinking model? Can’t seem to reproduce.
4/2/2026, 6:47:35 PM
by: dwa3592
Nice! Although I tried the parenthesis balanced question with gemini and it gave the right answer in first attempt.
4/2/2026, 6:29:03 PM
by:
4/2/2026, 5:46:43 PM
by: burningion
Ran this through Qwen3.5-397B-A17B, and the difference between 4 characters and 5 is wild to see:<p>> are the following parenthesis balanced? ((())))<p>> No, the parentheses are not balanced.<p>> Here is the breakdown:<p><pre><code> Opening parentheses (: 3 Closing parentheses ): 4 </code></pre> ... following up with:<p>> what about these? ((((())))<p>> Yes, the parentheses are balanced.<p>> Here is the breakdown:<p><pre><code> Opening parentheses (: 5 Closing parentheses ): 5 </code></pre> ... and uses ~5,000 tokens to get the wrong answer.
4/2/2026, 5:37:54 PM
by: cineticdaffodil
Another strange thing is that they just dont know the endings of popular stories. Like olanets that get blown up, etc. they just dont have that material..
4/2/2026, 7:09:32 PM
by: parliament32
> This is surprising given the excellent capabilities of GPT-5.2<p>The real surprise is that someone writing a paper on LLMs doesn't understand the baseline capabilities of a hallucinatory text generator (with tool use disabled).
4/2/2026, 5:33:29 PM
by: justinator
One! Two! Five!
4/2/2026, 5:36:11 PM
by: bigstrat2003
Let us be very clear: there is no such thing as a trustworthy LLM. Time and again they have shown that they understand nothing. They can be useful in the right context, but you can't trust them <i>at all</i>.
4/2/2026, 5:41:56 PM
by: throwuxiytayq
> This is surprising given the excellent capabilities of GPT-5.2.<p>Is this seriously surprising to anyone who knows <i>the absolute minimum</i> about how LLMs parse and understand text?
4/2/2026, 5:25:29 PM
by: charcircuit
Why didn't OpenAI finetune the model to use the python tool it has for these tasks?
4/2/2026, 5:13:24 PM
by: itsmyro
bruh
4/2/2026, 6:20:41 PM
by: jeremie_strand
[dead]
4/2/2026, 6:10:30 PM
by: simianwords
There’s no way this is right. I checked complicated ones with the latest thinking model. Can someone come up with a counter example?<p>Edit: here’s what I tried <a href="https://chatgpt.com/share/69cebb52-56a8-838f-969c-c47308262afa" rel="nofollow">https://chatgpt.com/share/69cebb52-56a8-838f-969c-c47308262a...</a>
4/2/2026, 6:10:56 PM