136

LLMs’ “simulated reasoning” abilities are a “brittle mirage,” researchers find (arstechnica.com)

submitted 4 days ago by Catoblepas@piefed.blahaj.zone to c/technology@beehaw.org

29 comments fedilink hide all child comments

Using supervised fine-tuning (SFT) to introduce even a small amount of relevant data to the training set can often lead to strong improvements in this kind of "out of domain" model performance. But the researchers say that this kind of "patch" for various logical tasks "should not be mistaken for achieving true generalization. ... Relying on SFT to fix every [out of domain] failure is an unsustainable and reactive strategy that fails to address the core issue: the model’s lack of abstract reasoning capability."

Rather than showing the capability for generalized logical inference, these chain-of-thought models are "a sophisticated form of structured pattern matching" that "degrades significantly" when pushed even slightly outside of its training distribution, the researchers write. Further, the ability of these models to generate "fluent nonsense" creates "a false aura of dependability" that does not stand up to a careful audit.

As such, the researchers warn heavily against "equating [chain-of-thought]-style output with human thinking" especially in "high-stakes domains like medicine, finance, or legal analysis." Current tests and benchmarks should prioritize tasks that fall outside of any training set to probe for these kinds of errors, while future models will need to move beyond "surface-level pattern recognition to exhibit deeper inferential competence," they write.

you are viewing a single comment's thread
view the rest of the comments

[-] jarfil@beehaw.org 8 points 4 days ago* (last edited 4 days ago)

chain-of-thought models

There are no "CoT LLMs", a CoT means externally iterating an LLM. The strength of CoT, resides in its ability to pull up external resources at each iteration, not in dogfooding the LLM its own outputs.

"Researchers" didn't "find out" this now, it was known from day one.

As for who needs to hear it... well, apparently people unable to tell apart an LLM from an AI.

[-] CanadaPlus@lemmy.sdf.org 5 points 4 days ago

Yes, but it supports the jerk that everything called or associated with AI is bad, so it makes a popular Beehaw post.

[-] RoadTrain@lemdro.id 2 points 3 days ago

a CoT means externally iterating an LLM

Not necessarily. Yes, a chain of thought can be provided externally, for example through user prompting or another source, which can even be another LLM. One of the key observations behind these models commonly referred to as reasoning is that since an external LLM can be used to provide "thoughts", could an LLM provide those steps itself, without depending on external sources?

To do this, it generates "thoughts" around the user's prompt, essentially exploring the space around it and trying different options. These generated steps are added to the context window and are usually much larger that the prompt itself, which is why these models are sometimes referred to as long chain-of-thought models. Some frontends will show a summary of the long CoT, although this is normally not the raw context itself, but rather a version that is summarised and re-formatted.

[-] interdimensionalmeme@lemmy.ml 1 points 4 days ago

I think of chain of thought as a self-prompting model
I suspect in the future, chain-of-thought model will run
a smaller tuned/dedicated chain-of-thought submodel just for the chain-of-thought tokens

The point of this is that, most users aren't very good at
prompting, they just don't have the feel for it

Personally I get worse results, way less what I wanted,
when CoT is enabled, I'm very annoyed that now
the "chatgpt classic" model selector just decides to use CoT
whenever it wants, I should be the one to decide that
and I want it off almost all of the time !!

this post was submitted on 11 Aug 2025

136 points (99.3% liked)

Technology

39967 readers

112 users here now

A nice place to discuss rumors, happenings, innovations, and challenges in the technology sphere. We also welcome discussions on the intersections of technology and society. If it’s technological news or discussion of technology, it probably belongs here.

Remember the overriding ethos on Beehaw: Be(e) Nice. Each user you encounter here is a person, and should be treated with kindness (even if they’re wrong, or use a Linux distro you don’t like). Personal attacks will not be tolerated.

Subcommunities on Beehaw:

This community's icon was made by Aaron Schneider, under the CC-BY-NC-SA 4.0 license.

founded 3 years ago

MODERATORS

alyaza@beehaw.org

TheRtRevKaiser@beehaw.org

gyrfalcon@beehaw.org

rs5th@beehaw.org

coldredlight@beehaw.org

SemioticStandard@beehaw.org

TheRtRevKaiser@kbin.social

remington@beehaw.org