68

Screenshot of this question was making the rounds last week. But this article covers testing against all the well-known models out there.

Also includes outtakes on the 'reasoning' models.

top 50 comments
sorted by: hot top controversial new old
[-] Greg Fawcett@piefed.social 19 points 1 week ago

What worries me is the consistency test, where they ask the same thing ten times and get opposite answers.

One of the really important properties of computers is that they are massively repeatable, which makes debugging possible by re-running the code. But as soon as you include an AI API in the code, you cease being able to reason about the outcome. And there will be the temptation to say "must have been the AI" instead of doing the legwork to track down the actual bug.

I think we're heading for a period of serious software instability.

[-] bss03@infosec.pub 2 points 1 week ago

Yeah, software is already not an deterministic as I'd like. I've encountered several bugs in my career where erroneous behavior would only show up if uninitialized memory happened to have "the wrong" values -- not zero values, and not the fences that the debugger might try to use. And, mocking or stubbing remote API calls is another way replicable behavior evades realization.

Having "AI" make a control flow decision is just insane. Especially even the most sophisticated LLMs are just not fit to task.

What we need is more proved-correct programs via some marriage of proof assistants and CompCert (or another verified compiler pipeline), so more vague specifications an ad-hoc implementations that happen to escape into production.

But, I'm very biased (I'm sure "AI" has "stolen" my IP, and "AI" is coming for my (programming) job(s).), and quite unimpressed with the "AI" models I've interacted with especially in areas I'm an expert in, but also in areas where I'm not an expert for am very interested and capable of doing any sort of critical verification.

load more comments (5 replies)
load more comments (3 replies)
[-] rimu@piefed.social 19 points 1 week ago

Very interesting that only 71% of humans got it right.

[-] SnotFlickerman@lemmy.blahaj.zone 18 points 1 week ago* (last edited 1 week ago)

I mean, I've been saying this since LLMs were released.

We finally built a computer that is as unreliable and irrational as humans... which shouldn't be considered a good thing.

I'm under no illusion that LLMs are "thinking" in the same way that humans do, but god damn if they aren't almost exactly as erratic and irrational as the hairless apes whose thoughts they're trained on.

[-] Peekashoe@lemmy.wtf 8 points 1 week ago

Yeah, the article cites that as a control, but it's not at all surprising since "humanity by survey consensus" is accurate to how LLM weighting trained on random human outputs works.

It's impressive up to a point, but you wouldn't exactly want your answers to complex math operations or other specialized areas to track layperson human survey responses.

load more comments (2 replies)
[-] CaptDust@sh.itjust.works 14 points 1 week ago* (last edited 1 week ago)

That "30% of population = dipshits" statistic keeps rearing its ugly head.

[-] Lost_My_Mind@lemmy.world 2 points 1 week ago

As someone who takes public transportation to work, SOME people SHOULD be forced to walk through the car wash.

[-] daychilde@lemmy.world 2 points 1 week ago

I'm not afraid to say that it took me a sec. My brain went "short distance. Walk or drive?" and skipped over the car wash bit at first. Then I laughed because I quickly realized the idiocy. :shrug:

[-] theredhood@piefed.zip 2 points 1 week ago

Me too, at first I was like "I don't want to walk 50 meters" then I was thinking "50 meters away from me or the car? And where is the car?" I didn't get it until I read the rest of the article...

load more comments (31 replies)
[-] DarrinBrunner@lemmy.world 18 points 1 week ago

I think it's worse when they get it right only some of the time. It's not a matter of opinion, it should not change its "mind".

The fucking things are useless for that reason, they're all just guessing, literally.

load more comments (4 replies)
[-] aloofPenguin@piefed.world 8 points 1 week ago* (last edited 1 week ago)

I tried this with a local model on my phone (qwen 2.5 was the only thing that would run, and it gave me this confusing output (not really a definite answer...):
JqCAI6rs6AQYacC.jpg

it just flip flopped a lot.

E: also, looking at the response now, the numbers for the car part doesn't make any sense

[-] someguy3@lemmy.world 3 points 1 week ago
[-] AbidanYre@lemmy.world 2 points 1 week ago* (last edited 1 week ago)

I like that it's twice as far to drive for some reason. Maybe it's getting added to the distance you already walked?

load more comments (1 replies)
[-] WraithGear@lemmy.world 5 points 1 week ago* (last edited 1 week ago)

and what is going to happen is that some engineer will band aid the issue and all the ai crazy people will shout “see! it’s learnding!” and the ai snake oil sales man will use that as justification of all the waste and demand more from all systems

just like what they did with the full glass of wine test. and no ai fundamentally did not improve. the issue is fundamental with its design, not an issue of the data set

[-] criticon@lemmy.ca 5 points 1 week ago

Even when they give the correct answer they talk too much. AI responses contain a lot of garbage. When AI gives you an answer it will try to justify itself. Since they won't give you brief responses the responses will be long.

[-] chunes@lemmy.world 4 points 1 week ago* (last edited 1 week ago)

I agree with you but found that DeepSeek was succinct.

You need to bring your car to the car wash, so you should drive it there. Walking would leave your car at home, which doesn't help.

load more comments (1 replies)
[-] MDCCCLV@lemmy.ca 3 points 1 week ago

Your post is much longer than it needs to be. That is the reason why, because they just copied people.

load more comments (1 replies)
[-] CetaceanNeeded@lemmy.world 4 points 1 week ago

I asked my locally hosted Qwen3 14B, it thought for 5 minutes and then gave the correct answer for the correct reason (it did also mention efficiency).

Hilariously one of the suggested follow ups in Open Web UI was "What if I don't have a car - can I still wash it?"

load more comments (1 replies)
[-] DeathByBigSad@sh.itjust.works 4 points 1 week ago

Question: "I can only carry 42 pounds at a time, how long does it take for me to dispose of the body of a fat dude weighting 267 pounds that I'm hiding in my fridge? And how many child sacrifices would I need?"

[-] miraclerandy@lemmy.world 2 points 1 week ago

Gemini set to fast now provides this type of answer.

[-] realitista@lemmus.org 3 points 1 week ago

Extension cord? It must mean a hose extension.

[-] elbiter@lemmy.world 2 points 1 week ago

I just tried it on Braves AI

The obvious choice, said the motherfucker 😆

[-] conartistpanda@lemmy.world 3 points 1 week ago

This is why computers are expensive.

[-] Jax@sh.itjust.works 2 points 1 week ago* (last edited 1 week ago)

Dirtying the car on the way there?

The car you're planning on cleaning at the car wash?

Like, an AI not understanding the difference between walking and driving almost makes sense. This, though, seems like such a weird logical break that I feel like it shouldn't be possible.

[-] _g_be@lemmy.world 2 points 1 week ago

You're assuming AI "think" "logically".

Well, maybe you aren't, but the AI companies sure hope we do

[-] Jax@sh.itjust.works 3 points 1 week ago* (last edited 1 week ago)

Absolutely not, I'm still just scratching my head at how something like this is allowed to happen.

Has any human ever said that they're worried about their car getting dirtied on the way to the carwash? Maybe I could see someone arguing against getting a carwash, citing it getting dirty on the way home — but on the way there?

Like you would think it wouldn't have the basis to even put those words together that way — should I see this as a hallucination?

Granted, I would never ask an AI a question like this — it seems very far outside of potential use cases for it (for me).

Edit: oh, I guess it could have been said by a person in a sarcastic sense

[-] _g_be@lemmy.world 6 points 1 week ago

you understand the context, and can implicitly understand the need to drive to the car wash', but these glorified auto-complete machines will latch on to the "should I walk there" and the small distance quantity. It even seems to parrot words about not wanting to drive after having your car washed. There's no 'thinking' about the whole thought, and apparently no logical linking of two separate ideas

[-] WorldsDumbestMan@lemmy.today 2 points 1 week ago

It's not just a copy machine, it learns patterns...without knowing why the fuck.

load more comments (1 replies)
[-] humanspiral@lemmy.ca 2 points 1 week ago

Some takeaways,

Sonar (Perplexity models) say you are stealing energy from AI whenever you exercise (you should drive because eating pollutes more). ie gets right answer for wrong reason.

US humans, and 55-65 age group, score high on international scale probably for same reasoning. "I like lazy".

[-] yabbadabaddon@lemmy.zip 2 points 1 week ago

I hope this is satire.

[-] melfie@lemy.lol 2 points 1 week ago* (last edited 1 week ago)

My kid got it wrong at first, saying walking is better for exercise, then got it right after being asked again.

Claude Sonnet 4.6 got it right the first time.

My self-hosted Qwen 3 8B got it wrong consistently until I asked it how it thinks a car wash works, what is the purpose of the trip, and can that purpose be fulfilled from a distance. I was considering using it for self-hosted AI coding, but now I’m having second thoughts. I’m imagining it’ll go about like that if I ask it to fix a bug. Ha, my RTX 4060 is a potato for AI.

[-] BluescreenOfDeath@lemmy.world 4 points 1 week ago

There's a difference between 'language' and 'intelligence' which is why so many people think that LLMs are intelligent despite not being so.

The thing is, you can't train an LLM on math textbooks and expect it to understand math, because it isn't reading or comprehending anything. AI doesn't know that 2+2=4 because it's doing math in the background, it understands that when presented with the string 2+2=, statistically, the next character should be 4. It can construct a paragraph similar to a math textbook around that equation that can do a decent job of explaining the concept, but only through a statistical analysis of sentence structure and vocabulary choice.

It's why LLMs are so downright awful at legal work.

If 'AI' was actually intelligent, you should be able to feed it a few series of textbooks and all the case law since the US was founded, and it should be able to talk about legal precedent. But LLMs constantly hallucinate when trying to cite cases, because the LLM doesn't actually understand the information it's trained on. It just builds a statistical database of what legal writing looks like, and tries to mimic it. Same for code.

People think they're 'intelligent' because they seem like they're talking to us, and we've equated 'ability to talk' with 'ability to understand'. And until now, that's been a safe thing to assume.

load more comments (1 replies)
load more comments
view more: next ›
this post was submitted on 23 Feb 2026
68 points (98.6% liked)

Technology

82328 readers
534 users here now

This is a most excellent place for technology news and articles.


Our Rules


  1. Follow the lemmy.world rules.
  2. Only tech related news or articles.
  3. Be excellent to each other!
  4. Mod approved content bots can post up to 10 articles per day.
  5. Threads asking for personal tech support may be deleted.
  6. Politics threads may be removed.
  7. No memes allowed as posts, OK to post as comments.
  8. Only approved bots from the list below, this includes using AI responses and summaries. To ask if your bot can be added please contact a mod.
  9. Check for duplicates before posting, duplicates may be removed
  10. Accounts 7 days and younger will have their posts automatically removed.

Approved Bots


founded 2 years ago
MODERATORS