Why LLMs can't really build software (zed.dev)

submitted 1 month ago by MarcellusDrum@lemmy.ml to c/technology@lemmy.world

55 comments fedilink hide all child comments

you are viewing a single comment's thread
view the rest of the comments

[-] frezik@lemmy.blahaj.zone 0 points 1 month ago

To those who have played around with LLM code generation more than me, how are they at debugging?

I'm thinking of Kernighan's Law: "Debugging is twice as hard as writing the code in the first place. Therefore, if you write the code as cleverly as possible, you are, by definition, not smart enough to debug it." If vibe coding reduces the complexity of writing code by 10x, but debugging remains just as difficult as before, then Kernighan's Law needs to be updated to say debugging is 20x as hard as vibe coding. Vibe coders have no hope of bridging that gap.

[-] very_well_lost@lemmy.world 2 points 1 month ago* (last edited 1 month ago)

The company I work for has recently mandated that we must start using AI tools in our workflow and is tracking our usage, so I've been experimenting with it a lot lately.

In my experience, it's worse than useless when it comes to debugging code. The class of errors that it can solve is generally simple stuff like typos and syntax errors — the sort of thing that a human would solve in 30 seconds by looking at a stack trace. The much more important class of problem, errors in the business logic, it really really sucks at solving.

For those problems, it very confidently identifies the wrong answer about 95% of the time. And if you're a dev who's desperate enough to ask AI for help debugging something, you probably don't know what's wrong either, so it won't be immediately clear if the AI just gave you garbage or if its suggestion has any real merit. So you go check and manually confirm that the LLM is full of shit which costs you time... then you go back to the LLM with more context and ask it to try again. It's second suggestion will sound even more confident than the first, ("Aha! I see the real cause of the issue now!") but it will still be nonsense. You go waste more time to rule out the second suggestion, then go back to the AI to scold it for being wrong again.

Rinse and repeat this cycle enough times until your manager is happy you've hit the desired usage metrics, then go open your debugging tool of choice and do the actual work.

[-] hietsu@sopuli.xyz 1 points 1 month ago* (last edited 1 month ago)

I have next to zero experience about coding (unless you count a few months of Borland Delphi work back in 00’s, which you shouldn’t). Yet, I’ve managed to create half a dozen really useful tools for my work, and a few more for my hobbies too.

Inflection point for me was Gemini 2.5 Pro. Before that I was only successful with smaller scripts, using ChatGPT mostly. But with Gemini I was able to do Deep Research as the initial step to plan out the overall architecture, interfaces, technologies etc. and finetune the actual coding prompt using that info.

Crucial step after first generated (buggy) version is to copy paste the code and errors to ChatGPT and/or Grok to get their take on it, then feed back those ideas to Gemini again. Some 5-10 iterations of this and I usually have a fully functional application or a component of bigger piece software. Problems at the moment usually arise if any particular file exceeds ~800 lines, and when there are many many iterations. Then LLMs tend to get forgetful, dropping out comments, reintroducing faults from earlier iterations etc. Better to start a new session at that point.

Thinking of LLMs as just a lossy compression algo for all human knowledge, the parallel use of LLMs makes kind of sense: All the companies use approximately the same data in their training, but end up having a bit different looking ”lossy big picture” in the end. But if I ”look at all these pictures” side by side I can perhaps see more detail. Or that some of the pictures are fuzzy on one location but one is much clearer.

LLM’s seem to be very good to at spotting the correct solution when they are given a couple options or hypothesis about an cause of an issue. Most surprising to me is that Grok has been the one to solve majority of the most stubborn bugs that other have gotten stuck to.

With (Edit:) Gemini I just had my first ”hole in one” where it generated a flawless ~500 line web app in the very first try. Just gave it my Git codebase as a zip file and asked for a new module that interfaces the existing stuff. Wild times.

[-] Pechente@feddit.org 0 points 1 month ago

Definitely not good. Sometimes they can solve issues but you gotta point them in the direction of the issue. Other times they write hacky workarounds that do the job for the moment but crash catastrophically with the next major dependency update.

[-] HarkMahlberg@kbin.earth 1 points 1 month ago

I saw an LLM override the casting operator in C#. An evangelist would say "genius! what a novel solution!" I said "nobody at this company is going to know what this code is doing 6 months from now."

It didn't even solve our problem.

[-] hisao@ani.social -3 points 1 month ago

I saw an LLM override the casting operator in C#. An evangelist would say “genius! what a novel solution!” I said “nobody at this company is going to know what this code is doing 6 months from now.”

Before LLMs people were often saying this about people smarter than the rest of the group. "Yeah he was too smart and overengineered solutions that no one could understand after he left,". This is btw one of the reasons why I increasingly dislike programming as a field over the years and happily delegate the coding part to AI nowadays. This field celebrates conformism and that's why humans shouldn't write code manually. Perfect field to automate away via LLMs.

[-] Feyd@programming.dev 1 points 1 month ago

Wow you just completely destroyed any credibility about your software development opinions.

[-] hisao@ani.social -2 points 1 month ago

Why though? I think hating and maybe even disrespecting programming and wanting your job to be as much redundant and replaced as possible is actually the best mindset for a programmer. Maybe in the past it was a nice mindset to become a teamlead or a project manager, but nowadays with AI it's a mindset for programmers.

[-] Feyd@programming.dev 2 points 1 month ago

Before LLMs people were often saying this about people smarter than the rest of the group. “Yeah he was too smart and overengineered solutions that no one could understand after he left,”.

This part.

[-] hisao@ani.social 0 points 1 month ago

The fact that I dislike it that it turned out that software engineering is not a good place for self-expression or for demonstrating your power level or the beauty and depth of your intricate thought patterns through advanced constructs and structures you come up with, doesn't mean that I disagree that this is true.

[-] chunkystyles@sopuli.xyz 1 points 1 month ago

If your code is as comprehensible as that run-on sentence, I can understand why coworkers would ask you to please write simpler code.

[-] very_well_lost@lemmy.world 1 points 1 month ago

demonstrating your power level

lolwut? I'm so tired of tech people acting like they're the next Genghis Khan or Julius Caesar...

[-] Feyd@programming.dev 1 points 1 month ago

The problem is that you don't realize that writing code that is difficult to maintain is in fact not a sign of intelligence, or "power level".

[-] hisao@ani.social -1 points 1 month ago

It depends. If it's difficult to maintain because it's some terrible careless spaghetti written by person who didn't care enough, then it's definitely not a sign of intelligence or power level. But if it's difficult to maintain because the rest of the team can't wrap their head around type-level metaprogramming or edsl you came up with, then it's a different case.

[-] chunkystyles@sopuli.xyz 1 points 1 month ago

No. Both are hard to maintain. And in fact, I'd prefer the spaghetti. It can be untangled.

[-] very_well_lost@lemmy.world 1 points 1 month ago

Before LLMs people were often saying this about people smarter than the rest of the group.

Smarter by whose metric? If you can't write software that meets the bare minimum of comprehensibility, you're probably not as smart as you think you are.

Software engineering is an engineering discipline, and conformity is exactly what you want in engineering — because in engineering you don't call it 'conformity', you call it 'standardization'. Nobody wants to hire a maverick bridge-builder, they wanna hire the guy who follows standards and best practices because that's how you build a bridge that doesn't fall down. The engineers who don't follow standards and who deride others as being too stupid or too conservative to understand their vision are the ones who end up crushed to death by their imploding carbon fiber submarine at the bottom of the Atlantic.

AI has exactly the same "maverick" tendencies as human developers (because, surprise surprise, it's trained on human output), and until that gets ironed out, it's not suitable for writing anything more than the most basic boilerplate — which is stuff you can usually just copy-paste together in five minutes anyway.

[-] hisao@ani.social -2 points 1 month ago

You're right of course and engineering as a whole is a first-line subject to AI. Everything that has strict specs, standards, invariants will benefit massively from it, and conforming is what AI inherently excels at, as opposed to humans. Those complaints like the one this subthread started with are usually people being bad at writing requirements rather than AI being bad at following them. If you approach requirements like in actual engineering fields, you will get corresponding results, while humans will struggle to fully conform or even try to find tricks and loopholes in your requirements to sidestep them and assert their will while technically still remaining in "barely legal" territory.

[-] TechLich@lemmy.world 1 points 1 month ago

I feel like this isn't quite true and is something I hear a lot of people say about ai. That it's good at following requirements and confirming and being a mechanical and logical robot because that's what computers are like and that's how it is in sci fi.

In reality, it seems like that's what they're worst at. They're great at seeing patterns and creating ideas but terrible at following instructions or staying on task. As soon as something is a bit bigger than they can track context for, they'll get "creative" and if they see a pattern that they can complete, they will, even if it's not correct. I've had copilot start writing poetry in my code because there was a string it could complete.

Get it to make a pretty looking static web page with fancy css where it gets to make all the decisions? It does it fast.

Give it an actual, specific programming task in a full sized application with multiple interconnected pieces and strict requirements? It confidently breaks most of the requirements, and spits out garbage. If it can't hold the entire thing in its context, or if there's a lot of strict rules to follow, it'll struggle and forget what it's doing or why. Like a particularly bad human programmer would.

This is why AI is automating art and music and writing and not more mundane/logical/engineering tasks. Great at being creative and balls at following instructions for more than a few steps.

[-] 0x01@lemmy.ml -1 points 1 month ago

I use it extensively daily.

It cannot step through code right now, so true debugging is not something you use it for. Most of the time the llm will take the junior engineer approach of "guess and check" unless you explicitly give it better guidance.

My process is generally to start with unit tests and type definitions, then a large multipage prompt for every segment of the app the llm will be tasked with. Then I'll make a snapshot of the code, give the tool access to the markdown prompt, and validate its work. When there are failures and the project has extensive unit tests it generally follows the same pattern of "I see that this failure should be added to the unit tests" which it does and then re-executes them during iterative development.

If tests are not available or if it is not something directly accessible to the tool then it will generally rely on logs either directly generated or provided by the user.

My role these days is to provide long well thought out prompts, verify the integrity of the code after every commit, and generally just kind of treat the llm as a reckless junior dev. Sometimes junior devs can surprise you, like yesterday I was very surprised by a one shot result: asking for a mobile rn app for taking my rambling voice recordings and summarize them into prompts, it was immediately remarkably successful and now I've been walking around mic'd up to generate prompts.

this post was submitted on 15 Aug 2025

13 points (100.0% liked)

Technology

75973 readers

297 users here now

This is a most excellent place for technology news and articles.

Our Rules

Follow the lemmy.world rules.
Only tech related news or articles.
Be excellent to each other!
Mod approved content bots can post up to 10 articles per day.
Threads asking for personal tech support may be deleted.
Politics threads may be removed.
No memes allowed as posts, OK to post as comments.
Only approved bots from the list below, this includes using AI responses and summaries. To ask if your bot can be added please contact a mod.
Check for duplicates before posting, duplicates may be removed
Accounts 7 days and younger will have their posts automatically removed.

Approved Bots

founded 2 years ago

MODERATORS

L3s@lemmy.world

enu@lemmy.world

technopagan@lemmy.world

L4s@lemmy.world

L3s@hackingne.ws

L4s@hackingne.ws