Judging from all the comments here, it’s going to be amazing seeing the fallout of all the LLM generated code in a year or so. The amount of people who seemingly relish the ability to stop thinking and let the model generate giant chunks of their code base, is uh, something else lol.
It entirely depends on the exposure and reliability the code needs. Some code is just a one-off to show a customer what something might look like. I don't care at all how well the code works or what it looks like for something like that. Rapid prototyping is a valid use case for that.
I have also written a C++ code that has to have a runtime of years, meaning there can be absolutely no memory leaks or bugs whatsoever, or TV stops working. I wouldn't have a language model write any of that, at least not without testing the hell out of it and making sure it makes sense to myself.
It's not all or nothing here. These things are tools and should be used as such.
> It entirely depends on the exposure and reliability the code needs.
Ahh, sweet summer child, if I had a nickel for every time I've heard "just hack something together quickly, that's throwaway code", that ended up being a critical lynchpin of a production system - well, I'd probably have at least like a buck or so.
Obviously, to emphasize, this kind of thing happens all the time with human-generated code, but LLMs make the issue a lot worse because it lets you generate a ton of eventual mess so much faster.
Also, I do agree with your primary point (my comment was a bit tongue in cheek) - it's very helpful to know what should be core and what can be thrown away. It's just in the real world whenever "throwaway" code starts getting traction and getting usage, the powers that be rarely are OK with "Great, now let's rebuild/refactor with production usage in mind" - it's more like "faster faster faster".
> Ahh, sweet summer child, if I had a nickel for every time I've heard "just hack something together quickly, that's throwaway code", that ended up being a critical lynchpin of a production system - well, I'd probably have at least like a buck or so.
Because this is the first pass on any project, any component, ever. Design is done with iterations. One can and should throw out the original rough lynchpin and replace it with a more robust solution once it becomes evident that it is essential.
If you know that ahead of time and want to make it robust early, the answer is still rarely a single diligent one-shot to perfection - you absolutely should take multiple quick rough iterations to think through the possibility space before settling on your choice. Even that is quite conducive to LLM coding - and the resulting synthesis after attacking it from multiple angles is usually the strongest of all. Should still go over it all with a fine toothed comb at the end, and understand exactly why each choice was made, but the AI helps immensely in narrowing down the possibility space.
Not to rag on you though - you were being tongue in cheek - but we're kidding ourselves if we don't accept that like 90% of the code we write is rough throwaway code at first and only a small portion gets polished into critical form. That's just how all design works though.
I would love to work at the places you have been where you are given enough time to throw out the prototype and do it properly. In my almost 20 years of professional experience this has never been the point and prototype and exploratory code has only been given minimal polishing time before reaching production and in use state.
I think you are over estimating the quality of code humans generate. I take LLM over any output of junior - to mid level developer (if they were given the same prompt / ask)
Learning how to use LLMs in a coding workflow is trivial. There is no learning curve. You can safely ignore them if they don’t fit your workflows at the moment.
I have never heard anybody successfully using LLMs say this before. Most of what I've learned from talking to people about their workflows is counterintuitive and subtle.
It's a really weird way to open up an article concluding that LLMs make one a worse programmer: "I definitely know how to use this tool optimally, and I conclude the tool sucks". Ok then. Also: the piano is a terrible, awful instrument; what a racket it makes.
Fully agree. It takes months to learn how to use LLMs properly. There is an initial honeymoon where the LLMs blow your mind out. Then you get some disappointments. But then you start realizing that there are some things that LLMs are good at and some that they are bad at. You start creating a feel for what you can expect them to do. And more importantly, you get into the habit of splitting problems into smaller problems that the LLMs are more likely to solve. You keep learning how to best describe the problem, and you keep adjusting your prompts. It takes time.
it really doesn't take that long. Maybe if you're super junior and never coded before? In that case I'm glad its helping you get into the field. Also, if its taking you months there are whole new models that will get released and you need to learn those quirks again.
I'm glad you feel like you've nailed it. I've been using models to help me code for over two years, and I still feel like I have no idea what I'm doing.
I feel like every time I have a prompt or use a new tool, I'm experimenting with how to make fire for the first time. It's not to say that I'm bad at it. I'm probably better than most people. But knowing how to use this tool is by far the largest challenge, in my opinion.
> Learning how to use LLMs in a coding workflow is trivial. There is no learning curve. You can safely ignore them if they don’t fit your workflows at the moment.
That's a wild statement. I'm now extremely productive with LLMs in my core codebases, but it took a lot of practice to get it right and repeatable. There's a lot of little contextual details you need to learn how to control so the LLM makes the right choices.
Whenever I start working in a new code base, it takes a a non-trivial amount of time to ramp back up to full LLM productivity.
Is the non-trivial amount of time significantly less than you trying to ramp up yourself?
I am still hesitant using AI for solving problems for me. Either it hallucinates and misleads me. Or it does a great job and I worry that my ability of reasoning through complex problems with rigor will degenerate. When my ability of solving complex problems degenerated, patience diminished, attention span destroyed, I will become so reliant on a service that other entities own to perform in my daily life. Genuine question - are people comfortable with this?
The ramp-up time with AI is absolutely lower than trying to ramp up without AI.
My comment is specifically in contrast to working in a codebase where I'm at "max AI productivity". In a new codebase, it just takes a bit of time to work out kinks and figure out tendencies of the LLMs in those codebases. It's not that I'm slower than I'd be without AI, I'm just not at my "usual" AI-driven productivity levels.
>Genuine question - are people comfortable with this?
It's a question of degree, but in general, yeah. I'm totally comfortable being reliant on other entities to solve complex problems for me.
That's how economies work [1]. I neither have nor want to acquire the lifetime of experience I would need to learn how to produce the tea leaves in my tea, or the clean potable water in it, or the mug they are contained within, or the concrete walls 50 meters up from ground level I am surrounded by, or so on and so forth. I can live a better life by outsourcing the need for this specialized knowledge to other people, and trade with them in exchange for my own increasingly-specialized knowledge. Even if I had 100 lifetimes to spend, and not the 1 I actually have, I would probably want to put most of them to things that, you know, aren't already solved-enough problems.
Everyone doing anything interesting works like this, with vanishingly few exceptions. My dad doesn't need to know how to do algebra to get his taxes done, he just has an accountant. And his accountant doesn't need to know how to rewire his turn of the century New England home. And if you look at the exceptions, like that really cute 'self sufficient' family who uploads weekly YouTube videos called "Our Homestead Life"... It often turns out that the revenue from that YouTube stream is nontrivial to keeping the whole operation running. In other words, even if they genuinely no longer go to Costco, it's kind of a gyp.
> My dad doesn't need to know how to do algebra to get his taxes done, he just has an accountant.
This is not quite the same thing. The AI is not perfect, it frequently makes mistakes or suboptimal code. As a software engineer, you are responsible for finding and fixing those. This means you have to review and fully understand everything that the AI has written.
Quite a different situation than your dad and his accountant.
I see your point. I don't think it's different in kind, just degree. My thought process: First, is my dad's accountant infallible?
If not, then they must themselves make mistakes or do things suboptimally sometimes. Whose responsibility is that - my dad, or my dad's accountant?
If it is my dad, does that then mean my dad has an obligation to review and fully understand everything the accountant has written?
And do we have to generalize that responsibility to everything and everyone my dad has to hand off work to in order to get something done? Clearly not, that's absurd. So where do we draw the line? You draw it in the same place I do for right now, but I don't see why we expect that line to be static.
I would honestly say, it's more like autocomplete on steroids, like you know what you want so you just don't wanna type it out (e.g. scripts and such)
And so if you don't use it then someone else will... But as for the models, we already have some pretty good open source ones like Qwen and it'll only get better from here so I'm not sure why the last part would be a dealbreaker
> That's a wild statement. I'm now extremely productive with LLMs in my core codebases, but it took a lot of practice to get it right and repeatable. There's a lot of little contextual details you need to learn how to control so the LLM makes the right choices.
> Whenever I start working in a new code base, it takes a a non-trivial amount of time to ramp back up to full LLM productivity.
Do you find that these details translate between models? Sounds like it doesn't translate across codebases for you?
I have mostly moved away from this sort of fine-tuning approach because of experience a while ago around OpenAI's ChatGPT 3.5 and 4. Extra work on my end necessary with the older model wasn't with the new one, and sometimes counterintuitively caused worse performance by pointing it at what the way I'd do it vs the way it might have the best luck with. ESPECIALLY for the sycophantic models which will heavily index on "if you suggested that this thing might be related, I'll figure out some way to make sure it is!"
So more recently I generally stick to the "we'll handle a lot of the prompt nitty gritty" for you IDE or CLI agent stuff, but I find they still fall apart with large complex codebases and also that the tricks don't translate across codebases.
Yes and no. The broader business context translates well, but each model has it's own blindspots and hyperfocuses that you need to massage out.
* Business context - these are things like code quality/robustness, expected spec coverage, expected performance needs, domain specific knowledge. These generally translate well between models, but can vary between code bases. For example, a core monolith is going to have higher standards than a one-off auxiliary service.
* Model focuses - Different models have different tendencies when searching a code base and building up their context. These are specific to each code base, but relatively obvious when they happen. For example, in one code base I work in, one model always seems to pick up our legacy notification system while another model happens to find our new one. It's not really a skill issue. It's just luck of the draw how files are named and how each of them search. They each just find a "valid" notification pattern in a different order.
LLMs are massively helpful for orienting to a new codebase, but it just takes some time to work out those little kinks.
Getting 80% of the benefit of LLMs is trivial. You can ask it for some functions or to write a suite of unit tests and you’re done.
The last 20%, while possible to attain, is ultimately not worth it for the amount of time you spend in context hells. You can just do it yourself faster.
> The last 20%, while possible to attain, is ultimately not worth it for the amount of time you spend in context hells. You can just do it yourself faster.
I'm arguing that there's a skill that has to be learned in order to break through this. As you start in a new code base, you should be quick to jump in when you hit that 20%. But, as you spend more time in it, you learn how to avoid the same "context hell" issues and move that number down to 15%, 10%, 5% of the time.
You're still going to need to jump in, but when you can learn to get the LLM to write 95% of the code for you, that's incredibly powerful.
It’s not incredibly powerful, it’s incrementally powerful. Getting the first 80% via LLM is already the incredible power. A sufficiently skilled developer should be able to handle the rest with ease. It is not worth doing anything unnatural in an effort to chase down the last 20%, you are just wasting time and atrophying skills. If you can get full 95% in some one shot prompts, great. But don’t go chasing waterfallls.
No, it actually has an exponential growth type of effect on productivity to be able to push it to the boundary more.
I’m making this a bit contrived, but I’m simplifying it to demonstrate the underlying point.
When an LLM is 80% effect, I’m limits to doing 5 things in parallel since I still need to jump in 20% of the time.
When an LLM is 90% effect, I can do 10 things at once. When it’s 95%, 20 things. 99%, 100 things.
Now, obviously I can’t actually juggle 10 or 20 things at once. However, the point is there are actually massive productivity gains to be had when you can reduce your involvement in a task from 20% to, even 10%. You’re effectively 2x as productive.
I agree with you and I have seen this take a few times now in articles on HN, which amounts to the classic: "We've tried nothing and we're all out of ideas" Simpson's joke.
I read these articles and I feel like I am taking crazy pills sometimes. The person, enticed by the hype, makes a transparently half-hearted effort for just long enough to confirm their blatantly obvious bias. They then act like the now have ultimate authority on the subject to proclaim their pre-conceived notions were definitely true beyond any doubt.
Not all problems yield well to LLM coding agents. Not all people will be able or willing to use them effectively.
But I guess "I gave it a try and it is not for me" is a much less interesting article compared to "I gave it a try and I have proved it is as terrible as you fear".
I agree with your assessment about this statement. I actually had to reread it a few times to actually understand it.
He is actually recommending Copilot for price/performance reasons and his closing statement is "Don’t fall for the hype, but also, they are genuinely powerful tools sometimes."
So, it just seems like he never really gave a try at how to engineer better prompts that these more advanced models can use.
The OPs point seems to be: it's very quick for LLMs to be a net benefit to your skills, if it is a benefit at all. That is, he's only speaking of the very beginning part of the learning curve.
The first two points directly contradict each other, too. Learning a tool should have the outcome that one is productive with it. If getting to "productive" is non-trivial, then learning the tool is non-trivial.
> I have never heard anybody successfully using LLMs say this before. Most of what I've learned from talking to people about their workflows is counterintuitive and subtle.
Because for all our posturing about being skeptical and data driven we all believe in magic.
Those "counterintuitive non-trivial workflows"? They work about as well as just prompting "implement X" with no rules, agents.md, careful lists etc.
Because 1) literally no one actually measures whether magical incarnations work and 2) it's impossible to make such measurements due to non-determinism
The problem with your argument here is that you're effectively saying that developers (like myself) who put effort into figuring out good workflows for coding with LLMs are deceiving themselves, and are effectively wasting their time.
Either I've wasted significant chunks of the past ~3 years of my life or you're missing something here. Up to you to decide which you believe.
I agree that it's hard to take solid measurements due to non-determinism. The same goes for managing people, and yet somehow many good engineering managers can judge if their team is performing well and figure out what levers they can pull to help them perform better.
That's not a problem, that is the argument. People are bad at measuring their own productivity. Just because you feel more productive with an LLM does not mean you are. We need more studies and less anecdata
I'm afraid all you're going to get from me is anecdata, but I find a lot of it very compelling.
I talk to extremely experienced programmers whose opinions I have valued for many years before the current LLM boom who are now flying with LLMs - I trust their aggregate judgement.
Meanwhile my own https://tools.simonwillison.net/colophon collection has grown to over 120 in just a year and a half, most of which I wouldn't have built at all - and that's a relatively small portion of what I've been getting done with LLMs elsewhere.
Hard to measure productivity on a "wouldn't exist" to "does exist" scale.
It might also be the largest collection of published chat transcripts for this kind of usage from a single person - though that's not hard since most people don't publish their prompts.
Building little things like this is really effective way of gaining experience using prompts to get useful code results out of LLMs.
Unlike my tools.simonwillison.net stuff the vast majority of those products are covered by automated tests and usually have comprehensive documentation too.
There have been so many "advances" in software development in the last decades - powerful type systems, null safety, sane error handling, Erlang-style fault tolerance, property testing, model checking, etc. - and yet people continue to write garbage code in unsafe languages with underpowered IDEs.
I think many in the industry have absolutely no clue what they're doing and are bad at evaluating productivity, often prioritising short term delivery over longterm maintenance.
LLMs can absolutely be useful but I'm very concerned that some people just use them to churn out code instead of thinking more carefully about what and how to build things. I wish we had at least the same amount of discussions about those things I mentioned above as we have about whether Opus, Sonnet, GPT5 or Gemini is the best model.
So far I've found that the people who are hating on AI are stuck maintaining highly coupled that they've invested a significant amount of mental energy internalizing. AI is bad on that type of code, and since they've invested so much energy on understanding the code, it ends up taking longer for them to load context and guide the AI than to just do the work. Their code base is hot coupled garbage, and rather than accept that the tools aren't working because of their own lack of architectural rigor, they just shit on the tools. This is part of the reason that that study of open source maintainers using Cursor didn't consistently produce improvement (also, Cursor is pretty mid).
https://www.youtube.com/watch?v=tbDDYKRFjhk&t=4s is one of the largest studies I've seen so far and it shows that when the codebase is small or engineered for AI use, >20% productivity improvements are normal.
> who put effort into figuring out good workflows for coding with LLMs are deceiving themselves, and are effectively wasting their time.
It's quite possible you do. Do you have any hard data justifying the claims of "this works better", or is it just a soft fuzzy feeling?
> The same goes for managing people, and yet somehow many good engineering managers can judge if their team is performing well
It's actually really easy to judge if a team is performing well.
What is hard is finding what actually makes the team perform well. And that is just as much magic as "if you just write the correct prompt everything will just work"
On top of this a lot of the “learning to work with LLMs” is breaking down tasks into small pieces with clear instructions and acceptance criteria. That’s just part of working efficiently but maybe don’t want to be bothered to do it.
Even this opens up a whole field of weird subtle workflow tricks people have, because people run parallel asynchronous agents that step on each other in git. Solo developers run teams now!
Really wild to hear someone say out loud "there's no learning curve to using this stuff".
I've said it before, I feel like I'm some sort of lottery winner when it comes to LLM usage.
I've tried a few things that have mostly been positive. Starting with copilot in-line "predictive text on steroids" which works really well. It's definitely faster and more accurate than me typing on a traditional intellisense IDE. For me, this level of AI is cant-lose: it's very easy to see if a few lines of prediction is what you want.
I then did Cursor for a while, and that did what I wanted as well. Multi-file edits can be a real pain. Sometimes, it does some really odd things, but most of the time, I know what I want, I just don't want to find the files, make the edits on all of them, see if it compiles, and so on. It's a loop that you have to do as a junior dev, or you'll never understand how to code. But now I don't feel I learn anything from it, I just want the tool to magically transform the code for me, and it does that.
Now I'm on Claude. Somehow, I get a lot fewer excursions from what I wanted. I can do much more complex code edits, and I barely have to type anything. I sort of tell it what I would tell a junior dev. "Hey let's make a bunch of connections and just use whichever one receives the message first, discarding any subsequent copies". If I was talking to a real junior, I might answer a few questions during the day, but he would do this task with a fair bit of mess. It's a fiddly task, and there are assumptions to make about what the task actually is.
Somehow, Claude makes the right assumptions. Yes, indeed I do want a test that can output how often each of the incoming connections "wins". Correct, we need to send the subscriptions down all the connections. The kinds of assumptions a junior would understand and come up with himself.
I spend a lot of time with the LLM critiquing, rather than editing. "This thing could be abstracted, couldn't it?" and then it looks through the code and says "yeah I could generalize this like so..." and it means instead of spending my attention on finding things in files, I look at overall structure. This also means I don't need my highest level of attention, so I can do this sort of thing when I'm not even really able to concentrate, eg late at night or while I'm out with the kids somewhere.
So yeah, I might also say there's very little learning curve. It's not like I opened a manual or tutorial before using Claude. I just started talking to it in natural language about what it should do, and it's doing what I want. Unlike seemingly everyone else.
Pianists' results are well known to be proportional to their talent/effort. In open source hardly anyone is even using LLMs and the ones that do have barely any output, In many cases less output than they had before using LLMs.
> In open source hardly anyone is even using LLMs and the ones that do have barely any output, In many cases less output than they had before using LLMs.
LLM’s are basically glorified slot machines. Some people try very hard to come up with techniques or theories about when the slot machine is hot, it’s only an illusion, let me tell you, it’s random and arbitrary, maybe today is your lucky day maybe not. Same with AI, learning the “skill” is as difficult as learning how to google or how to check stackoverflow, trivial. All the rest is luck and how many coins do you have in your pocket.
This is not a good analogy. The parameters of slot machines can be changed to make the casino lose money. Just because something is random, doesn't mean it is useless. If you get 7 good outputs out of 10 from an LLM, you can still use it for your benefit. The frequency of good outputs and how much babysitting it requires determine whether it is worth using or not. Humans make mistakes too, although way less often.
There's plenty of evidence that good prompts (prompt engineering, tuning) can result in better outputs.
Improving LLM output through better inputs is neither an illusion, nor as easy as learning how to google (entire companies are being built around improving llm outputs and measuring that improvement)
Sure, but tricks & techniques that work with one model often don't translate or are actively harmful with others. Especially when you compare models from today and 6 or more months ago.
Keep in mind that the first reasoning model (o1) was released less than 8 months ago and Claude Code was released less than 6 months ago.
So true! About ten years ago Peter Norvig recommended the short Google online course on how to use Google Search: amazing how much one hour of structured learning permanently improved my search skills.
I have used neural networks since the 1980s, and modern LLM tech simply makes me happy, but there are strong limits to what I will use the current tech for.
> Learning how to use LLMs in a coding workflow is trivial. There is no learning curve. You can safely ignore them if they don’t fit your workflows at the moment.
Learning how to use LLMs in a coding workflow is trivial to start, but you find you get a bad taste early if you don't learn how to adapt both your workflow and its workflow. It is easy to get a trivially good result and then be disappointed in the followup. It is easy to try to start on something it's not good at and think it's worthless.
The pure dismissal of cursor, for example, means that the author didn't learn how to work with it. Now, it's certainly limited and some people just prefer Claude code. I'm not saying that's unfair. However, it requires a process adaptation.
"There's no learning curve" just means this guy didn't get very far up, which is definitely backed up by thinking that Copilot and other tools are all basically the same.
Define "not trivial". Obviously, experience helps, as with any tool. But it's hardly rocket science.
It seems to me the biggest barrier is that the person driving the tool needs to be experienced enough to recognize and assist when it runs into issues. But that's little different from any sophisticated tool.
It seems to me a lot of the criticism comes from placing completely unrealistic expectations on an LLM. "It's not perfect, therefore it sucks."
As of about three months ago, one of the most important skills in effective LLM coding is coding agent environment design.
If you want to use a tool like Claude Code (or Gemini CLI or Cursor agent mode or Code CLI or Qwen Code) to solve complex problems you need to give them an environment they can operate in where they can solve that problem without causing too much damage if something goes wrong.
You need to think about sandboxing, and what tools to expose to them, and what secrets (if any) they should have access to, and how to control the risk of prompt injection if they might be exposed to potentially malicious sources of tokens.
The other week I wanted to experiment with some optimizations of configurations on my Fly.io hosted containers. I used Claude Code for this by:
- Creating a new Fly organization which I called Scratchpad
- Assigning that a spending limit (in case my coding agent went rogue or made dumb expensive mistakes)
- Creating a Fly API token that could only manipulate that organization - so I could be sure my coding agent couldn't touch any of my production deployments
- Putting together some examples of how to use the Fly CLI tool to deploy an app with a configuration change - just enough information that Claude Code could start running its own deploys
- Running Claude Code such that it had access to the relevant Fly command authenticated with my new Scratchpad API token
With all of the above in place I could run Claude in --dangerously-skip-permissions mode and know that the absolute worse that could happen is it might burn through the spending limit I had set.
This took a while to figure out! But now... any time I want to experiment with new Fly configuration patterns I can outsource much of that work safely to Claude.
For one - I’d say scoped API tokens that prevent messing with resources across logical domains (eg prod vs nonprod, distinct github repos, etc) is best practice in general. Blowing up a resource with a broadly scoped token isn’t a failure mode unique to LLMs.
edit: I don’t have personal experience around spending limits but I vaguely recall them being useful for folks who want to set up AWS resources and swing for the fences, in startups without thinking too deeply about the infra. Again this isn’t a failure mode unique to LLMs although I can appreciate it not mapping perfectly to your scenario above
edit #2: fwict the LLM specific context of your scenario above is: providing examples, setting up API access somehow (eg maybe invoking a CLI?). The rest to me seems like good old software engineering
The statement I responded to was, "creating an effective workflow is not trivial".
There are plenty of useful LLM workflows that are possible to create pretty trivially.
The example you gave is not hardly the first thing a beginning LLM user would need. Yes, more sophisticated uses of an advanced tool require more experience. There's nothing different from any other tool here. You can find similar debates about programming languages.
Again, what I said in my original comment applies: people place unrealistic expectations on LLMs.
I suspect that this is at least partly is a psychological game people unconsciously play to try to minimize the competence of LLMs, to reduce the level of threat they feel. A sort of variation of terror management theory.
I don’t think anyone expects perfection. Programs crash, drives die, and computers can break anytime. But we expect our tools to be reliable and not fight with it everyday to get it to work.
I don’t have to debug Emacs every day to write code. My CI workflow just runs every time a PR is created. When I type ‘make tests’, I get a report back. None of those things are perfect, but they are reliable.
I'm not a native speaker, but to me that quote doesn't necessarily imply an inability of OP to get up the curve. Maybe they just mean that the curve can look flat at the start?
No, it's sometimes just extremely easy to recognize people who have no idea what they're talking about when they make certain claims.
Just like I can recognize a clueless frontend developer when they say "React is basically just a newer jquery". Recognizing clueless engineers when they talk about AI can be pretty easy.
It's a sector that is both old and new: AI has been around forever, but even people who worked in the sector years ago are taken aback by what is suddenly possible, the workflows that are happening... hell, I've even seen cases where it's the very people who have been following GenAI forever that have a bias towards believing it's incapable of what it can do.
For context, I lead an AI R&D lab in Europe (https://ingram.tech/). I've seen some shit.
Basically, they are the same, they are all LLMs. They all have similar limitations. They all produce "hallucinations". They can also sometimes be useful. And they are all way overhyped.
The amount of misconceptions in this comment are quite profound.
Copilot isn't an LLM, for a start. You _combine_ it wil a selection of LLMs. And it absolutely has severe limitations compared to something like Claude Code in how it can interact with the programming environment.
"Hallucinations" are far less of a problem with software that grounds the AI to the truth in your compiler, diagnostics, static analysis, a running copy of your project, runnning your tests, executing dev tools in your shell, etc.
If it’s not trivial, it’s worthless, because writing things out manually yourself is usually trivial, but tedious.
With LLMs, the point is to eliminate tedious work in a trivial way. If it’s tedious to get an LLM to do tedious work, you have not accomplished anything.
If the work is not trivial enough for you to do yourself, then using an LLM will probably be a disaster, as you will not be able to judge the final output yourself without spending nearly the same amount of time it takes for you to develop the code on your own. So again, nothing is gained, only the illusion of gain.
The reason people think they are more productive using LLMs to tackle non-trivial problems is because LLMs are pretty good at producing “office theatre”. You look like you’re busy more often because you are in a tight feedback loop of prompting and reading LLM output, vs staring off into space thinking deeply about a problem and occasionally scribbling or typing something out.
So, I'd like you to talk to a fair number of emacs and vim users. They have spent hours and hours learning their tools, tweaking their configurations, and learning efficiencies. They adapt their tool to them and themselves to the tool.
We are learning that this is not going to be magic. There are some cases where it shines. If I spend the time, I can put out prototypes that are magic and I can test with users in a fraction of the time. That doesn't mean I can use that for production.
I can try three or four things during a meeting where I am generally paying attention, and look afterwards to see if it's pursuing.
I can have it work through drudgery if I provide it an example. I can have it propose a solution to a problem that is escaping me, and I can use it as a conversational partner for the best rubber duck I've ever seen.
But I'm adapting myself to the tool and I'm adapting the tool to me through learning how to prompt and how to develop guardrails.
Outside of coding, I can write chicken scratch and provide an example of what I want, and have it write a proposal for a PRD. I can have it break down a task, generate a list of proposed tickets, and after I've went through them have it generate them in jira (or anything else with an API). But the more I invest into learning how to use the tool, the less I have to clean up after.
Maybe one day in the future it will be better. However, the time invested into the tool means that 40 bucks of investment (20 into cursor, 20 into gpt) can add 10-15% boost in productivity. Putting 200 into claude might get you another 10% and it can get you 75% in greenfield and prototyping work. I bet that agency work can be sped up as much as 40% for that 200 bucks investment into claude.
That's a pretty good ROI.
And maybe some workloads can do even better. I haven't seen it yet but some people are further ahead than me.
vim and eMacs are owned by the developer who configures them. LLMs are products, whose capabilities are subject to the whims of their host. These are not the same things.
Everything you mentioned is also fairly trivial, just a couple of one shot prompts needed.
Does not mention the actual open source solution that has autocomplete, chat, planer and agents, lets you bring your own keys, connect to any llm provider, customize anything, rewrite all the prompts and tools.
Learning how to use LLMs in a coding workflow is trivial. There is no learning curve. [...]
LLMs will always suck at writing code that has not be written millions of times before. As soon as you venture slightly offroad, they falter.
That right there is your learning curve! Getting LLMs to write code that's not heavily represented in their training data takes experience and skill and isn't obvious to learn.
I’m still waiting that someone claiming how prompting is such an skill to learn, explain just once a single technique that is not obvious, like: storing checkpoint to go back to working version (already a good practice without using Llm see:git) or launch 10 tabs with slightly different prompts and choose the best, or ask the Llm to improve my prompt, or adding more context … is that an skill? I remember when I was a child that my mom thought that programming a vcr to record the night show to be such a feat…
If you have a big rock (a software project), there's quite a difference between pushing it uphill (LLM usage) and hauling it up with a winch (traditional tooling and methods).
People are claiming that it takes time to build the muscles and train the correct footing to push, while I'm here learning mechanical theory and drawing up levers. If one managed to push the rock for one meter, he comes clamoring, ignoring the many who was injured by doing so, saying that one day he will be able to pick the rock up and throw it at the moon.
Simon, I have mad respect for your work but I think on this your view might be skewed because your day to day work involves a codebase where a single developer can still hold the whole context in their head. I would argue that the inadequacies of LLMs become more evident the more you have to make changes to systems that evolve at the speed of 15+ concurrent developers.
One of the things I'm using LLMs for a lot right now is quickly generating answers about larger codebases I'm completely unfamiliar with.
Anything up to 250,000 tokens I pipe into GPT-5 (prior to that o3), and beyond that I'll send them to Gemini 2.5 Pro.
For even larger code than that I'll fire up Codex CLI or Claude Code and let them grep their way to an answer.
This stuff has gotten good enough now that I no longer get stuck when new tools lack decent documentation - I'll pipe in just the source code (filtered for .go or .rs or .c files or whatever) and generate comprehensive documentation for myself from scratch.
Don't you see how this opens up a blindspot in your view of the code?
You don't have the luxury of having someone who is deeply familiar with the code sanity check your perceived understanding of the code, i.e. you don't see where the LLM is horribly off-track because you don't have sufficient understanding of that code to see the error. In enterprise contexts this is very common tho so its quite likely that a lot of the haters here have seen PRs submitted by vibecoders to their own work which have been inadequate enough that they started to blame the tool. For example I have seen someone reinvent the wheel of the session handling by a client library because they were unaware that the existing session came batteries included and the LLM didn't hesitate to write the code again for them. The code worked, everything checked out but because the developer didn't know what they didn't know about they submitted a janky mess.
As someone who leans more towards the side of LLM-sceptiscism, I find Sonnet 4 quite useful for generating tests, provided I describe in enough detail how I want the tests to be structured and which cases should be tested. There's a lot of boilerplate code in tests and IMO because of that many developers make the mistake of DRYing out their test code so much that you can barely understand what is being tested anymore. With LLM test generation, I feel that this is no longer necessary.
Isn’t tests supposed to be premises (ensure initial state is correct), compute (run the code), and assertions (verify the result state and output). If your test code is complex, most of it should be moved into harness and helpers functions. Writing more complex code isn’t particularly useful.
I would actually disagree with the final conclusion here; despite claiming to offer the same models, Copilot seems very much nerfed — cross-comparing the Copilotified LLM and the same LLM through OpenRouter, the Copilot one seems to fail much harder. I'm not an expert in the details of LLMs but I guess there might be some extra system prompt, I also notice the context window limit is much lower, which kinda suggests it's been partially pre-consumed.
In case it matters, I was using Copilot that is for 'free' because my dayjob is open source, and the model was Claude Sonnet 3.7.
I've not yet heard anyone else saying the same as me which is kind of peculiar.
Deeply curious to know if this is an outlier opinion, a mainstream but pessimistic one, or the general consensus. My LinkedIn feed and personal network certainly suggests that it's an outlier, but I wonder if the people around me are overly optimistic or out of synch with what the HN community is experiencing more broadly.
My impression has been that in corporate settings (and I would include LinkedIn in that) AI optimism is basically used as virtue signaling, making it very hard to distinguish people who are actually excited about the tech from people wanting to be accepted.
My personal experience has been that AI has trouble keeping the scope of the change small and targeted. I have only been using Gemini 2.5 pro though, as we don’t have access to other models at my work. My friend tells me he uses Claud for coding and Gemini for documentation.
I reckon this opinion is more prevalent than the hyped blog posts and news stories suggest; I've been asking this exact question of colleagues and most share the sentiment, myself included, albeit not as pessimistic.
Most people I've seen espousing LLMs and agentic workflows as a silver bullet have limited experience with the frameworks and languages they use with these workflows.
My view currently is one of cautious optimism; that LLM workflows will get to a more stable point whereby they ARE close to what the hype suggests. For now, that quote that "LLMs raise the floor, not the ceiling" I think is very apt.
I think it’s pretty common among people whose job it is to provide working, production software.
If you go by MBA types on LinkedIn that aren’t really developers or haven’t been in a long time, now they can vibe out some react components or a python script so it’s a revolution.
Hi, my job is building working production software (these days heavily LLM assisted). The author of the article doesn't know what they're talking about.
I tend to strongly agree with the "unpopular opinion" about the IDEs mentioned versus CLI (specifically, aider.chat and Claude Code).
Assuming (this is key) you have mastery of the language and framework you're using, working with the CLI tool in 25 year old XP practices is an incredible accelerant.
Caveats:
- You absolutely must bring taste and critical thinking, as the LLM has neither.
- You absolutely must bring systems thinking, as it cannot keep deep weirdness "in mind". By this I mean the second and third order things that "gotcha" about how things ought to work but don't.
- Finally, you should package up everything new about your language or frameworks since a few months or year before the knowledge cutoff date, and include a condensed synthesis in your context (e.g., Swift 6 and 6.1 versus the 5.10 and 2024's WWDC announcements that are all GPT-5 knows).
For this last one I find it useful to (a) use OpenAI's "Deep Research" to first whitepaper the gaps, then another pass to turn that into a Markdown context prompt, and finally bring that over to your LLM tooling to include as needed when doing a spec or in architect mode. Similarly, (b) use repomap tools on dependencies if creating new code that leverages those dependencies, and have that in context for that work.
I'm confused why these two obvious steps aren't built into leading agentic tools, but maybe handling the LLM as a naive and outdated "Rain Man" type doesn't figure into mental models at most KoolAid-drinking "AI" startups, or maybe vibecoders don't care, so it's just not a priority.
Either way, context based development beats Leroy Jenkins.
> use repomap tools on dependencies if creating new code that leverages those dependencies, and have that in context for that work.
It seems to me that currently there are 2 schools of thought:
1. Use repomap and/or LSP to help the models navigate the code base
2. Let the models figure things out with grep
Personally, I am 100% a grep guy, and my editor doesn't even have LSP enabled. So, it is very interesting to see how many of these agentic tools do exactly the same thing.
And Claude Code /init is a great feature that basically writes down the current mental model after the initial round of grep.
Linkedin posts seems like an awful source. The people I see posting for themselves there are either pre-successful or just very fond of personal branding
Speaking to actual humans IRL (as in, non-management colleagues and friends in the field), people are pretty lukewarm on AI, with a decent chunk of them who find AI tooling makes them less productive. I know a handful of people who are generally very bullish on AI, but even they are nowhere near the breathless praise and hype you read about here and on LinkedIn, they're much more measured about it and approach it with what I would classify as common sense. Of course this is entirely anecdotal, and probably depends where you are and what kind of business you're in, though I will say I'm in a field where AI even makes some amount of sense (customer support software), and even then I'm definitely noticing a trend of disillusionment.
On the management side, however, we have all sorts of AI mandates, workshops, social media posts hyping our AI stuff, our whole "product vision" is some AI-hallucinated nightmare that nobody understands, you'd genuinely think we've been doing nothing but AI for the last decade the way we're contorting ourselves to shove "AI" into every single corner of the product. Every day I see our CxOs posting on LinkedIn about the random topic-of-the-hour regarding AI. When GPT-5 launched, it was like clockwork, "How We're Using GPT-5 At $COMPANY To Solve Problems We've Never Solved Before!" mere minutes after it was released (we did not have early access to it lol). Hilarious in retrospect, considering what a joke the launch was like with the hallucinated graphs and hilarious errors like in the Bernoulli's Principle slide.
Despite all the mandates and mandatory shoves coming from management, I've noticed the teams I'm close with (my team included) are starting to push back themselves a bit. They're getting rid of the spam generating PR bots that have never, not once, provided a useful PR comment. People are asking for the various subscriptions they were granted be revoked because they're not using them and it's a waste of money. Our own customers #1 piece of feedback is to focus less on stupid AI shit nobody ever asked for, and to instead improve the core product (duh). I'm even seeing our CTO who was fanboy number 1 start dialing it back a bit and relenting.
It's good to keep in mind that HN is primarily an advertisement platform for YC and their startups. If you check YC's recent batches, you would think that the 1 and only technology that exists in the world is AI, every single one of them mentions AI in one way or another. The majority of them are the lowest effort shit imaginable that just wraps some AI APIs and is calling it a product. There is a LOT of money riding on this hype wave, so there's also a lot of people with vested interests in making it seem like these systems work flawlessly. The less said about LinkedIn the better, that site is the epitome of the dead internet theory.
Have built many pipelines integrating LLMs to drive real $ results. I think this article boils it down too simply. But i always remember, if the LLM is the most interesting part of your work, something is severely wrong and you probably aren’t adding much value. Context management based on some aspects of your input is where LLMs get good, but you need to do lots of experimentation to tune something. Most cases i have seen are about developing one pipeline to fit 100s of extremely different cases; LLM does not solve this problem but basically serves as an approximator for you to discretize previously large problems in to some information sub space where you can treat the infinite set of inputs as something you know. LLMs are like a lasso (and a better/worse one than traditional lassos depending on use case) but once you get your catch you still need to process it, deal with it progammatically to solve some greater problem. I hate how so many LLM related articles/comments say “ai is useless throw it away dont use it” or “ai is the future if we dont do it now we’re doomed lets integrate it everywhere it can solve all our problems” like can anyone pick a happy medium? Maybe thats what being in a bubble looks like
People that comment on and get defensive about this bit:
> Learning how to use LLMs in a coding workflow is trivial. There is no learning curve. You can safely ignore them if they don’t fit your workflows at the moment.
How much of your workflow or intuition from 6 months ago is still relevant today? How long would it take to learn the relevant bits today?
Keep in mind that Claude Code was released less than 6 months ago.
A fraction of the LLM maximalists are being defensive, because they don't want to consider that they've maybe invested too much time in those tools ; considering what said tools are currently genuinely good at.
Pretty much all of the intuition I've picked up about getting good results from LLMs has stayed relevant.
If I was starting from fresh today I expect it would take me months of experimentation to get back to where I am now.
Working thoughtfully with LLMs has also helped me avoid a lot of the junk tips ("Always start with 'you are the greatest world expert in X', offer to tip it, ...") that are floating around out there.
All of the intuition? Definitely not my experience. I have found that optimal prompting differs significantly between models, especially when you look at models that are 6months old or older (the first reasoning model, o1, is less than 8 months old).
Speaking mostly from experience of building automated, dynamic data processing workflows that utilize LLMs:
Things that work with one model, might hurt performance or be useless with another.
Many tricks that used to be necessary in the past are no longer relevant, or only applicable for weaker models.
This isn't me dimissing anyone's experience. It's ok to do things that become obsolete fairly quickly, especially if you derive some value from it. If you try to stay on top of a fast moving field, it's almost inevitable. I would not consider it a waste of time.
LLM driven coding can yield awesome results, but you will be typing a lot and, as article states, requires already well structured codebase.
I recently started with fresh project, and until I got to the desired structure I only used AI to ask questions or suggestions. I organized and written most of the code.
Once it started to get into the shape that felt semi-permanent to me, I started a lot of queries like:
```
- Look at existing service X at folder services/x
- see how I deploy the service using k8s/services/x
- see how the docker file for service X looks like at services/x/Dockerfile
- now, I started service Y that does [this and that]
- create all that is needed for service Y to be skaffolded and deployed, follow the same pattern as service X
```
And it would go, read existing stuff for X, then generate all of the deployment/monitoring/readme/docker/k8s/helm/skaffold for Y
With zero to none mistakes.
Both claude and gemini are more than capable to do such task.
I had both of them generate 10-15 files with no errors, with code being able to be deployed right after (of course service will just answer and not do much more than that)
Then, I will take over again for a bit, do some business logic specific to Y, then again leverage AI to fill in missing bits, review, suggest stuff etc.
It might look slow, but it actually cuts most boring and most error prone steps when developing medium to large k8s backed project.
My workflow with a medium sized iOS codebase is a bit like that. By the time everything works and is up to my standards, I‘ve usually taken longer, or almost as long, as if I‘d written everything manually. That’s with Opus-only Claude Code. It’s complicated stuff (structured concurrency and lots of custom AsyncSequence operators) which maybe CC just isn‘t suitable for.
Whipping up greenfield projects is almost magical, of course. But that’s not most of my work.
Interesting read, but strange to totally ignore the macOS ChatGPT app that optionally integrates with a terminal session, the currently opened VSCode editor tab, XCode. etc. I use this combination at least 2 or 3 times a month, and even if my monthly use is less that 40 minutes total, it is a really good tool to have in your toolbelt.
The other thing I disagree with is the coverage of gemnini-cli: if you use gemini-cli for a single long work session, then you must set your Google API key as an environment variable when starting gemini-cli, otherwise you end up after a short while using Gemini-2.5-flash, and that leads to unhappy results. So, use gemini-cli for free for short and focused 3 or 4 minute work sessions and you are good, or pay for longer work sessions, and you are good.
I do have a random off topic comment: I just don’t get it: why do people live all day in an LLM-infused coding environment? LLM based tooling is great, but I view it as something I reach for a few times a day for coding and that feels just right. Separately, for non-coding tasks, reaching for LLM chat environments for research and brainstorming is helpful, but who really needs to do that more than once or twice a day?
>I made a CLI logs viewers and querier for my job, which is very useful but would have taken me a few days to write (~3k LoC)
I recall The Mythical Man-Month stating a rough calculation that the average software developer writes about 10 net lines of new, production-ready code per day. For a tool like this going up an order of magnitude to about 100 lines of pretty good internal tooling seems reasonable.
OP sounds a few cuts above the 'average' software developer in terms of skill level. But here we also need to point out a CLI log viewer and querier is not the kind of thing you actually needed to be a top tier developer to crank out even in the pre-LLM era, unless you were going for lnav [1] levels of polish.
A lot of the Mythical Man-Month is timeless, but for a stat like that, it really is worth bearing in mind the book was written half a century ago about developers working on 1970s mainframes.
Yeah, I think that metric has grown to about 20 lines per day using 2010s-era languages and methods. So maybe we could think of LLM usage as an attempt to bring it back down to 10 per day.
Relying on LLM for any skill, especially programming, is like cutting your own healthy legs and buying crutches to walk. Plus you now have to pay $49/month for basic walking ability and $99/month for "Walk+" plan, where you can also (clumsily) jog.
There are a lot of skills which I haven't developed because I rely on external machines to handle it for me; memorization, fire-starting, navigation. On net, my life is better for it. LLMs may or may not be as effective at replacing code development as books have been at replacing memorization and GPS has been at replacing navigation, but eventually some tool will be and I don't think I'll be worse off for developing other skills.
GPS is particularly good analog... Lose it for any reason and suddenly you are helpless without backup navigation aids. But compass, paper map, watch and sextant will still work!
This article makes me wanna try building a token field in Flutter using a LLM chat or agent. Chat should be enough. A few iterations to get the behaviour and the tests right. A bit of style to make it look Apple-nice. As if a regular dev would do much better/quicker for this use case, such a bad example imo I don't buy it
> By being particularly bad at anything outside of the most popular languages and frameworks, LLMs force you to pick a very mainstream stack if you want to be efficient.
Do they? I’ve found Clojure-MCP[1] to be very useful. OTOH, I’m not attempting to replace myself, only augment myself.
Thanks for the link! I used to use Clojure a lot professionally, but now just for fun projects, and to occasionally update my old Clojure book. I had bookmarked Clojure-MCP a while ago, but never got back to it but I will give it a try.
I like your phrasing of “OTOH, I’m not attempting to replace myself, only augment myself.” because that is my personal philosophy also.
I think we're still in the gray zone of the "Incessant Obsolescence Postulate" (the Wait Calculation). Are you better off "skilling up" on the tech as it is today, or waiting for it to just "get better" so by the time you kick off, you benefit from the solved-problems X years from now. I also think this calculation differs by domain, skill level, and your "soft skill" abilities to communicate, explain and teach. In some domains, if you're not already on this train, you won't even get hired anymore.
The current state of LLM-driven development is already several steps down the path of an end-game where the overwhelming majority of code is written by the machine; our entire HCI for "building" is going to be so far different to how we do it now that we'll look back at the "hand-rolling code era" in a similar way to how we view programming by punch-cards today. The failure modes, the "but it SUCKS for my domain", the "it's a slot machine" etc etc are not-even-wrong. They're intermediate states except where they're not.
The exceptions to this end-game will be legion and exist only to prove the end-game rule.
So many articles should prepend “My experience with ...” to their title. Here is OP's first sentence: “I spent the past ~4 weeks trying out all the new and fancy AI tools for software development.” Dude, you have had some experiences and they are worth writing up and sharing. But your experiences are not a stand-in for "the current state." This point applies to a significant fraction of HN articles, to the point that I wish the headlines were flagged “blog”.
Clickbait gets more reach. It's an unfortunate thing. I remember Veritasium in a video even saying something along the lines of him feeling forced to do clickbaity YouTube because it works so well.
The reach is big enough to not care about our feelings. I wish it wasn't this way.
OP did miss the vscode extension for claude code, it is still terminal based but:
- it show you the diff of the incoming changes in vscode ( like git )
- it know the line you selected in the editor for context
Good read. I just want to pinpoint that LLMs seems to write better React code, but as an experienced frontend developers my opinion is that it's also bad at React. Its approach is outdated as it doesn't follow the latest guidelines. It writes React as I would have written it in 2020. So as usual, you need to feed the right context to get proper results.
I have not tried every IDE/CLI or models, only a few, mostly Claude and Qwen.
I work mostly in C/C++.
The most valuable improvement of using this kind of tools, for me, is to easily find help when I have to work on boring/tedious tasks or when I want to have a Socratic conversation about a design idea with a not-so-smart but extremely knowledgeable colleague.
But for anything requiring a brain, it is almost useless.
Strange post. It reads in part like an incoherent rant and in part as a well made analysis.
It’s mostly on point though. Although, in recent years I’ve been assigned to manage and plan projects at work, and the skills I’ve learnt from that greatly help to get effective results from an LLM I think.
> By being particularly bad at anything outside of the most popular languages and frameworks, LLMs force you to pick a very mainstream stack if you want to be efficient.
I haven't found that to be true with my most recent usage of AI. I do a lot of programming in D, which is not popular like Python or Javascript, but Copilot knows it well enough to help me with things like templates, metaprogramming, and interoperating with GCC-produced DLL's on Windows. This is true in spite of the lack of a big pile of training data for these tasks. Importantly, it gets just enough things wrong when I ask it to write code for me that I have to understand everything well enough to debug it.
I have a biased opinion since I work for a background agent startup currently - but there are more (and better!) out there than Jules and Copilot that might address some of the author's issues.
By no means are better background agents "mythical" as you claim. I didn't bother to mention them as it is easy enough to search for asynchronous/background agents yourself.
Devin is perhaps the one that is most fully featured and I believe has been around the longest. Other examples that seem to be getting some attention recently are Warp, Cursor's own background agent implementation, Charlie Labs, Codegen, Tembo, and OpenAI's Codex.
I do not work for any of the aforementioned companies.
>Ah yes. An unverifiable claim followed by "just google them yourself".
Some agent scaffolding performs better on benchmarks than others given the same underlying base model - see SWE Bench and Terminal Bench for examples.
Some may find certain background agents better than others simply because of UX. Some background agents have features that others don't - like memory systems, MCP, 3rd party integrations, etc.
I maintain it is easy to search for examples of background coding agents that are not Jules or Copilot. For me, searching "background coding agents" on google or duckduckgo returns some of the other examples that I mentioned.
> Learning how to use LLMs in a coding workflow is trivial. There is no learning curve. You can safely ignore them if they don’t fit your workflows at the moment.
I'm sprry, but I disagree with this claim. That is not my experience, nor many others. It's true that you can make them do something without learning anything. However, it takes time to learn what they are good amd bad at, what information they need, and what nonsense they'll do without express guidance. It also takes time to know what to look for when reviewing results.
I also find that they work fine for languages without static types. You need need tests, yes, but you need them anyway.
> By being particularly bad at anything outside of the most popular languages and frameworks, LLMs force you to pick a very mainstream stack if you want to be efficient.
Almost like hiring and scaling a team? There are also benchmarks that specifically measure this, and its in theory a very temporary problem (Aider Polyglot Benchmark is one such).
"LLMs won’t magically make you deliver production-ready code"
Either I'm extremely lucky or I was lucky to find the guy who said it must all be test driven and guided by the usual principles of DRY etc. Claude Code works absolutely fantastically nine out of 10 times and when it doesn't we just roll back the three hours of nonsense it did postpone this feature or give it extra guidance.
I'm beginning to suspect robust automated tests may be one of the single strongest indicators for if you're going to have a good time with LLM coding agents or not.
If there's a test suite for the thing to run it's SO much less likely to break other features when it's working. Plus it can read the tests and use them to get a good idea about how everything is supposed to work already.
Telling Claude to write the test first, then execute it and watch it fail, then write the implementation has been giving me really great results.
My favorite setup so far is using the Claude code extension in VScode. All the power of CC, but it opens files and diffs in VScode. Easy to read and modify as needed.
There are kind of a lot of errors in this piece. For instance, the problem the author had with Gemini CLI running out of tokens in ten minutes is what happens when you don’t set up (a free) API key in your environment.
> By being particularly bad at anything outside of the most popular languages and frameworks, LLMs force you to pick a very mainstream stack if you want to be efficient.
I use clojure for my day-to-day work, and I haven't found this to be true. Opus and GPT-5 are great friends when you start pushing limits on Clojure and the JVM.
> Or 4.1 Opus if you are a millionaire and want to pollute as much possible
I know this was written tongue-in-cheek, but at least in my opinion it's worth it to use the best model if you can. Opus is definitely better on harder programming problems.
> GPT 4.1 and 5 are mostly bad, but are very good at following strict guidelines.
This was interesting. At least in my experience GPT-5 seemed about as good as Opus. I found it to be _less_ good at following strict guidelines though. In one test Opus avoided a bug by strictly following the rules, while GPT-5 missed.
"Google’s enshittification has won and it looks like no competent software developers are left. I would know, many of my friends work there". Ouch ... I hope his friends are in marketing!
They missed OpenAI Codex, maybe deliberately? It's less llm-development and more vibe-coding, or maybe "being a PHB of robots". I'm enjoying it for my side project this week.
Yet another developer who is too full of themselves to admit that they have no idea how to use LLMs for development. There's an arrogance that can set in when you get to be more senior and unless you're capable of force feeding yourself a bit of humility you'll end up missing big, important changes in your field.
It becomes farcical when not only are you missing the big thing but you're also proud of your ignorance and this guy is both.
Personally, I’ve had a pretty positive experience with the coding assistants, but I had to spend some time to develop intuition for the types of tasks they’re likely to do well. I would not say that this was trivial to do.
Like if you need to crap out a UI based on a JSON payload, make a service call, add a server endpoint, LLMs will typically do this correctly in one shot. These are common operations that are easily extrapolated from their training data. Where they tend to fail are tasks like business logic which have specific requirements that aren’t easily generalized.
I’ve also found that writing the scaffolding for the code yourself really helps focus the agent. I’ll typically add stubs for the functions I want, and create overall code structure, then have the agent fill the blanks. I’ve found this is a really effective approach for preventing the agent from going off into the weeds.
I also find that if it doesn’t get things right on the first shot, the chances are it’s not going to fix the underlying problems. It tends to just add kludges on top to address the problems you tell it about. If it didn’t get it mostly right at the start, then it’s better to just do it yourself.
All that said, I find enjoyment is an important aspect as well and shouldn’t be dismissed. If you’re less productive, but you enjoy the process more, then I see that as a net positive. If all LLMs accomplish is to make development more fun, that’s a good thing.
I also find that there's use for both terminal based tools and IDEs. The terminal REPL is great for initially sketching things out, but IDE based tooling makes it much easier to apply selective changes exactly where you want.
As a side note, got curious and asked GLM-4.5 to make a token field widget with React, and it did it in one shot.
It's also strange not to mention DeepSeek and GLM as options given that they cost orders of magnitude less per token than Claude or Gemini.
I have also written a C++ code that has to have a runtime of years, meaning there can be absolutely no memory leaks or bugs whatsoever, or TV stops working. I wouldn't have a language model write any of that, at least not without testing the hell out of it and making sure it makes sense to myself.
It's not all or nothing here. These things are tools and should be used as such.
Ahh, sweet summer child, if I had a nickel for every time I've heard "just hack something together quickly, that's throwaway code", that ended up being a critical lynchpin of a production system - well, I'd probably have at least like a buck or so.
Obviously, to emphasize, this kind of thing happens all the time with human-generated code, but LLMs make the issue a lot worse because it lets you generate a ton of eventual mess so much faster.
Also, I do agree with your primary point (my comment was a bit tongue in cheek) - it's very helpful to know what should be core and what can be thrown away. It's just in the real world whenever "throwaway" code starts getting traction and getting usage, the powers that be rarely are OK with "Great, now let's rebuild/refactor with production usage in mind" - it's more like "faster faster faster".
Because this is the first pass on any project, any component, ever. Design is done with iterations. One can and should throw out the original rough lynchpin and replace it with a more robust solution once it becomes evident that it is essential.
If you know that ahead of time and want to make it robust early, the answer is still rarely a single diligent one-shot to perfection - you absolutely should take multiple quick rough iterations to think through the possibility space before settling on your choice. Even that is quite conducive to LLM coding - and the resulting synthesis after attacking it from multiple angles is usually the strongest of all. Should still go over it all with a fine toothed comb at the end, and understand exactly why each choice was made, but the AI helps immensely in narrowing down the possibility space.
Not to rag on you though - you were being tongue in cheek - but we're kidding ourselves if we don't accept that like 90% of the code we write is rough throwaway code at first and only a small portion gets polished into critical form. That's just how all design works though.
I have never heard anybody successfully using LLMs say this before. Most of what I've learned from talking to people about their workflows is counterintuitive and subtle.
It's a really weird way to open up an article concluding that LLMs make one a worse programmer: "I definitely know how to use this tool optimally, and I conclude the tool sucks". Ok then. Also: the piano is a terrible, awful instrument; what a racket it makes.
I feel like every time I have a prompt or use a new tool, I'm experimenting with how to make fire for the first time. It's not to say that I'm bad at it. I'm probably better than most people. But knowing how to use this tool is by far the largest challenge, in my opinion.
That's a wild statement. I'm now extremely productive with LLMs in my core codebases, but it took a lot of practice to get it right and repeatable. There's a lot of little contextual details you need to learn how to control so the LLM makes the right choices.
Whenever I start working in a new code base, it takes a a non-trivial amount of time to ramp back up to full LLM productivity.
I am still hesitant using AI for solving problems for me. Either it hallucinates and misleads me. Or it does a great job and I worry that my ability of reasoning through complex problems with rigor will degenerate. When my ability of solving complex problems degenerated, patience diminished, attention span destroyed, I will become so reliant on a service that other entities own to perform in my daily life. Genuine question - are people comfortable with this?
My comment is specifically in contrast to working in a codebase where I'm at "max AI productivity". In a new codebase, it just takes a bit of time to work out kinks and figure out tendencies of the LLMs in those codebases. It's not that I'm slower than I'd be without AI, I'm just not at my "usual" AI-driven productivity levels.
It's a question of degree, but in general, yeah. I'm totally comfortable being reliant on other entities to solve complex problems for me.
That's how economies work [1]. I neither have nor want to acquire the lifetime of experience I would need to learn how to produce the tea leaves in my tea, or the clean potable water in it, or the mug they are contained within, or the concrete walls 50 meters up from ground level I am surrounded by, or so on and so forth. I can live a better life by outsourcing the need for this specialized knowledge to other people, and trade with them in exchange for my own increasingly-specialized knowledge. Even if I had 100 lifetimes to spend, and not the 1 I actually have, I would probably want to put most of them to things that, you know, aren't already solved-enough problems.
Everyone doing anything interesting works like this, with vanishingly few exceptions. My dad doesn't need to know how to do algebra to get his taxes done, he just has an accountant. And his accountant doesn't need to know how to rewire his turn of the century New England home. And if you look at the exceptions, like that really cute 'self sufficient' family who uploads weekly YouTube videos called "Our Homestead Life"... It often turns out that the revenue from that YouTube stream is nontrivial to keeping the whole operation running. In other words, even if they genuinely no longer go to Costco, it's kind of a gyp.
[1]: https://www.youtube.com/watch?v=67tHtpac5ws
This is not quite the same thing. The AI is not perfect, it frequently makes mistakes or suboptimal code. As a software engineer, you are responsible for finding and fixing those. This means you have to review and fully understand everything that the AI has written.
Quite a different situation than your dad and his accountant.
If not, then they must themselves make mistakes or do things suboptimally sometimes. Whose responsibility is that - my dad, or my dad's accountant?
If it is my dad, does that then mean my dad has an obligation to review and fully understand everything the accountant has written?
And do we have to generalize that responsibility to everything and everyone my dad has to hand off work to in order to get something done? Clearly not, that's absurd. So where do we draw the line? You draw it in the same place I do for right now, but I don't see why we expect that line to be static.
Yes, and people who care and is knowledgeable do this already. I do this, for one.
And so if you don't use it then someone else will... But as for the models, we already have some pretty good open source ones like Qwen and it'll only get better from here so I'm not sure why the last part would be a dealbreaker
> Whenever I start working in a new code base, it takes a a non-trivial amount of time to ramp back up to full LLM productivity.
Do you find that these details translate between models? Sounds like it doesn't translate across codebases for you?
I have mostly moved away from this sort of fine-tuning approach because of experience a while ago around OpenAI's ChatGPT 3.5 and 4. Extra work on my end necessary with the older model wasn't with the new one, and sometimes counterintuitively caused worse performance by pointing it at what the way I'd do it vs the way it might have the best luck with. ESPECIALLY for the sycophantic models which will heavily index on "if you suggested that this thing might be related, I'll figure out some way to make sure it is!"
So more recently I generally stick to the "we'll handle a lot of the prompt nitty gritty" for you IDE or CLI agent stuff, but I find they still fall apart with large complex codebases and also that the tricks don't translate across codebases.
* Business context - these are things like code quality/robustness, expected spec coverage, expected performance needs, domain specific knowledge. These generally translate well between models, but can vary between code bases. For example, a core monolith is going to have higher standards than a one-off auxiliary service.
* Model focuses - Different models have different tendencies when searching a code base and building up their context. These are specific to each code base, but relatively obvious when they happen. For example, in one code base I work in, one model always seems to pick up our legacy notification system while another model happens to find our new one. It's not really a skill issue. It's just luck of the draw how files are named and how each of them search. They each just find a "valid" notification pattern in a different order.
LLMs are massively helpful for orienting to a new codebase, but it just takes some time to work out those little kinks.
Getting 80% of the benefit of LLMs is trivial. You can ask it for some functions or to write a suite of unit tests and you’re done.
The last 20%, while possible to attain, is ultimately not worth it for the amount of time you spend in context hells. You can just do it yourself faster.
I'm arguing that there's a skill that has to be learned in order to break through this. As you start in a new code base, you should be quick to jump in when you hit that 20%. But, as you spend more time in it, you learn how to avoid the same "context hell" issues and move that number down to 15%, 10%, 5% of the time.
You're still going to need to jump in, but when you can learn to get the LLM to write 95% of the code for you, that's incredibly powerful.
I’m making this a bit contrived, but I’m simplifying it to demonstrate the underlying point.
When an LLM is 80% effect, I’m limits to doing 5 things in parallel since I still need to jump in 20% of the time.
When an LLM is 90% effect, I can do 10 things at once. When it’s 95%, 20 things. 99%, 100 things.
Now, obviously I can’t actually juggle 10 or 20 things at once. However, the point is there are actually massive productivity gains to be had when you can reduce your involvement in a task from 20% to, even 10%. You’re effectively 2x as productive.
Or do you mean you are using long running agents to do tasks and then review those? I haven't seen such a workflow be productive so far.
I read these articles and I feel like I am taking crazy pills sometimes. The person, enticed by the hype, makes a transparently half-hearted effort for just long enough to confirm their blatantly obvious bias. They then act like the now have ultimate authority on the subject to proclaim their pre-conceived notions were definitely true beyond any doubt.
Not all problems yield well to LLM coding agents. Not all people will be able or willing to use them effectively.
But I guess "I gave it a try and it is not for me" is a much less interesting article compared to "I gave it a try and I have proved it is as terrible as you fear".
He is actually recommending Copilot for price/performance reasons and his closing statement is "Don’t fall for the hype, but also, they are genuinely powerful tools sometimes."
So, it just seems like he never really gave a try at how to engineer better prompts that these more advanced models can use.
Because for all our posturing about being skeptical and data driven we all believe in magic.
Those "counterintuitive non-trivial workflows"? They work about as well as just prompting "implement X" with no rules, agents.md, careful lists etc.
Because 1) literally no one actually measures whether magical incarnations work and 2) it's impossible to make such measurements due to non-determinism
Either I've wasted significant chunks of the past ~3 years of my life or you're missing something here. Up to you to decide which you believe.
I agree that it's hard to take solid measurements due to non-determinism. The same goes for managing people, and yet somehow many good engineering managers can judge if their team is performing well and figure out what levers they can pull to help them perform better.
I talk to extremely experienced programmers whose opinions I have valued for many years before the current LLM boom who are now flying with LLMs - I trust their aggregate judgement.
Meanwhile my own https://tools.simonwillison.net/colophon collection has grown to over 120 in just a year and a half, most of which I wouldn't have built at all - and that's a relatively small portion of what I've been getting done with LLMs elsewhere.
Hard to measure productivity on a "wouldn't exist" to "does exist" scale.
What in the wooberjabbery is this even.
List of single-commit LLM generated stuff. Vibe coded shovelware like animated-rainbow-border [1] or unix-timestamp [2].
Calling these tools seems to be overstating it.
1: https://gist.github.com/simonw/2e56ee84e7321592f79ceaed2e81b...
2: https://gist.github.com/simonw/8c04788c5e4db11f6324ef5962127...
I wrote more about it here: https://simonwillison.net/2024/Oct/21/claude-artifacts/ - and a lot of them have explanations in posts under my tools tag: https://simonwillison.net/tags/tools/
It might also be the largest collection of published chat transcripts for this kind of usage from a single person - though that's not hard since most people don't publish their prompts.
Building little things like this is really effective way of gaining experience using prompts to get useful code results out of LLMs.
(I know this question is futile since you will not go off-script even for a second.)
Also literally hundreds of smaller plugins and libraries and CLI tools, see https://github.com/simonw?tab=repositories (now at 880 repos, though a few dozen of those are scrapers and shouldn't count) and https://pypi.org/user/simonw/ (340 published packages).
Unlike my tools.simonwillison.net stuff the vast majority of those products are covered by automated tests and usually have comprehensive documentation too.
What do you mean by my script?
But it was already a warning before LLMs because, as you wrote, people are bad at measuring productivity (among many things).
I think many in the industry have absolutely no clue what they're doing and are bad at evaluating productivity, often prioritising short term delivery over longterm maintenance.
LLMs can absolutely be useful but I'm very concerned that some people just use them to churn out code instead of thinking more carefully about what and how to build things. I wish we had at least the same amount of discussions about those things I mentioned above as we have about whether Opus, Sonnet, GPT5 or Gemini is the best model.
https://www.youtube.com/watch?v=tbDDYKRFjhk&t=4s is one of the largest studies I've seen so far and it shows that when the codebase is small or engineered for AI use, >20% productivity improvements are normal.
It's quite possible you do. Do you have any hard data justifying the claims of "this works better", or is it just a soft fuzzy feeling?
> The same goes for managing people, and yet somehow many good engineering managers can judge if their team is performing well
It's actually really easy to judge if a team is performing well.
What is hard is finding what actually makes the team perform well. And that is just as much magic as "if you just write the correct prompt everything will just work"
---
wait. why are we fighting again? :) https://dmitriid.com/everything-around-llms-is-still-magical...
Really wild to hear someone say out loud "there's no learning curve to using this stuff".
I've tried a few things that have mostly been positive. Starting with copilot in-line "predictive text on steroids" which works really well. It's definitely faster and more accurate than me typing on a traditional intellisense IDE. For me, this level of AI is cant-lose: it's very easy to see if a few lines of prediction is what you want.
I then did Cursor for a while, and that did what I wanted as well. Multi-file edits can be a real pain. Sometimes, it does some really odd things, but most of the time, I know what I want, I just don't want to find the files, make the edits on all of them, see if it compiles, and so on. It's a loop that you have to do as a junior dev, or you'll never understand how to code. But now I don't feel I learn anything from it, I just want the tool to magically transform the code for me, and it does that.
Now I'm on Claude. Somehow, I get a lot fewer excursions from what I wanted. I can do much more complex code edits, and I barely have to type anything. I sort of tell it what I would tell a junior dev. "Hey let's make a bunch of connections and just use whichever one receives the message first, discarding any subsequent copies". If I was talking to a real junior, I might answer a few questions during the day, but he would do this task with a fair bit of mess. It's a fiddly task, and there are assumptions to make about what the task actually is.
Somehow, Claude makes the right assumptions. Yes, indeed I do want a test that can output how often each of the incoming connections "wins". Correct, we need to send the subscriptions down all the connections. The kinds of assumptions a junior would understand and come up with himself.
I spend a lot of time with the LLM critiquing, rather than editing. "This thing could be abstracted, couldn't it?" and then it looks through the code and says "yeah I could generalize this like so..." and it means instead of spending my attention on finding things in files, I look at overall structure. This also means I don't need my highest level of attention, so I can do this sort of thing when I'm not even really able to concentrate, eg late at night or while I'm out with the kids somewhere.
So yeah, I might also say there's very little learning curve. It's not like I opened a manual or tutorial before using Claude. I just started talking to it in natural language about what it should do, and it's doing what I want. Unlike seemingly everyone else.
The blogging output on the other hand ...
That is not what that paper said, lol.
Improving LLM output through better inputs is neither an illusion, nor as easy as learning how to google (entire companies are being built around improving llm outputs and measuring that improvement)
Keep in mind that the first reasoning model (o1) was released less than 8 months ago and Claude Code was released less than 6 months ago.
Slot machines on the other hand are truly random and success is luck based with no priors (the legal ones in the US anyways)
I have used neural networks since the 1980s, and modern LLM tech simply makes me happy, but there are strong limits to what I will use the current tech for.
Pseudo-random number generators remain one of the most amazing things in computing IMO. Knuth volume 2. One of my favourite books.
> Learning how to use LLMs in a coding workflow is trivial. There is no learning curve. You can safely ignore them if they don’t fit your workflows at the moment.
Learning how to use LLMs in a coding workflow is trivial to start, but you find you get a bad taste early if you don't learn how to adapt both your workflow and its workflow. It is easy to get a trivially good result and then be disappointed in the followup. It is easy to try to start on something it's not good at and think it's worthless.
The pure dismissal of cursor, for example, means that the author didn't learn how to work with it. Now, it's certainly limited and some people just prefer Claude code. I'm not saying that's unfair. However, it requires a process adaptation.
Not everyone with a different opinion is dumber than you.
It seems to me the biggest barrier is that the person driving the tool needs to be experienced enough to recognize and assist when it runs into issues. But that's little different from any sophisticated tool.
It seems to me a lot of the criticism comes from placing completely unrealistic expectations on an LLM. "It's not perfect, therefore it sucks."
If you want to use a tool like Claude Code (or Gemini CLI or Cursor agent mode or Code CLI or Qwen Code) to solve complex problems you need to give them an environment they can operate in where they can solve that problem without causing too much damage if something goes wrong.
You need to think about sandboxing, and what tools to expose to them, and what secrets (if any) they should have access to, and how to control the risk of prompt injection if they might be exposed to potentially malicious sources of tokens.
The other week I wanted to experiment with some optimizations of configurations on my Fly.io hosted containers. I used Claude Code for this by:
- Creating a new Fly organization which I called Scratchpad
- Assigning that a spending limit (in case my coding agent went rogue or made dumb expensive mistakes)
- Creating a Fly API token that could only manipulate that organization - so I could be sure my coding agent couldn't touch any of my production deployments
- Putting together some examples of how to use the Fly CLI tool to deploy an app with a configuration change - just enough information that Claude Code could start running its own deploys
- Running Claude Code such that it had access to the relevant Fly command authenticated with my new Scratchpad API token
With all of the above in place I could run Claude in --dangerously-skip-permissions mode and know that the absolute worse that could happen is it might burn through the spending limit I had set.
This took a while to figure out! But now... any time I want to experiment with new Fly configuration patterns I can outsource much of that work safely to Claude.
Yea, there’s some grunt work involved but in terms of learned ability all of that is obvious to someone who knew only a little bit about LLMs.
edit: I don’t have personal experience around spending limits but I vaguely recall them being useful for folks who want to set up AWS resources and swing for the fences, in startups without thinking too deeply about the infra. Again this isn’t a failure mode unique to LLMs although I can appreciate it not mapping perfectly to your scenario above
edit #2: fwict the LLM specific context of your scenario above is: providing examples, setting up API access somehow (eg maybe invoking a CLI?). The rest to me seems like good old software engineering
There are plenty of useful LLM workflows that are possible to create pretty trivially.
The example you gave is not hardly the first thing a beginning LLM user would need. Yes, more sophisticated uses of an advanced tool require more experience. There's nothing different from any other tool here. You can find similar debates about programming languages.
Again, what I said in my original comment applies: people place unrealistic expectations on LLMs.
I suspect that this is at least partly is a psychological game people unconsciously play to try to minimize the competence of LLMs, to reduce the level of threat they feel. A sort of variation of terror management theory.
I don’t have to debug Emacs every day to write code. My CI workflow just runs every time a PR is created. When I type ‘make tests’, I get a report back. None of those things are perfect, but they are reliable.
Just like I can recognize a clueless frontend developer when they say "React is basically just a newer jquery". Recognizing clueless engineers when they talk about AI can be pretty easy.
It's a sector that is both old and new: AI has been around forever, but even people who worked in the sector years ago are taken aback by what is suddenly possible, the workflows that are happening... hell, I've even seen cases where it's the very people who have been following GenAI forever that have a bias towards believing it's incapable of what it can do.
For context, I lead an AI R&D lab in Europe (https://ingram.tech/). I've seen some shit.
Copilot isn't an LLM, for a start. You _combine_ it wil a selection of LLMs. And it absolutely has severe limitations compared to something like Claude Code in how it can interact with the programming environment.
"Hallucinations" are far less of a problem with software that grounds the AI to the truth in your compiler, diagnostics, static analysis, a running copy of your project, runnning your tests, executing dev tools in your shell, etc.
You're being overly pedantic here and moving goalposts. Copilot (for coding) without an LLM is pretty useless.
I stand by my assertion that these tools are all basically the same fundamental tech - LLMs.
Over generalizing. The synergy between the LLM and the client (cursor, Claude code, copilot, etc) make a huge difference in results.
With LLMs, the point is to eliminate tedious work in a trivial way. If it’s tedious to get an LLM to do tedious work, you have not accomplished anything.
If the work is not trivial enough for you to do yourself, then using an LLM will probably be a disaster, as you will not be able to judge the final output yourself without spending nearly the same amount of time it takes for you to develop the code on your own. So again, nothing is gained, only the illusion of gain.
The reason people think they are more productive using LLMs to tackle non-trivial problems is because LLMs are pretty good at producing “office theatre”. You look like you’re busy more often because you are in a tight feedback loop of prompting and reading LLM output, vs staring off into space thinking deeply about a problem and occasionally scribbling or typing something out.
We are learning that this is not going to be magic. There are some cases where it shines. If I spend the time, I can put out prototypes that are magic and I can test with users in a fraction of the time. That doesn't mean I can use that for production.
I can try three or four things during a meeting where I am generally paying attention, and look afterwards to see if it's pursuing.
I can have it work through drudgery if I provide it an example. I can have it propose a solution to a problem that is escaping me, and I can use it as a conversational partner for the best rubber duck I've ever seen.
But I'm adapting myself to the tool and I'm adapting the tool to me through learning how to prompt and how to develop guardrails.
Outside of coding, I can write chicken scratch and provide an example of what I want, and have it write a proposal for a PRD. I can have it break down a task, generate a list of proposed tickets, and after I've went through them have it generate them in jira (or anything else with an API). But the more I invest into learning how to use the tool, the less I have to clean up after.
Maybe one day in the future it will be better. However, the time invested into the tool means that 40 bucks of investment (20 into cursor, 20 into gpt) can add 10-15% boost in productivity. Putting 200 into claude might get you another 10% and it can get you 75% in greenfield and prototyping work. I bet that agency work can be sped up as much as 40% for that 200 bucks investment into claude.
That's a pretty good ROI.
And maybe some workloads can do even better. I haven't seen it yet but some people are further ahead than me.
Everything you mentioned is also fairly trivial, just a couple of one shot prompts needed.
https://github.com/continuedev/continue
LLMs will always suck at writing code that has not be written millions of times before. As soon as you venture slightly offroad, they falter.
That right there is your learning curve! Getting LLMs to write code that's not heavily represented in their training data takes experience and skill and isn't obvious to learn.
Effective LLM usage these days is about a lot more than just the prompts.
People are claiming that it takes time to build the muscles and train the correct footing to push, while I'm here learning mechanical theory and drawing up levers. If one managed to push the rock for one meter, he comes clamoring, ignoring the many who was injured by doing so, saying that one day he will be able to pick the rock up and throw it at the moon.
Anything up to 250,000 tokens I pipe into GPT-5 (prior to that o3), and beyond that I'll send them to Gemini 2.5 Pro.
For even larger code than that I'll fire up Codex CLI or Claude Code and let them grep their way to an answer.
This stuff has gotten good enough now that I no longer get stuck when new tools lack decent documentation - I'll pipe in just the source code (filtered for .go or .rs or .c files or whatever) and generate comprehensive documentation for myself from scratch.
You don't have the luxury of having someone who is deeply familiar with the code sanity check your perceived understanding of the code, i.e. you don't see where the LLM is horribly off-track because you don't have sufficient understanding of that code to see the error. In enterprise contexts this is very common tho so its quite likely that a lot of the haters here have seen PRs submitted by vibecoders to their own work which have been inadequate enough that they started to blame the tool. For example I have seen someone reinvent the wheel of the session handling by a client library because they were unaware that the existing session came batteries included and the LLM didn't hesitate to write the code again for them. The code worked, everything checked out but because the developer didn't know what they didn't know about they submitted a janky mess.
What are those things that they are good for? And consistently so?
In case it matters, I was using Copilot that is for 'free' because my dayjob is open source, and the model was Claude Sonnet 3.7. I've not yet heard anyone else saying the same as me which is kind of peculiar.
My personal experience has been that AI has trouble keeping the scope of the change small and targeted. I have only been using Gemini 2.5 pro though, as we don’t have access to other models at my work. My friend tells me he uses Claud for coding and Gemini for documentation.
Most people I've seen espousing LLMs and agentic workflows as a silver bullet have limited experience with the frameworks and languages they use with these workflows.
My view currently is one of cautious optimism; that LLM workflows will get to a more stable point whereby they ARE close to what the hype suggests. For now, that quote that "LLMs raise the floor, not the ceiling" I think is very apt.
LinkedIn is full of BS posturing, ignore it.
If you go by MBA types on LinkedIn that aren’t really developers or haven’t been in a long time, now they can vibe out some react components or a python script so it’s a revolution.
I tend to strongly agree with the "unpopular opinion" about the IDEs mentioned versus CLI (specifically, aider.chat and Claude Code).
Assuming (this is key) you have mastery of the language and framework you're using, working with the CLI tool in 25 year old XP practices is an incredible accelerant.
Caveats:
- You absolutely must bring taste and critical thinking, as the LLM has neither.
- You absolutely must bring systems thinking, as it cannot keep deep weirdness "in mind". By this I mean the second and third order things that "gotcha" about how things ought to work but don't.
- Finally, you should package up everything new about your language or frameworks since a few months or year before the knowledge cutoff date, and include a condensed synthesis in your context (e.g., Swift 6 and 6.1 versus the 5.10 and 2024's WWDC announcements that are all GPT-5 knows).
For this last one I find it useful to (a) use OpenAI's "Deep Research" to first whitepaper the gaps, then another pass to turn that into a Markdown context prompt, and finally bring that over to your LLM tooling to include as needed when doing a spec or in architect mode. Similarly, (b) use repomap tools on dependencies if creating new code that leverages those dependencies, and have that in context for that work.
I'm confused why these two obvious steps aren't built into leading agentic tools, but maybe handling the LLM as a naive and outdated "Rain Man" type doesn't figure into mental models at most KoolAid-drinking "AI" startups, or maybe vibecoders don't care, so it's just not a priority.
Either way, context based development beats Leroy Jenkins.
It seems to me that currently there are 2 schools of thought:
1. Use repomap and/or LSP to help the models navigate the code base
2. Let the models figure things out with grep
Personally, I am 100% a grep guy, and my editor doesn't even have LSP enabled. So, it is very interesting to see how many of these agentic tools do exactly the same thing.
And Claude Code /init is a great feature that basically writes down the current mental model after the initial round of grep.
The strategy of one or the other brings differing big gaps and require context or prompt work to compensation.
They should be using 1 to keep overall lay of the land, and 2 before writing any code.
On the management side, however, we have all sorts of AI mandates, workshops, social media posts hyping our AI stuff, our whole "product vision" is some AI-hallucinated nightmare that nobody understands, you'd genuinely think we've been doing nothing but AI for the last decade the way we're contorting ourselves to shove "AI" into every single corner of the product. Every day I see our CxOs posting on LinkedIn about the random topic-of-the-hour regarding AI. When GPT-5 launched, it was like clockwork, "How We're Using GPT-5 At $COMPANY To Solve Problems We've Never Solved Before!" mere minutes after it was released (we did not have early access to it lol). Hilarious in retrospect, considering what a joke the launch was like with the hallucinated graphs and hilarious errors like in the Bernoulli's Principle slide.
Despite all the mandates and mandatory shoves coming from management, I've noticed the teams I'm close with (my team included) are starting to push back themselves a bit. They're getting rid of the spam generating PR bots that have never, not once, provided a useful PR comment. People are asking for the various subscriptions they were granted be revoked because they're not using them and it's a waste of money. Our own customers #1 piece of feedback is to focus less on stupid AI shit nobody ever asked for, and to instead improve the core product (duh). I'm even seeing our CTO who was fanboy number 1 start dialing it back a bit and relenting.
It's good to keep in mind that HN is primarily an advertisement platform for YC and their startups. If you check YC's recent batches, you would think that the 1 and only technology that exists in the world is AI, every single one of them mentions AI in one way or another. The majority of them are the lowest effort shit imaginable that just wraps some AI APIs and is calling it a product. There is a LOT of money riding on this hype wave, so there's also a lot of people with vested interests in making it seem like these systems work flawlessly. The less said about LinkedIn the better, that site is the epitome of the dead internet theory.
> Learning how to use LLMs in a coding workflow is trivial. There is no learning curve. You can safely ignore them if they don’t fit your workflows at the moment.
How much of your workflow or intuition from 6 months ago is still relevant today? How long would it take to learn the relevant bits today?
Keep in mind that Claude Code was released less than 6 months ago.
If I was starting from fresh today I expect it would take me months of experimentation to get back to where I am now.
Working thoughtfully with LLMs has also helped me avoid a lot of the junk tips ("Always start with 'you are the greatest world expert in X', offer to tip it, ...") that are floating around out there.
Speaking mostly from experience of building automated, dynamic data processing workflows that utilize LLMs:
Things that work with one model, might hurt performance or be useless with another.
Many tricks that used to be necessary in the past are no longer relevant, or only applicable for weaker models.
This isn't me dimissing anyone's experience. It's ok to do things that become obsolete fairly quickly, especially if you derive some value from it. If you try to stay on top of a fast moving field, it's almost inevitable. I would not consider it a waste of time.
I recently started with fresh project, and until I got to the desired structure I only used AI to ask questions or suggestions. I organized and written most of the code.
Once it started to get into the shape that felt semi-permanent to me, I started a lot of queries like:
```
- Look at existing service X at folder services/x
- see how I deploy the service using k8s/services/x
- see how the docker file for service X looks like at services/x/Dockerfile
- now, I started service Y that does [this and that]
- create all that is needed for service Y to be skaffolded and deployed, follow the same pattern as service X
```
And it would go, read existing stuff for X, then generate all of the deployment/monitoring/readme/docker/k8s/helm/skaffold for Y
With zero to none mistakes. Both claude and gemini are more than capable to do such task. I had both of them generate 10-15 files with no errors, with code being able to be deployed right after (of course service will just answer and not do much more than that)
Then, I will take over again for a bit, do some business logic specific to Y, then again leverage AI to fill in missing bits, review, suggest stuff etc.
It might look slow, but it actually cuts most boring and most error prone steps when developing medium to large k8s backed project.
Whipping up greenfield projects is almost magical, of course. But that’s not most of my work.
The other thing I disagree with is the coverage of gemnini-cli: if you use gemini-cli for a single long work session, then you must set your Google API key as an environment variable when starting gemini-cli, otherwise you end up after a short while using Gemini-2.5-flash, and that leads to unhappy results. So, use gemini-cli for free for short and focused 3 or 4 minute work sessions and you are good, or pay for longer work sessions, and you are good.
I do have a random off topic comment: I just don’t get it: why do people live all day in an LLM-infused coding environment? LLM based tooling is great, but I view it as something I reach for a few times a day for coding and that feels just right. Separately, for non-coding tasks, reaching for LLM chat environments for research and brainstorming is helpful, but who really needs to do that more than once or twice a day?
I recall The Mythical Man-Month stating a rough calculation that the average software developer writes about 10 net lines of new, production-ready code per day. For a tool like this going up an order of magnitude to about 100 lines of pretty good internal tooling seems reasonable.
OP sounds a few cuts above the 'average' software developer in terms of skill level. But here we also need to point out a CLI log viewer and querier is not the kind of thing you actually needed to be a top tier developer to crank out even in the pre-LLM era, unless you were going for lnav [1] levels of polish.
[1]: https://lnav.org/
It makes your existing strength and mobility greater, but don't be surprised if you fly into space that you will suffocate,
or if you fly over an ocean and run out gas, that you'll sink to the bottom,
or if you fly the suit in your fine glassware shop with patrons in the store, that your going to break and burn everything/everyone in there.
Do they? I’ve found Clojure-MCP[1] to be very useful. OTOH, I’m not attempting to replace myself, only augment myself.
1: https://github.com/bhauman/clojure-mcp
I like your phrasing of “OTOH, I’m not attempting to replace myself, only augment myself.” because that is my personal philosophy also.
* I let the AI do something
* I find bad bug or horrifying code
* I realize I have it too much slack
* hand code for a while
* go back to narrow prompts
* get lazy, review code a bit less add more complexity
* GOTO 1, hopefully with a better instinct for where/how to trust this model
Then over time you hone your instinct on what to delegate and what to handle yourself. And how deeply to pay attention.
The current state of LLM-driven development is already several steps down the path of an end-game where the overwhelming majority of code is written by the machine; our entire HCI for "building" is going to be so far different to how we do it now that we'll look back at the "hand-rolling code era" in a similar way to how we view programming by punch-cards today. The failure modes, the "but it SUCKS for my domain", the "it's a slot machine" etc etc are not-even-wrong. They're intermediate states except where they're not.
The exceptions to this end-game will be legion and exist only to prove the end-game rule.
The reach is big enough to not care about our feelings. I wish it wasn't this way.
I work mostly in C/C++.
The most valuable improvement of using this kind of tools, for me, is to easily find help when I have to work on boring/tedious tasks or when I want to have a Socratic conversation about a design idea with a not-so-smart but extremely knowledgeable colleague.
But for anything requiring a brain, it is almost useless.
It’s mostly on point though. Although, in recent years I’ve been assigned to manage and plan projects at work, and the skills I’ve learnt from that greatly help to get effective results from an LLM I think.
I haven't found that to be true with my most recent usage of AI. I do a lot of programming in D, which is not popular like Python or Javascript, but Copilot knows it well enough to help me with things like templates, metaprogramming, and interoperating with GCC-produced DLL's on Windows. This is true in spite of the lack of a big pile of training data for these tasks. Importantly, it gets just enough things wrong when I ask it to write code for me that I have to understand everything well enough to debug it.
And promoting own startups are usually okay if that is phrased okay :)
Devin is perhaps the one that is most fully featured and I believe has been around the longest. Other examples that seem to be getting some attention recently are Warp, Cursor's own background agent implementation, Charlie Labs, Codegen, Tembo, and OpenAI's Codex.
I do not work for any of the aforementioned companies.
Ah yes. An unverifiable claim followed by "just google them yourself".
> Devin is perhaps the one that is most fully featured and I believe has been around the longest.
And it had been hilariously bad the longest. Is it better now? Maybe? I don't really know anyone even mentioning Devin anymore
> examples that seem to be getting some attention recently
So, "some attention", but you could "easily find them by searching".
> Charlie Labs, Codegen, Tembo
Never heard of them, but will take a look.
See how easy it was to mention them?
Some agent scaffolding performs better on benchmarks than others given the same underlying base model - see SWE Bench and Terminal Bench for examples.
Some may find certain background agents better than others simply because of UX. Some background agents have features that others don't - like memory systems, MCP, 3rd party integrations, etc.
I maintain it is easy to search for examples of background coding agents that are not Jules or Copilot. For me, searching "background coding agents" on google or duckduckgo returns some of the other examples that I mentioned.
I'm sprry, but I disagree with this claim. That is not my experience, nor many others. It's true that you can make them do something without learning anything. However, it takes time to learn what they are good amd bad at, what information they need, and what nonsense they'll do without express guidance. It also takes time to know what to look for when reviewing results.
I also find that they work fine for languages without static types. You need need tests, yes, but you need them anyway.
Almost like hiring and scaling a team? There are also benchmarks that specifically measure this, and its in theory a very temporary problem (Aider Polyglot Benchmark is one such).
Either I'm extremely lucky or I was lucky to find the guy who said it must all be test driven and guided by the usual principles of DRY etc. Claude Code works absolutely fantastically nine out of 10 times and when it doesn't we just roll back the three hours of nonsense it did postpone this feature or give it extra guidance.
If there's a test suite for the thing to run it's SO much less likely to break other features when it's working. Plus it can read the tests and use them to get a good idea about how everything is supposed to work already.
Telling Claude to write the test first, then execute it and watch it fail, then write the implementation has been giving me really great results.
It’s not perfect but it’s okay.
I use clojure for my day-to-day work, and I haven't found this to be true. Opus and GPT-5 are great friends when you start pushing limits on Clojure and the JVM.
> Or 4.1 Opus if you are a millionaire and want to pollute as much possible
I know this was written tongue-in-cheek, but at least in my opinion it's worth it to use the best model if you can. Opus is definitely better on harder programming problems.
> GPT 4.1 and 5 are mostly bad, but are very good at following strict guidelines.
This was interesting. At least in my experience GPT-5 seemed about as good as Opus. I found it to be _less_ good at following strict guidelines though. In one test Opus avoided a bug by strictly following the rules, while GPT-5 missed.
https://speculumx.at/pages/read_post.html?post=59
It becomes farcical when not only are you missing the big thing but you're also proud of your ignorance and this guy is both.
Like if you need to crap out a UI based on a JSON payload, make a service call, add a server endpoint, LLMs will typically do this correctly in one shot. These are common operations that are easily extrapolated from their training data. Where they tend to fail are tasks like business logic which have specific requirements that aren’t easily generalized.
I’ve also found that writing the scaffolding for the code yourself really helps focus the agent. I’ll typically add stubs for the functions I want, and create overall code structure, then have the agent fill the blanks. I’ve found this is a really effective approach for preventing the agent from going off into the weeds.
I also find that if it doesn’t get things right on the first shot, the chances are it’s not going to fix the underlying problems. It tends to just add kludges on top to address the problems you tell it about. If it didn’t get it mostly right at the start, then it’s better to just do it yourself.
All that said, I find enjoyment is an important aspect as well and shouldn’t be dismissed. If you’re less productive, but you enjoy the process more, then I see that as a net positive. If all LLMs accomplish is to make development more fun, that’s a good thing.
I also find that there's use for both terminal based tools and IDEs. The terminal REPL is great for initially sketching things out, but IDE based tooling makes it much easier to apply selective changes exactly where you want.
As a side note, got curious and asked GLM-4.5 to make a token field widget with React, and it did it in one shot.
It's also strange not to mention DeepSeek and GLM as options given that they cost orders of magnitude less per token than Claude or Gemini.
That was an unnecessary guilt-shaming remark.