It is incoherent to ask for a “confession” from an LLM. An LLM is fundamentally predicting a next token, repeatedly. If you ask it “Why did you do X” it will not do the human thing and introspect about latent motives that we are only finding out about now. It will respond in the statistically likely way, which isn’t useful.
All this is to say that if you don’t know what you’re doing with software you can shoot yourself in the foot, and now with AI agents you can shoot yourself in the foot with a machine gun.
Don’t ask the AI agent nicely not to delete your backup databases. That isn’t reliable. Do not give them write permission to a thing you’re not comfortable with them writing to.
That’s a huge amount of messing around to get those handful of songs then. If only 1% are good, you’re pulling the lever 100x more than you should need to.
How many iterations (arrangements and recordings) do you think a typical Billboard pop song goes through before it's ready for a final mix and mastering?
Go find a YouTube of someone doing this work, it is kind of mind blowing. Given how expensive studio time is, you realize why it costs so much for a popular artist to produce a polished album.
How can a person reconcile this comment with the one at the root of this thread? One person says Claude struggles to even meet the strict requirements of a spec sheet, another says Claude is doing a great job and doesn’t even need specific specs?
I have my own anecdata but my comment is more about the dissonance here.
One aspect you have to consider is the differences in human beings doing the evaluation. I had a coworker/report who would hand me obvious garbage tier code with glaring issues even in its output, and it would take multiple iterations to address very specific review comments (once, in frustration, I showed a snippet of their output to my nontechnical mom and even my mom wtf’ed and pointed out the problem unprompted); I’m sure all the AI-generated code I painstakingly spec, review and fix is totally amazing to them and need very little human input. Not saying it must be the case here, that was extreme, but it’s a very likely factor.
This is plausible. Assuming it’s true, we would see the adoption of vibe coding at a faster rate amongst inexperienced developers. I think that’s true.
A counterpoint is Google saying the vast majority of their code is written by AI. The developers at Google are not inexperienced. They build complex critical systems.
But it still feels odd to me, this contradiction. Yes there’s some skill to using AI but that doesn’t feel enough to explain the gap in perception. Your point would really explain it wonderfully well, but it’s contradicted by pronouncements by major companies.
One thing I would add is that code quality is absolutely tanking. PG mentioned YC companies adopted AI generated code at Google levels years ago. Yesterday I was using the software of one such company and it has “Claude code” levels of bugginess. I see it in a bunch of startups. One of the tells is they seem to experience regressions, which is bizarre. I guess that indicates bugs with their AI generated tests.
This is magical because you are both on the exact right path and not right. My theory is there’s a sort of skill to teasing code from AI (or maybe not and it’s alchemy all over again) and this is all new enough and we don’t have a common vocabulary for it that it’s hard for one person who is having a good experience and one person who is not to meaningfully sort out what they are doing differently.
Alternatively, it could be there’s a large swath of people out there so stupid they are proud of code your mom can somehow review and suggest improvements in despite being nontechnical.
I just read Steve Yegge's book Vibe Coding, and he says learning to use AI effectively is a skill of its own, and takes about a year of solid work to get good at it. It will sometimes do a good job and other times make a mess, and he has a lot of tips on how to get good results, but also says a lot of it is just experience and getting a good feel for when it's about to go haywire.
One is getting paid by a marketing department program and the other isn't. Remember how much has been spent making LLMs and they have now decided that coding is its money maker. I expect any negative comment on LLM coding to be replied to by at least 2 different puppets or bots.
Not necessarily; rising tide and all that. When a new scam like this emerges, it behooves all of the grifters to cooperate and not muddy the waters with distrust.
I’m normally very skeptical of conspiracy theories. But saw an AI booster bot responding to a negative AI post I made here.
Someone pointed out to me in the comments that the username had posted long replies to 3 completely different threads in the same minute. That and looking back at its post history confirmed it was a bot.
... or one person has a very strong mental model of what he expects to do, but the LLM has other ideas. FWIW I'm very happy with CC and Opus, but I don't treat it as a subordinate but as a peer; I leave it enough room to express what it thinks is best and guide later as needed. This may not work for all cases.
If you don’t have a very strong mental model for what you are working on Claude can very easily guide in you into building the wrong thing.
For example I’m working on a huge data migration right now. The data has to be migrated correctly. If there are any issues I want to fail fast and loud.
Claude hates that philosophy. No matter how many different ways I add my reasons and instructions to stop it to the context, it will constantly push me towards removing crashes and replacing them with “graceful error handling”.
If I didn’t have a strong idea about what I wanted, I would have let it talk me into building the wrong thing.
Claude has no taste and its opinions are mostly those of the most prolific bloggers. Treating Claude like a peer is a terrible idea unless you are very inexperienced. And even then I don’t know if that’s a good idea.
> Claude has no taste and its opinions are mostly those of the most prolific bloggers.
I often think that LLMs are like a reddit that can talk. The more I use them, the more I find this impression to be true - they have encyclopedic knowledge at a superficial level, the approximate judgement and maturity of a teenager, and the short-term memory of a parakeet. If I ask for something, I get the statistical average opinion of a bunch of goons, unconstrained by context or common sense or taste.
That’s amazing and incredible, and probably more knowledgeable than the median person, but would you outsource your thinking to reddit? If not, then why would you do it with an LLM?
> they have encyclopedic knowledge at a superficial level, the approximate judgement and maturity of a teenager, and the short-term memory of a parakeet. If I ask for something, I get the statistical average opinion of a bunch of goons, unconstrained by context or common sense or taste.
Love this paragraph; it's exactly how I feel about the LLMs. Unless you really know what you are doing, they will produce very sub-optimal code, architecturally speaking. I feel like a strong acumen for proper software architecture is one of the main things that defines the most competent engineers, along with naming things properly. LLMs are a long, long way from having architectural taste
I’ve tried that. I’ve experimented with a whole council of 13 personas including many famous developers. It’s definitely different. But it’s hasn’t performed significantly better in my tests.
That’s interesting to hear as for me Claude has been quite good about writing code that fails fast and loud and has specifically called it out more than once. It has also called out code that does not fail early in reviews.
If you add a single space to a prompt, you’ll get a completely different output, so it’s no surprise that feeding entirely different programs into the prompt produces radically different output.
My guess is that there must be something about the language(go) or the domain (a data migration tool that uses Kafka) that triggers this.
Have you created a plan where the requisite is not to bother you with x and y, and to use some predetermined approach? What you describe sometimes happens to me, but it happens less when its part of the spec.
I think it matters what the project and tech stack is, and how much you try to get done before starting a fresh chat.
I've had interesting chats where it explained that it's choice of tailwind for example was because it had a ton of training knowledge on it.
I've also had it try to build more in one chat than it should many times.
For some reason openai codex handles building too much without failing better - but that is total anecdata from my particular projects and ymmv.
I've had these things try to build big when a little nudge gets them to change direction and not build so much. Explain which libraries and such and asking it to change the tech stack and the steps to build at once seem to make things much better for my use cases.
Also running extra checks and cleanup later is a thing, that sure a human might have seen an obvious thing at time of build, but we have bigger memory context comparatively imho.
I think it depends on both the complexity and the quality bars set by the engineer.
From my observations, generally AI-generated code is average quality.
Even with average quality it can save you a lot of time on some narrowly specialized tasks that would otherwise take you a lot of research and understanding. For example, you can code some deep DSP thingie (say audio) without understanding much what it does and how.
For simpler things like backend or frontend code that doesn't require any special knowledge other than basic backend or frontend - this is where the bars of quality come into play. Some people will be more than happy with AI generated code, others won't be, depending on their experience, also requirements (speed of shipping vs. quality, which almost always resolves to speed) etc.
It could just be that each of the two reviewers is merely focussing on different sides of the same coin? I use Claude all the time. It saves me a lot of effort that I would have otherwise spent in looking up specific components. The magically autocompleted pieces of boilerplate are a tangible relief. It also catches issues that I missed. But when it is wrong, it can be subtly or embarassingly or spectacularly wrong depending on the situation.
It boils down to scope. I use CC in both very specific one-language systems and broad backend-frontend-db-cache systems. You can guess where the difficulty lies. (Hint: its the stuff with at least 3 distinct languages)
I guess in the same way that one might presuppose a boat wants water?
If a piece of software doesn’t have users and the developers don’t care about the papercuts they are delivering, I would argue what they have created is more of an art project than a utility.
Science research without obvious practical application can still be important and valuable.
Art works without popular appeal can become highly treasured by some.
Open source software doesn't have to be ambitious to be worthwhile and useful. It can be artful, utilitarian or a artifact of play. Commercial standards shouldn't be the only measure of good software.
It's more like building your own boat then someone else coming along and saying it'll never compete with a cruise ship because it doesn't have a water slide and endless buffet; sometimes, things in the same category can serve wholly different purposes.
There was a really good point in this podcast episode about the speed of LLMs. They are so slow that all of the progress messages and token streaming are necessary. But the core problem is that the technology is so darn slow.
As someone who both uses and builds this technology I think this is a core UX issue we’re going to be improving for a while. At times it really feels like a choose 2+ of: slow, bad, and expensive.
I am trying Qwen3.5-9B-Claude-4.6 since a couple of days now locally coming from OMLX. Either via Hermes or Continue in VS Code. It's oka'ish, even performance-wise.
This really should be the case. If AI tools are really making it easier to build stuff, we should see hordes of new startups solving all kinds of problems thar were difficult or expensive to solve before.
I've been seeing this in the startup ive been for the past year. We are 20 people, and are solving fiscal reconciliation problems for HUGE companies in my country. Building thing that were just not scalable before.
I'm waiting for all the cool startups in both b2b and b2c that solve health, time spending or money problems.
in theory yes, but all the money is going into AI or AI-adjacent startups that no one would actually build a product that solves problems if it doesn't incorporate AI in it.
Using AI is fine. The key is to use it to build processes/Systems that solve problems deterministically. Instead of "asking them" to solve the problems non-deterministically themselves. It's way cheaper and robust.
As an example, we had to be able to parse most uses b2b bank statements. That means understanding the structure of around 30 different formats, some with very subtle differences within them.
The naive and expensive approach was to train an LLM to do it (after OCR).
Instead we used AI to generate a generalized python parser that is "configurable" for different structures. (And also extracts data from PDFs without OCR, unless they are pure images).
It covers the 99% of the cases. And for that 1% we pass it through AI for immediate solution, and to generate the additional deterministic config to cover it.
But I digress (lol). The point is, there are so many interesting problems that can now be solved. We should have teams of 3 to 5 people doing crazy stuff.
Because the path to Rule of Law is not deleting/refusing to enforce all laws.
Rule of Law means no one is above the law. In practice this is an aspiration (in the U.S. and everywhere else) but giving up on that isn’t going to make the world better.
All this is to say that if you don’t know what you’re doing with software you can shoot yourself in the foot, and now with AI agents you can shoot yourself in the foot with a machine gun.
Don’t ask the AI agent nicely not to delete your backup databases. That isn’t reliable. Do not give them write permission to a thing you’re not comfortable with them writing to.
reply