Hacker Newsnew | past | comments | ask | show | jobs | submit | lmeyerov's commentslogin

Maybe a failure to automate?

The volume of people successfully adopting agentic engineering practices suggests this stuff isn't rocket science, but it is a learned skill and takes setup.

A year later into heavy AI coding, my experience is what you're describing should aid in being able to run 5+ agents simultaneously on a project because you know what you're doing, you set it up right, and you know how to tell agents to leverage that properly.


Maybe you're the exception and are actually doing it right and actually getting good results, but every time I have heard this, it has been an ignorance-is-bliss scenario where the person saying it is generating massive amounts of code that they don't understand, not because they're incapable but because they don't care to, and immediately wiping their hands of it afterward.

To give an example of where I hear this, it is indistinguishable from the things I hear from my coworkers: "You just need the right setup!" (IMO the actual difference is I need to turn off the part of my brain that cares about what the code actually does or considers edge cases at all) What I actually see, in practice, are constant bugs where nobody ever actually addresses the root cause, and instead just paves over it with a new Claude mass-edit that inevitably introduces another bug where we'll have to repeat the same process when we run into another production issue.

We end up making no actual progress, but boy do we close tickets, push PRs, and move fast and oh man do we break things. We're just doing it all in-place. But at least we're sucking ourselves off for how fast we're moving and how cutting edge we are, I guess.

I dunno, maybe I'm doing it wrong, maybe my team is all doing it wrong. But like I said the things they say are indistinguishable from the common HN comment that insists how this stuff is jet fuel for them, and I see the actual results, not just the volume of output, and there's no way we're occupying the same reality.


Yes and no

I've seen productivity surveys of senior programmers that share the reverse, and that matches our experience. A common finding is that gardening projects are a lot cheaper now when they're just a few extra terminal tabs running in parallel - security, refactoring, more testing, etc. Non-feature backlog items that senior developers value around tech debt are less of a discussion now. They're often essential now: to make AI coding work well, there is an effective automation poverty line around verification, testing, and specification that needs to be reached.

The understanding code thing is tough. Eg, when a non-senior fullstack developer manually edits frontend css code and didn't start from pixel-perfect designs across all form-factors, do they really understand what they did? I wrote the first formal mechanized specification of the CSS standard, and would claim 95%+ of web developers do not understand core CSS layout rules to beginwith: it was a struggle to semantically formalize even a tiny core of the box model as soon as you have floats. If the AI generates live storybooks and in-tool screenshots of all these things as part of the review process, and doing code review "looks good", what's the difference?

I don't truly think this way - my point is to challenge basic claims of manual coding to be good to begin with and whether AI coding is being held to an artificial standard. What I see in commercial and defense software is a joke compared to what we do in the verification world. AI coding automating review iteration fixes in areas like security engineering and test coverage+amplification has been a blessing for quality improvement.

More fundamentally, we require developers by default to be responsible for knowing what the code does and having tested it. Every case of relaxing that rule has to be explicit, eg, clear that something is a prototype, or an area is vibed with what alternate review/test flow, and we are learning as a team what that means in different situations. In practice, our senior ai coders are doing more quality engineering work than the manual coders, both per-pr and in broader gardening contributions.


> do they really understand what they did? ...

I know you said you don't truly think that way, but to counter anyway since some people seem to legitimately hold this viewpoint:

I take issue with the implication that not necessarily having a full understanding of what the code/library/driver/compiler/abstraction is doing is somehow justification/permission to embrace and celebrate having basically no understanding of what any of the code is doing. The in-between space there is the vast majority of the surface area where nuance can and should exist.

>my point is to challenge basic claims of manual coding to be good to begin with and whether AI coding is being held to an artificial standard

That's fair, and I can only speak for myself here; I don't have any inherent philosophical issue with manual vs AI, but my personal experience is that AI coding is just straight-up a frustrating nightmare to deal with, IMO orders of magnitude worse than manual. It's faster, sure, but I end my rage-filled LLM debugging session walking away knowing I learned pretty much nothing and that there's no compounding knowledge or outcome that will keep me from experiencing the same thing tomorrow, and I hate that. I am Sisyphus rolling prompts into a terminal.

But I'm not gonna sit here and act like manual coding makes you morally virtuous or pure or whatever. IMO it's a great forcing function to better (even if not completely) understand what is going on in your system(s) and I think most everyone would agree with that. What's up for debate is probably whether that's worth the time tradeoff now that we have a magic time compressor machine available to us.

Maybe I only find that knowledge tradeoff valuable because I'm a lowly IC and not some super turbo chad 10x principal who built a distributed database in brainfuck 10 years ago for fun and has nothing left to learn, or a technical founder of 5 concurrent startups who is optimizing for business value. It's possible that a heavy bias for learning/skill acquisition blinds me here.

>we require developers by default to be responsible for knowing what the code does and having tested it. Every case of relaxing that rule has to be explicit

This sounds pretty reasonable tbh.


1. If what you're replying to was a thing, wouldn't there be a open source project where I could see this in action? or Some sort of example I could watch on youtube somewhere. 2. The people that talk like this in my company, spin up new projects all the time and then just get to hand them off for other teams to clean up the mess and decode what the heck is going on.

1. Probably most of https://github.com/simonw , but take care to seperate adopted / semi-professional from exploratory personal work

2. That sounds like your company has a weak engineering culture and is early on its upskilling journey. We explicitly seperate projects into prototypes vs production, where vibes are fine for the former, eg, demos by designers / data scientists / sales engineers but traditional code review standards for whatever is going into production. That mirrors my qualifier in #1.

I find that success here is a combination of engineering seniority, prompting experience, and domain experience . Anything lacking breaks the automation loop, like not knowing how and what to automate. Ex: All of our team finds value in ai coding, but junior engineers struggle on these dimensions, so are not running the 3+ agents that senior ones are.


You seem to have missed OP's point: some things are only encoded in our brains when you are sufficiently experienced.

Translating that into code can happen directly by you, or into prompt iterations that need to result in the same/similar coded representation.

In other words, when it matters how something works and it is full of intricate details, you do not need to specify it, you just do it (eg. as an example which is probably not the best is you knowing how to avoid N+1 query performance issue — you do not need a ticket or spec to be explicit, you can just do it at no extra effort — models are probably OK at this as it is such a pervasive gotcha, but there are so many more).


That's the failure to automate. The AI isn't telepathic, so agentic engineers not automating this stuff is skipping out on the engineering part.

You setup the environment and then you do the work. Unless you are switching employers every week, you invest in writing that stuff down so the generation is right-ish and generate validation tooling so it auto-detects the mistakes and self-repairs.


sometimes you write the feature and write it well so it's reusable.

imagine you have to implement a specific algorithm for a quantum computer.

There's no value setting up AI to do the writing for you. That might be orders of magnitude harder then writing the algorithm directly.

For highly specialized one-off features, it doesn't always pay off.

On the other hand, if all you do are some generic items that AI can do well... then I'm not sure you're going to have a job long term, your prompts and automation will be useful for the new junior hires that will be specialized in using these and cost effective.


That feels like true in theory, but in practice, we see the reverse for advanced projects where AI is helping us a lot. A decent chunk of our core IP falls into the bucket you're describing:

We have been building a GPU-accelerated graph investigation platform that has grown over 10+ years with fancy stuff all over the place - think accelerated query languages, layout kernels, distribution, etc. R&D-grade high performance engineering projects and kernels end up needing a lot of iterations to make a prototype and initial release. Likewise, they're more devilish to maintain when they need a small tweak later because of the sophistication and bus factor. Both phases benefit.

AI coding helps automate investigation, testing, measurement, patching, etc. The immediate effect is we can squeeze in many more experimental iterations with more fidelity and reach. Having an AI help automatically explore the design space and the details helps a LOT. And later, maintaining a wide surface area of code here that is delicate to touch and infrequently edited is traditionally stressful for teammates, and AI editing + AI-generated automation is helping destress that a LOT. We very much invest in upgrading our team, processes, and tooling here.


Allright, thank you! I need to re-evaluate then.

I think there's a level above that where the words to describe such structure are familiar and readily available and hey guess what? The model understands those too. Just about every pattern has a name. Or a shape. Or an analog or metaphor in other languages or codebases. All work as descriptors.

This presumes that most of this stays encoded as words in our brains: the effort to translate some of these into words might be similar to translating it into code (still words, just very precise).

It's like talking legalese vs plain English; or formal logic vs English. Some people have the formal stuff come more naturally, and then spitting code out is not a burden.


No, it really doesn't presume anything about brains or information encoding. Just points out that there is a level of mastery in which all the techniques and all the forms have names or adequate descriptions. Teachers often attempt to achieve this, to facilitate education.

It's no accident there is an adage from Aristotle in the vein of: "Those who can, do. Those who understand, teach."

So yes, there is a level of mastery that is beyond being able to do a good job of designing and evolving complex systems which enables people to teach others the same skill set.

However, this is a smaller number of practitioners, and most have learned through practice and looking over how more experienced engineers apply their knowledge.

Where I disagree is that this means everybody is equally capable of teaching with words, or that there are no experts who are bad at teaching (humans or directing AI) — this clearly indicates it is not encoded as words for said experts.


It's been pretty clear in my experience that experts tend to be capable of working with the same ideas in many different forms. That's what I would call mastery. It implies "complete" knowledge, which probably means several interrelated encodings with loci in different parts of the brain. Those interrelated encodings will be highly associated, and discerning in an expert. Which implies a high degree of usefulness and specificity in communication. This matches my experience.

> successfully adopting agentic engineering practices

What's your definition of "successfully"?

More LOC committed per day is probably the only one that's guaranteed when you let spicy autocomplete take the wheel.

I don't think it's at all possible to reason about the other more meaningful metrics in software development, because we simply don't have the context of what each human is working on, and as with the WYSIWYG fad of 3 decades ago, "success" is generally self-reported, by people who don't know what they don't know, and thus they don't know what spicy autocomplete is getting woefully wrong.

"But it {compiles,runs,etc}" isn't a meaningful metric when a large portion of the code in question is dynamic/loosely typed in a non-compiled language (JavaScript, Python, Ruby, PHP, etc).


Also, if your boss tells you "we're AI company now, you will use AI or be fired" then of course you will use AI and claim it is productive.

If you are on the right team with the right professionals you can measure. when we first started using LLMs we decided to run the same process as if they did not exist, same sprint planning meetings, same estimation. we did this for 6 months and saw roughly 55% increase in output compared to pre-LLM usage. there are biases in what were tried to achieve, it is not easy to estimate something will take XX hours when you know some portion (for example writing documentation or portions of the test coverage) you won’t have to write but we did our best. after we convinced ourselves of productivity gains we stopped doing this. saying you can’t measure something is typical SWE BS like “we can’t estimate” and the other lies we were able to convince everyone off successfully

The baseline Claude prompt being compared to feels pretty laughable so not sure what is learned. Maybe compare to a more realistic baseline for the DIY side for more compelling benchmarketing?

We started with a DIY code review skill because it's inherent to want to customize to our codebase and infra before trying solutions that add layers which may get in our way here. We have a 1 page skill that that does seperate passes on security, spec conformance, proper DRY & architectural abstractions, etc, and adversarial result quality passes to prune & prioritize. Others do similar.

Once quality is fixed, I'd expect the comparisons to be less on hits/misses , and more on token efficiency. That's a tricky one bc developer local review tokens are heavily subsidized right now.


> Maybe compare to a more realistic baseline for the DIY side for more compelling benchmarketing?

This is fair critique. However, I don’t really trust myself to write a great code review skill for vLLM or OpenClaw. I also don’t think Claude Code is the right harness for this deep and broad scanning work. We find that it struggles to maintain clarity when considering many different bugs at the same time. The coding agents seem really great at single-goal tasks that they can Ralph their way to.

> We started with a DIY code review skill because it's inherent to want to customize to our codebase and infra before trying solutions that add layers which may get in our way here.

Being able to tinker deeply with the tools is pretty inherent to my love of dev tools in general. Our job is to make use of all of those customizations (our agent will use that 1 page skill when doing its bug finding). I also still think externalizing part of your dev workflow is the right way to get ahead. You really don’t want to do the work of eval-ing/maintaining that skill to make sure it still performs well with a mythos or something.

> and more on token efficiency.

I’m really confident in our ability to stretch $20 of tokens ;)


> I also don’t think Claude Code is the right harness for this deep and broad scanning work. We find that it struggles to maintain clarity when considering many different bugs at the same time.

The skill we do splits it into multiple passes to divide and conquer on task dimension and on files for that exact reason. Likewise, it loops (Ralph-ish) until it converges. It maintains a task queue and work log to stay on track. We are growing it over time, but now more about per-repo customization, while the bones are good cross-repo.

I would only trust frontier-grade harnesses to do this kind of skill run, and guilty-until-proven-innocent various harnesses x prompt combos because of that.

My point isn't that our 1 page skill eliminates the need for your startup, but that is a normal flow for more serious ai-augmented coders so you are picking a blatantly known-bad starting point for serious coders. That makes it unclear what value your tool brings and calls into question why you are refusing to measure yourself in a post about measurement.


> That makes it unclear what value your tool brings and calls into question why you are refusing to measure yourself in a post about measurement.

Wait, definitely no refusal! Is there a public repo with a good skill we can measure against?


We did this from the earliest days for louie.ai, which is an adjacent space of Investigations. Sandboxing the LLM was secondary to the primary reason: the threat model for servers. I suspect most people building agentic products are in this bucket.

Sufficiently advanced desktop tools starts to want server capabilities like teleport, scheduled tasks, ci mode, shared sessions, etc. Web-based ones start here to begin with.

Pretty soon after you have a server, you also think about multitenancy isolation and task isolation. The article's sandboxing also matters for regular old non-LLM code escapes in a multitenant world. We have to assume malicious python by the attacker, whether AI or human, and cannot let one tenant's python have write access to trusted surface of another.


I'm curious what flows folks find most productive here? We are a heavy vibe coding team, with heavy review. That has smoothed out for our backend work, but frontend feels much earlier.

We have AI driving a usual mix of storybook, pencil, figma, playwright, tailwind/react, per-pr staging servers, etc, and a few skill files on using these. PRs include autogenerated storybook and intool screenshots, and links to staging servers.

Except... Everyone works quite differently in how they flow through this. Likewise, it's unclear how valuable each pieces still is, and when. Our developers are doing more ownership now, which is shifting this too.

Are folks switching to Claude Design? Some super skills imports? Etc..


We have been getting increasingly hit by this. We do defense, not offense, and AI refusals to run defense prompts has been going noticeably up. Historically, tasks used to only get randomly rejected when we were doing disaster management AI, so this is a surprise shift in refusals to function reliably for basic IT.

Related, they outsourced the TAP verification to a terrible vendor, and their internal support process to AI, so we are now in fairly busted support email threads with both and no humans in sight.

This all feels like an unserious cybersecurity partner.


They are selling an impossible product.

If you make an LLM more safe, you are going to shift the weight for defensive actions as well.

There’s no physical way to assign weights to have one and not the other.


> If you make an LLM more safe, you are going to shift the weight for defensive actions as well. > > There’s no physical way to assign weights to have one and not the other.

Do you think a human is capable of providing assistance with defense but not offense, over a textual communication channel with another human?

If no, how does a cybersec firm train its employees?

If yes, how can you make the bold claim that it's possible for a human to differentiate between the two cases using incoming text as their basis for judgement, but IMpossible for an LLM to be configured to do the same? Note that if some hypothetical completely-determinstic LLM that always rejects "attack" requests and accepts "defense" ones can exist, the claim it's impossible is false. Providing nondeterministic output for a given input is not a hard requirement for language models.


> Do you think a human is capable of providing assistance with defense but not offense, over a textual communication channel with another human? > If no, how does a cybersec firm train its employees?

In general, no, humans can’t be sure they are only helping with defensive and not offensive work unless they have more context. IRL, a security engineer would know who they’re working for. If they’re advising Apple, then they’d feel pretty confident that Apple is not turning around and hacking people.


If the task is ill-defined, then it's a bit unfair to make it sound like the problem is that an LLM can't be configured to do something, if a human would have an equally hard time with the same task. The statement "it's impossible to configure the weights to..." should really be something more broad like "it's impossible to...".

I have no comment about whether it's impossible to determine the intentions of a person asking for assistance through a textual conversation with that person.


> IMpossible for an LLM to be configured to do the same?

Because that’s what I am seeing emerge from the various efforts to build LLM safety tools.

> Do you think a human is capable of providing assistance with defense but not offense, over a textual communication channel with another human?

LLM != human? They don’t even use the same reasoning process.


> Because that’s what I am seeing emerge from the various efforts to build LLM safety tools.

Something having not been obtained so far is not a logical argument it is impossible to obtain that thing.

> LLM != human? They don’t even use the same reasoning process.

There are a finite number of possible input strings of a given length. For any set of input strings, it is possible to build a deterministic mapping that produces "correct" answers, where those correct answers exist. Ergo anything a human can do correctly with a certain set of text inputs, it is possible to build an LLM that performs equally well. You can think of this as hardcoding the right answers into the model. The model itself can get very large, but it is always possible (not necessarily feasible).

It's only impossible for an LLM to do something right if we cannot decide what it means for the answer to BE right in a stable way, or if it requires an unbounded amount of input. No real-world tasks require an unbounded input.


It's been fun benchmarking AI investigations at botsbench.com . Part of it is checking for these kinds of issues - we recently started seeing contamination in our first generation challenge, and less obvious, agent sandbox escapes for other kinds of cheating. Fun times!

arxiv: https://arxiv.org/html/2311.02206v5

I've been a fan of this series! The work by this team, as well as kuzudb (acquired by apple) and relational.ai, have similar vibes.

One area that has been especially interesting to me is identifying cases where new kinds of vector-friendly join operators are helpful . We've been doing a very different kind of oss gpu graph query language & engine (gfql), where we're solving how to turn declarative cypher property graph queries on big parquets / sql db results / etc -> query plans over scalable cpu/gpu dataframe operations that trounce neo4j etc at a fraction of the time & cost and without needing a DB, and these join algorithm results carry over quite enticingly despite not being datalog.


This is very cool to hear. Please get in touch with me, I would love to learn more. By the way, I am recruiting participants for an upcoming seminar in which I am soliciting industrial participation: https://kmicinski.com/minnowbrook-26. Please get in contact with me if this is relevant to anyone at your company.

At RSAC, there were a ton of agentic security startups converging on ebpf monitors for this reason. Eg, sondera gave a fun talk at graph the planet where they did that + exposed with a policy layer over agent traces via Cedar (used in AWS IAM etc). ABAC and identity were also appearing near here.

One thing I didn't see: are there any OSS solutions appearing here?


We are Open Source… code will be published soon (before launch)


Then you will be open source ;) Not yet open source.


Yes, true ;)


It's interesting to think of where the value comes from. Afaict 2 interesting areas:

A: One of the main lessons of the RAG era of LLMs was reranked multiretrieval is a great balance of test time, test compute, and quality at the expense of maintaining a few costly index types. Graph ended up a nice little lift when put alongside text, vector, and relational indexing by solving some n-hop use cases.

I'm unsure if the juice is worth the squeeze, but it does make some sense as infra. Making and using these flows isn't that conceptually complicated and most pieces have good, simple OSS around them.

B: There is another universe of richer KG extraction with even heavier indexing work. I'm less clear on the ROI here in typical benchmarks relative to case A. Imagine going full RDF, vs the simpler property graph queries & ontologies here, and investing in heavy entity resolution etc preprocessing during writes. I don't know how well these improve scores vs regular multiretrieval above, and how easy it is to do at any reasonable scale.

In practice, a lot of KG work lives out of the DB and agent, and in a much fancier kg pipeline. So there is a missing layer with less clear proof and a value burden.

--

Seperately, we have been thinking about these internally. We have been building gfql , oss gpu cypher queries on dataframes etc without needing a DB -- reuse existing storage tiers by moving into embedded compute tier -- and powering our own LLM usage has been a primary internal use case for us. Our experiences have led us to prioritizing case A as a next step for what the graph engine needs to support inside, and viewing case B as something that should live outside of it in a separate library . This post does make me wonder if case B should move closer into the engine to help streamline things for typical users, akin how solr/lucene/etc helped make elastic into something useful early on for search.


I'm conceptually very bullish on B (entity resolution and hierarchy pre-processing during writes). I'm less certain than A and B need to be merged into a single library. Obviously, a search agent should know the properties of the KG being searched, but as the previous poster mentioned, these graph dbs are inherently inaccurate and only form part of the retrieval pattern anyway.


Maybe it's useful to split out B1) KG pipelines from the choice of B2) simple property graph ontologies & queries vs advanced rdf ontologies and sparql queries

It sounds like you are thinking about KG pipelines, but I'm unclear on whether typed property graphs, vs more advanced RDF/SPARQL, is needed in your view on the graph engine side?


This is great

We reached a similar conclusion for GFQL (oss graph dataframe query language), where we needed an LLM-friendly interface to our visualization & analytics stack, especially without requiring a code sandbox. We realized we can do quite rich GPU visual analytics pipelines with some basic extensions to opencypher . Doing SQL for the tabular world makes a lot of sense for the same reasons!

For the GFQL version (OpenCypher), an example of data loading, shaping, algorithmic enrichment, visual encodings, and first-class pipelines:

- overall pipelines: https://pygraphistry.readthedocs.io/en/latest/gfql/benchmark...

- declarative visual encodings as simple calls: https://pygraphistry.readthedocs.io/en/latest/gfql/builtin_c...


Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: