Skip to content

The Unreasonable Effectiveness of External Feedback Loops

How we test and validate software built by LLMs is massively important. Lots of predictions have been made around how long folks will have jobs in software engineering. I’ve watched product folks build in Lovable (and similar tools), raise their arms to the heavens in victory then wake up and find a bug wiped out important data and there were no backups. A classic joke is that you’re not an experienced dev until you’ve accidentally dropped a database, so maybe it’s their first steps down that shared journey.

Dark software factories and the agent-as-compiler analogy

A more recent prediction that’s been gaining some steam is a new conceptual way of building software called Dark Software Factories (DSFs): long-running, mostly autonomous pipelines where agents turn natural language requirements into working systems. Over the past six months, I’ve been exploring different approaches to utilizing Dark Factories in targeted capacities - mostly related to the creation of non-traditional software like agents rather than the execution frameworks and workflows they exist within. Following the convention so far, I’ll call these Dark Agent Factories. I’ve gotten really compelling results in many areas in those targeted tests. I’m just not sure it’s generalizable across all types of software, yet. An implicit result of the predictions regarding DSFs is that software engineering expertise will become far less necessary. Dark Software Engineers (agents, really) will operate within the factories and through continual iteration, eventually converge on a working result, ushering us into the age of software democratization.

While I can’t predict the future, I do think there’s still plenty of space for expertise. The idea that no one will ever read code again doesn’t seem like a realistic goal. I’m also not totally sold on software democratization quite yet. So far, I’ve observed outcomes that track closer to a K-shaped curve - those that previously were capable of shipping functional, quality software can now ship at superhuman rates and those that couldn’t ship working software still can’t. What software engineering is and how we interface with computational devices will surely change, but I see it more as a convergence with technical aptitude rising across the board. More to come on that topic in a later post.

A lot of what is driving this is an idea that has come up from a number of people over the past few months. The thought is that agents orchestrating coding assistants (Claude Code, OpenCode, Codex) are now compilers, treating our natural language instructions the same way a compiler treats source code — as input to be transformed into working software. That analogy, while easy to talk about, unfortunately falls short. While there are some experiments where compilers are leveraging ML techniques (loop unrolling, optimizations) - compilers are intended to be a largely deterministic translation between two formalized, well defined languages (given a fixed toolchain and flags). Agent orchestration is obviously non-deterministic, but is further differentiated by its operation on natural language and unpredictable transformation of the input into an output. It’s a lot more like a human than a compiler.

Ground-truth feedback loops

Let’s go back to my first statement of How we test and validate software built by LLMs is massively important. This is an area where that isn’t getting as much hype. I’ve found that external feedback loops (compilers, immutable test suites hardened to avoid being gamed, chrome CDP) are remarkably effective. They provide ground-truth for whatever you’ve instrumented: either the invariant holds, or it doesn’t. (Of course, the loop is only as good as what it measures. Tests can miss bugs, browser runs can be flaky, and simulators can diverge from reality.) Sometimes DSF main loops can get stuck along the way while attempting to brute force a solution and failing to meet the requirements in the external feedback loop - but that is better than a “finished” solution that we merely believe is correct. This is where I think naive approaches to DSFs fall short today, although it’s ripe for improvement.

Kilroy, StrongDM, and the DTU

This weekend, I spent time with Dan Shapiro’s Kilroy project. Kilroy is based on StrongDM’s dark software factory design. Kilroy implements the attractor pattern and utilizes other components built by StrongDM such as CXDB.

Kilroy is exciting, even in its current, early state. I had used Gas Town previously and was refreshed to see Kilroy trying to take a more formalized approach with some standards around schemas used to direct building. I’ll add in for anyone looking to test it - it’s alpha/beta(ish) level software and is in heavy, active development. Today, I consider it an exploration into what could be rather than a production quality piece of software. Tomorrow, if model improvement continues it could be a heavily leveraged or dominant execution pattern. Conceptually, I like a lot of ideas in Kilroy such as multiple, well defined stages in the execution loop as well attempts to self heal the various stages of the pipelines. It’s a little weird initially when you dig in and realize that the first pass of building may fail due to not having dependencies available and another stage will heal that - but it’s actually a reasonable separation of concerns. Graphviz dot notation feels like another solid choice for the core language of Kilroy: it’s inspectable by humans when necessary, but sufficiently concise for machines to process without ambiguity.

One aspect of Kilroy worth questioning is its choice of Go. The compiler is itself an external feedback loop and stricter type systems yield richer ones. In my experience, languages with stronger type systems produce better results from LLM-generated code because there are more invariants the compiler will reject before the code ever runs. Rust’s borrow checker, lifetime errors, and exhaustive pattern matching are all ground-truth checks the agent can’t hallucinate past. Go’s compiler catches less. This makes language choice a feedback loop design decision and it’s one reason I’d question building factory produced software in Go when stricter alternatives exist. That said, a stricter compiler is only valuable if the agent can’t routinely escape hatch its way to a clean build. LLMs are known to fight Rust’s borrow checker by cloning liberally or wrapping everything in Arc<Mutex<>> which compiles but defeats the purpose. The fix luckily is pretty easy. We can just layer in additional static analysis (e.g., clippy lints configured as errors) that catch the shortcuts, so the feedback loop stays honest.

Another area I noted was a bit sparse was the lack of examples of external feedback loops. Kilroy relies heavily on internal feedback loops since it’s built on an iterative pipeline model, but my assertion is that including a number of external, deterministic feedback mechanisms will vastly improve software creation. It looks like there may be a way to hook them in - but they don’t seem like a first class citizen. For example, I’d love to see run an immutable test suite and gate the next stage on structured results show up as a canonical node type in the pipeline language, not just something you wire in ad hoc.

When looking over StrongDM’s documentation and why they are building the software they’re building - I discovered it was largely to use digital twins of well known software in testing use cases. The key insight on their side was that if they took all publicly available SDK documentation, they could build pretty good, functional systems that they could then test against without any security, throttling, or quota concerns. They also built simplified UIs atop them.

Simon Willison’s coverage of StrongDM’s Software Factory is the most detailed public writeup of the approach and goes into more detail about the Digital Twin Universe (DTU). What I think is worth saying explicitly is the underlying pattern: the DTU is an external, deterministic feedback loop. The agents generate integration code, run it against behavioral clones of Okta, Jira, and Slack, check if the results satisfy the scenarios, and iterate - sometimes thousands of times per hour - without rate limits or quotas getting in the way. The obvious footgun is fidelity: any twin will have edge cases it doesn’t model, so at some cadence you still need reality checks against the real services. It’s the same pattern as connecting Chrome via Chrome DevTools Protocol to validate a canvas transformation, or running an immutable test suite that the agent can’t game its way through by editing them and returning true. The mechanism that makes all of these work isn’t the orchestrator, the model, or the spec. It’s the ground-truth check against reality that the LLM can’t hallucinate its way past.

Make feedback loops first-class

If the agent can’t be forced to confront ground truth, you didn’t build a software factory, you built a demo generator.

We need to borrow more ideas from electrical engineering and control theory and apply more rigor. Until that feedback loop is a first-class citizen in how we build software with LLMs, we’re going to keep celebrating demo quality software that quietly drops databases.

Some “first-class” requirements I increasingly view as non-negotiable:

  • Immutable, write-protected test and scenario suites (including holdout sets the agent can’t train on by iteration).
  • Structured evaluation outputs pass/fail + diffs + traces.
  • Instrumented end-to-end harnesses where needed (e.g., CDP for UI/canvas/layout).
  • High-throughput simulators / DTUs for integrations (plus periodic checks against reality).
    And once the code passes those loops, it still has to survive contact with production:
  • Production safety rails: backups, point-in-time restore, migrations with roll-forward/rollback paths, and invariant monitoring.