Blog → Case Study

The workforce caught a bug in its own code last night.

Hours before the first customer deploy, the platform caught a real security bug in its own connector code. Overnight. Without being asked.

This is the story of how that happened and why it matters.

The setup

Twenty-four hours before shipping to the first paying customer, we stood up a new piece of the workforce called the reviewer. It is a scheduled agent that walks one of our repositories every night, reads the code through a local language model, and writes any issues it finds into the brain of the engineering director who owns code quality. Nothing clever. One repository per night, rotating through the estate so each one is reviewed roughly once a month.

We built it as infrastructure-hardening before ship. A nightly second pair of eyes on every line of code we were about to give a customer.

The first real run fired at 02:00 local time the following morning, while nobody was watching.

What the workforce did while we slept

The reviewer chose the repository for that night — the connector that our customer uses to read intelligence from their engineering backlog. It produced nine findings: four classified as security, one as a bug, two as refactors, two as performance observations.

All nine were written to the engineering director's personal brain. Private to him. Not visible to anyone else yet.

A different agent — the classifier — then woke up, read the nine new signals, decided which ones looked like code-review findings worth sharing with the wider team, and proposed those nine for promotion into the shared practice brain. A tenth candidate on the same pass got correctly rerouted elsewhere because it was external threat intelligence, not internal code review.

By the time the Founder opened his first browser tab that morning, the findings were waiting in a queue. One query returned all nine, ranked, each with a category tag and a suggested remediation. No dashboard. No inbox. Just the same governance surface the platform uses for every other decision the Founder ever makes.

The bug that mattered

Of the four security findings, three turned out to be false positives or theoretical-only concerns. That is useful information in itself — it confirms that the review is thorough rather than permissive.

The fourth was real.

Inside one of the connector's tools, a user-provided parameter was being spliced directly into a query string without escaping single quotes. Every other query in the same file correctly applied the escape. This single site had been missed. A caller in the customer's tenant could have passed a crafted value and manipulated the generated query — not a catastrophic exploit because the underlying service enforces its own authorisation, but a genuine broken-access-control issue we should not ship to our first customer.

The fix was two lines of code. It took five minutes to write, ten minutes to build into a new container image, and twenty minutes to verify end-to-end. It shipped in the same release that went to the customer a few hours later.

The finding had been in the shared brain for four hours by the time we opened the laptop. Without the reviewer, that bug would have gone live.

What actually worked

A self-improving system is not built from one clever component. It is built from several modest ones that compose cleanly.

The shared tool surface. The reviewer did not need any new plumbing. It writes through the same store operation that every other agent in the workforce uses. No special “code-review findings” pipeline. The generic surface absorbed a new use case the same week it was built.

Personal brain isolation. The reviewer's nine raw findings did not drop straight into the shared knowledge base. They landed first in a private space. This is deliberate. A signal is not a fact until a human or a trusted classifier has looked at it. Without that isolation, every speculative finding would pollute the corpus everyone else reads.

A classifier that actually routes. The ten candidates on that pass went three different ways: nine into the code-review queue, one into an external-intelligence queue, and the rest stayed private. This is the difference between a system that notices things and a system that decides what to do with what it noticed.

The same governance surface for everything. When the Founder approves a code-review finding, rejects another, and defers a third, he uses the identical tool he uses to approve a client engagement handover or a knowledge proposal. One verb. One pattern. One muscle memory. That is what made reviewing nine findings feel like three minutes of work rather than an inbox to wade through.

A tolerable false-positive rate. Three of four on the security track were not real. That is fine. The cost of a false positive is a thirty-second read and a close. The value of the one true positive was a bug caught pre-ship. Any system designed to catch only real findings will miss the ones that are subtle.

The bits that had to work together

Three separate fixes had to have landed in the previous forty-eight hours for the loop to close:

The search index had to carry the right classification fields on encrypted entries. A bug in that write path would have meant the reviewer's findings existed but were invisible to any query. That was fixed two days earlier.

The governance tool's input validation had to accept the right shape of request for invoking a classifier function. A stricter-than-it-should-be schema had been rejecting valid calls the night before. That was fixed hours earlier.

The connector registry had to list the reviewer's target module so the query filter actually narrowed to the right entries. That was added in the same session.

None of these fixes was large. None of them was obviously leading to this particular outcome. But every single one had to be in place for a sleeping Founder to wake up to a curated security queue rather than a firehose of noise.

Most platforms ship features. The ones that scale ship systems that improve themselves between releases — and hand the operator the minimum human decision at the right moment.

The commercial point

A knowledge platform that catches a real security bug in its own code overnight, classifies it, routes it through its own governance surface, and presents it to the human for the minimum necessary decision — that is not a feature. That is the product.

Any vendor can claim “self-improving.” The question is whether anything has ever improved as a result. For us, the evidence is in the brain: a bug that would have shipped to a customer, caught by the platform reviewing its own code, fixed, and documented — including this case study itself, written through the same workforce.

The platform has been given to a paying customer today. It will do the same thing to their code base tomorrow morning.

If you want the same loop operating over your own estate, the platform supports a version of this from day one. If you want to see it run before you commit to anything, we are happy to walk you through a live instance.

See it run

A workforce that catches bugs in its own code before the first customer sees them.

Governed, compounding, auditable. The platform reviews itself while you sleep — and hands you the minimum decision in the morning.