We pointed our chaos-QA agent at our own site. It found a shipped bug.

We build an AI QA engineer, so the fair test is the obvious one: point it at ourselves. On 15 June 2026 we ran Gremlin mode — Prufa’s chaos-testing modality — against our own marketing site, prufa.dev. It found a real, user-facing bug that our CI had gone green on and shipped that same day. Here is the whole run, including the parts where the tool was wrong about itself.

What Gremlin mode actually does

A normal Prufa flow checks a path you already know to check. Gremlin is for the paths you didn’t. An LLM-backed agent drives a real browser as a deliberately difficult user — a confused newbie, an impatient double-clicker, a fat-finger typist, a back-button masher, a hostile poker — and chooses its own next action every step. It is the part of QA that needs a model: absorbing an unfamiliar UI and deciding what a frustrated human would try next.

What the agent never does is decide whether anything broke. That is the same invariant as the rest of Prufa — the LLM navigates, plain code verifies — and it is the whole reason a finding from an LLM-driven tester can be trusted: a separate layer of deterministic detectors grades the run. A 500 response, an uncaught exception, a form that accepts invalid input, content wider than the viewport, two clickable elements overlapping — those are facts, read off the live page, not opinions.

The bug: a mobile overflow CI had just shipped

Across three personas, every run reported the same verified finding at the 390px mobile viewport: the page was 103 pixels wider than the screen, with the “Run a free audit” button in the header hanging off the right edge.

Here is the part that makes the case for chaos QA. Earlier that same day, a commit titled “fix” had added exactly the rule meant to prevent this:

@media (max-width: 520px) { .header-cta { display: none; } }

It never applied. The button is styled by a.btn-primary { display: inline-block }, whose selector specificity (0,1,1) outranks the bare .header-cta (0,1,0), so the display: none was silently overridden on every phone-width render. The CSS was valid. The build passed. The linter was happy. CI was green. And the bug shipped to production, where it sat 103px wide until an agent that had never seen our codebase resized the viewport and measured the document.

The fix was to out-specify the button:

@media (max-width: 520px) { header a.header-cta { display: none; } }

header a.header-cta is specificity (0,1,2), which beats a.btn-primary regardless of source order. After the change, a fresh build measured 0px of horizontal overflow at 390px and the button correctly hidden. The class of bug matters here: nothing errored. A test that asserts known selectors would have stayed green forever, because the breakage was in a layout dimension no one had written an assertion about. You catch that by measuring the rendered page, not by re-running the path you already trusted. (That same phone-viewport measurement runs on every audit — the how-to is in test your website on mobile before launch.) If you’d rather run an audit from code, the same overflow check ships in the JSON the API returns, so you can reproduce this exact measurement against your own URL.

The safety guarantee, demonstrated on a live site

A chaos tester loose on a real site is only acceptable if it cannot change anything. In Prufa, mutations are denied by default: the run is dry-run and a network-layer guard aborts every non-GET request before it leaves the browser. A destructive click becomes a “would have mutated” finding instead of an action.

We didn’t have to take that on faith — the run logged it. Across the three personas the agent attempted between 0 and 4 mutations each; every one was blocked, and the run recorded which control it would have submitted. Real payment instruments are never used at all. To let Gremlin submit forms for real, you explicitly authorise a domain you own — and even then, hard caps bound how many submissions it can make.

Where the tool was wrong about itself

The honest part. In an earlier run, two of the gremlin’s own detectors fired on things that were not bugs:

A “dead-end / error page” detector matched the bare string 500 in ordinary marketing copy (think “save $500”), calling a healthy page an error page.
A “bad input accepted” detector treated any navigation after a form fill as a successful submission — so clicking a normal link after typing in a field looked like the app had swallowed invalid input.

A verified finding that turns out to be noise costs more trust than a missed bug costs coverage, so we did not ship around it. We added a detector false-positive policy: the error-page check now requires a strong error phrase in the page’s prominent text (title or heading) on an error-shaped page, not a substring match anywhere in the body; the bad-input check now requires a real form submission — an actual non-GET request — before it fires. Both false positives are gone, and the genuine findings (the mobile overflow) still land.

We also measured discovery quality directly. On a seeded-bug fixture with five planted bugs, the agent’s first pass found four of five (0.80 coverage); after we gave it an exploration frontier — a running list of same-site pages it hasn’t visited yet, fed back into each decision — it found all five (1.00), because it stopped looping one corner and started covering the whole app. That number is fixture discovery quality, not a claim about your site; the point is that “does the chaos actually find the planted bugs” is something we test, with a number, not assert.

Why we publish the misses

A QA product that only tells flattering stories about itself is exactly the product you shouldn’t trust to test you. The mobile bug is a good demo. The false positives are a better one: they show the failure mode that matters for an LLM-driven tester — a confident, wrong “this is broken” — and they show the line we hold against it. The model proposes; plain code disposes; and when plain code gets it wrong, we fix the plain code, in the open. (The biggest miss we’ve published since: our own agent, not the bot detector, was what kept failing a customer’s login — the numbers are in why AI agents fail at login. We later hand-labeled a whole run’s findings and put the score in CI — four of six were wrong, and the guards that killed each false positive are described finding by finding.)

Gremlin mode is available on any paid plan — read how it works on the chaos-testing page, or run a free audit to see the deterministic side of the same engine on your own URL first. And if your tester is itself an agent, the same run we just described is one call to the MCP server behind the gremlin.

Frequently asked questions

What is chaos testing for a web app?

Chaos testing drives an app with unscripted, adversarial behaviour instead of a fixed test path — a tester (here, an LLM-backed agent) pokes the UI like a confused, impatient, or careless user and watches for what breaks. Unlike an end-to-end test, you don't tell it the steps; it explores on its own, so it finds failures on routes you never thought to script.

Does an LLM decide whether the app is broken?

No. In Prufa the model only chooses where to poke. Plain-code detectors decide what counts as broken — a 500 response, an uncaught exception, content wider than the viewport, two tap targets overlapping. Those ship as verified findings with evidence. Anything the model merely suspects ships in a separate advisory tier, labelled as opinion, never graded as fact.

How do you stop a chaos tester from breaking real data?

Mutations are denied by default. The run is dry-run: a network-layer guard aborts every non-GET request, so a destructive click is recorded as a 'would have mutated' finding instead of executing. Real payment instruments are never used. In our prufa.dev run the agent tried 0–4 mutations per persona; every one was blocked before it left the browser.