How Prufa verifies a signup flow (and why the LLM never grades the result)

One signup run, end to end: an LLM-backed agent drives the browser, plain code grades the outcome against a public flow-spec. Same input, same verdict.

“Signup is broken” is a serious claim. Before it wakes anyone up, you want to know who is making it — a language model’s impression of a page, or a recorded HTTP response. Prufa is built so it is always the second. Here is what actually happens when Prufa verifies a signup flow, and exactly where the line between the LLM and plain code sits.

The flow starts as a sentence, not a script

You describe the flow in plain language: “go to /signup, fill email and password, submit, expect the dashboard.” Prufa compiles that into a flow-spec — a small, deterministic document in a public, versioned format (flow-spec v1) — and you confirm it before it ever runs on a monitor:

{
  "spec_version": "1",
  "name": "signup",
  "url": "https://example.com",
  "steps": [
    { "type": "goto", "url": "https://example.com/signup" },
    { "type": "act", "action": "fill", "selector": "input[name=email]", "value": "{{EMAIL}}" },
    { "type": "act", "action": "fill", "selector": "input[name=password]", "value": "{{PASSWORD}}" },
    { "type": "act", "action": "click", "selector": "button[type=submit]" },
    { "type": "expect", "kind": "url", "contains": "/dashboard" }
  ],
  "assertions": [
    { "kind": "no_console_errors", "severity": "warning" }
  ]
}

The step vocabulary is deliberately constrained — goto, act, expect, wait, extract — not free-form natural language. A reviewable spec beats a clever one: you can pin selectors, add waits, override assertions, and mark steps optional, and what you approved is what runs, every time.

The agent navigates — and that is the whole job

At run time, an LLM-backed agent drives a real browser through a few primitives: act, observe, extract. Navigation is the part that genuinely needs a model. Real apps move buttons, rename labels, interpose cookie banners, and split one form across three screens — the agent absorbs that ambiguity the way a human tester would.

What the agent never does is decide whether the flow worked. It explores; it doesn’t judge.

Plain code keeps the score

Underneath the agent, a plain-code harness owns the browser session and records what actually happened: every network request and response, console output, page URLs, screenshots. After the run, deterministic checks grade that recording against the spec you approved.

The expect step above is a string comparison against a recorded URL — the browser either ended up on /dashboard or it didn’t. The form submit either returned a success response or it returned a 500, and the finding carries the actual response code as evidence. Same input, same verdict: run the spec twice and you get the same answer, which is the property a monitor has to have before its alerts mean anything.

When a flow also asserts that your analytics noticed the signup, the same discipline applies: captured traffic is first normalized into BeaconEvent v1 events, and the check asserts against that schema — the verification layer never reads raw browser protocol traffic, and the LLM never sees it at all.

Why the LLM never grades the result

Language models are persuasive, and a persuasive tester that is wrong is worse than no tester. So the separation is structural, not a prompt instruction: the verification code has no model in the loop to drift, no temperature to tune, nothing to convince.

When the model does have an opinion — “this error message seems misleading” — that opinion ships in a separate advisory tier, labeled as an opinion and never phrased as broken. Verified findings carry the check name, the recorded evidence, and a timestamp. The two tiers never mix in one list.

Credentials never enter the prompt

Those {{EMAIL}} and {{PASSWORD}} placeholders are resolved by the runner, outside the LLM context — the agent’s tools receive already-resolved values, and secrets never appear in the prompt or the spec’s log output. That is a conformance requirement of flow-spec v1, not a courtesy.

The formats are public on purpose

Both formats are published, versioned specs: additive changes ship as v1.x, breaking changes mean a v2 with a sunset on v1. Anyone — including a competitor, including your own tooling — can emit or consume them. The bet is that defensibility comes from the network of agents and integrations speaking the same format, not from format opacity.

If you want the longer argument for why QA needs this shape now, read Why Prufa exists. The execute-once, replay-as-code architecture described here is also the axis we compare QA tools on — see Prufa vs bug0 for how the same bet looks wrapped in a managed service. If you’d rather see your own app graded, run a free audit — paste a URL, no signup.