Status: versioned, public. Compiles from constrained-vocabulary
natural-language test cases (step 3 of the build order).
Stability: v1 is the contract. Backwards-compatible additions only;
breaking changes require a v2 the user opts into.
Plain-text test cases are written in natural language, but the
runner needs a deterministic, reviewable spec. The flow-spec
format is the bridge: it's the artifact the user confirms
*before* it runs on a monitor.
Two design choices follow from the trust invariant (PLAN § E-H3):
Compiles to a reviewable spec; the user can pin selectors, add
waits, override assertions, mark steps optional.
types. A v2 requires a new format and the user opts in.
The machine-readable contract is flow_spec.schema.json (JSON
Schema, Draft 2020-12), generated from the reference
implementation's models — the schema and the runner cannot drift.
{
"spec_version": "1",
"name": "signup",
"description": "Sign up a new user end-to-end on the marketing site.",
"url": "https://example.com",
"allowed_hosts": ["checkout.stripe.com"],
"steps": [
{ "type": "goto", "url": "https://example.com" },
{ "type": "act", "action": "click", "target": "the sign-up button", "selector": "text=Sign up" },
{ "type": "act", "action": "fill", "target": "the email field", "selector": "input[name=email]", "value": "{{EMAIL}}" },
{ "type": "act", "action": "fill", "target": "the password field", "selector": "input[name=password]", "value": "{{PASSWORD}}" },
{ "type": "act", "action": "click", "target": "the submit button", "selector": "button[type=submit]" },
{ "type": "wait", "for": "[data-testid=dashboard]" },
{ "type": "expect", "kind": "url_contains", "value": "/dashboard" },
{ "type": "expect", "kind": "beacon", "vendor": "ga4", "event": "sign_up" }
],
"assertions": [
{ "kind": "no_console_errors", "severity": "warning" },
{ "kind": "beacon_fires", "vendor": "ga4", "event": "sign_up", "severity": "critical" }
]
}
Unknown fields are refused everywhere — at the top level and
inside every step and assertion. A spec that says something the
consumer doesn't understand must not run with the unknown part
silently dropped (the "no silent failures" invariant applied to
parsing).
| type | Fields | Effect |
|---|---|---|
| goto | url | Navigate. The URL must pass the public-URL guard and the host allowlist. |
| act | action, selector?, target?, value? | One browser action: click, fill, press, hover, scroll. |
| expect | kind, ... | Deterministic verification: url_contains, url_matches, text_contains, beacon. |
| wait | ms *or* for | Sleep 1–30000 ms, or wait for a selector to appear. |
| extract | selector, into | Pull the element's visible text into a run variable. |
Every step also accepts optional (boolean, default false) —
see Failure semantics.
act carries a pinned selector (CSS or text= selector —
never raw XPath; a deliberate DX choice, DX7) and/or a target:
a short human description ("the sign-up button") that the agent
loop uses to re-resolve against the live page when the pinned
selector fails. At least one of the two is required. value is
required for fill and press and refused for click, hover
and scroll.
expect kinds:
| kind | Fields | Passes when |
|---|---|---|
| url_contains | value | The current URL contains value. |
| url_matches | value | The current URL matches value as a regular expression (validated at parse time; invalid regex is refused). |
| text_contains | value, selector? | The element at selector (default: body) contains value, case-insensitive. |
| beacon | vendor, event? | A matching BeaconEvent has been captured at any point in the flow up to this step. Event-name match is case-insensitive. |
The LLM never judges an expect — every kind is plain code over
the page or the captured beacon stream.
wait takes exactly one of ms (integer, 1–30000) or for
(a selector). Both or neither is refused.
extract stores the element's trimmed inner text under the
name into (an identifier: [A-Za-z_][A-Za-z0-9_]*). Later
steps reference it as {{name}} — see Variables.
allowed_hosts (top-level, optional, max 10 entries) lists bare
hostnames the flow may navigate to *beyond* the entry URL's host
and its subdomains — e.g. a hosted checkout. Entries are bare
hostnames only: no scheme, no port, no path (refused at parse
time). Each entry also admits its subdomains.
The allowlist is enforced by the runner outside the LLM (T7), on
every goto and after every act that causes navigation. A
hostile page — or a prompt-injected instruction — can never widen
it at run time: a navigation outside the allowlist fails the step
with spec_step_unresolvable.
A spec is a reviewable artifact, not a program. Hard ceilings:
| Field | Limit |
|---|---|
| name | 1–120 chars after trimming outer whitespace |
| description | ≤ 500 chars |
| url (top-level and goto) | 1–2048 chars |
| allowed_hosts | ≤ 10 entries |
| steps | 1–30 |
| assertions | ≤ 10 |
| selector (any step) | ≤ 512 chars |
| target | ≤ 200 chars |
| value (act / expect) | ≤ 2048 chars |
| wait.ms | 1–30000 |
| extract.into | identifier, ≤ 64 chars |
| vendor / event | ≤ 64 / ≤ 128 chars |
The assertions block is independent of step order and is
evaluated after the flow completes. v1 defines exactly two kinds:
| kind | Fields | Fails when |
|---|---|---|
| no_console_errors | — | Any console error or page error occurred during the flow. |
| beacon_fires | vendor, event? | No matching BeaconEvent was captured during the flow. |
Every assertion has a severity: critical, warning (the
default) or info. Failed critical and warning assertions
become verified findings in the report; info becomes
advisory. Assertion failures never halt the run — the flow
already completed; they grade it.
{{VARIABLE}} placeholders come from two sources with different
rules:
credential vault ONLY in the value position of fill and
press steps, at the tool boundary — outside the LLM context
(E-S1). Credentials never enter a prompt, and every string that
leaves the run (summaries, details, logs, events) is redacted.
A secret placeholder anywhere else is not resolved.
anywhere a string appears — url, selector, value. They
are NOT secrets: they may appear in summaries and logs.
Exception: url_matches values are treated as a literal regex
and are never substituted.
A spec that references a variable the run cannot supply fails
*before the browser starts* with the credential_rejected
failure class — never mid-flow with a half-completed run.
Steps are required by default. The contract:
reported as skipped, and the run fails with that step's
failure class.
step records a verified warning finding and the run
continues to the next step.
the only class that alerts. A failed *optional* expect still
emits a verified finding; it just doesn't halt the run.
evidence.
its severity; the run still completes.
Run-failure classes (DX3) drive distinct behavior downstream:
| Class | Meaning | Alert behavior |
|---|---|---|
| credential_rejected | A {{VAR}} the run can't supply, or a login rejected. | Ask the user; never alert. |
| spec_step_unresolvable | Selector/target can't be resolved, navigation blocked by the allowlist, step timeout. | Re-review CTA on the spec; no alert. |
| agent_uncertain | The managed model backend was unavailable while re-resolving a target. | Retry later; no alert. |
| assertion_failed | A required expect failed — the site verifiably misbehaved. | The only class that alerts. |
| quota_exceeded | The run hit its hard LLM-call cap. | Pause the monitor and tell the human (hard-cap contract); no alert. |
A consumer of flow-spec v1 MUST:
the error to the user verbatim.
step, report the remainder as skipped, continue past failed
optional steps with a verified warning finding.
run-failure class (DX3) — distinct alert behavior from
agent_uncertain or credential_rejected.
API receives already-resolved values; credentials never enter
the prompt and only resolve in fill/press values.
kind, new optional field → v1.x.
header on v1.
and committed; a CI contract test asserts the two never drift.
Validate against the published schema
(/docs/specs/flow_spec.schema.json):
import json, jsonschema
from pathlib import Path
schema = json.loads(Path("flow_spec.schema.json").read_text())
jsonschema.validate(spec, schema)
The schema enforces shape and limits; cross-field rules (act
needs selector or target, wait needs exactly one of ms /
for, value only on fill/press, regex validity) are
enforced by the reference validator and listed above.
first public consumer: steps are required by default (a failed
required step halts; optional: true continues with a verified
warning), expect kinds named url_contains / url_matches /
text_contains / beacon, allowed_hosts added (runner-
enforced navigation allowlist), no_404 expect and
within_steps assertion field dropped. flow_spec.schema.json
is now generated from the reference implementation.