QA for vibe-coded apps: what actually breaks

In a June 2026 audit of 49 fresh web launches, 38 (78%) had at least one critical verified finding on day one — usually analytics events that recorded zero events, or cookies set without the Secure attribute. This article is about the nine failure modes that drove that number, the order to fix them, and the free 60-second audit that catches the top ones. The sample is launches generally, not exclusively vibe-coded apps; n=49 is small. Take it as a starting point, not a verdict.

The nine failure modes at a glance

The body of this article goes one by one. The short version, in order of how often each one showed up across 49 launches:

No analytics events recorded — 38 of 49 had zero events firing on a real browser load. (Critical.)
Cookies set without the Secure attribute — 22. (Warning.)
No canonical link on the entry page — 24. (Info, but kills SEO.)
Broken internal links — 14. (Warning.)
JS console errors during page load — 10. (Warning.)
Missing meta description — 10. (Warning.)
The discoverability bundle (no h1, no robots.txt, no OG tags, missing alt text) — universal across the 49; counts vary by site.
Canonical pointing to a different host — 2. (Critical; the only one we don’t name-and-shame in the public post.)
Payment webhooks that silently drop events — not in our 49-audit set; this is the most-cited failure in the r/vibecoding and r/nocode threads, where multiple founders report the Stripe checkout.session.completed event arriving as a 200 OK without the database updating. Listed for completeness because the community names it more than the data does.

The split is intentional: the first eight are Prufa-measured and in-repo. The ninth is community-sourced. Mixing them is honest about the boundary of the data.

What we audited, and what we found

The 49 audits were run on a self-hosted browser against a self-harvested list of Show HN launches from the prior 30 days with at least 10 points. Each launch got the same public-pages- only audit: load the homepage, exercise signup/login if visible, walk one inner page, watch the network and the console. The numbers below are what the deterministic checks produced; the advisory-tier LLM findings are not included (they’re opinions, not facts, and the schema is explicit about that — verified findings get a green check, advisory ones do not).

One sentence for the citation set: In June 2026, 38 of 49 freshly launched web apps (78%, n=49) had at least one critical verified finding on the first automated audit.

49 of 50 attempts succeeded; 1 was blocked by a bot wall.
38 of 49 had ≥1 critical verified finding.
49 of 49 had ≥1 finding of any tier (critical, warning, or info).
40 critical findings, 61 warning findings, total — these are floors, not ceilings, because the per-site top-findings cap was 4.
2 of 50 ran ad pixels; the inverse signal — these founders cared enough to instrument something. Everyone else had instrumentation that recorded zero events.

The “no analytics events” finding (38/49) is the most striking. It is not always because the founder forgot to add analytics; often the script is loaded, the consent banner blocks the initial fire, and the founder never tests the post-consent flow. The audit catches it because the check is “did any beacon fire across the whole visit, post-consent-acceptance,” not “is the script tag present in the HTML.”

The nine failure modes, one by one

1. No analytics events recorded (38 of 49, critical)

What it looks like: the page loads, the founder opens the dashboard, sees zero visitors. The script is in the HTML. The tag manager is configured. Nothing fires.

Why vibe coding causes it: the most common cause is consent state. The LLM generates the analytics tag and the consent banner as two separate features; they don’t get tested against each other. The tag fires only after consent = accepted, but the audit runs before consent, so the audit sees zero. In production, the same audit runs after a synthetic consent acceptance, and you see: nothing. Either the tag manager isn’t reading the consent state, or the event name doesn’t match what the dashboard is listening for, or the dataLayer.push is being swallowed by an error handler.

The Prufa check that catches it: the tracking analyzer in backend/service/checks/tracking/. It watches the browser’s actual network traffic for the BeaconEvent v1 schema (vendor, event_name, account_id, request_url). If zero beacons fire and the capture was provably trustworthy (CDP attached, network activity observed), it’s a verified critical. The capture_trustworthy invariant on CaptureMeta is what prevents the check from emitting “no analytics” when the audit itself was broken.

5-minute manual check: open the site in a real browser, accept the consent banner, click around for 30 seconds, then open the network tab and filter by your analytics vendor’s hostname. If zero requests, you have this bug.

2. Cookies set without the Secure attribute (22 of 49, warning)

What it looks like: a session cookie is set over HTTPS but the cookie has no Secure flag. Some browsers send it on HTTP anyway; some don’t. The behavior is inconsistent across user agents.

Why vibe coding causes it: the LLM writes document.cookie = "session=..." without thinking about flag discipline. The framework default isn’t Secure. The dev environment is HTTP, so the flag never matters locally.

The Prufa check: the cookies list on PageSnapshot is inspected by the consent analyzer and the tracking analyzer. Both surface this as a warning.

5-minute manual check: in DevTools → Application → Cookies, inspect any auth-related cookie. The Secure column should be checked.

3. No canonical link on the entry page (24 of 49, info)

What it looks like: the page renders fine, but <link rel="canonical"> is missing. Google picks a URL on its own — usually the one with the most inbound links, which may not be the one you want indexed.

Why vibe coding causes it: canonical links are SEO housekeeping, not user-visible behavior. The LLM is not optimized to add them.

The Prufa check: the seo analyzer.

5-minute manual check: view source on the homepage, search for rel="canonical". If missing, add it.

4. Broken internal links (14 of 49, warning)

What it looks like: a link in the nav or footer 404s. Worse, an internal link in a CTA goes to a 404 and the founder doesn’t know until a user reports it.

Why vibe coding causes it: the LLM hallucinates routes that don’t exist, or copies a route from a different page and the IDs don’t match. The pages are generated as a set; the links between them aren’t validated as a graph.

The Prufa check: the link_statuses map on PageSnapshot is populated by the runner walking every <a href> on the visited pages; the ux analyzer flags non-2xx.

5-minute manual check: in DevTools, right-click any internal link → “Open in new tab” → check status code. Sample 10 links.

5. JS console errors during page load (10 of 49, warning)

What it looks like: the page renders, but the console shows Uncaught TypeError: Cannot read properties of null (reading 'foo') or similar. Often the page looks fine because the error is in a non-critical script.

Why vibe coding causes it: the LLM writes code that works for the case it imagined, but the DOM doesn’t match (an element ID changed, a third-party script loaded later, an async dependency resolved in a different order). The page is technically broken; the user doesn’t notice until they hit the path that exercises the broken code.

The Prufa check: console_errors on PageSnapshot. Any non-empty list of errors is a warning.

5-minute manual check: open DevTools, load the page, check the console.

6. Missing meta description (10 of 49, warning)

What it looks like: Google generates the snippet from the page content. Usually ugly. Sometimes misleading.

Why vibe coding causes it: same as canonical — SEO housekeeping, not in the LLM’s training priority.

The Prufa check: the seo analyzer.

5-minute manual check: view-source: search for <meta name="description". Add one if missing.

7. The discoverability bundle (universal)

A grab-bag of small SEO/a11y misses that show up on essentially every vibe-coded site we audit: no <h1> on the entry page (12 of 49), no robots.txt (11 of 49), no Open Graph tags (7 of 49), images missing alt text (8 of 49). Each is “info” tier — none is critical on its own. Together they tell you the LLM wasn’t optimizing for either SEO or accessibility in any structured way.

The Prufa checks: the seo and a11y analyzers both contribute. The a11y analyzer runs axe-core and emits the violations; the seo analyzer inspects the meta-tag surface.

5-minute manual check: in DevTools → Elements, scroll the rendered homepage and confirm there’s exactly one <h1>, the images have alt text, and the page description is set.

8. Canonical pointing to a different host (2 of 49, critical)

What it looks like: the page sets <link rel="canonical" href="https://some-other-site.com/page">. Either a copy-paste bug, or the LLM trained on a template that had a different canonical.

Why it matters: Google treats that as a strong signal that the other host is the authoritative version. Your page doesn’t rank. Worse, if the other host isn’t yours, you’ve just told Google to give them your search traffic.

The Prufa check: the seo analyzer. Critical-tier because the impact is “the page will not be indexed under your domain.”

5-minute manual check: search source for rel="canonical", confirm the host is yours.

9. Payment webhooks that silently drop events (community-sourced)

This is the failure mode the r/vibecoding and r/nocode community names most often. We did not have it in our 49-audit set because the public-pages-only audit doesn’t exercise payments; but the founder reports are consistent and the failure mode is well-understood.

What it looks like: a user pays. Stripe shows the payment as succeeded. The customer’s account does not upgrade. They email support. The founder manually fixes it in the database. Then it happens again. The Stripe dashboard says everything’s fine; the checkout.session.completed webhook was sent and acknowledged with a 200.

Why vibe coding causes it: the LLM wires the post-checkout flow to update state from the client-side redirect (the “return URL” Stripe sends the user to after a successful payment). This works in the happy path. It silently fails when: the user closes the tab early, the redirect is blocked by the browser, the network is slow, or the session expires. The fix is well-known: don’t update state from the redirect; update state from the webhook, which is server-to-server and reliable. The redirect should only redirect; the webhook should update the database.

The Prufa check: the flows analyzer exercises the flow-spec, which can include a “verify webhook is received and processed” step. The advisory_findings list on DomSnapshot is populated by the runner; the LLM-detected flows (FlowDetection) are pre-verified in the DOM by plain code before checks run.

5-minute manual check: in the Stripe dashboard → Developers → Webhooks, send a test checkout.session.completed, a test payment_intent.payment_failed, and a test charge.refunded. Confirm your endpoint receives each, verifies the signature, and writes to your database. The last two are the ones the community reports get missed most often.

Why “just add tests” doesn’t work for vibe coders

The honest version of the HN critique (“vibe coding creates a bus factor of zero”, “AI started actively breaking working code”, “vibe coding kills open source”): yes, the LLM is the only one who understands the code. That’s the structural problem, and “tell the founder to write tests” doesn’t solve it, because:

The tests would have to test the right thing. Writing a test requires knowing what the correct behavior is across a range of conditions, not just the happy path. The LLM that wrote the code has the same blind spots as the code.
The founder may not be an engineer. “Add a Playwright spec” is not actionable for a non-technical founder who just shipped an MVP in a weekend.
The tests would have to keep up with the code. Vibe- coded codebases change fast — a single session might restructure half the application. Hand-maintained test suites don’t survive that velocity.

The structural answer is to separate the navigator from the verifier. An LLM is fine for navigation — it can figure out which button to click, which form to fill, what to look for. What the LLM cannot be trusted for is grading the result. The verifier has to be plain code, deterministic, against captured browser evidence. Then the LLM’s blind spots can’t become the verifier’s blind spots, because the verifier doesn’t share them.

The verification architecture that does work

This is exactly the architecture Prufa runs on. To be clear about the trust boundaries, since “AI tests your site” is the category and not all AI testers are the same:

Navigation is LLM-driven. The agent looks at the page the way a real user would — reads the visible text, picks the signup button, fills the form. This is where an LLM earns its keep; deterministic scripts are brittle on sites that change weekly.
Verification is plain code. Every action the LLM takes produces captured browser evidence. Plain code checks that evidence against the spec: did the network request return 200? Did the cookie get set with Secure? Did the next page render with the expected DOM? Did the analytics beacon fire? Did the payment webhook arrive?
The LLM never grades results. LLM judgments are surfaced as advisory findings and labeled as such — they are opinions, not facts. The verified findings, the ones that show up in the report with a green check, are the plain-code ones.

Because the verifier is independent of the navigator, you can safely let the coding agent that wrote the app run the audit on it — add Prufa to Codex or set up Prufa in Roo Code, and the same agent that shipped the code runs the QA, while the deterministic verifier still catches the failure modes the agent can’t see in its own output.

This is the deterministic-vs-agentic frame the SERP keeps arguing about, and the honest answer is that both are needed, in different roles. The LLM is the navigator. Plain code is the verifier. The same split is what makes the audit trustworthy on vibe-coded apps specifically: the LLM can navigate a Lovable-built app that changes every week, but the verification is the part that has to be deterministic, and that’s the part that catches the failure modes above.

For a worked example of this on a real signup flow, see How Prufa verifies a signup flow. For the category-level “why this matters” essay, see Why Prufa exists: QA built for the agentic era.

What to do before you launch

If you take exactly one thing from this article: paste your URL into the free 60-second audit on prufa.dev. It runs the critical-tier checks against your live URL and shows you which of the nine failure modes above you’ve shipped. No signup, no card.

After the audit, do these in order — they’re ordered by how often the failure shows up in the 49-audit set:

Confirm analytics events fire after consent acceptance. This was 38 of 49. Open your dashboard; check it shows real events. If it shows zero, your dashboard has been lying to you.
Read the code that sets session cookies. Confirm Secure, HttpOnly, SameSite=Lax (or Strict).
Click every link in your nav and footer from a fresh incognito window. Find the 404s.
Set <link rel="canonical"> on the entry page to itself, with the host you want indexed.
View the page in a real browser with DevTools open. The console should be empty.
Send a Stripe test webhook. Confirm your endpoint receives it, verifies the signature, and writes to the database. Send a test payment_intent.payment_failed and charge.refunded too.
Load the page in a phone-width viewport. Scroll horizontally. If you can scroll, the layout is broken.

Items 1–5 are the audit. Items 6–7 are the parts the audit doesn’t cover (auth-required and mobile-specific). The combination catches every failure mode in the list above except #8 (which is one search-and-replace) and #9 (which needs the Stripe dashboard).

For a more general pre-launch checklist that includes performance, accessibility, and SEO-audit items not in the vibe-coding-specific list above, see Website QA checklist before launch, ordered by what actually breaks. For a 49-launch data view of what shipped broken, see We audited 49 Show HN launches. 38 had a critical bug on day one..

If you ship landing pages, not just product flows — the post-deploy tracking breakage pattern is the same, but the audience that owns the fix is the marketing team. See the marketer flow for the version built around “did my pixel fire.”

FAQ

Is vibe coding production ready?

It can be. The data: in a June 2026 audit of 49 fresh launches, 38 (78%) had at least one critical verified finding on day one — usually analytics events that recorded zero events, or cookies set without the Secure attribute. Vibe coding is fine for the happy path; it is the edges that need systematic review. The tools have not caught up yet, but the practice can work if the verification layer is independent of the navigator.

What is the best testing approach for vibe-coded projects?

The cheapest first step is a free 60-second audit that exercises the critical paths (signup, login, payment) and verifies the most common failure modes. Combine that with one human pass on a real browser before launch. Manual + automated in that order; no single tool is enough on its own. Continuous monitoring on every PR is the upgrade once the app is in production.

Does vibe coding create more security vulnerabilities?

A May 2025 audit of 1,645 apps built on Lovable found 170 with critical security vulnerabilities — a 10% critical failure rate across a platform used primarily by non-engineers. The cause is usually row-level security left incomplete, exposed API keys in client code, or unverified payment webhooks. Treat the AI output as you would any code review by a junior engineer: read it, don’t trust it.

Can AI write the tests?

It can — and the same AI that wrote the code will produce tests with the same blind spots. The tests will confirm wrong behavior because both outputs share the same model. The fix is to keep the navigator (an LLM is fine here) separate from the verifier (plain code against captured browser evidence, no LLM judgment). The deterministic part has to be outside the model.

If you want the runnable version of this article’s nine failure modes — each mapped to a real Prufa check and a 5-minute manual fallback — see the vibe coding testing checklist: ten items in priority order, the first three of which the free 60-second audit runs automatically.

If you want the scored version — six questions, ten minutes, the same 49-launch distribution as the benchmark — see Is my vibe-coded app production ready? a scored assessment from 49 launches. The cluster is a layered onramp: this hub catalogs what breaks, the 6-step process to test a vibe-coded app is the ordered routine, the checklist is the runnable list, and the scored assessment is the ship-or-wait decision. Two deeper cuts sit alongside it: the security angle on exposed API keys in vibe-coded apps (how secrets leak, and what an external scan can’t see), and what happens after you ship in vibe coding support after launch (the four things that silently rot a working app).

Data and architecture source: the 49-audit harvest (outbound/README.md, June 2026) and the check surface in backend/service/checks/. The trust boundaries — verified vs advisory, capture_trustworthy, LLM detects flow entry + plain code re-verifies in the DOM — are documented in backend/service/checks/schema.py.

Frequently asked questions

Is vibe coding production ready?

It can be. In a June 2026 audit of 49 fresh launches, 38 (78%) had at least one critical verified finding on day one — usually analytics events that recorded zero events, or cookies set without the Secure attribute. Vibe coding is fine for the happy path; it is the edges that need systematic review. The tools have not caught up yet, but the practice can work if the verification layer is independent of the navigator.