Vibe Coding vs. Vibe Engineering

When people hear “programming with AI,” most imagine a simple process: write a prompt, AI spits out code, done.

That might work for a simple script. But not for a project like Kvalty.cz — 237,000 lines of TypeScript, 163 database tables, 3 Next.js apps, a Hono/tRPC backend, PostGIS geographic queries. There, I apply what I call Vibe Engineering. And it looks completely different.

My process isn’t a monologue. It’s a structured adversarial review.

How It Actually Works

Phase 1: Research & Strategy

I never start with code. My first command is always something like:

“I want to achieve X. Research the current documentation and modern best practices. Create a detailed implementation plan.”

But let me be specific about what “research” actually looks like, because this isn’t just a throwaway prompt.

The prompt includes constraints. Technical constraints from the existing codebase — which packages we already use, what the database schema looks like, which patterns are established. Business constraints — what the feature needs to do for users, what data we have, what data we don’t. And explicit anti-patterns — things I’ve tried before that failed, approaches that don’t fit our architecture.

For a real example: when I built the geographic search for Kvalty.cz, the Phase 1 prompt was roughly 400 words. It specified that we use PostgreSQL with PostGIS, that driving schools have latitude/longitude columns stored as geography types (not geometry — this distinction matters for distance calculations on a curved earth), that the search needs to support both radius-based and bounding-box queries, that results need to be sorted by distance, and that the query must perform under 150ms for any region in the Czech Republic.

The output of Phase 1 isn’t code. It’s a structured plan document. For the geographic search, it was a 3-page breakdown: database indexes needed (GiST index on the geography column), the specific PostGIS functions to use (ST_DWithin for radius, ST_MakeEnvelope for bounding box), how to handle the common “search near me” case where the user shares browser geolocation, fallback behavior when geolocation is denied, and pagination strategy for large result sets.

That plan document becomes the contract for everything that follows.

Phase 2: The Roast

Here’s where it gets interesting. I don’t take that plan and start coding.

I take it and throw it at another AI agent with this instruction:

“You’re a senior architect reviewing a junior developer’s proposal. Your job is to find every flaw. Assume the plan has at least 3 serious problems. Check: security vulnerabilities, SQL injection vectors, performance issues at scale, missing edge cases, error handling gaps, and whether the approach will actually work with our existing schema. Be brutal. I’d rather fix problems in the plan than in production.”

That “be brutal” isn’t decoration. Without explicit permission to be critical, AI defaults to politeness. It’ll say “this looks great, but you might also consider…” instead of “this will break under load because…”

Here’s what The Roast caught for the geographic search feature:

Security hole. The original plan passed user-supplied coordinates directly into a raw SQL query template. The Roast flagged it immediately: “This is a SQL injection vector. User-supplied lat/lng values must be parameterized, not interpolated.” Obvious in retrospect. Easy to miss when you’re focused on the spatial logic.

Performance bottleneck. The plan used ST_Distance to sort results by proximity. The Roast pointed out that ST_Distance on geography types calculates the actual geodesic distance for every row in the result set, which is expensive. The fix: use ST_DWithin for the initial filter (which uses the spatial index), then compute ST_Distance only on the filtered results. The difference on our 1,700-school dataset: 340ms vs 85ms.

Missing edge case. What happens when a user searches with a radius of 0 km? The original plan didn’t handle it. In production, that would return zero results with no explanation. The Roast suggested: treat radius < 1 km as “exact location match” with a 1 km minimum, and show a UI message explaining the adjustment.

Three problems. All caught before a single line of implementation code existed.

Phase 3: Consensus

These agents argue with each other (under my guidance). We iterate until we land on a bulletproof design that I’m satisfied with too.

Here’s the nuance: sometimes the Roast agent is wrong. It might flag a “performance issue” that doesn’t matter at our scale, or suggest an over-engineered solution to a problem that’s better solved simply. My job in Phase 3 is to mediate.

For the geographic search, the Roast agent also suggested implementing a caching layer with Redis for frequently searched locations. I overruled it. The query was already fast enough at 85ms after the ST_DWithin fix. Adding Redis would mean another infrastructure dependency, cache invalidation complexity (what happens when a school updates its address?), and maintenance overhead. Not worth it for a sub-100ms query.

The Roast agent also recommended implementing a spatial partitioning system — splitting the Czech Republic into grid cells and pre-computing which schools fall in each cell. Technically interesting. Completely unnecessary for 1,700 records. Maybe when we hit 50,000. Not today.

That’s the “consensus” part. It’s not about blindly accepting every suggestion. It’s about having the judgment to know which improvements matter now, which matter later, and which are just engineering vanity.

Only then does the first line of code get written.

A Feature, Start to Finish

Let me walk through the entire flow with another real example: the 200-point ranking algorithm that determines how driving schools are sorted in search results.

Phase 1 output: A scoring specification. 14 factors, each weighted. Review count and average (40 points max). Data completeness — does the school have pricing listed, course descriptions, photos? (30 points). Freshness — when was the data last updated? (20 points). Geographic coverage — do they serve multiple locations? (15 points). Response rate to user inquiries (15 points). And several smaller signals. The plan specified the exact SQL query structure: a materialized view that pre-computes scores nightly, with a manual refresh trigger for admin overrides.

Phase 2 (The Roast) caught: The weighting was gameable. A school could artificially inflate their score by adding empty course listings (boosting “data completeness” without providing real value). The fix: completeness scoring checks for minimum content thresholds — a course listing needs at least a title, price, and description to count. The Roast also caught that the materialized view refresh would lock the table for reads during recomputation. On our dataset, that lock would last 2-3 seconds — not catastrophic, but noticeable. Fix: use REFRESH MATERIALIZED VIEW CONCURRENTLY with a unique index.

Phase 3 consensus: The Roast suggested adding a “trust score” based on how long the school has been registered with the transport authority. I liked the concept but overruled the implementation — we don’t reliably have registration dates for all schools, and scoring based on incomplete data would punish newer entrants unfairly. Parked for later, when our data coverage improves.

Implementation: 4 rounds of iteration. Round 1: core scoring logic and materialized view. Round 2: edge case handling (schools with zero reviews, newly added schools, schools that have been flagged for data issues). Round 3: admin dashboard integration for manual score adjustments. Round 4: performance testing and index optimization.

Total time from Phase 1 to production: 2 days. A feature that would normally involve a product spec, architecture review meeting, implementation sprint, and QA cycle. Same rigor, compressed timeline.

The Iteration Reality

Not every feature is 2 days. Simple CRUD endpoints or UI components might be 1-2 rounds — Phase 1 plan, quick sanity check, implement. The geographic search and ranking algorithm were medium complexity at 3-4 rounds each.

The most complex feature I’ve built this way was the automated data pipeline — scraping, parsing, normalizing, and merging driving school data from 14 different source formats. That took 8 rounds of iteration over a week. The Roast kept finding edge cases: what if two sources have conflicting prices for the same school? What if a school’s name is slightly different across sources (“Autoškola Novák” vs “Autoškola Ing. Novák s.r.o.”)? What if a source provides prices including VAT and another excluding? Each round caught problems that would have been production bugs.

8 rounds sounds like a lot. But each round is minutes, not days. The total time invested in adversarial review for that feature was maybe 3 hours. The bugs it prevented would have taken weeks to debug in production, because data pipeline bugs are silent — you don’t know your data is wrong until someone complains.

Tools and Setup

My daily setup is straightforward. Claude Code CLI running in the terminal, usually 2-3 sessions open simultaneously. One session for the current implementation task. One for The Roast review. Sometimes a third for documentation or test writing.

Context management is critical. Each session gets a focused context — the relevant files, the plan document, and specific instructions about its role. The implementation session doesn’t see the Roast feedback until I deliberately share it. The Roast session doesn’t see the implementation until I share the plan. This separation prevents the agents from being “nice” to each other or anchoring on previous output.

For larger features, I maintain a running plan document — a markdown file that tracks what’s been decided, what’s been implemented, and what’s still open. This acts as the single source of truth when context windows get long.

Why This Isn’t “Asking ChatGPT Twice”

I get this pushback. “You’re just asking the same AI to review its own work. That’s circular.”

It’s not. And here’s why.

When you ask a single AI to “write code and then review it,” the review is contaminated by the generation process. The AI is biased toward its own output. It wrote the code, so it “understands” the code, and that understanding masks the flaws.

Vibe Engineering uses structural separation. The research agent and the review agent operate with different prompts, different roles, and different incentive frames. The research agent is told to be creative and comprehensive. The review agent is told to be destructive and skeptical. These aren’t the same thinking mode.

More importantly, I’m the third node. I’m not passively shuttling output between two AI sessions. I’m evaluating both sides, injecting domain knowledge that neither agent has (“this won’t work because our Cloud Run instance has a 30-second request timeout”), and making judgment calls that require understanding the business context, not just the code.

Traditional code review has a human reviewing human code. The reviewer brings experience, pattern recognition, and domain knowledge — but they’re limited by time and attention. They might spend 20 minutes on a complex PR.

Vibe Engineering has AI reviewing AI plans, with a human making the final call. The AI reviewers are tireless, thorough, and check every edge case mechanically. They’ll catch the SQL injection vector and the race condition and the missing null check — all in the same pass. A human reviewer might catch one of those three, depending on what they’re paying attention to.

The tradeoff: AI reviewers miss systemic issues. They don’t know that your Cloud Run instance is memory-constrained, or that your users are mostly on mobile with slow connections, or that the Czech market has specific regulatory requirements for driving school data. That’s what the human brings.

The Result

Kvalty.cz doesn’t run on the first idea an AI hallucinated. It runs on a solution that went through several rounds of critical review — it just happened in minutes, not weeks.

So far it’s held up. 237,000 lines of TypeScript that I maintain solo, a system with 163 database tables that I can explain and modify confidently, and features that ship to production without the “please don’t break” prayer that usually accompanies solo-developer deployments.

What actually matters

Vibe Engineering isn’t about AI doing the work for you. It’s about directing, coordinating, and deciding who’s right when two agents disagree.

Writing syntax stopped being the hard part a while ago. The hard part is judgment — knowing what to build, knowing when the AI is wrong, and knowing when “good enough” is actually good enough versus when it’s hiding a problem you’ll find at 2 AM.

I keep finding that the 10 minutes I spend on adversarial review saves me hours of debugging. The math is obvious once you try it.