How I Used 20 AI Agents to Extract Data From 1,700 Driving School Websites

When you’re building the largest driving school comparator in a country, you need data. A lot of it. And the Czech driving school market doesn’t exactly make that easy — pricing buried in PDFs, contact info scattered across poorly built websites, and no standardized format for anything.

Here’s how I solved that problem using AI agents at a scale I haven’t seen anyone else attempt as a solo founder.

The Problem

Kvalty.cz needed comprehensive data on every driving school in the Czech Republic: names, addresses, contacts, pricing for every course type, license categories, transmission options, installment plans, opening hours, Google Maps ratings — the full picture.

There were ~1,700 schools. Each one with a different website (if they had one at all). Some had modern React sites. Some were static HTML from 2008. Some had their entire pricing in a PDF you had to download. Some had it spread across 5 different subpages.

No API. No standard format. Just chaos.

Phase 1: Discovery

First, I needed to find all the schools. I aggregated data from publicly available business directories and registries, cross-referencing with Google Maps for ratings, geocoding, and region mapping. Czech business ID numbers (IČO) served as the primary matching key, giving us a 99.6% match rate.

This got us the skeleton. But skeletons don’t have pricing data.

Phase 2: The Ralph Method

This is where it gets interesting.

I needed to visit every single school’s website, find their pricing page, extract structured data from whatever format they used, validate it against Google Maps, write descriptions, and translate everything. For 1,700 schools.

Doing this manually would take months. Doing it with a simple scraper wouldn’t work — the formats were too varied, too messy, too human-readable-only.

So I built what I call the Ralph method: a fleet of 20 parallel Claude Code (Anthropic’s CLI) sessions, each running as an autonomous validation agent.

Why “Ralph”?

The name comes from Ralph Wiggum — the Simpsons character who’s cheerfully oblivious but somehow gets the job done. The agents reminded me of him during testing. They’d encounter a school website that was just a Facebook page with a phone number in the bio, and they’d dutifully report: “Website content extracted. Pricing: not found. Description: this school maintains a social media presence.” Technically correct, charmingly naive. The name stuck.

The Architecture

An orchestrator script (Python + Bash) splits 1,700 schools into 20 worker batches. Each worker runs an isolated Claude Code session via GNU screen with a custom 826-line prompt (PROMPT.md) that describes exactly what the agent should do.

Each Claude instance has access to:

WebFetch for crawling websites
Filesystem tools (Read/Write/Edit) for managing data files
Bash for running helper scripts
Task management tools for tracking progress

Critically: git push/commit is explicitly denied. The agents do the work, but a human reviews every change before it touches the database.

Inside the 826-Line Prompt

The PROMPT.md is the brain of the operation. Here’s a representative slice of what the agent’s instructions look like:

## Pricing Extraction Rules

When extracting pricing, you MUST:
- Match each found price to an existing service in the database using semantic matching
- "Řidičské oprávnění skupiny B" → maps to category "License B - Car"
- "Kondičné jízdy" → maps to category "Refresher Driving Lessons"
- "Výcvik na automat" → maps to "License B - Car (Automatic)"
- NEVER create new service categories — only match to existing ones
- If a price includes VAT info, extract both with-VAT and without-VAT amounts
- If pricing is "od 15 000 Kč" (from 15,000 CZK), mark as minimum price, not fixed
- Currency is ALWAYS CZK. If you see EUR, convert at the rate in config.json

The semantic matching piece was critical. Czech driving schools describe the same service in dozens of ways. “Řidičské oprávnění skupiny B,” “Řidičák na auto,” “Výcvik sk. B,” “Autoškola B” — all mean the same thing: License B for cars. The prompt includes ~60 of these mapping examples so the agent can handle the variations without hallucinating new categories.

Other prompt sections cover: how to handle multi-page pricing (follow all links that look like pricing subpages), how to write descriptions at different quality tiers based on the school’s Google rating, how to flag suspicious data for human review, and strict rules about what constitutes “no website” versus “website exists but has no useful content.”

What Each Agent Does (Autonomously)

For every school, the agent:

Finds the next incomplete task via find-next-task.py
Reads the database snapshot (input.json) and working copy (output.json)
Runs Google Maps lookup → gets rating, place_id, CID, verified address, opening hours
Crawls the school’s entire website — static pages via WebFetch, JavaScript-heavy sites (Wix, React, Vue) via js-site-extractor.py (Playwright-based), PDFs via pdf-parser.py
Extracts all pricing — courses, services, fees, with semantic matching to existing database records. Handles license types (A, B, C, D, T…), transmission (manual/automatic), lesson hours, installment plans
Writes two descriptions per school — one factual, one SEO-optimized. With rating-based tone tiers: 4.5+ stars = enthusiastic, below 3.5 = neutral
Translates everything Czech → English
Validates against Pydantic schema (validate-output.py)
Creates a VALIDATION-SUMMARY.md documenting every change made
Marks the task complete and stops — one school per invocation, no overlap risk

The Schools That Broke Everything

Out of 1,700 schools, most were straightforward. But a handful of edge cases nearly broke the pipeline:

The Wix infinite scroll. One school in Brno built their entire site on Wix with pricing displayed as a scrollable “services” section that loaded dynamically as you scrolled down. The WebFetch tool grabbed the initial HTML — which contained zero pricing. The Playwright-based js-site-extractor.py captured the first scroll viewport, which showed motorcycle courses only. The car pricing? Loaded after 3 scroll events. I had to add scroll-to-bottom behavior to the JS extractor specifically for Wix lazy-loading patterns. One school, one custom fix.

The PDF with no text layer. A school in southern Moravia had their full pricing table in a scanned PDF — an image of a printed document. No selectable text. The pdf-parser.py got nothing. OCR would have worked but wasn’t in the pipeline (adding Tesseract to 20 parallel agents wasn’t worth the complexity for fewer than 10 schools). These got flagged for manual extraction. I typed those prices in by hand. Sometimes the old way is the only way.

The mid-crawl URL migration. Three weeks into the 32-day run, a school in Plzeň decided to redesign their website. New domain structure, old URLs returning 404s. The agent dutifully reported “website appears to be offline.” But the school wasn’t offline — they’d just moved /cenik to /sluzby/cenik-autoskoly. The changedetection.io monitors caught this one later, but during the batch run, this school’s data came back empty and I had to re-queue it manually.

These edge cases represent maybe 3-4% of all schools. But they consumed about 20% of my time. The long tail of web weirdness is where automation breaks down and human judgment takes over.

The Cost

Here’s the math that made this viable.

All 20 agents ran on Claude Code with Haiku — Anthropic’s fastest and cheapest model. For 1,639 schools over ~32 days, the total API cost came to roughly $0.15–0.25 per school. Call it ~$300 for the entire country.

For comparison: hiring manual data entry contractors to visit 1,700 websites, extract pricing, verify against Google Maps, write descriptions, and translate? At even $3-5 per school (which is conservative for quality bilingual work), that’s $5,000–8,500. And it would take a team of people weeks, with inevitable inconsistencies between extractors.

The Ralph method cost roughly 5% of the manual alternative and produced more consistent output because every school was processed by the same prompt with the same rules.

The infrastructure cost was negligible — the orchestrator ran on my local machine. The Playwright instances for JS-heavy sites were the heaviest part, but 20 headless Chrome tabs on a MacBook Pro is manageable.

The Results

Over ~32 days (January–February 2026):

1,639 out of 1,640 schools validated (~99.9%)
~80% had new pricing items added
~40% had address corrections via Google Maps cross-referencing
100% got fresh descriptions rewritten

All of this running on Claude Code Haiku for cost efficiency.

Phase 3: Keeping It Fresh

Schools change their prices. Websites get updated. You can’t just extract once and call it done.

The Ralph method handles ongoing updates too. changedetection.io with headless Chrome monitors every school’s pricing page daily. When a change is detected, Ralph agents compare the current website data against what’s already in the database, extract any differences, and create a draft update for human review.

Same autonomous workflow, same validation rigor — just running continuously instead of as a one-time batch. The agents already know the school’s existing data, so they can intelligently diff what changed rather than re-extracting everything from scratch.

The bigger picture

A year ago, this data didn’t exist in any structured, comparable form. Driving school pricing in Czech Republic was locked in PDFs, buried in website footers, and fundamentally opaque.

Now it’s all in one place: searchable, filterable, comparable. And it stays current automatically.

The Ralph method proved something I kept suspecting: AI agents aren’t limited to generating code. They can do real work at a scale that would be economically impossible for a solo founder otherwise. Twenty agents running in parallel, each making intelligent decisions about messy real-world data, coordinated by an orchestrator and validated by a human.

That’s a long way from vibe coding.

Could this approach work for other data extraction problems? I think so. Any domain where information is scattered across hundreds of inconsistent websites — tradespeople, restaurants, local services, healthcare providers — the same architecture applies. An orchestrator, parallel agents, a detailed prompt with domain-specific semantic mappings, schema validation, and human review.

The whole pipeline is part of Kvalty.cz — and it’s still running, keeping 1,700+ school profiles up to date.