How I Used 20 AI Agents to Extract Data From 1,700 Driving School Websites

When you’re building the largest driving school comparator in a country, you need data. A lot of it. And the Czech driving school market doesn’t exactly make that easy — pricing buried in PDFs, contact info scattered across poorly built websites, and no standardized format for anything.

Here’s how I solved that problem using AI agents at a scale I haven’t seen anyone else attempt as a solo founder.

The Problem

Kvalty.cz needed comprehensive data on every driving school in the Czech Republic: names, addresses, contacts, pricing for every course type, license categories, transmission options, installment plans, opening hours, Google Maps ratings — the full picture.

There were ~1,700 schools. Each one with a different website (if they had one at all). Some had modern React sites. Some were static HTML from 2008. Some had their entire pricing in a PDF you had to download. Some had it spread across 5 different subpages.

No API. No standard format. Just chaos.

Phase 1: Discovery

First, I needed to find all the schools. I aggregated data from publicly available business directories and registries, cross-referencing with Google Maps for ratings, geocoding, and region mapping. Czech business ID numbers (ICO) served as the primary matching key, giving us a 99.6% match rate.

This got us the skeleton. But skeletons don’t have pricing data.

Phase 2: The Ralph Method

This is where it gets interesting.

I needed to visit every single school’s website, find their pricing page, extract structured data from whatever format they used, validate it against Google Maps, write descriptions, and translate everything. For 1,700 schools.

Doing this manually would take months. Doing it with a simple scraper wouldn’t work — the formats were too varied, too messy, too human-readable-only.

So I built what I call the Ralph method: a fleet of 20 parallel Claude Code (Anthropic’s CLI) sessions, each running as an autonomous validation agent.

The Architecture

An orchestrator script (Python + Bash) splits 1,700 schools into 20 worker batches. Each worker runs an isolated Claude Code session via GNU screen with a custom 826-line prompt (PROMPT.md) that describes exactly what the agent should do.

Each Claude instance has access to:

WebFetch for crawling websites
Filesystem tools (Read/Write/Edit) for managing data files
Bash for running helper scripts
Task management tools for tracking progress

Critically: git push/commit is explicitly denied. The agents do the work, but a human reviews every change before it touches the database.

What Each Agent Does (Autonomously)

For every school, the agent:

Finds the next incomplete task via find-next-task.py
Reads the database snapshot (input.json) and working copy (output.json)
Runs Google Maps lookup → gets rating, place_id, CID, verified address, opening hours
Crawls the school’s entire website — static pages via WebFetch, JavaScript-heavy sites (Wix, React, Vue) via js-site-extractor.py (Playwright-based), PDFs via pdf-parser.py
Extracts all pricing — courses, services, fees, with semantic matching to existing database records. Handles license types (A, B, C, D, T…), transmission (manual/automatic), lesson hours, installment plans
Writes two descriptions per school — one factual, one SEO-optimized. With rating-based tone tiers: 4.5+ stars = enthusiastic, below 3.5 = neutral
Translates everything Czech → English
Validates against Pydantic schema (validate-output.py)
Creates a VALIDATION-SUMMARY.md documenting every change made
Marks the task complete and stops — one school per invocation, no overlap risk

The Results

Over ~32 days (January–February 2026):

1,639 out of 1,640 schools validated (~99.9%)
~80% had new pricing items added
~40% had address corrections via Google Maps cross-referencing
100% got fresh descriptions rewritten

All of this running on Claude Code Haiku for cost efficiency.

Phase 3: Keeping It Fresh

Schools change their prices. Websites get updated. You can’t just extract once and call it done.

The Ralph method handles ongoing updates too. changedetection.io with headless Chrome monitors every school’s pricing page daily. When a change is detected, Ralph agents compare the current website data against what’s already in the database, extract any differences, and create a draft update for human review.

Same autonomous workflow, same validation rigor — just running continuously instead of as a one-time batch. The agents already know the school’s existing data, so they can intelligently diff what changed rather than re-extracting everything from scratch.

Why This Matters

A year ago, this data didn’t exist in any structured, comparable form. Driving school pricing in Czech Republic was locked in PDFs, buried in website footers, and fundamentally opaque.

Now it’s all in one place: searchable, filterable, comparable. And it stays current automatically.

The Ralph method proved something I strongly believe in: AI agents aren’t just for generating code. They’re for doing work at scale that would be economically impossible for a solo founder otherwise. Twenty agents running in parallel, each making intelligent decisions about messy real-world data, coordinated by an orchestrator and validated by a human.

That’s not vibe coding. That’s systems engineering with AI as the workforce.

The whole pipeline is part of Kvalty.cz — and it’s still running, keeping 1,700+ school profiles up to date.