How we built Cloudflare's data platform and an AI agent on top of it
Key Points
- Unified SQL lakehouse with Trino + Iceberg on R2
- Default-closed governance with automated PII scanning
- Skipper: grounded NL→SQL agent built on Workers AI
Summary
Cloudflare consolidated scattered observability and product data into Town Lake, a lakehouse-backed unified SQL layer, and built Skipper, an AI agent that translates natural-language questions into auditable SQL and visualizations. The platform is designed for scale (billions of events/sec), freshness-based tiering (unsampled for billing/security, downsampled for dashboards), and strict, default-closed governance with automated PII detection and time-bound, auditable access.
Key Points
-
Architecture
- Query engine: Apache Trino as a single SQL entrypoint that can join Postgres, ClickHouse, and Iceberg-on-R2 without intermediate materialization.
- Cold/warm storage: R2 + Apache Iceberg for schema evolution, time travel, compaction, and cost-based retention (per-minute → hourly → daily rollups).
- Metadata: DataHub as the canonical catalog for schemas, owners, lineage, and usage metadata used by the agent and planners.
- Access control: Lifeguard stores and composes access rules; Trino reads policies at query time; requests and grants are auditable.
- Ingestion & ELT: an orchestrated ingestion pipeline for Parquet→Iceberg and Transformer (Workflows + YAML DAGs) to run scheduled SQL transforms on Trino.
- PII handling: Skimmer continuously samples and classifies columns (two-pass classifier + human review) and integrates findings into DataHub and Lifeguard.
-
Governance and UX
- Default-closed model: new tables are hidden until Skimmer classification and reviewer approval; unreviewed columns are not shown in DESCRIBE/SELECT *.
- Session-based PII opt-in: sensitive columns are redacted by default; users flip a session bit to access raw PII subject to permission checks and logging.
- Self-serve review flow: error messages point to review requests; Skipper can suggest RBAC groups and link to approval flows.
-
Skipper (AI data agent)
- Natural-language → SQL pipeline: uses DataHub to find tables, pulls schema/lineage, synthesizes SQL, submits to Trino, and returns tables/charts.
- Closed-loop reasoning: carries conversational context, retries/repairs queries that produce unexpected results, and packages charts/dashboards.
- Built on Cloudflare stack: Workers, Workers AI, Durable Objects, D1, R2, Workflows; integrates Lifeguard and Transformer tools for access checks and transforms.
Practical notes for engineers
- Use Town Lake as the single SQL surface for cross-system joins; prefer Trino for ad hoc queries and Transformer for repeatable ETL DAGs.
- Register and review new datasets promptly (Skimmer automates detection); expect default-closed behavior—request reviews via the self-serve links.
- For investigations requiring PII, request time-bounded access and flip the session PII bit; every query is logged for audit.
- When building user-facing tools or agents, ground models with metadata (DataHub) and live queries (Trino) to avoid hallucinated joins or misused columns.
Why it matters
Town Lake + Skipper reduces tribal knowledge, centralizes lineage and governance, and makes accurate, auditable data answers available to engineers across the company without exposing sensitive data by default.