Browser Rendering: /crawl endpoint for single-call full-site crawling (open beta)
Key Points
- Single API call to crawl whole sites
- Outputs HTML, Markdown, and structured JSON
- Respects robots.txt and AI Crawl Control
Summary
Crawl an entire website with a single API call using Browser Rendering's new open-beta /crawl endpoint. Submit a starting URL and the endpoint will automatically discover pages (from sitemaps and links), render them in a headless browser (or fetch static HTML), and produce results in HTML, Markdown, or structured JSON. Jobs run asynchronously: you POST to start a crawl, receive a job ID, and poll for results.
Key Points
- Single-call workflow: POST to
/accounts/{account_id}/browser-rendering/crawlwith aurlto start; GET/.../crawl/{job_id}to poll results. - Output formats: HTML, Markdown, and structured JSON (Workers AI-powered conversion).
- Crawl scope controls: configure depth, page limits, and include/exclude wildcard patterns.
- Automatic discovery: finds URLs via sitemaps, page links, or both.
- Incremental crawling: use
modifiedSinceandmaxAgeto skip unchanged pages and reduce cost. - Static mode: set
render: falseto fetch static HTML without launching a browser for faster crawls of static sites. - Well-behaved by default: signed-agent respects
robots.txt,crawl-delay, and Cloudflare AI Crawl Control. - Limitations: cannot bypass Cloudflare bot detection or captchas and self-identifies as a bot.
- Availability: on Workers Free and Paid plans.
Quick usage notes for engineers
- Start a job: POST
{ "url": "https://example.com" }to/browser-rendering/crawl→ receivejob_id. - Retrieve results: poll GET
/browser-rendering/crawl/{job_id}until pages are processed. - Optimize costs: use
modifiedSince,maxAge, page limits, andrender: falsewhere applicable.
Action items
- Review the crawl endpoint documentation before integrating.
- Check target site
robots.txtand sitemaps to ensure compliance and expected discovery behavior.