openaienmodel: gpt-5-mini-2025-08-07

Browser Rendering: /crawl endpoint for single-call full-site crawling (open beta)

browser-rendering crawl api headless-browser robots.txt incremental-crawling

Key Points

Single API call to crawl whole sites
Outputs HTML, Markdown, and structured JSON
Respects robots.txt and AI Crawl Control

Summary

Crawl an entire website with a single API call using Browser Rendering's new open-beta /crawl endpoint. Submit a starting URL and the endpoint will automatically discover pages (from sitemaps and links), render them in a headless browser (or fetch static HTML), and produce results in HTML, Markdown, or structured JSON. Jobs run asynchronously: you POST to start a crawl, receive a job ID, and poll for results.

Key Points

Single-call workflow: POST to /accounts/{account_id}/browser-rendering/crawl with a url to start; GET /.../crawl/{job_id} to poll results.
Output formats: HTML, Markdown, and structured JSON (Workers AI-powered conversion).
Crawl scope controls: configure depth, page limits, and include/exclude wildcard patterns.
Automatic discovery: finds URLs via sitemaps, page links, or both.
Incremental crawling: use modifiedSince and maxAge to skip unchanged pages and reduce cost.
Static mode: set render: false to fetch static HTML without launching a browser for faster crawls of static sites.
Well-behaved by default: signed-agent respects robots.txt, crawl-delay, and Cloudflare AI Crawl Control.
Limitations: cannot bypass Cloudflare bot detection or captchas and self-identifies as a bot.
Availability: on Workers Free and Paid plans.

Quick usage notes for engineers

Start a job: POST { "url": "https://example.com" } to /browser-rendering/crawl → receive job_id.
Retrieve results: poll GET /browser-rendering/crawl/{job_id} until pages are processed.
Optimize costs: use modifiedSince, maxAge, page limits, and render: false where applicable.

Action items

Review the crawl endpoint documentation before integrating.
Check target site robots.txt and sitemaps to ensure compliance and expected discovery behavior.

openaijamodel: gpt-5-mini-2025-08-07

Browser Rendering - 単一の API 呼び出しでウェブサイト全体をクロールする

単一の API 呼び出しでウェブサイト全体をクロールする — Browser Rendering

公開日: 2026-03-10

編集: この投稿はサイトのガイダンスに関するクロール挙動を明確にするために編集されました。

Cloudflare の Browser Rendering に新しく追加された /crawl エンドポイント（オープンベータ）を使うと、単一の API 呼び出しでウェブサイト全体をクロールできます。開始 URL を送信すると、ページが自動的に発見され、ヘッドレスブラウザでレンダリングされ、HTML、Markdown、構造化された JSON など複数のフォーマットで返されます。エンドポイントは signed-agent ↗ を使い、デフォルトで robots.txt と AI Crawl Control ↗ を尊重するため、開発者がウェブサイトのルールに従いやすく、クローラーがサイト所有者のガイダンスを無視する可能性が下がります。

この機能はモデルのトレーニング、RAG パイプラインの構築、サイト全体のコンテンツ調査や監視に最適です。

クロールジョブの流れ

クロールジョブは非同期で実行されます。URL を送信するとジョブ ID が返され、ページが処理されるにつれて結果を確認できます。

ターミナル例

以下は基本的な操作例です。

Initiate a crawl:

curl -X POST 'https://api.cloudflare.com/client/v4/accounts/{account_id}/browser-rendering/crawl' -H 'Authorization: Bearer <apiToken>' -H 'Content-Type: application/json' -d '{ "url": "https://blog.cloudflare.com/" }'

Check results:

curl -X GET 'https://api.cloudflare.com/client/v4/accounts/{account_id}/browser-rendering/crawl/{job_id}' -H 'Authorization: Bearer <apiToken>' -H 'Content-Type: application/json' -d '{ "url": "https://blog.cloudflare.com/" }'

主な機能

複数の出力フォーマット
- クロールしたコンテンツを HTML、Markdown、構造化 JSON（Workers AI による処理）で取得できます。
クロール範囲の制御
- クロール深度、ページ上限、ワイルドカードパターンで特定の URL パスを含める／除外する設定が可能です。
自動ページ発見
- sitemap やページ内リンクから URL を自動発見します（両方を併用可能）。
増分クロール
- modifiedSince や maxAge を使って、変更がないページや最近取得済みのページをスキップし、繰り返しクロールの時間とコストを節約できます。
スタティックモード
- render: false を設定するとブラウザを起動せずに静的な HTML を取得でき、静的サイトの高速クロールに適しています。
マナーの良いボット
- robots.txt の指示（crawl-delay を含む）を尊重します。

Workers Free および Workers Paid の両プランで利用可能です。

注意: /crawl エンドポイントは Cloudflare のボット検出や CAPTCHA を回避するものではなく、自己識別（self-identifies）としてボットを示します。

開始方法や詳細は crawl endpoint documentation を参照してください。自身のサイトをクロールされる設定にする場合は、robots.txt と sitemaps のベストプラクティスを確認してください。

Browser Rendering - Crawl entire websites with a single API call using Browser Rendering

Summary

Summary

Key Points

Quick usage notes for engineers

Action items

Translations

単一の API 呼び出しでウェブサイト全体をクロールする — Browser Rendering

クロールジョブの流れ

ターミナル例

主な機能