AI Crawler Directory
AI companies crawl your site to train models and answer questions about it. This directory lists the crawlers that matter, what each one actually does, whether it respects robots.txt, and the exact rules to allow or block it. To apply any of this in one click, use the AI Crawler Manager.
Every AI crawler at a glance
Two questions decide every AI crawler policy: does it use my content for training, and does it honor robots.txt? Training crawlers (GPTBot, ClaudeBot, CCBot) are the ones most sites block. AI-search crawlers (OAI-SearchBot, PerplexityBot) can send referral traffic, so blocking them is a trade-off.
| Crawler | Company | Trains AI | AI search | Honors robots.txt | Recommended |
|---|---|---|---|---|---|
| GPTBot | OpenAI | Yes | No | Yes | Block |
| OAI-SearchBot | OpenAI | No | Yes | Yes | Depends |
| ClaudeBot | Anthropic | Yes | No | Yes | Block |
| Claude-SearchBot | Anthropic | No | Yes | Yes | Depends |
| PerplexityBot | Perplexity | Partial | Yes | Partial | Depends |
| CCBot | Common Crawl | Yes | No | Yes | Block |
| Google-Extended | Yes | No | Yes | Block | |
| Bytespider | ByteDance | Yes | No | Partial | Block |
| Amazonbot | Amazon | Yes | Partial | Yes | Depends |
Blocking AI crawlers does not affect Google ranking
How to allow or block any AI crawler
Each crawler is identified by its User-agent token. Add a group for the crawler with Disallow: / to block it everywhere, or Disallow: (empty) to allow it. Multiple crawlers can share one group.
# Block the major AI training crawlers, allow everything else
User-agent: GPTBot
User-agent: ClaudeBot
User-agent: CCBot
User-agent: Google-Extended
Disallow: /
User-agent: *
Allow: /
Sitemap: https://example.com/sitemap.xmlCompliance varies
GPTBot — OpenAI
Crawls the web to gather training data for OpenAI's foundation models.
| Trains AI | AI search | Honors robots.txt |
|---|---|---|
| Yes | No | Yes |
# Block GPTBot
User-agent: GPTBot
Disallow: /
# Allow GPTBot
User-agent: GPTBot
Disallow:Recommended: Block
OAI-SearchBot — OpenAI
Fetches and links pages to surface them in ChatGPT Search results.
| Trains AI | AI search | Honors robots.txt |
|---|---|---|
| No | Yes | Yes |
# Block OAI-SearchBot
User-agent: OAI-SearchBot
Disallow: /
# Allow OAI-SearchBot
User-agent: OAI-SearchBot
Disallow:Recommended: Depends
ClaudeBot — Anthropic
Crawls the web to gather training data for Anthropic's Claude models.
| Trains AI | AI search | Honors robots.txt |
|---|---|---|
| Yes | No | Yes |
# Block ClaudeBot
User-agent: ClaudeBot
Disallow: /
# Allow ClaudeBot
User-agent: ClaudeBot
Disallow:Recommended: Block
Claude-SearchBot — Anthropic
Fetches pages so Claude can cite and answer with current web results.
| Trains AI | AI search | Honors robots.txt |
|---|---|---|
| No | Yes | Yes |
# Block Claude-SearchBot
User-agent: Claude-SearchBot
Disallow: /
# Allow Claude-SearchBot
User-agent: Claude-SearchBot
Disallow:Recommended: Depends
PerplexityBot — Perplexity
Indexes pages for Perplexity's AI answer engine and citations.
| Trains AI | AI search | Honors robots.txt |
|---|---|---|
| Partial | Yes | Partial |
# Block PerplexityBot
User-agent: PerplexityBot
Disallow: /
# Allow PerplexityBot
User-agent: PerplexityBot
Disallow:Recommended: Depends
CCBot — Common Crawl
Builds the open Common Crawl dataset that seeds many third-party AI models.
| Trains AI | AI search | Honors robots.txt |
|---|---|---|
| Yes | No | Yes |
# Block CCBot
User-agent: CCBot
Disallow: /
# Allow CCBot
User-agent: CCBot
Disallow:Recommended: Block
Google-Extended — Google
Opt-out token controlling whether Google uses your content for Gemini training and grounding.
| Trains AI | AI search | Honors robots.txt |
|---|---|---|
| Yes | No | Yes |
# Block Google-Extended
User-agent: Google-Extended
Disallow: /
# Allow Google-Extended
User-agent: Google-Extended
Disallow:Recommended: Block
Bytespider — ByteDance
Aggressively crawls the web for ByteDance/TikTok AI training.
| Trains AI | AI search | Honors robots.txt |
|---|---|---|
| Yes | No | Partial |
# Block Bytespider
User-agent: Bytespider
Disallow: /
# Allow Bytespider
User-agent: Bytespider
Disallow:Recommended: Block
Amazonbot — Amazon
Crawls pages for Amazon products including Alexa answers and AI features.
| Trains AI | AI search | Honors robots.txt |
|---|---|---|
| Yes | Partial | Yes |
# Block Amazonbot
User-agent: Amazonbot
Disallow: /
# Allow Amazonbot
User-agent: Amazonbot
Disallow:Recommended: Depends
What is an AI crawler?
An AI crawler is a bot that fetches web pages to train an AI model or to answer questions in an AI product. Examples include OpenAI's GPTBot, Anthropic's ClaudeBot, and Perplexity's PerplexityBot. They are separate from search-engine crawlers like Googlebot.
Do AI crawlers respect robots.txt?
The major operators (OpenAI, Anthropic, Google, Common Crawl) document that their crawlers honor robots.txt. Some crawlers, such as Bytespider, have been reported to ignore it. robots.txt is a voluntary standard, so use the Analyzer to confirm a crawler is actually being blocked.
Should I block all AI crawlers?
Block training crawlers (GPTBot, ClaudeBot, CCBot, Google-Extended) if you don't want your content used to train models — it has no effect on search ranking. AI-search crawlers like OAI-SearchBot and PerplexityBot can drive referral traffic, so blocking those is a business trade-off, not an automatic win.
Which AI crawler should I worry about most?
CCBot has the widest reach: Common Crawl's dataset feeds dozens of downstream model trainers, so one block stops many of them. GPTBot, ClaudeBot, and Google-Extended cover the largest commercial models directly.
Robots.txt Validator
Catch syntax errors and best-practice issues, with a health score.
Robots.txt Studio Editorial · Technical SEO & crawling
We build robots.txt tooling and parse thousands of real-world files. Guides are written by practitioners and reviewed against the Google and RFC 9309 specifications.