AI Crawler Directory

AI companies crawl your site to train models and answer questions about it. This directory lists the crawlers that matter, what each one actually does, whether it respects robots.txt, and the exact rules to allow or block it. To apply any of this in one click, use the AI Crawler Manager.

RSRobots.txt Studio Editorial Updated June 8, 2026 Reviewed against Google Search Central and RFC 9309

Every AI crawler at a glance

Two questions decide every AI crawler policy: does it use my content for training, and does it honor robots.txt? Training crawlers (GPTBot, ClaudeBot, CCBot) are the ones most sites block. AI-search crawlers (OAI-SearchBot, PerplexityBot) can send referral traffic, so blocking them is a trade-off.

CrawlerCompanyTrains AIAI searchHonors robots.txtRecommended
GPTBotOpenAIYesNoYesBlock
OAI-SearchBotOpenAINoYesYesDepends
ClaudeBotAnthropicYesNoYesBlock
Claude-SearchBotAnthropicNoYesYesDepends
PerplexityBotPerplexityPartialYesPartialDepends
CCBotCommon CrawlYesNoYesBlock
Google-ExtendedGoogleYesNoYesBlock
BytespiderByteDanceYesNoPartialBlock
AmazonbotAmazonYesPartialYesDepends

Blocking AI crawlers does not affect Google ranking

Google-Extended, GPTBot and the rest are separate from Googlebot. You can block every AI training crawler and remain 100% visible in Google Search.
Robots.txt AnalyzerAnalyze a site

How to allow or block any AI crawler

Each crawler is identified by its User-agent token. Add a group for the crawler with Disallow: / to block it everywhere, or Disallow: (empty) to allow it. Multiple crawlers can share one group.

Block AI training, keep search engines
# Block the major AI training crawlers, allow everything else
User-agent: GPTBot
User-agent: ClaudeBot
User-agent: CCBot
User-agent: Google-Extended
Disallow: /

User-agent: *
Allow: /

Sitemap: https://example.com/sitemap.xml

Compliance varies

Well-behaved crawlers (OpenAI, Anthropic, Google) honor these rules. Some — like Bytespider — have a patchy record. robots.txt is a request, not an enforced firewall; verify behavior in the Analyzer.

GPTBot — OpenAI

Crawls the web to gather training data for OpenAI's foundation models.

Trains AIAI searchHonors robots.txt
YesNoYes
GPTBot allow / block config
# Block GPTBot
User-agent: GPTBot
Disallow: /

# Allow GPTBot
User-agent: GPTBot
Disallow:

Recommended: Block

Pure training crawler — block it if you don't want your content training ChatGPT.

OAI-SearchBot — OpenAI

Fetches and links pages to surface them in ChatGPT Search results.

Trains AIAI searchHonors robots.txt
NoYesYes
OAI-SearchBot allow / block config
# Block OAI-SearchBot
User-agent: OAI-SearchBot
Disallow: /

# Allow OAI-SearchBot
User-agent: OAI-SearchBot
Disallow:

Recommended: Depends

Allow it for referral traffic from ChatGPT Search; block it to stay out of AI answers.

ClaudeBot — Anthropic

Crawls the web to gather training data for Anthropic's Claude models.

Trains AIAI searchHonors robots.txt
YesNoYes
ClaudeBot allow / block config
# Block ClaudeBot
User-agent: ClaudeBot
Disallow: /

# Allow ClaudeBot
User-agent: ClaudeBot
Disallow:

Recommended: Block

Training crawler — block it to keep your content out of Claude's training set.

Claude-SearchBot — Anthropic

Fetches pages so Claude can cite and answer with current web results.

Trains AIAI searchHonors robots.txt
NoYesYes
Claude-SearchBot allow / block config
# Block Claude-SearchBot
User-agent: Claude-SearchBot
Disallow: /

# Allow Claude-SearchBot
User-agent: Claude-SearchBot
Disallow:

Recommended: Depends

Allow it to be cited in Claude's web answers; block it to opt out of AI search.

PerplexityBot — Perplexity

Indexes pages for Perplexity's AI answer engine and citations.

Trains AIAI searchHonors robots.txt
PartialYesPartial
PerplexityBot allow / block config
# Block PerplexityBot
User-agent: PerplexityBot
Disallow: /

# Allow PerplexityBot
User-agent: PerplexityBot
Disallow:

Recommended: Depends

Drives referral traffic, but has been reported to fetch some pages without a declared agent — verify with the Analyzer.

CCBot — Common Crawl

Builds the open Common Crawl dataset that seeds many third-party AI models.

Trains AIAI searchHonors robots.txt
YesNoYes
CCBot allow / block config
# Block CCBot
User-agent: CCBot
Disallow: /

# Allow CCBot
User-agent: CCBot
Disallow:

Recommended: Block

One block stops your content reaching dozens of downstream model trainers that use Common Crawl.

Google-Extended — Google

Opt-out token controlling whether Google uses your content for Gemini training and grounding.

Trains AIAI searchHonors robots.txt
YesNoYes
Google-Extended allow / block config
# Block Google-Extended
User-agent: Google-Extended
Disallow: /

# Allow Google-Extended
User-agent: Google-Extended
Disallow:

Recommended: Block

Blocking it has zero effect on Google Search ranking — Googlebot is separate — so you can opt out of AI training safely.

Bytespider — ByteDance

Aggressively crawls the web for ByteDance/TikTok AI training.

Trains AIAI searchHonors robots.txt
YesNoPartial
Bytespider allow / block config
# Block Bytespider
User-agent: Bytespider
Disallow: /

# Allow Bytespider
User-agent: Bytespider
Disallow:

Recommended: Block

High-volume training crawler with a patchy compliance record — most sites block it to save bandwidth.

Amazonbot — Amazon

Crawls pages for Amazon products including Alexa answers and AI features.

Trains AIAI searchHonors robots.txt
YesPartialYes
Amazonbot allow / block config
# Block Amazonbot
User-agent: Amazonbot
Disallow: /

# Allow Amazonbot
User-agent: Amazonbot
Disallow:

Recommended: Depends

Block to opt out of Amazon AI; allow if you want Alexa to answer from your content.
Frequently asked questions
What is an AI crawler?

An AI crawler is a bot that fetches web pages to train an AI model or to answer questions in an AI product. Examples include OpenAI's GPTBot, Anthropic's ClaudeBot, and Perplexity's PerplexityBot. They are separate from search-engine crawlers like Googlebot.

Do AI crawlers respect robots.txt?

The major operators (OpenAI, Anthropic, Google, Common Crawl) document that their crawlers honor robots.txt. Some crawlers, such as Bytespider, have been reported to ignore it. robots.txt is a voluntary standard, so use the Analyzer to confirm a crawler is actually being blocked.

Should I block all AI crawlers?

Block training crawlers (GPTBot, ClaudeBot, CCBot, Google-Extended) if you don't want your content used to train models — it has no effect on search ranking. AI-search crawlers like OAI-SearchBot and PerplexityBot can drive referral traffic, so blocking those is a business trade-off, not an automatic win.

Which AI crawler should I worry about most?

CCBot has the widest reach: Common Crawl's dataset feeds dozens of downstream model trainers, so one block stops many of them. GPTBot, ClaudeBot, and Google-Extended cover the largest commercial models directly.

Robots.txt Validator

Catch syntax errors and best-practice issues, with a health score.

Validate your file
Related resources
Next upBlock GPTBot in robots.txt
RS

Robots.txt Studio Editorial · Technical SEO & crawling

We build robots.txt tooling and parse thousands of real-world files. Guides are written by practitioners and reviewed against the Google and RFC 9309 specifications.