Crawler Directory
The complete reference to the bots that crawl the web: search engines, AI crawlers, SEO tools, and social preview bots. For each one — what it does, whether it respects robots.txt, and the exact rules to control it. To act on any of this, run the Analyzer or open the AI Crawler Manager.
Every crawler that visits your site
A web crawler is an automated bot that fetches your pages. Some build search indexes that send you traffic (Googlebot, Bingbot), some gather data to train AI models (GPTBot, ClaudeBot, Bytespider), some power SEO tools (AhrefsBot, SemrushBot), and some build social link previews. robots.txt is how you tell each one what it may access. This directory covers the crawlers worth knowing, grouped by what they do.
Blocking AI and SEO crawlers does not affect your Google ranking
Search engines
AI training crawlers
| Crawler | Operator | What it does | Recommended |
|---|---|---|---|
| GPTBot | OpenAI | Collects content to train OpenAI's models (e.g. ChatGPT). | Block |
| ClaudeBot | Anthropic | Collects content to train Anthropic's Claude models. | Block |
| CCBot | Common Crawl | Builds the Common Crawl public dataset used to train many AI models. | Block |
| Bytespider | ByteDance | ByteDance's crawler, used for AI training. | Block |
| Meta-ExternalAgent | Meta | Meta's crawler used for AI training. | Block |
AI search crawlers
| Crawler | Operator | What it does | Recommended |
|---|---|---|---|
| PerplexityBot | Perplexity | Fetches pages for Perplexity's AI answer engine. | Block |
| Google-Extended | Controls Google using your content to train Gemini and other AI models. | Block | |
| Applebot-Extended | Apple | Controls Apple using your content for AI training. | Block |
SEO crawlers
| Crawler | Operator | What it does | Recommended |
|---|---|---|---|
| AhrefsBot | Ahrefs | Ahrefs' SEO backlink crawler. | Allow |
| SemrushBot | Semrush | Semrush's SEO analytics crawler. | Allow |
| MJ12bot | Majestic | Majestic's backlink crawler. | Allow |
Other / data crawlers
| Crawler | Operator | What it does | Recommended |
|---|---|---|---|
| Amazonbot | Amazon | Amazon's crawler (used by Alexa and AI products). | Block |
| cohere-ai | Cohere | Cohere's crawler for AI products. | Block |
| Diffbot | Diffbot | Diffbot's structured-data crawler. | Block |
| ImagesiftBot | ImageSift | ImageSift's image crawler. | Block |
How to allow or block any crawler
Every crawler is identified by a User-agent token. Add a group naming the crawler, then Disallow: / to block it everywhere or an empty Disallow: to allow it. Crawlers not named anywhere fall back to the User-agent: * group.
# Block AI training crawlers, keep search engines
User-agent: GPTBot
User-agent: ClaudeBot
User-agent: CCBot
User-agent: Bytespider
Disallow: /
User-agent: *
Allow: /
Sitemap: https://example.com/sitemap.xmlrobots.txt is a request, not a firewall
What is a web crawler?
A web crawler (or bot/spider) is an automated program that fetches web pages. Search engines crawl to build their index, AI companies crawl to train models or answer questions, and SEO tools crawl to map links. You control what each crawler can access using robots.txt.
How do I block a specific crawler?
Add a group to your robots.txt naming the crawler's User-agent token, followed by Disallow: / to block your whole site. For example, User-agent: AhrefsBot then Disallow: /. Use an empty Disallow: to allow it.
Will blocking crawlers hurt my SEO?
Only if you block search-engine crawlers like Googlebot or Bingbot. Blocking AI crawlers (GPTBot, ClaudeBot) or SEO tools (AhrefsBot, SemrushBot) has no effect on your Google or Bing rankings.
How do I know a crawler is really who it claims to be?
User-agent strings can be spoofed. For major crawlers, verify the request IP with a reverse DNS lookup that forward-confirms to the operator's domain (e.g. googlebot.com), or match the operator's published IP ranges.
AI Crawler Manager
Allow or block GPTBot, ClaudeBot, PerplexityBot and more in one place.
Robots.txt Studio Editorial · Technical SEO & crawling
We build robots.txt tooling and parse thousands of real-world files. Guides are written by practitioners and reviewed against the Google and RFC 9309 specifications.
Social crawlers