Crawler Directory

The complete reference to the bots that crawl the web: search engines, AI crawlers, SEO tools, and social preview bots. For each one — what it does, whether it respects robots.txt, and the exact rules to control it. To act on any of this, run the Analyzer or open the AI Crawler Manager.

RSRobots.txt Studio Editorial Updated June 8, 2026 Reviewed against Google Search Central and RFC 9309

Robots.txt AnalyzerAnalyze a site

Every crawler that visits your site

A web crawler is an automated bot that fetches your pages. Some build search indexes that send you traffic (Googlebot, Bingbot), some gather data to train AI models (GPTBot, ClaudeBot, Bytespider), some power SEO tools (AhrefsBot, SemrushBot), and some build social link previews. robots.txt is how you tell each one what it may access. This directory covers the crawlers worth knowing, grouped by what they do.

Blocking AI and SEO crawlers does not affect your Google ranking

Only search-engine crawlers (Googlebot, Bingbot) control your search visibility. You can block every AI and SEO crawler and remain fully indexed in Google and Bing.

Robots.txt AnalyzerAnalyze a site

Search engines

Crawler	Operator	What it does	Recommended
Googlebot	Google	Google Search's crawler — controls your visibility in Google.	Allow
Bingbot	Microsoft	Microsoft Bing's crawler — controls visibility in Bing.	Allow
Slurp	Yahoo	Yahoo Search's crawler.	Allow

AI training crawlers

Crawler	Operator	What it does	Recommended
GPTBot	OpenAI	Collects content to train OpenAI's models (e.g. ChatGPT).	Block
ClaudeBot	Anthropic	Collects content to train Anthropic's Claude models.	Block
CCBot	Common Crawl	Builds the Common Crawl public dataset used to train many AI models.	Block
Bytespider	ByteDance	ByteDance's crawler, used for AI training.	Block
Meta-ExternalAgent	Meta	Meta's crawler used for AI training.	Block

AI search crawlers

Crawler	Operator	What it does	Recommended
PerplexityBot	Perplexity	Fetches pages for Perplexity's AI answer engine.	Block
Google-Extended	Google	Controls Google using your content to train Gemini and other AI models.	Block
Applebot-Extended	Apple	Controls Apple using your content for AI training.	Block

SEO crawlers

Crawler	Operator	What it does	Recommended
AhrefsBot	Ahrefs	Ahrefs' SEO backlink crawler.	Allow
SemrushBot	Semrush	Semrush's SEO analytics crawler.	Allow
MJ12bot	Majestic	Majestic's backlink crawler.	Allow

Crawler	Operator	What it does	Recommended
Twitterbot	X (Twitter)	Generates link previews for X/Twitter.	Allow
facebookexternalhit	Meta	Generates link previews for Facebook.	Allow

Other / data crawlers

Crawler	Operator	What it does	Recommended
Amazonbot	Amazon	Amazon's crawler (used by Alexa and AI products).	Block
cohere-ai	Cohere	Cohere's crawler for AI products.	Block
Diffbot	Diffbot	Diffbot's structured-data crawler.	Block
ImagesiftBot	ImageSift	ImageSift's image crawler.	Block

How to allow or block any crawler

Every crawler is identified by a User-agent token. Add a group naming the crawler, then Disallow: / to block it everywhere or an empty Disallow: to allow it. Crawlers not named anywhere fall back to the User-agent: * group.

Block AI training, allow search engines

# Block AI training crawlers, keep search engines
User-agent: GPTBot
User-agent: ClaudeBot
User-agent: CCBot
User-agent: Bytespider
Disallow: /

User-agent: *
Allow: /

Sitemap: https://example.com/sitemap.xml

robots.txt is a request, not a firewall

Reputable crawlers obey it; some (like Bytespider) have a patchy record. Confirm a crawler is actually blocked with the Analyzer, and verify a bot's identity by reverse DNS before trusting its user-agent.

Frequently asked questions

What is a web crawler?

A web crawler (or bot/spider) is an automated program that fetches web pages. Search engines crawl to build their index, AI companies crawl to train models or answer questions, and SEO tools crawl to map links. You control what each crawler can access using robots.txt.

How do I block a specific crawler?

Add a group to your robots.txt naming the crawler's User-agent token, followed by Disallow: / to block your whole site. For example, User-agent: AhrefsBot then Disallow: /. Use an empty Disallow: to allow it.

Will blocking crawlers hurt my SEO?

Only if you block search-engine crawlers like Googlebot or Bingbot. Blocking AI crawlers (GPTBot, ClaudeBot) or SEO tools (AhrefsBot, SemrushBot) has no effect on your Google or Bing rankings.

How do I know a crawler is really who it claims to be?

User-agent strings can be spoofed. For major crawlers, verify the request IP with a reverse DNS lookup that forward-confirms to the operator's domain (e.g. googlebot.com), or match the operator's published IP ranges.

AI Crawler Manager

Allow or block GPTBot, ClaudeBot, PerplexityBot and more in one place.

Manage AI crawlers

AI Crawler Directory

The AI crawlers in depth.

Read

AI Crawler Manager

Allow or block crawlers in one click.

Read

Robots.txt Analyzer

See who can crawl any live site.

Read

Why robots.txt Is Important

The case for crawler control.

Read

Next upAI Crawler Directory

Robots.txt Studio Editorial · Technical SEO & crawling

We build robots.txt tooling and parse thousands of real-world files. Guides are written by practitioners and reviewed against the Google and RFC 9309 specifications.

Crawler Directory

Every crawler that visits your site

Search engines

AI training crawlers

AI search crawlers

SEO crawlers

Social crawlers

Other / data crawlers

How to allow or block any crawler

AI Crawler Manager