robots.txt Glossary

Every term you'll meet in a robots.txt file, defined in plain English. New to the format? Start with What Is robots.txt? then paste a real file into the Explainer to see these terms in context.

RSRobots.txt Studio Editorial Updated June 8, 2026 Reviewed against Google Search Central and RFC 9309

Robots.txt ExplainerExplain a file

How to use this glossary

Terms are grouped into directives (the lines you write), matching and precedence (how crawlers decide what a rule covers), core concepts, and crawler types. Jump to a group from the table of contents, or read straight through.

Directives

Term	Definition
User-agent	The robots.txt line that names which crawler a group of rules applies to. User-agent: * matches every crawler not named more specifically; User-agent: Googlebot targets just Googlebot.
Disallow	Tells the named crawler not to fetch URLs matching a path. Disallow: / blocks the whole site; Disallow: /admin blocks anything starting with /admin; an empty Disallow: allows everything.
Allow	Explicitly permits a path, used to carve exceptions out of a broader Disallow. For example, disallow /wp-admin/ but Allow: /wp-admin/admin-ajax.php.
Sitemap	Declares the absolute URL of your XML sitemap so crawlers can find it. It is independent of any User-agent group and can appear anywhere in the file.
Crawl-delay	Requests a minimum number of seconds between requests from a crawler. Bing and several others honor it; Googlebot ignores it (set the rate in Search Console instead).
Host	A non-standard directive once used by Yandex to indicate a preferred domain. It is not part of the official standard and is ignored by Google and Bing.

Matching & precedence

Term	Definition
Group	A block of one or more User-agent lines followed by the Allow/Disallow rules that apply to them. A crawler obeys exactly one group — the most specific one matching its name.
Wildcard (*)	Inside a path, * matches any sequence of characters. Disallow: /.pdf blocks every URL ending in .pdf after any prefix. (As a User-agent value, means 'all crawlers'.)
End anchor ($)	The $ character anchors a path match to the end of the URL. Disallow: /*.php$ blocks URLs ending in .php but not /file.php?id=1.
Longest-match precedence	When both an Allow and a Disallow rule match a URL, the more specific (longer path) rule wins. If they are the same length, Allow wins. This is how exceptions work.
User-agent specificity	A crawler uses the group whose User-agent most specifically names it, not the * group, if a specific group exists — even if the * group has more rules. Only one group ever applies.

Core concepts

Term	Definition
robots.txt	A plain-text file at the root of a domain (/robots.txt) that tells crawlers which parts of the site they may request. It controls crawling, not indexing, and is a voluntary standard.
Robots Exclusion Protocol (REP)	The convention, dating to 1994, by which sites use robots.txt to communicate crawl preferences to bots. It was formalised as RFC 9309 in 2022.
RFC 9309	The 2022 IETF standard that formally specifies robots.txt parsing and matching. Google, Bing, and other major crawlers implement it.
Crawl budget	The number of URLs a search engine will crawl on your site in a given period. Disallowing low-value paths can steer that budget toward pages you want indexed.
noindex	A meta tag or X-Robots-Tag header that tells search engines not to index a page. Unlike robots.txt, it removes a page from results — but the crawler must be allowed to fetch the page to see it.
X-Robots-Tag	An HTTP response header that applies indexing directives (like noindex or nofollow) to any file type, including PDFs and images, where a meta tag isn't possible.
Indexing vs crawling	Crawling is fetching a page; indexing is storing it for search results. robots.txt governs crawling. A page blocked by robots.txt can still be indexed (without a snippet) if other sites link to it.

Crawler types

Term	Definition
Crawler (spider/bot)	An automated program that fetches web pages. Each identifies itself with a User-agent string, which robots.txt rules target — though strings can be spoofed, so verify major bots by IP.
AI crawler	A bot that fetches pages to train an AI model (GPTBot, ClaudeBot, CCBot) or to answer questions in an AI product (PerplexityBot, OAI-SearchBot). Separate from search-engine crawlers.
Search-engine crawler	A bot that builds a search index, such as Googlebot or Bingbot. These are the only crawlers whose access affects your search rankings.

Frequently asked questions

What is the difference between Disallow and noindex?

Disallow (in robots.txt) stops a crawler from fetching a URL. noindex (a meta tag or HTTP header) tells a search engine not to index a page it has fetched. To deindex a page you must allow crawling and use noindex — a robots.txt block alone can leave a URL indexed without a snippet.

What does the * mean in robots.txt?

It depends on where it appears. As a User-agent value (User-agent: *) it means 'all crawlers'. Inside a path (Disallow: /*.pdf) it is a wildcard matching any sequence of characters.

Which rule wins when Allow and Disallow conflict?

The more specific rule — the one with the longer matching path — wins. If both match the same number of characters, Allow takes precedence over Disallow. This is how exceptions to a broad Disallow work.

Robots.txt Explainer

Read any robots.txt in plain English, including AI crawler impact.

Explain a file

robots.txt Syntax

Directives in full detail.

Read

How Does robots.txt Work?

Matching and precedence.

Read

Robots.txt Explainer

See these terms applied to a real file.

Read

robots.txt vs noindex

Crawling vs indexing, compared.

Read

Next uprobots.txt Syntax

Robots.txt Studio Editorial · Technical SEO & crawling

We build robots.txt tooling and parse thousands of real-world files. Guides are written by practitioners and reviewed against the Google and RFC 9309 specifications.