robots.txt Glossary

Every term you'll meet in a robots.txt file, defined in plain English. New to the format? Start with What Is robots.txt? then paste a real file into the Explainer to see these terms in context.

RSRobots.txt Studio Editorial Updated June 8, 2026 Reviewed against Google Search Central and RFC 9309

How to use this glossary

Terms are grouped into directives (the lines you write), matching and precedence (how crawlers decide what a rule covers), core concepts, and crawler types. Jump to a group from the table of contents, or read straight through.

Directives

TermDefinition
User-agentThe robots.txt line that names which crawler a group of rules applies to. User-agent: * matches every crawler not named more specifically; User-agent: Googlebot targets just Googlebot.
DisallowTells the named crawler not to fetch URLs matching a path. Disallow: / blocks the whole site; Disallow: /admin blocks anything starting with /admin; an empty Disallow: allows everything.
AllowExplicitly permits a path, used to carve exceptions out of a broader Disallow. For example, disallow /wp-admin/ but Allow: /wp-admin/admin-ajax.php.
SitemapDeclares the absolute URL of your XML sitemap so crawlers can find it. It is independent of any User-agent group and can appear anywhere in the file.
Crawl-delayRequests a minimum number of seconds between requests from a crawler. Bing and several others honor it; Googlebot ignores it (set the rate in Search Console instead).
HostA non-standard directive once used by Yandex to indicate a preferred domain. It is not part of the official standard and is ignored by Google and Bing.

Matching & precedence

TermDefinition
GroupA block of one or more User-agent lines followed by the Allow/Disallow rules that apply to them. A crawler obeys exactly one group — the most specific one matching its name.
Wildcard (*)Inside a path, * matches any sequence of characters. Disallow: /*.pdf blocks every URL ending in .pdf after any prefix. (As a User-agent value, * means 'all crawlers'.)
End anchor ($)The $ character anchors a path match to the end of the URL. Disallow: /*.php$ blocks URLs ending in .php but not /file.php?id=1.
Longest-match precedenceWhen both an Allow and a Disallow rule match a URL, the more specific (longer path) rule wins. If they are the same length, Allow wins. This is how exceptions work.
User-agent specificityA crawler uses the group whose User-agent most specifically names it, not the * group, if a specific group exists — even if the * group has more rules. Only one group ever applies.

Core concepts

TermDefinition
robots.txtA plain-text file at the root of a domain (/robots.txt) that tells crawlers which parts of the site they may request. It controls crawling, not indexing, and is a voluntary standard.
Robots Exclusion Protocol (REP)The convention, dating to 1994, by which sites use robots.txt to communicate crawl preferences to bots. It was formalised as RFC 9309 in 2022.
RFC 9309The 2022 IETF standard that formally specifies robots.txt parsing and matching. Google, Bing, and other major crawlers implement it.
Crawl budgetThe number of URLs a search engine will crawl on your site in a given period. Disallowing low-value paths can steer that budget toward pages you want indexed.
noindexA meta tag or X-Robots-Tag header that tells search engines not to index a page. Unlike robots.txt, it removes a page from results — but the crawler must be allowed to fetch the page to see it.
X-Robots-TagAn HTTP response header that applies indexing directives (like noindex or nofollow) to any file type, including PDFs and images, where a meta tag isn't possible.
Indexing vs crawlingCrawling is fetching a page; indexing is storing it for search results. robots.txt governs crawling. A page blocked by robots.txt can still be indexed (without a snippet) if other sites link to it.

Crawler types

TermDefinition
Crawler (spider/bot)An automated program that fetches web pages. Each identifies itself with a User-agent string, which robots.txt rules target — though strings can be spoofed, so verify major bots by IP.
AI crawlerA bot that fetches pages to train an AI model (GPTBot, ClaudeBot, CCBot) or to answer questions in an AI product (PerplexityBot, OAI-SearchBot). Separate from search-engine crawlers.
Search-engine crawlerA bot that builds a search index, such as Googlebot or Bingbot. These are the only crawlers whose access affects your search rankings.
Frequently asked questions
What is the difference between Disallow and noindex?

Disallow (in robots.txt) stops a crawler from fetching a URL. noindex (a meta tag or HTTP header) tells a search engine not to index a page it has fetched. To deindex a page you must allow crawling and use noindex — a robots.txt block alone can leave a URL indexed without a snippet.

What does the * mean in robots.txt?

It depends on where it appears. As a User-agent value (User-agent: *) it means 'all crawlers'. Inside a path (Disallow: /*.pdf) it is a wildcard matching any sequence of characters.

Which rule wins when Allow and Disallow conflict?

The more specific rule — the one with the longer matching path — wins. If both match the same number of characters, Allow takes precedence over Disallow. This is how exceptions to a broad Disallow work.

Robots.txt Explainer

Read any robots.txt in plain English, including AI crawler impact.

Explain a file
Related resources
Next uprobots.txt Syntax
RS

Robots.txt Studio Editorial · Technical SEO & crawling

We build robots.txt tooling and parse thousands of real-world files. Guides are written by practitioners and reviewed against the Google and RFC 9309 specifications.