How Does robots.txt Work?

When a crawler visits your site, it fetches /robots.txt first, picks the group of rules that applies to its user-agent, then decides per URL whether it is allowed. Understanding that sequence is what separates a file that works from one that silently blocks the wrong things.

RSRobots.txt Studio Editorial Updated June 8, 2026 Reviewed against Google Search Central and RFC 9309

The crawl lifecycle

  1. Fetch: the crawler requests /robots.txt. A 2xx text response is parsed; a 404 means “crawl everything”; repeated 5xx errors usually mean “crawl nothing for now.”
  2. Cache: crawlers cache the file (Google up to ~24 hours), so changes aren't instant.
  3. Group selection: the crawler finds the most specific User-agent group that matches its name.
  4. Matching: for each URL, it applies that group's Allow/Disallow rules using a precedence algorithm.

How a crawler chooses its group

A crawler uses exactly one group — the one whose User-agent token is the longest case-insensitive match for its name. If no named group matches, it falls back to the wildcard group (User-agent: *). If there's no wildcard group either, everything is allowed.

Googlebot uses only the Googlebot group
User-agent: *
Disallow: /private/

User-agent: Googlebot
Disallow: /no-google/

A common surprise

Because Googlebot matches its own group, it ignores the wildcard group entirely — so /private/ is NOT blocked for Googlebot here. Rules are not merged across groups.

Allow vs Disallow precedence

Within the selected group, the rule with the longest matching path wins. If an Allow and a Disallow match with the same length, the least restrictive rule (Allow) wins.

/admin/public/page is allowed (longer match)
User-agent: *
Disallow: /admin/
Allow: /admin/public/

You don't have to reason about this by hand. Paste your file into the URL Tester and it shows the matched rule and a step-by-step trace for any URL and crawler.

Wildcards and the $ anchor

Paths are prefix matches by default. Two special characters extend them: * matches any sequence of characters, and $ anchors the match to the end of the URL.

PatternMatchesDoesn't match
/admin/admin, /admin/x, /administrator/x/admin
/*.pdf/files/report.pdf/files/report.txt
/page$/page/page/sub

The full directive reference lives on the syntax page.

Where it goes wrong

  • Expecting instant changes

    Crawlers cache robots.txt. Edits can take hours to take effect.

  • Assuming rules merge across groups

    A named crawler ignores the * group; duplicate shared rules into each group you care about.

  • Trusting case for paths

    User-agent matching is case-insensitive, but URL paths are case-sensitive: /Admin and /admin are different.

Frequently asked questions
Can robots.txt block Google?

It can stop Googlebot from crawling URLs, yes. But a blocked URL can still appear in search results (without a description) if other sites link to it. To remove a page from Google, use a noindex tag and let Google crawl it once.

Do crawlers have to obey robots.txt?

It's voluntary. Reputable crawlers (Google, Bing, OpenAI, Anthropic) honor it. Malicious scrapers can and do ignore it, which is why robots.txt is not a security measure.

How often do crawlers re-read robots.txt?

Typically every several hours to a day. Google caches it for up to 24 hours, so don't expect immediate effect after an edit.

Does robots.txt remove pages from Google?

No. Disallowing a page prevents crawling, not indexing. Removing a page requires noindex, a removal request, or returning a 404/410.

URL Tester

See whether a URL is blocked or allowed for any crawler, with a rule trace.

Test a URL
Related resources
Next uprobots.txt Syntax
RS

Robots.txt Studio Editorial · Technical SEO & crawling

We build robots.txt tooling and parse thousands of real-world files. Guides are written by practitioners and reviewed against the Google and RFC 9309 specifications.