How Does robots.txt Work?
When a crawler visits your site, it fetches /robots.txt first, picks the group of rules that applies to its user-agent, then decides per URL whether it is allowed. Understanding that sequence is what separates a file that works from one that silently blocks the wrong things.
The crawl lifecycle
- Fetch: the crawler requests /robots.txt. A 2xx text response is parsed; a 404 means “crawl everything”; repeated 5xx errors usually mean “crawl nothing for now.”
- Cache: crawlers cache the file (Google up to ~24 hours), so changes aren't instant.
- Group selection: the crawler finds the most specific User-agent group that matches its name.
- Matching: for each URL, it applies that group's Allow/Disallow rules using a precedence algorithm.
How a crawler chooses its group
A crawler uses exactly one group — the one whose User-agent token is the longest case-insensitive match for its name. If no named group matches, it falls back to the wildcard group (User-agent: *). If there's no wildcard group either, everything is allowed.
User-agent: *
Disallow: /private/
User-agent: Googlebot
Disallow: /no-google/A common surprise
Allow vs Disallow precedence
Within the selected group, the rule with the longest matching path wins. If an Allow and a Disallow match with the same length, the least restrictive rule (Allow) wins.
User-agent: *
Disallow: /admin/
Allow: /admin/public/You don't have to reason about this by hand. Paste your file into the URL Tester and it shows the matched rule and a step-by-step trace for any URL and crawler.
Wildcards and the $ anchor
Paths are prefix matches by default. Two special characters extend them: * matches any sequence of characters, and $ anchors the match to the end of the URL.
| Pattern | Matches | Doesn't match |
|---|---|---|
/admin | /admin, /admin/x, /administrator | /x/admin |
/*.pdf | /files/report.pdf | /files/report.txt |
/page$ | /page | /page/sub |
The full directive reference lives on the syntax page.
Where it goes wrong
Expecting instant changes
Crawlers cache robots.txt. Edits can take hours to take effect.
Assuming rules merge across groups
A named crawler ignores the * group; duplicate shared rules into each group you care about.
Trusting case for paths
User-agent matching is case-insensitive, but URL paths are case-sensitive: /Admin and /admin are different.
Can robots.txt block Google?
It can stop Googlebot from crawling URLs, yes. But a blocked URL can still appear in search results (without a description) if other sites link to it. To remove a page from Google, use a noindex tag and let Google crawl it once.
Do crawlers have to obey robots.txt?
It's voluntary. Reputable crawlers (Google, Bing, OpenAI, Anthropic) honor it. Malicious scrapers can and do ignore it, which is why robots.txt is not a security measure.
How often do crawlers re-read robots.txt?
Typically every several hours to a day. Google caches it for up to 24 hours, so don't expect immediate effect after an edit.
Does robots.txt remove pages from Google?
No. Disallowing a page prevents crawling, not indexing. Removing a page requires noindex, a removal request, or returning a 404/410.
URL Tester
See whether a URL is blocked or allowed for any crawler, with a rule trace.
Robots.txt Studio Editorial · Technical SEO & crawling
We build robots.txt tooling and parse thousands of real-world files. Guides are written by practitioners and reviewed against the Google and RFC 9309 specifications.