Our Methodology
Every verdict Robots.txt Studio produces is deterministic and traceable to a rule. This page documents the standards we follow, how each part of the analysis works, how we classify crawlers, and the limits of what robots.txt can do.
Last updated June 10, 2026
Standards we follow
Our analysis implements the Robots Exclusion Protocol as formalized in RFC 9309 (2022), aligned with Google Search Central's documented behavior. We wrote our own parser and matching engine rather than depending on a third-party library, so the logic is transparent and we control conformance. No large language model is involved in the core parsing, validation, testing, or scoring — identical input always yields identical output.
Deterministic by design
How we parse robots.txt
We parse the raw file into a canonical syntax tree that preserves fidelity: original line order, duplicate directives, and unknown directives are all retained rather than silently dropped. The parser is tolerant — malformed lines are recovered and surfaced as issues instead of crashing — so even a broken file produces a useful, line-referenced analysis.
- Recognized directives: User-agent, Allow, Disallow, Sitemap, Crawl-delay, Host, and comments.
- Groups are reconstructed from consecutive User-agent lines and the rules that follow them.
- Every directive keeps its source line number, so issues and explanations link back to exact lines.
How URL matching works
To decide whether a URL is allowed or blocked for a crawler, we follow the RFC 9309 matching algorithm exactly:
- Group selection — the crawler obeys the single group whose User-agent most specifically names it; if none matches, it falls back to the User-agent: * group.
- Path evaluation — among the matching group's rules, the longest matching path wins (most specific rule).
- Tie-breaking — when an Allow and a Disallow match the same length, Allow takes precedence.
- Wildcards — * matches any sequence within a path, and $ anchors a match to the end of the URL.
Always explained
How validation and health scoring work
The Validator runs a rule set across three dimensions — syntax correctness, AI crawler configuration, and best practices. Each finding carries a severity (error, warning, or recommendation), a line reference, and a plain-English description. The overall health score is decomposed into those three sub-scores using a single documented banding (80+ good, 60–79 moderate, below 60 poor) shared with the Analyzer's visibility score.
| Dimension | Examples of what we check |
|---|---|
| Syntax | Missing colons, invalid or unsupported directives, empty values, duplicates. |
| AI crawler config | Whether major AI crawlers are addressed; conflicting allow/disallow. |
| Best practices | Sitemap declared over HTTPS, crawl-delay sanity, conflicting rules. |
How live site analysis works
When you analyze a domain, our server fetches that domain's public robots.txt through a controlled route handler — not your browser — so we can manage redirects, timeouts, response size, and content type safely. We fetch only the robots.txt file at the host you enter.
- An 8-second timeout and a 512 KiB size cap protect against slow or oversized responses.
- A missing robots.txt (HTTP 404) is treated as 'all crawlers allowed' and reported as such — not as an error.
- Redirects, timeouts, network failures, and non-text content types each produce a distinct, explained result.
- The visibility score combines crawler access, restrictions, and syntax health into one explained number.
How we classify crawlers
Our crawler database is a transparent config layer. Each crawler entry records its identity (name, operator, User-agent token), its category, a plain-English description of what it does, and a default recommendation. Classifications and compliance notes are compiled from each operator's published documentation and updated as that documentation changes.
| Category | What it means |
|---|---|
| Search engine | Builds a search index that can send you traffic (Googlebot, Bingbot). |
| AI training | Collects content to train AI models (GPTBot, ClaudeBot, CCBot). |
| AI search | Fetches pages to answer or cite in AI products (PerplexityBot, Google-Extended). |
| SEO | Powers third-party SEO tools (AhrefsBot, SemrushBot). |
| Social | Builds link previews for social platforms (facebookexternalhit, Twitterbot). |
| Data | Other data-collection crawlers (Amazonbot, Diffbot). |
A default recommendation (allow, block, or depends) reflects the typical site-owner choice for that crawler — it is informational, not advice tailored to your situation. Because User-agent strings can be spoofed, we document how to verify major crawlers by reverse DNS or published IP ranges on each crawler's directory page.
Limitations and honesty
robots.txt is a request, not a firewall
- Compliance varies by operator; we note known exceptions (for example, aggressive crawlers with a patchy record).
- robots.txt controls crawling, not indexing — a blocked URL can still be indexed if linked elsewhere.
- We analyze the file you provide or the public file we fetch; we cannot see server-level blocks or firewall rules.
Review and corrections
We build robots.txt tooling and parse thousands of real-world files. Guides are written by practitioners and reviewed against the Google and RFC 9309 specifications. Reviewed against Google Search Central and RFC 9309.
If a crawler's behavior or documentation has changed, or you spot an error, please contact us — we keep classifications current.