Our Methodology

Every verdict Robots.txt Studio produces is deterministic and traceable to a rule. This page documents the standards we follow, how each part of the analysis works, how we classify crawlers, and the limits of what robots.txt can do.

Last updated June 10, 2026

Standards we follow

Our analysis implements the Robots Exclusion Protocol as formalized in RFC 9309 (2022), aligned with Google Search Central's documented behavior. We wrote our own parser and matching engine rather than depending on a third-party library, so the logic is transparent and we control conformance. No large language model is involved in the core parsing, validation, testing, or scoring — identical input always yields identical output.

Deterministic by design

Because results are computed from the parsed file (not generated by a model), you can reproduce and audit every outcome. Explanations are derived from the same data the verdict uses.

How we parse robots.txt

We parse the raw file into a canonical syntax tree that preserves fidelity: original line order, duplicate directives, and unknown directives are all retained rather than silently dropped. The parser is tolerant — malformed lines are recovered and surfaced as issues instead of crashing — so even a broken file produces a useful, line-referenced analysis.

Recognized directives: User-agent, Allow, Disallow, Sitemap, Crawl-delay, Host, and comments.
Groups are reconstructed from consecutive User-agent lines and the rules that follow them.
Every directive keeps its source line number, so issues and explanations link back to exact lines.

How URL matching works

To decide whether a URL is allowed or blocked for a crawler, we follow the RFC 9309 matching algorithm exactly:

Group selection — the crawler obeys the single group whose User-agent most specifically names it; if none matches, it falls back to the User-agent: * group.
Path evaluation — among the matching group's rules, the longest matching path wins (most specific rule).
Tie-breaking — when an Allow and a Disallow match the same length, Allow takes precedence.
Wildcards — * matches any sequence within a path, and $ anchors a match to the end of the URL.

Always explained

The URL Tester shows the matched rule and a four-step trace for every verdict, so you can see exactly why a URL is blocked or allowed.

How validation and health scoring work

The Validator runs a rule set across three dimensions — syntax correctness, AI crawler configuration, and best practices. Each finding carries a severity (error, warning, or recommendation), a line reference, and a plain-English description. The overall health score is decomposed into those three sub-scores using a single documented banding (80+ good, 60–79 moderate, below 60 poor) shared with the Analyzer's visibility score.

Dimension	Examples of what we check
Syntax	Missing colons, invalid or unsupported directives, empty values, duplicates.
AI crawler config	Whether major AI crawlers are addressed; conflicting allow/disallow.
Best practices	Sitemap declared over HTTPS, crawl-delay sanity, conflicting rules.

How live site analysis works

When you analyze a domain, our server fetches that domain's public robots.txt through a controlled route handler — not your browser — so we can manage redirects, timeouts, response size, and content type safely. We fetch only the robots.txt file at the host you enter.

An 8-second timeout and a 512 KiB size cap protect against slow or oversized responses.
A missing robots.txt (HTTP 404) is treated as 'all crawlers allowed' and reported as such — not as an error.
Redirects, timeouts, network failures, and non-text content types each produce a distinct, explained result.
The visibility score combines crawler access, restrictions, and syntax health into one explained number.

Robots.txt AnalyzerAnalyze a site

How we classify crawlers

Our crawler database is a transparent config layer. Each crawler entry records its identity (name, operator, User-agent token), its category, a plain-English description of what it does, and a default recommendation. Classifications and compliance notes are compiled from each operator's published documentation and updated as that documentation changes.

Category	What it means
Search engine	Builds a search index that can send you traffic (Googlebot, Bingbot).
AI training	Collects content to train AI models (GPTBot, ClaudeBot, CCBot).
AI search	Fetches pages to answer or cite in AI products (PerplexityBot, Google-Extended).
SEO	Powers third-party SEO tools (AhrefsBot, SemrushBot).
Social	Builds link previews for social platforms (facebookexternalhit, Twitterbot).
Data	Other data-collection crawlers (Amazonbot, Diffbot).

A default recommendation (allow, block, or depends) reflects the typical site-owner choice for that crawler — it is informational, not advice tailored to your situation. Because User-agent strings can be spoofed, we document how to verify major crawlers by reverse DNS or published IP ranges on each crawler's directory page.

Limitations and honesty

robots.txt is a request, not a firewall

It is a voluntary standard. Reputable crawlers obey it; some ignore it. Our tools report what a compliant crawler would do and help you verify behavior — they cannot enforce access. For pages that must be private, use authentication, not robots.txt.

Compliance varies by operator; we note known exceptions (for example, aggressive crawlers with a patchy record).
robots.txt controls crawling, not indexing — a blocked URL can still be indexed if linked elsewhere.
We analyze the file you provide or the public file we fetch; we cannot see server-level blocks or firewall rules.

Review and corrections

We build robots.txt tooling and parse thousands of real-world files. Guides are written by practitioners and reviewed against the Google and RFC 9309 specifications. Reviewed against Google Search Central and RFC 9309.

If a crawler's behavior or documentation has changed, or you spot an error, please contact us — we keep classifications current.

Crawler Directory

Every crawler we classify.

Read

Robots.txt Validator

The validation engine in action.

Read

URL Tester

See the matching algorithm with a trace.

Read

About Robots.txt Studio

What we are and who it's for.

Read