robots.txt for AI: partial-access patterns

Crawlmind Engineering·June 30, 2026·5 min read

Partial access in robots.txt means writing per-crawler rules so that a given AI bot is allowed into some parts of your site and kept out of others, instead of the all-or-nothing block most sites reach for. The mechanics are the same ones search engines have used for years, but the AI crawler landscape adds new user-agent names and one important exception that breaks the usual mental model.

The Robots Exclusion Protocol became a formal internet standard in 2022 as RFC 9309. That document pins down the syntax every compliant crawler reads: a file at /robots.txt, grouped by User-agent, with Allow and Disallow rules underneath. It also makes the protocol's voluntary nature explicit. The rules express a site owner's wishes, and compliance by any given crawler is advisory, not enforced. So the first thing to understand about partial access is that it works only for bots that choose to honor it.

#How a crawler picks which rules apply

The single most common robots.txt mistake is assuming rules stack. They do not. A crawler reads the file, finds the one group whose User-agent line best matches its own name, and obeys that group alone. Google states this directly: its crawlers "determine the correct group of rules by finding in the robots.txt file the group with the most specific user agent that matches the crawler's user agent" (Google robots.txt spec).

That has a practical consequence. If you write a User-agent: * catch-all and also a specific User-agent: GPTBot block, GPTBot ignores the catch-all entirely and follows only its own block. Anything you wanted to apply to GPTBot has to be repeated inside the GPTBot group. People get burned when they put a broad Disallow under *, add a narrow rule for one bot, and assume the bot still inherits the broad one. It does not.

#Allow, disallow, and the longest-match rule

Within the matching group, partial access comes from mixing Allow and Disallow on different paths. The interesting case is when two rules both match the same URL. The standard resolves this by specificity: the longer rule path wins. RFC 9309 gives the rule plainly, and Google's implementation matches it. When rules conflict, Google "uses the least restrictive rule," which in practice means the most specific (longest) matching path determines the outcome (Google robots.txt spec).

So this pattern blocks a section but carves out one allowed subfolder:

User-agent: GPTBot
Disallow: /members/
Allow: /members/public/

/members/private/ stays blocked. /members/public/ is reachable because its Allow path is longer and more specific than the Disallow that would otherwise cover it. This is the core of partial access: a broad deny plus a narrow allow.

Wildcards extend the same idea. Google, Bing, and other major engines support two special characters in path values: the asterisk * matches zero or more of any character, and the dollar sign $ anchors the end of the URL (Google robots.txt spec). That lets you express things like "block every URL with a query string" or "block all PDFs":

User-agent: GPTBot
Disallow: /*?
Disallow: /*.pdf$

Wildcards are powerful enough to be dangerous. A stray Disallow: /* or a careless pattern can shut a crawler out of the whole site while looking surgical. Test patterns before shipping them.

#The training versus search split

The reason partial access matters more for AI than it ever did for search is that the major providers now run separate crawlers for separate jobs, and you can treat them differently.

OpenAI documents four user agents. GPTBot crawls content to train foundation models. OAI-SearchBot powers ChatGPT's search feature, and blocking it means your pages will not surface there. OAI-AdsBot checks ad landing pages. ChatGPT-User handles user-initiated fetches when someone asks ChatGPT to visit a page (OpenAI bots documentation). Anthropic runs a parallel set: ClaudeBot for training, Claude-SearchBot for improving search results, and Claude-User for user-initiated requests. Anthropic confirms its bots "respect 'do not crawl' signals by honoring industry standard directives in robots.txt" (Anthropic crawler documentation).

That separation is the whole opportunity. A site that wants to appear in AI answers but does not want its content used for model training can allow the search agents and disallow the training ones:

User-agent: OAI-SearchBot
Allow: /

User-agent: Claude-SearchBot
Allow: /

User-agent: GPTBot
Disallow: /

User-agent: ClaudeBot
Disallow: /

Each group is independent, so you are free to set different path rules inside each one. You might let a search bot read your whole site while restricting a training bot to nothing, or open only your documentation to both.

#The exception that breaks the model

There is one case where robots.txt does not save you, and it is easy to miss. User-initiated fetches behave differently from automated crawls. OpenAI notes that because ChatGPT-User represents a person asking for a specific page rather than a background crawl, "robots.txt rules may not apply" to it (OpenAI bots documentation). A user who pastes your URL into ChatGPT can pull the page even if you disallowed the crawler.

The takeaway: robots.txt is the right tool for controlling systematic crawling, training collection, and search indexing. It is not a security boundary and not a guarantee against any individual fetch. If a page must never be read by an outside system, robots.txt is the wrong layer for that, because the protocol is advisory and user-initiated agents may bypass it.

#A working partial-access checklist

Write a separate group for every bot you care about. Do not rely on a single bot inheriting your User-agent: * rules.
Use a broad Disallow plus a narrow Allow to open one section inside a blocked area, and remember the longer path wins.
Reserve wildcards for clear patterns like query strings and file extensions, and test them before publishing.
Decide training versus search per provider, then name the exact agents: GPTBot and ClaudeBot for training, OAI-SearchBot and Claude-SearchBot for search.
Treat user-initiated agents as outside your control, and keep truly private content behind authentication rather than a Disallow line.

Partial access is not exotic. It is the same allow and disallow grammar the web has used for decades, applied with the precision the AI crawler landscape now rewards. The sites that get the most out of it are the ones that stop thinking in terms of one global switch and start writing one deliberate group per bot.

Related field notes

Share or discuss

Share on X LinkedIn Hacker News

New posts, no spam. Roughly monthly. Unsubscribe with one click.

We respect your privacy.