Crawlmind Research
Who blocks GPTBot, ClaudeBot, PerplexityBot — top-10K
Published 2026-05-12 · by the Crawlmind research team
As of May 2026, 23% of the top 10,000 websites block at least one of GPTBot, ClaudeBot, PerplexityBot, or Google-Extended via robots.txt — up from 11% one year ago. 6% block all four. The pattern is sharply category-driven: news + publishing blocks at a 71% rate, mainstream SaaS blocks at 4%, and developer tooling blocks at less than 1%. The most-blocked bot is GPTBot (18.4% of sites have an explicit Disallow), followed by CCBot (15.1%), PerplexityBot (8.2%), and ClaudeBot (6.9%).
23%
block ≥1 major AI bot
6%
block all four
71%
news + publishing block rate
<1%
developer-tooling block rate
What we measured
We fetched /robots.txt from each of the 10,000 hostnames in the Tranco top-10K list. For each file we parsed the explicit User-agent blocks for the 14 named AI crawlers Crawlmind tracks (GPTBot, OAI-SearchBot, ChatGPT-User, ClaudeBot, anthropic-ai, PerplexityBot, Perplexity-User, Google-Extended, Applebot-Extended, CCBot, Meta-ExternalAgent, Bytespider, cohere-ai, Diffbot) and classified the policy as Allow, Disallow, or Implicit (fall-through to *).
Block rates by category
| Category | Blocks ≥1 AI bot | Blocks all four major |
|---|---|---|
| News + publishing | 71.2% | 38.4% |
| Stock photo + media | 64.8% | 41.0% |
| Government + .edu | 12.1% | 2.3% |
| Mainstream SaaS | 4.1% | 0.7% |
| AI-native startups | 1.9% | 0.0% |
| Developer tooling | 0.8% | 0.0% |
The news + publishing rate is consistent with the wave of NYT/Reuters/AP-style policy decisions in 2024–2025. The developer-tooling near-zero is consistent with the GEO incentive: those sites want to be in the training set.
Who blocks what
Block rates per bot, averaged across the entire 10K list:
| Bot | Disallow rate |
|---|---|
| GPTBot | 18.4% |
| CCBot | 15.1% |
| PerplexityBot | 8.2% |
| ClaudeBot | 6.9% |
| Google-Extended | 5.8% |
| Bytespider | 4.1% |
| Applebot-Extended | 2.0% |
GPTBot is the consistent top-block target — it was the first bot most operators learned about, and policy decisions tend to anchor on whichever bot was the first to make the news.
Implicit vs explicit policy
77% of sites have no explicit AI-bot policy at all — they fall through to the default User-agent: * block. Most of those default * blocks allow / and disallow nothing AI-relevant, so the bots get in by default. The interesting finding: among sites that DO write explicit blocks, 87% write them as a *blanket disallow* for each bot rather than a nuanced policy (e.g., "allow OAI-SearchBot but disallow GPTBot training"). The nuance is available in robots.txt, but operators rarely use it.
What this means for you
If you are competing for AI citation traffic, check your robots.txt against the Crawlmind AI crawler checker — implicit fall-through to * is fine, but an explicit User-agent: GPTBot / Allow: / block is a stronger signal both to the bot operator and to scoring tools. If you are a news publisher and you block today, expect a 30-60% drop in inbound AI-engine citations within 90 days; that may be the trade-off you want, but make it consciously.
Methodology
Tranco top-10K list as of 2026-04-15. Robots.txt fetched with User-Agent: CrawlmindResearchBot/1.0, max 5 redirects, 10s timeout. Parser is the same one in our free AI crawler checker. For each named bot we record the most specific matching User-agent block; explicit Disallow counts as "blocked" only if the block is on / or a path that includes the homepage. Raw data: contact [email protected].
See how your site is positioned
Run a free Crawlmind audit — every page graded for AI discoverability.