Who blocks GPTBot, ClaudeBot, PerplexityBot: top-10K

What we measured

We fetched /robots.txt from each of the 10,000 hostnames in the Tranco top-10K list. For each file we parsed the explicit User-agent blocks for the 14 named AI crawlers Crawlmind tracks (GPTBot, OAI-SearchBot, ChatGPT-User, ClaudeBot, anthropic-ai, PerplexityBot, Perplexity-User, Google-Extended, Applebot-Extended, CCBot, Meta-ExternalAgent, Bytespider, cohere-ai, Diffbot) and classified the policy as Allow, Disallow, or Implicit (fall-through to *).

Block rates by category

Category	Blocks ≥1 AI bot	Blocks all four major
News + publishing	71.2%	38.4%
Stock photo + media	64.8%	41.0%
Government + .edu	12.1%	2.3%
Mainstream SaaS	4.1%	0.7%
AI-native startups	1.9%	0.0%
Developer tooling	0.8%	0.0%

The news + publishing rate is consistent with the wave of NYT/Reuters/AP-style policy decisions in 2024–2025. The developer-tooling near-zero is consistent with the GEO incentive: those sites want to be in the training set.

Who blocks what

Block rates per bot, averaged across the entire 10K list:

Bot	Disallow rate
GPTBot	18.4%
CCBot	15.1%
PerplexityBot	8.2%
ClaudeBot	6.9%
Google-Extended	5.8%
Bytespider	4.1%
Applebot-Extended	2.0%

GPTBot is the consistent top-block target: it was the first bot most operators learned about, and policy decisions tend to anchor on whichever bot was the first to make the news.

Implicit vs explicit policy

77% of sites have no explicit AI-bot policy at all: they fall through to the default User-agent: * block. Most of those default * blocks allow / and disallow nothing AI-relevant, so the bots get in by default. The interesting finding: among sites that DO write explicit blocks, 87% write them as a *blanket disallow* for each bot rather than a nuanced policy (e.g., "allow OAI-SearchBot but disallow GPTBot training"). The nuance is available in robots.txt, but operators rarely use it.

What this means for you

If you are competing for AI citation traffic, check your robots.txt against the Crawlmind AI crawler checker: implicit fall-through to * is fine, but an explicit User-agent: GPTBot / Allow: / block is a stronger signal both to the bot operator and to scoring tools. If you are a news publisher and you block today, expect a 30-60% drop in inbound AI-engine citations within 90 days; that may be the trade-off you want, but make it consciously.

Methodology

Tranco top-10K list as of 2026-04-15. Robots.txt fetched with User-Agent: CrawlmindResearchBot/1.0, max 5 redirects, 10s timeout. Parser is the same one in our free AI crawler checker. For each named bot we record the most specific matching User-agent block; explicit Disallow counts as "blocked" only if the block is on / or a path that includes the homepage. Raw data: contact [email protected].

We respect your privacy.