We respect your privacy.

We use strictly necessary cookies to keep you signed in and to protect against CSRF. With your permission we also use a small amount of first-party analytics to improve the product. We do not sell your data and we do not use third-party advertising trackers. See our cookie policy and privacy policy .

Home/Research/Who blocks GPTBot, ClaudeBot, PerplexityBot — top-10K

Crawlmind Research

Who blocks GPTBot, ClaudeBot, PerplexityBot — top-10K

Published 2026-05-12 · by the Crawlmind research team

As of May 2026, 23% of the top 10,000 websites block at least one of GPTBot, ClaudeBot, PerplexityBot, or Google-Extended via robots.txt — up from 11% one year ago. 6% block all four. The pattern is sharply category-driven: news + publishing blocks at a 71% rate, mainstream SaaS blocks at 4%, and developer tooling blocks at less than 1%. The most-blocked bot is GPTBot (18.4% of sites have an explicit Disallow), followed by CCBot (15.1%), PerplexityBot (8.2%), and ClaudeBot (6.9%).

23%

block ≥1 major AI bot

6%

block all four

71%

news + publishing block rate

<1%

developer-tooling block rate

What we measured

We fetched /robots.txt from each of the 10,000 hostnames in the Tranco top-10K list. For each file we parsed the explicit User-agent blocks for the 14 named AI crawlers Crawlmind tracks (GPTBot, OAI-SearchBot, ChatGPT-User, ClaudeBot, anthropic-ai, PerplexityBot, Perplexity-User, Google-Extended, Applebot-Extended, CCBot, Meta-ExternalAgent, Bytespider, cohere-ai, Diffbot) and classified the policy as Allow, Disallow, or Implicit (fall-through to *).

Block rates by category

CategoryBlocks ≥1 AI botBlocks all four major
News + publishing71.2%38.4%
Stock photo + media64.8%41.0%
Government + .edu12.1%2.3%
Mainstream SaaS4.1%0.7%
AI-native startups1.9%0.0%
Developer tooling0.8%0.0%

The news + publishing rate is consistent with the wave of NYT/Reuters/AP-style policy decisions in 2024–2025. The developer-tooling near-zero is consistent with the GEO incentive: those sites want to be in the training set.

Who blocks what

Block rates per bot, averaged across the entire 10K list:

BotDisallow rate
GPTBot18.4%
CCBot15.1%
PerplexityBot8.2%
ClaudeBot6.9%
Google-Extended5.8%
Bytespider4.1%
Applebot-Extended2.0%

GPTBot is the consistent top-block target — it was the first bot most operators learned about, and policy decisions tend to anchor on whichever bot was the first to make the news.

Implicit vs explicit policy

77% of sites have no explicit AI-bot policy at all — they fall through to the default User-agent: * block. Most of those default * blocks allow / and disallow nothing AI-relevant, so the bots get in by default. The interesting finding: among sites that DO write explicit blocks, 87% write them as a *blanket disallow* for each bot rather than a nuanced policy (e.g., "allow OAI-SearchBot but disallow GPTBot training"). The nuance is available in robots.txt, but operators rarely use it.

What this means for you

If you are competing for AI citation traffic, check your robots.txt against the Crawlmind AI crawler checker — implicit fall-through to * is fine, but an explicit User-agent: GPTBot / Allow: / block is a stronger signal both to the bot operator and to scoring tools. If you are a news publisher and you block today, expect a 30-60% drop in inbound AI-engine citations within 90 days; that may be the trade-off you want, but make it consciously.

Methodology

Tranco top-10K list as of 2026-04-15. Robots.txt fetched with User-Agent: CrawlmindResearchBot/1.0, max 5 redirects, 10s timeout. Parser is the same one in our free AI crawler checker. For each named bot we record the most specific matching User-agent block; explicit Disallow counts as "blocked" only if the block is on / or a path that includes the homepage. Raw data: contact [email protected].

See how your site is positioned

Run a free Crawlmind audit — every page graded for AI discoverability.