AI crawlers
The complete list of AI crawlers in 2026
Updated 2026-05-17 · by the Crawlmind team
There are 14 named AI crawlers worth knowing about as of May 2026, operated by 9 vendors (OpenAI, Anthropic, Google, Apple, Perplexity, Meta, Common Crawl, Bytedance, Cohere). This is the canonical reference: for each bot, the user-agent string, the operator, what it fetches the web for, whether it honors robots.txt, and the policy Crawlmind recommends. Use it as a checklist when you write your robots.txt.
The list
| User-agent | Operator | Purpose | Honors robots.txt | Recommended policy |
|---|---|---|---|---|
GPTBot | OpenAI | Training data for GPT models | Yes | Allow if you want training inclusion |
OAI-SearchBot | OpenAI | ChatGPT search retrieval | Yes | Allow — this is the citation bot |
ChatGPT-User | OpenAI | User-triggered URL fetch from ChatGPT | Yes | Allow — high-intent users |
ClaudeBot | Anthropic | Training + retrieval for Claude | Yes | Allow if you want citation in Claude |
anthropic-ai | Anthropic | Legacy training crawler | Yes | Match your ClaudeBot policy |
Claude-Web | Anthropic | Legacy user-triggered fetch | Yes | Match your ClaudeBot policy |
PerplexityBot | Perplexity | Index for Perplexity answers | Yes | Allow — primary citation bot |
Perplexity-User | Perplexity | User-triggered URL fetch | Yes | Allow |
Google-Extended | Opt-out for Gemini training + AI Overviews | Yes | Allow if you want Gemini inclusion | |
Applebot-Extended | Apple | Opt-out for Apple Intelligence training | Yes | Allow if you want inclusion |
CCBot | Common Crawl | Public training-data corpus | Yes | Allow for broad LLM training inclusion |
Bytespider | ByteDance | Training data for ByteDance LLMs | Mixed (reports of violations) | Block unless you target China |
Meta-ExternalAgent | Meta | Training data for Llama | Yes | Allow if you want Llama inclusion |
cohere-ai | Cohere | Training data for Cohere models | Yes | Allow for broad inclusion |
Allow everything (recommended for SaaS + content sites)
Most sites want maximum AI-engine visibility. Use this robots.txt skeleton:
User-agent: * Allow: / Disallow: /admin/ Disallow: /api/ User-agent: GPTBot Allow: / User-agent: OAI-SearchBot Allow: / User-agent: ChatGPT-User Allow: / User-agent: ClaudeBot Allow: / User-agent: PerplexityBot Allow: / User-agent: Google-Extended Allow: / User-agent: Applebot-Extended Allow: / Sitemap: https://example.com/sitemap.xml
Why the explicit allow blocks even though * already allows? Signal. Naming each bot tells the operator (and tools like Crawlmind) that you made a deliberate decision; it also lets you add per-bot path exclusions later without rewriting everything.
Block training, allow retrieval
If you want to be cited in answer engines but not in training corpora:
User-agent: GPTBot Disallow: / User-agent: OAI-SearchBot Allow: / User-agent: ChatGPT-User Allow: / User-agent: ClaudeBot Disallow: / User-agent: PerplexityBot Allow: / User-agent: Google-Extended Disallow: / User-agent: CCBot Disallow: /
The trade-off: blocking ClaudeBot today blocks both training *and* retrieval, because Anthropic has not yet split them. Watch for that to change.
Block everything AI
Some publishers (NYT, Reuters, Sony) have chosen to block all AI crawlers pending licensing deals. Use this if that's your stance:
User-agent: GPTBot Disallow: / User-agent: OAI-SearchBot Disallow: / User-agent: ClaudeBot Disallow: / User-agent: PerplexityBot Disallow: / User-agent: Google-Extended Disallow: / User-agent: Applebot-Extended Disallow: / User-agent: CCBot Disallow: / User-agent: Bytespider Disallow: / User-agent: Meta-ExternalAgent Disallow: /
Expect a 30–60% drop in AI-engine inbound citations within 90 days. That may be the trade-off you want; make it consciously.
Bots that ignore robots.txt
Most named bots above honor robots.txt. The ones that have been observed violating their stated policy in 2024–2026:
- Bytespider (ByteDance) — multiple reports of crawls from IPs claiming Bytespider user-agent that did not honor
Disallow. Block at the firewall, not just in robots.txt. - Unidentified / spoofed AI bots — increasingly common. If you see suspicious crawl patterns from IPs not in any vendor's published range, treat as bot abuse and rate-limit.
The right defense against bots that ignore policy is at the edge, not in robots.txt: Cloudflare's AI-bot block, Fastly's similar feature, or a WAF rule keyed off the user-agent.
Check yours
Use the free AI crawler access checker — paste your URL, see exactly what each of the 12 bots above sees when reading your robots.txt. It catches implicit fall-through to * (which is fine) and missing explicit blocks (which is a signal-strength issue, not a policy bug).
Related
Glossary
See how your site scores
Run a free Crawlmind audit — get every page graded on the rules in this guide.