The complete list of AI crawlers in 2026

The list

User-agent	Operator	Purpose	Honors robots.txt	Recommended policy
`GPTBot`	OpenAI	Training data for GPT models	Yes	Allow if you want training inclusion
`OAI-SearchBot`	OpenAI	ChatGPT search retrieval	Yes	Allow: this is the citation bot
`ChatGPT-User`	OpenAI	User-triggered URL fetch from ChatGPT	Yes	Allow: high-intent users
`ClaudeBot`	Anthropic	Training + retrieval for Claude	Yes	Allow if you want citation in Claude
`anthropic-ai`	Anthropic	Legacy training crawler	Yes	Match your ClaudeBot policy
`Claude-Web`	Anthropic	Legacy user-triggered fetch	Yes	Match your ClaudeBot policy
`PerplexityBot`	Perplexity	Index for Perplexity answers	Yes	Allow: primary citation bot
`Perplexity-User`	Perplexity	User-triggered URL fetch	Yes	Allow
`Google-Extended`	Google	Opt-out for Gemini training + AI Overviews	Yes	Allow if you want Gemini inclusion
`Applebot-Extended`	Apple	Opt-out for Apple Intelligence training	Yes	Allow if you want inclusion
`CCBot`	Common Crawl	Public training-data corpus	Yes	Allow for broad LLM training inclusion
`Bytespider`	ByteDance	Training data for ByteDance LLMs	Mixed (reports of violations)	Block unless you target China
`Meta-ExternalAgent`	Meta	Training data for Llama	Yes	Allow if you want Llama inclusion
`cohere-ai`	Cohere	Training data for Cohere models	Yes	Allow for broad inclusion

Allow everything (recommended for SaaS + content sites)

Most sites want maximum AI-engine visibility. Use this robots.txt skeleton:

User-agent: *
Allow: /
Disallow: /admin/
Disallow: /api/

User-agent: GPTBot
Allow: /

User-agent: OAI-SearchBot
Allow: /

User-agent: ChatGPT-User
Allow: /

User-agent: ClaudeBot
Allow: /

User-agent: PerplexityBot
Allow: /

User-agent: Google-Extended
Allow: /

User-agent: Applebot-Extended
Allow: /

Sitemap: https://example.com/sitemap.xml

Why the explicit allow blocks even though * already allows? Signal. Naming each bot tells the operator (and tools like Crawlmind) that you made a deliberate decision; it also lets you add per-bot path exclusions later without rewriting everything.

Block training, allow retrieval

If you want to be cited in answer engines but not in training corpora:

User-agent: GPTBot
Disallow: /

User-agent: OAI-SearchBot
Allow: /

User-agent: ChatGPT-User
Allow: /

User-agent: ClaudeBot
Disallow: /

User-agent: PerplexityBot
Allow: /

User-agent: Google-Extended
Disallow: /

User-agent: CCBot
Disallow: /

The trade-off: blocking ClaudeBot today blocks both training *and* retrieval, because Anthropic has not yet split them. Watch for that to change.

Block everything AI

Some publishers (NYT, Reuters, Sony) have chosen to block all AI crawlers pending licensing deals. Use this if that's your stance:

User-agent: GPTBot
Disallow: /

User-agent: OAI-SearchBot
Disallow: /

User-agent: ClaudeBot
Disallow: /

User-agent: PerplexityBot
Disallow: /

User-agent: Google-Extended
Disallow: /

User-agent: Applebot-Extended
Disallow: /

User-agent: CCBot
Disallow: /

User-agent: Bytespider
Disallow: /

User-agent: Meta-ExternalAgent
Disallow: /

Expect a 30–60% drop in AI-engine inbound citations within 90 days. That may be the trade-off you want; make it consciously.

Bots that ignore robots.txt

Most named bots above honor robots.txt. The ones that have been observed violating their stated policy in 2024–2026:

Bytespider (ByteDance): multiple reports of crawls from IPs claiming Bytespider user-agent that did not honor Disallow. Block at the firewall, not just in robots.txt.
Unidentified / spoofed AI bots: increasingly common. If you see suspicious crawl patterns from IPs not in any vendor's published range, treat as bot abuse and rate-limit.

The right defense against bots that ignore policy is at the edge, not in robots.txt: Cloudflare's AI-bot block, Fastly's similar feature, or a WAF rule keyed off the user-agent.

Check yours

Use the free AI crawler access checker: paste your URL, see exactly what each of the 12 bots above sees when reading your robots.txt. It catches implicit fall-through to * (which is fine) and missing explicit blocks (which is a signal-strength issue, not a policy bug).

We respect your privacy.