We respect your privacy.

We use strictly necessary cookies to keep you signed in and to protect against CSRF. With your permission we also use a small amount of first-party analytics to improve the product. We do not sell your data and we do not use third-party advertising trackers. See our cookie policy and privacy policy .

Home/Learn/The complete list of AI crawlers in 2026

AI crawlers

The complete list of AI crawlers in 2026

Updated 2026-05-17 · by the Crawlmind team

There are 14 named AI crawlers worth knowing about as of May 2026, operated by 9 vendors (OpenAI, Anthropic, Google, Apple, Perplexity, Meta, Common Crawl, Bytedance, Cohere). This is the canonical reference: for each bot, the user-agent string, the operator, what it fetches the web for, whether it honors robots.txt, and the policy Crawlmind recommends. Use it as a checklist when you write your robots.txt.

The list

User-agentOperatorPurposeHonors robots.txtRecommended policy
GPTBotOpenAITraining data for GPT modelsYesAllow if you want training inclusion
OAI-SearchBotOpenAIChatGPT search retrievalYesAllow — this is the citation bot
ChatGPT-UserOpenAIUser-triggered URL fetch from ChatGPTYesAllow — high-intent users
ClaudeBotAnthropicTraining + retrieval for ClaudeYesAllow if you want citation in Claude
anthropic-aiAnthropicLegacy training crawlerYesMatch your ClaudeBot policy
Claude-WebAnthropicLegacy user-triggered fetchYesMatch your ClaudeBot policy
PerplexityBotPerplexityIndex for Perplexity answersYesAllow — primary citation bot
Perplexity-UserPerplexityUser-triggered URL fetchYesAllow
Google-ExtendedGoogleOpt-out for Gemini training + AI OverviewsYesAllow if you want Gemini inclusion
Applebot-ExtendedAppleOpt-out for Apple Intelligence trainingYesAllow if you want inclusion
CCBotCommon CrawlPublic training-data corpusYesAllow for broad LLM training inclusion
BytespiderByteDanceTraining data for ByteDance LLMsMixed (reports of violations)Block unless you target China
Meta-ExternalAgentMetaTraining data for LlamaYesAllow if you want Llama inclusion
cohere-aiCohereTraining data for Cohere modelsYesAllow for broad inclusion

Allow everything (recommended for SaaS + content sites)

Most sites want maximum AI-engine visibility. Use this robots.txt skeleton:

User-agent: *
Allow: /
Disallow: /admin/
Disallow: /api/

User-agent: GPTBot
Allow: /

User-agent: OAI-SearchBot
Allow: /

User-agent: ChatGPT-User
Allow: /

User-agent: ClaudeBot
Allow: /

User-agent: PerplexityBot
Allow: /

User-agent: Google-Extended
Allow: /

User-agent: Applebot-Extended
Allow: /

Sitemap: https://example.com/sitemap.xml

Why the explicit allow blocks even though * already allows? Signal. Naming each bot tells the operator (and tools like Crawlmind) that you made a deliberate decision; it also lets you add per-bot path exclusions later without rewriting everything.

Block training, allow retrieval

If you want to be cited in answer engines but not in training corpora:

User-agent: GPTBot
Disallow: /

User-agent: OAI-SearchBot
Allow: /

User-agent: ChatGPT-User
Allow: /

User-agent: ClaudeBot
Disallow: /

User-agent: PerplexityBot
Allow: /

User-agent: Google-Extended
Disallow: /

User-agent: CCBot
Disallow: /

The trade-off: blocking ClaudeBot today blocks both training *and* retrieval, because Anthropic has not yet split them. Watch for that to change.

Block everything AI

Some publishers (NYT, Reuters, Sony) have chosen to block all AI crawlers pending licensing deals. Use this if that's your stance:

User-agent: GPTBot
Disallow: /

User-agent: OAI-SearchBot
Disallow: /

User-agent: ClaudeBot
Disallow: /

User-agent: PerplexityBot
Disallow: /

User-agent: Google-Extended
Disallow: /

User-agent: Applebot-Extended
Disallow: /

User-agent: CCBot
Disallow: /

User-agent: Bytespider
Disallow: /

User-agent: Meta-ExternalAgent
Disallow: /

Expect a 30–60% drop in AI-engine inbound citations within 90 days. That may be the trade-off you want; make it consciously.

Bots that ignore robots.txt

Most named bots above honor robots.txt. The ones that have been observed violating their stated policy in 2024–2026:

  • Bytespider (ByteDance) — multiple reports of crawls from IPs claiming Bytespider user-agent that did not honor Disallow. Block at the firewall, not just in robots.txt.
  • Unidentified / spoofed AI bots — increasingly common. If you see suspicious crawl patterns from IPs not in any vendor's published range, treat as bot abuse and rate-limit.

The right defense against bots that ignore policy is at the edge, not in robots.txt: Cloudflare's AI-bot block, Fastly's similar feature, or a WAF rule keyed off the user-agent.

Check yours

Use the free AI crawler access checker — paste your URL, see exactly what each of the 12 bots above sees when reading your robots.txt. It catches implicit fall-through to * (which is fine) and missing explicit blocks (which is a signal-strength issue, not a policy bug).

Related

Glossary

See how your site scores

Run a free Crawlmind audit — get every page graded on the rules in this guide.