AI crawlers in 2026 and what each one fetches
Crawlmind Engineering··5 min read
AI crawlers are the automated agents that AI companies send to fetch your pages, and in 2026 they do three different jobs: train a model, build a search index, or fetch one page on demand for a user. Treating them as a single swarm of "bots" is the mistake that pushes sites to one of two bad extremes. Block everything and you vanish from AI answers. Allow everything and you feed model training that may never send a visitor back. The labels matter because each crawler is controllable on its own, and the right policy is rarely all-or-nothing.
#Three jobs, not one bot
Start with the function, not the brand. Every major AI crawler falls into one of three categories, and the category tells you what blocking it actually costs you.
Training crawlers fetch your content to include it in a model's training data. Once your page is in the training set, the model "knows" it without fetching again, and you have no ongoing visibility into whether it gets used. GPTBot and ClaudeBot are the headline examples.
Retrieval and search crawlers build the index that answer engines query at response time. These are the ones that produce citations. If your page is not in the retrieval index, it cannot be quoted or linked when a user asks a relevant question. OAI-SearchBot, Claude-SearchBot, and PerplexityBot live here.
On-demand user fetchers retrieve a single page when a person inside a chat asks the assistant to read a specific URL. They respond to a direct user action rather than crawling on a schedule. ChatGPT-User, Claude-User, and Perplexity-User do this.
#OpenAI runs four agents, each separately controllable
OpenAI documents its crawlers publicly, and the descriptions are precise. GPTBot "is used to crawl content that may be used in training our generative AI foundation models." OAI-SearchBot "is used to surface websites in search results in ChatGPT's search features." ChatGPT-User handles "certain user actions in ChatGPT and Custom GPTs," visiting a page when a user asks a question that needs it. A fourth agent, OAI-AdsBot, validates the safety of pages submitted as ads, and its data is not used for training (OpenAI).
The practical consequence: you can let OpenAI's search index include you while keeping your content out of training. Those are separate user-agent tokens in robots.txt, so allowing OAI-SearchBot and disallowing GPTBot is a supported, intentional configuration. One caveat worth reading carefully: OpenAI notes that ChatGPT-User is user-initiated, and robots.txt rules may not apply to it the way they apply to automated crawling (OpenAI).
#Anthropic runs three, and robots.txt is the only lever
Anthropic splits its crawling the same way. ClaudeBot collects content for training, Claude-SearchBot powers Claude's search and retrieval, and Claude-User fetches a page when a person asks Claude to read a link. All three respect robots.txt, and Anthropic also honors the non-standard Crawl-delay directive (Anthropic).
One detail changes how you enforce a policy against Claude bots. Anthropic does not publish fixed IP ranges for these agents, so you cannot reliably allow-list or block them by IP. Robots.txt is the official control mechanism (Anthropic). If your plan was to verify Claude traffic by source IP, it will not hold up. Match on the user-agent token and trust the published directives instead.
#Perplexity and Google round out the list
Perplexity uses PerplexityBot to index pages for citation and Perplexity-User to fetch a page live when answering a specific question. The same allow-search, decide-on-training logic applies.
Google is the odd one out because it does not run a separate training crawler at all. Instead, Google-Extended is a robots.txt token that controls whether content Google already crawls can be used to train Gemini and the Vertex AI generative APIs, and whether it is used for grounding. It is not a crawler with its own fetch behavior, and Google confirmed in April 2025 that Google-Extended is not a ranking signal and does not affect whether your site appears in Search (Google). Blocking it removes you from Gemini's grounded answers without touching your Search rankings.
#The asymmetry that should drive your policy
Here is why the training-versus-retrieval distinction is not academic. Cloudflare tracked how many times each AI platform crawled sites for every visitor it referred back, and the gap is enormous. Anthropic's crawl-to-referral ratio ran at 286,930 to 1 in January 2025, falling to 38,065 to 1 by July, still by far the most crawl-heavy platform. OpenAI sat near 1,200 to 1 across the same window. Perplexity moved the other way, from 54 to 1 up to 194 to 1 (Cloudflare).
Read those numbers as a cost-benefit signal. The crawlers doing the heaviest fetching relative to traffic returned are the training agents. The crawlers that actually correlate with citations and clicks are the search and retrieval ones. So the defensible default for most publishers is to allow the retrieval and on-demand agents (OAI-SearchBot, Claude-SearchBot, PerplexityBot, and the user-fetch agents) because those are what put you in answers, then make a deliberate decision about the training agents (GPTBot, ClaudeBot, Google-Extended) based on whether feeding model training fits your business.
#What to do with this
Open your robots.txt and treat each token by name, not as a category. Allow the search and retrieval bots if you want to be cited. Decide on training bots on their own merits. Remember that user-fetch agents may not follow robots.txt the same way automated crawlers do, so do not assume a disallow line fully stops them. And do not lean on IP allow-listing for vendors that publish no IP ranges.
The reason this granularity exists is that the AI vendors built it on purpose. They expose separate tokens so you can say yes to citations and no to training, or the reverse. The sites that win in AI answers are the ones that make that choice deliberately, crawler by crawler, instead of reaching for a single allow-all or block-all line and hoping.
Related field notes
June 22, 2026 · 4 min
FAQ schema in 2026: signal or noise
FAQ schema in 2026 no longer earns a Google SERP feature, but it still feeds AI answer engines when the Q-and-A is real.
June 22, 2026 · 5 min
Writing for follow-up questions in AI search
AI assistants now generate the next question for the user. Content that answers the follow-up, not just the headline query, gets cited twice.
2026-06-19T00:00:00.000Z · 4 min
How freshness signals shape AI answers
Freshness signals tell AI engines how recently a page changed. Here is what dateModified and changelog pages actually do for citations.
Share or discuss
New posts, no spam. Roughly monthly. Unsubscribe with one click.