Block or allow AI crawlers: a decision guide
Crawlmind Engineering··5 min read
Whether to block or allow an AI crawler is a question with two answers, not one, because training crawlers and search crawlers do different jobs and carry different trade-offs. Block the bots that copy your content into a model's weights with no link back, and allow the bots that fetch your page to answer a live question and cite you. Get that split right and the rest is a matter of business model.
#The two kinds of bot
OpenAI documents three user agents, and the distinction is the whole game. GPTBot "is used to crawl content that may be used in training our generative AI foundation models." OAI-SearchBot "is used to surface websites in search results in ChatGPT's search features," and "sites that are opted out of OAI-SearchBot will not be shown in ChatGPT search answers." ChatGPT-User fires only when a person asks ChatGPT to visit a specific page.
OpenAI is explicit that each setting is independent. You can disallow GPTBot to keep your content out of training while allowing OAI-SearchBot so you still appear in ChatGPT's answers with a citation and a link. Anthropic, Google, and Perplexity draw the same training-versus-retrieval line with separate agents.
So a blanket rule is almost always the wrong rule. "Block everything" deletes you from AI search results. "Allow everything" hands your archive to model training with no attribution and no referral traffic. The useful position is in between, and where you land depends on how your business makes money.
#What publishers are actually doing
The blocking trend is real and it is accelerating. By December 2025, about 5.6 million websites were blocking GPTBot in robots.txt, up from roughly 3.3 million at the start of July 2025, a near-70 percent jump in five months, according to BuiltWith data cited by The Register. ClaudeBot was blocked at about 5.8 million sites over the same period.
News publishers are the sharpest case. A study of 100 top US and UK news sites, last updated April 8, 2026, found that 79 percent block at least one AI training bot in robots.txt, while 71 percent also block at least one live search or retrieval bot. Common Crawl's CCBot was blocked by 75 percent of those sites and GPTBot by 62 percent. Only 14 percent blocked every AI bot, and 18 percent blocked none.
That spread matters. Even among the publishers most worried about AI, most are not slamming the door entirely. They are sorting bots by purpose. The reason the visibility stakes keep climbing: AI search visits grew 42.8 percent year over year, to more than 27.4 billion visits in Q1 2026, per Wix Studio's AI Search Lab. Blocking the search bots opts you out of that traffic.
#A framework by business model
Start from one question: do you sell content, or do you sell something content helps you sell? That answer sets your default.
Publishers and media (content is the product). Your articles are the inventory an AI model would otherwise reproduce for free. Default to blocking training crawlers (GPTBot, CCBot, ClaudeBot in its training mode, Google-Extended) and to allowing search crawlers (OAI-SearchBot, Claude's search agent, PerplexityBot) so you keep the cited link and the click. If you have a licensing deal or a pay-per-crawl arrangement, follow its terms instead. The whole point is to be paid or credited, not scraped silently.
SaaS and B2B (content is demand generation). Your blog, docs, and comparison pages exist to get found and trusted, not to be sold by the word. Allow both training and search crawlers for marketing content. When an AI assistant answers "best tool for X" and cites you, that is a qualified buyer at the top of your funnel. The asymmetry favors openness: the marginal value of one more training pass on your feature page is low, and the cost of being absent from the answer is a lost lead. Keep the block list for things that should never be in a model anyway: gated assets, customer data, internal portals.
E-commerce and marketplaces (content drives transactions). Product and category pages want to be in AI shopping answers, so allow the search crawlers. Be selective about training crawlers on high-effort assets like original buying guides or proprietary specs, where you would rather be cited than absorbed. User-generated reviews are a judgment call: they are a moat, and a moat is worth protecting from training even while you expose it to retrieval.
Knowledge bases, communities, and proprietary research. Here the content is the asset and the differentiator. Lean toward blocking training crawlers to keep your hard-won material from training a competitor's model, while still allowing search crawlers if discovery brings you sign-ups or members. If your research is the product you sell, treat the training crawlers the way a publisher does.
#How to actually set it
robots.txt is the front door, and reputable AI companies honor it. List each agent by name and decide per bot rather than per company, because one vendor runs several. A search bot you allow and a training bot you block can belong to the same firm.
Two cautions. First, robots.txt is a request, not a wall. The BuzzStream data shows the major commercial crawlers respecting it, but enforcement for bad actors needs server-level or edge controls. Second, ChatGPT-User and similar user-initiated agents act on a person's explicit request, so OpenAI notes that standard crawler rules may not govern them the way they govern autonomous bots. Blocking those can break a feature your own prospects are trying to use.
Whatever you choose, decide it deliberately and review it as the agents change. The wrong move is the unexamined default, allowing everything because you never looked, or blocking everything because one headline scared you. Sort the bots by what they do, match the policy to how you make money, and you will keep the visibility that helps you without giving away the content that defines you.
Related field notes
Mon Jun 29 2026 03:00:00 GMT+0300 (Eastern European Summer Time) · 5 min
When AI engines prefer tables over prose
AI engines reach for a table when the question is a comparison or a lookup. Here is when to use one and how to format it so the model can read it.
2026-06-29T00:00:00.000Z · 5 min
How answer engines deduplicate sources
Answer engines collapse many pages that say the same thing into one citation. Being the original, or adding new information, is how you survive the cut.
June 29, 2026 · 5 min
The atomic-answer pattern for AI citations
One question, one paragraph, one citable claim. The atomic-answer pattern is how you write content that AI engines can extract and quote cleanly.
Share or discuss
New posts, no spam. Roughly monthly. Unsubscribe with one click.