Glossary

Every AI-discoverability term, defined.

GPTBot. PerplexityBot. llms.txt. GEO. AI citation. Entity clarity. Concise definitions with examples, and a one-click link to a longer guide where one exists.

AI crawlers

Amazonbot
Amazonbot is Amazon's crawler. It feeds Alexa answers and Amazon's newer Rufus shopping assistant. Allowing Amazonbot is the path to inclusion in Alexa and Rufus answer surfaces.
anthropic-ai
2 guides · 1 report
anthropic-ai is the legacy user-agent name for Anthropic's web crawler. Anthropic has since standardised on ClaudeBot, but anthropic-ai directives in robots.txt are still respected for backwards compatibility. Allow or block both for the most defensible posture.
Applebot-Extended
1 guide · 2 reports
Applebot-Extended is Apple's user-agent for Apple Intelligence training: separate from Applebot, which crawls for Siri / Spotlight / web search. Blocking Applebot-Extended opts you out of Apple Intelligence training without affecting Siri or Spotlight indexing.
Bytespider
1 guide · 1 report
Bytespider is ByteDance's crawler, used to feed training data and live retrieval for Doubao and TikTok Search. It's known for aggressive crawl rates, and many sites block it by default for bandwidth reasons.
CCBot
1 guide · 1 report
CCBot is the crawler for Common Crawl, an open-source web archive that feeds the training corpora of many smaller AI engines and academic models. Blocking CCBot reduces (but doesn't eliminate) inclusion in derivative AI training datasets.
ChatGPT-User
3 guides · 1 report
ChatGPT-User is the user-agent OpenAI uses when a ChatGPT user explicitly asks the assistant to fetch a URL. Unlike GPTBot (training) and OAI-SearchBot (search index), ChatGPT-User fires only on per-conversation user action.
ClaudeBot
2 guides · 2 reports
ClaudeBot is Anthropic's web crawler, used for training Claude and for live retrieval in Claude's web-search feature. It replaced the older `anthropic-ai` user-agent, which is still respected for backwards compatibility.
Google-Extended
1 guide · 2 reports
Google-Extended is Google's user-agent for AI training and AI Overviews: separate from Googlebot, which still controls classic search indexing. Blocking Google-Extended opts you out of Gemini training and AI Overviews surfacing without affecting your standard Google ranking.
GPTBot
3 guides · 1 report
GPTBot is OpenAI's primary web crawler. It fetches publicly available pages to gather data for training future GPT models and, in some configurations, to ground answers in ChatGPT. Sites can allow or disallow it via robots.txt with a `User-agent: GPTBot` directive.
Meta-ExternalAgent
1 guide · 1 report
Meta-ExternalAgent is Meta's newer web crawler for Llama training and Meta AI features. Distinct from Meta's older `facebookexternalhit` (which fetches link previews for Facebook/Instagram), Meta-ExternalAgent is specifically for AI corpus collection.
OAI-SearchBot
3 guides · 2 reports
OAI-SearchBot is OpenAI's crawler for live document retrieval in ChatGPT search. Distinct from GPTBot (training) and ChatGPT-User (user-triggered fetches), OAI-SearchBot is what indexes content for inclusion in ChatGPT's search-answer surface.
PerplexityBot
3 guides · 2 reports
PerplexityBot is Perplexity's web crawler. Perplexity's entire UX surfaces citations next to every answer, so allowing PerplexityBot is the single biggest move for getting traffic from Perplexity. Perplexity does not train on crawled content: PerplexityBot is for retrieval only.

We respect your privacy.

Every AI-discoverability term, defined.

AI crawlers

Standards

GEO concepts

SEO concepts