Glossary
Every AI-discoverability term, defined.
GPTBot. PerplexityBot. llms.txt. GEO. AI citation. Entity clarity. Concise definitions with examples — and a one-click link to a longer guide where one exists.
AI crawlers
Amazonbot
Amazonbot is Amazon's crawler. It feeds Alexa answers and Amazon's newer Rufus shopping assistant. Allowing Amazonbot is the path to inclusion in Alexa and Rufus answer surfaces.
anthropic-ai
2 guides · 1 report
anthropic-ai is the legacy user-agent name for Anthropic's web crawler. Anthropic has since standardised on ClaudeBot, but anthropic-ai directives in robots.txt are still respected for backwards compatibility. Allow or block both for the most defensible posture.
Applebot-Extended
1 guide · 2 reports
Applebot-Extended is Apple's user-agent for Apple Intelligence training — separate from Applebot, which crawls for Siri / Spotlight / web search. Blocking Applebot-Extended opts you out of Apple Intelligence training without affecting Siri or Spotlight indexing.
Bytespider
1 guide · 1 report
Bytespider is ByteDance's crawler, used to feed training data and live retrieval for Doubao and TikTok Search. It's known for aggressive crawl rates, and many sites block it by default for bandwidth reasons.
CCBot
1 guide · 1 report
CCBot is the crawler for Common Crawl, an open-source web archive that feeds the training corpora of many smaller AI engines and academic models. Blocking CCBot reduces (but doesn't eliminate) inclusion in derivative AI training datasets.
ChatGPT-User
3 guides · 1 report
ChatGPT-User is the user-agent OpenAI uses when a ChatGPT user explicitly asks the assistant to fetch a URL. Unlike GPTBot (training) and OAI-SearchBot (search index), ChatGPT-User fires only on per-conversation user action.
ClaudeBot
2 guides · 2 reports
ClaudeBot is Anthropic's web crawler, used for training Claude and for live retrieval in Claude's web-search feature. It replaced the older `anthropic-ai` user-agent, which is still respected for backwards compatibility.
Google-Extended
1 guide · 2 reports
Google-Extended is Google's user-agent for AI training and AI Overviews — separate from Googlebot, which still controls classic search indexing. Blocking Google-Extended opts you out of Gemini training and AI Overviews surfacing without affecting your standard Google ranking.
GPTBot
3 guides · 1 report
GPTBot is OpenAI's primary web crawler. It fetches publicly available pages to gather data for training future GPT models and, in some configurations, to ground answers in ChatGPT. Sites can allow or disallow it via robots.txt with a `User-agent: GPTBot` directive.
Meta-ExternalAgent
1 guide · 1 report
Meta-ExternalAgent is Meta's newer web crawler for Llama training and Meta AI features. Distinct from Meta's older `facebookexternalhit` (which fetches link previews for Facebook/Instagram), Meta-ExternalAgent is specifically for AI corpus collection.
OAI-SearchBot
3 guides · 2 reports
OAI-SearchBot is OpenAI's crawler for live document retrieval in ChatGPT search. Distinct from GPTBot (training) and ChatGPT-User (user-triggered fetches), OAI-SearchBot is what indexes content for inclusion in ChatGPT's search-answer surface.
PerplexityBot
3 guides · 2 reports
PerplexityBot is Perplexity's web crawler. Perplexity's entire UX surfaces citations next to every answer, so allowing PerplexityBot is the single biggest move for getting traffic from Perplexity. Perplexity does not train on crawled content — PerplexityBot is for retrieval only.
Standards
BreadcrumbList
1 guide · 1 report
`BreadcrumbList` is a Schema.org JSON-LD type that declares the navigational path from the site root to the current page. Google uses it for the breadcrumb display in SERPs; AI engines use it to understand site hierarchy and topic clustering.
FAQPage
4 guides · 1 report
`FAQPage` is a Schema.org JSON-LD type that marks a page's question-and-answer pairs as structured Q&A. AI engines extract `acceptedAnswer.text` almost verbatim into answer snippets, making FAQPage the highest-leverage schema type to ship after `Organization`.
llms-full.txt
1 guide · 1 report
`/llms-full.txt` is the optional companion to `/llms.txt` from the llmstxt.org spec. Where `llms.txt` is a curated *index* (links + 1-line summaries), `llms-full.txt` contains the *full content* of those linked pages, concatenated into one Markdown file. Sites shipping both see materially higher AI citation rates.
llms.txt
3 guides · 1 report
llms.txt is a plain-text Markdown file served at the root of a website (`/llms.txt`) that gives AI engines a curated, machine-readable index of the most important pages. It's structurally similar to robots.txt but is a discovery hint rather than an access policy.
robots.txt
6 guides · 2 reports
`robots.txt` is a plain-text file at the site root that tells web crawlers which paths they may or may not fetch. It is the canonical place to allow or disallow specific AI crawlers like GPTBot, ClaudeBot, and PerplexityBot. Crawlers honor it on a per-User-agent basis — order and specificity matter.
GEO concepts
AI citation
2 guides · 2 reports
An AI citation is a named reference an AI answer engine attaches to a sentence or paragraph in its response, identifying the source it used to ground that claim. Perplexity, ChatGPT search, Claude, and Bing AI all show citations; being cited is the AI-search equivalent of ranking on Google.
AI Overviews
3 guides · 1 report
AI Overviews is Google Search's generative-answer surface — a multi-paragraph answer composed by Gemini at the top of search results, with citations to the underlying source pages. To appear as a cited source, pages need clean structured data, an unambiguous canonical URL, and atomic-answer paragraphs near the top of the body.
Atomic answer
An atomic answer is a single self-contained paragraph that directly answers the implicit question of a page. AI engines extract atomic answers as citation snippets — placing one in the first paragraph of every guide is the single highest-leverage GEO tactic.
Citation readiness
1 guide
Citation readiness is the property of a page that makes AI answer engines cite it when grounding answers. It combines retrievability (can the engine fetch and chunk the page?), authority (is the domain trusted?), and specificity (does the page contain the atomic fact the answer needs?).
Entity clarity
2 guides
Entity clarity is the degree to which a website unambiguously identifies the single organization, product, or person it represents. Strong entity clarity (consistent Organization schema, named publisher, single canonical About page) helps AI engines bind a query to the right entity in their knowledge graph.
GEO
1 guide · 2 reports
Generative Engine Optimization
GEO (Generative Engine Optimization) is the practice of structuring a website so that AI answer engines — ChatGPT search, Perplexity, Claude, Google AI Overviews — retrieve and cite its content when answering user questions. GEO is to AI answer engines what SEO is to Google's blue-link results.
SEO concepts
JSON-LD
1 guide · 1 report
JSON-LD (JSON for Linking Data) is the machine-readable format used to embed structured data (schema.org markup) in HTML pages. It's the format Google, AI engines, and crawlers all prefer over the older Microdata and RDFa syntaxes.
Schema markup
2 guides
Schema markup is structured data added to a webpage to describe its content in a machine-readable way, using the vocabulary defined at schema.org. AI engines and Google use schema to extract clean facts, populate rich results, and ground citations.