How Crawlmind scores your site

The 7 sub-scores

Dimension	What it measures
Technical SEO	Title, meta, H1, canonical, redirects, sitemap, robots, broken links, mobile viewport, lang
Structured data	JSON-LD detection, type coverage, validation against Schema.org + Google Rich Results requirements
AI crawler access	Explicit allow/disallow policy for 12+ AI bots (GPTBot, ClaudeBot, PerplexityBot, OAI-SearchBot, Google-Extended, Applebot-Extended, CCBot, …)
LLM readability	Atomic-answer presence, jargon density, grade level, paragraph length, sentence variety
Entity clarity	Whether the homepage + About page describe a single unambiguous entity (LLM-scored)
Citation readiness	Per-page likelihood of being cited by an AI answer engine (facts + sources + freshness)
Performance (basic)	TTFB, response time, server compression, basic Core Web Vitals proxies

The overall score is a weighted blend with technical-SEO + AI-crawler-access carrying the most weight by default. You can override the weights in /docs/scoring/custom-weights (Agency+).

Indexability integrity (Phase 7a)

Crawlmind verifies that the signals search engines and AI engines use to decide *what to index* are mutually consistent. Each rule fires per-page (or as a cross-page aggregate) when it detects a misconfiguration.

page.indexability.x-robots-noindex: X-Robots-Tag: noindex header on the response (easy to miss because it's not in the HTML)
page.robots-meta.noindex: meta robots noindex inside the page
site.canonical.target-noindex: a page canonicalises *to* another page that has noindex (contradictory signals)
site.canonical.target-non200: canonical points at a 4xx / 5xx URL (broken consolidation)
site.canonical.chain: A→B canonical and B→C canonical (Google picks the first hop and may pick wrong)
site.soft-404: HTTP 200 + 'page not found' body markers (the worst kind of stranded crawl budget)
page.links.internal-nofollow: rel=nofollow on internal links (silent PageRank leak, usually a plugin misconfig)
page.links.excessive: > 100 links per page (PageRank dilution)
site.orphan-pages: page exists in the crawl set but no other page links to it
site.thin-internal-links: page has only 1–2 inbound internal links

All of these are TECHNICAL_SEO. Severities range from CRITICAL (noindex on the root URL) down to LOW (excessive link count).

Duplicate content + page-level uniqueness

Distinct pages should send distinct signals. Crawlmind catches the three common duplicate-signal classes that demote sites at scale:

site.duplicate-titles: same <title> on multiple pages
site.duplicate-meta-descriptions: same meta description across pages
site.duplicate-h1s: same primary H1 across pages

Near-duplicate body-content detection (Jaccard / shingle-based) is on the Tier-2 roadmap.

AI-discoverability depth (Phase 7b)

These rules don't exist in any competing SEO crawler: they're the GEO-specific checks that decide whether AI answer engines (ChatGPT, Perplexity, Gemini, Claude) cite your pages.

page.indexability.noai-meta: <meta robots="noai"> or noimageai opts the page out of AI use. Often set accidentally by privacy tooling.
page.content.atomic-answer-missing: opening paragraph doesn't define the topic in one sentence (the linguistic shape AI engines extract verbatim).
page.content.statistic-density-low: long-form page with too few concrete numbers. Pages with 3+ stats per 1000 chars get cited ~4× as often.
page.content.authoritative-links-missing: no outbound links to .gov / .edu / Wikipedia / DOI sources (E-E-A-T trust signal).
page.content.eeat-author-missing: Article-like JSON-LD with no author field.
page.content.eeat-dates-missing: Article-like JSON-LD missing datePublished and/or dateModified (kills freshness signal for time-sensitive queries).
page.structured-data.id-broken-reference: JSON-LD @id reference doesn't resolve to a defined node in the same graph. Breaks Google's Knowledge Graph entity disambiguation.
page.structured-data.sameas-invalid-url: sameAs URLs that are malformed or use http://. Silently breaks the bridge to canonical LinkedIn / Wikipedia / Twitter profiles.
site.llms-txt.off-host-links: your llms.txt indexes URLs outside your registered domain (spec violation).
site.llms-txt.optional-section-missing: no ## Optional section in the file. AI engines use it to deprioritise terms/privacy/archive URLs.

Every rule reads HTML + JSON-LD we already extract: no extra fetches against the customer's site.

Performance & accessibility quick wins

Phase 7a also added:

page.image.dimensions-missing: <img> without explicit width/height (Cumulative Layout Shift)
page.image.lazy-load-missing: below-fold images without loading="lazy" (LCP regression)
page.security.mixed-content: HTTPS page that references http:// resources (trust + CSP failure)
page.heading.hierarchy-skip: H1 → H3 with no H2 (screen reader + LLM outline break)

Real Core Web Vitals measurement (LCP / CLS / INP) lands in Phase 7c.

Real Core Web Vitals (Phase 7d, rendered crawls only)

On rendered (Playwright) crawls Crawlmind injects a PerformanceObserver into every page before any of the site's own scripts. After the page hits networkidle we wait a 2-second settle window, then read:

LCP: Largest Contentful Paint. Threshold: ≤ 2.5s good, > 4s poor.
CLS: Cumulative Layout Shift. Threshold: ≤ 0.1 good, > 0.25 poor.
FCP: First Contentful Paint. Threshold: ≤ 1.8s good, > 3s poor.
TBT: Total Blocking Time. Threshold: ≤ 200ms good, > 600ms poor. Used as a lab proxy for INP: INP needs real user interactions we don't simulate.
Transfer size: sum of transferSize over the first 500 resource-timing entries. Threshold: < 3MB good, > 5MB poor.
DOM nodes: element count at finalize. Threshold: < 1500 good, > 3000 poor.

Each metric drives a dedicated rule (page.performance.lcp-slow, page.performance.cls-high, page.performance.fcp-slow, page.performance.tbt-high, page.performance.transfer-bloated, page.performance.dom-size-excessive) with severity bands matching Google's thresholds.

The PERFORMANCE sub-score is re-blended when CWV data is present: 30% weight to LCP, 25% to CLS, 20% to TBT, 15% to FCP, 10% to transfer size: mirroring PageSpeed Insights' rough split. No regression for HTTP-only crawls: when zero pages have webVitals data, the CWV penalty resolves to 0 and the score is identical to the pre-7d TTFB-based formula.

Honesty surface: these are LAB measurements from headless Chromium at the crawler's CPU/network profile. They're directional: not the same as Google's CrUX real-user data. Use them to spot regressions + diagnose problems; cross-reference with CrUX (or Google Search Console) for definitive customer-facing numbers.

International SEO: hreflang (Phase 7c)

If you publish content in more than one language, hreflang is the canonical signal Google + AI engines use to pick the right variant per user locale. Phase 7c adds five cross-page hreflang checks:

site.hreflang.invalid-lang-code: non-BCP-47 codes (the classic en_US underscore typo, language names like english, made-up region codes)
site.hreflang.x-default-missing: multi-language site with no x-default fallback for unmatched locales
site.hreflang.return-tag-missing: A links to B with hreflang but B doesn't link back to A. Per Google's docs, each reference must be reciprocal.
site.hreflang.self-reference-missing: a page in an hreflang set should include itself; missing self-references break Google's set-detection
site.hreflang.canonical-conflict: page declares hreflang alternates but its canonical points outside the set, blocking locale-correct indexing

All five run from extracted.hreflang[]: no new fetches.

Static accessibility + UX (Phase 7c)

WCAG-aligned spot checks that run from already-extracted HTML: no external accessibility-engine dep:

page.accessibility.link-text-non-descriptive: anchors with text like "click here" / "read more" / "here". WCAG 2.4.4 violation; screen readers + AI engines can't infer destination from context.
page.accessibility.link-text-empty: icon-only links with no accessible name. WCAG 4.1.2.
page.accessibility.viewport-zoom-blocked: viewport meta with user-scalable=no or maximum-scale=1. WCAG 1.4.4 (Resize Text). Cargo-culted into many SPA boilerplates.
page.title.generic: title is suspiciously generic ("Home", "Untitled", "Document", "Page"). The title is the highest-weight SERP-CTR signal: generic wastes the slot.
page.canonical.fragment: canonical URL contains #fragment. Silently dropped by Google but signals a buggy canonical generator.

*Coming in Phase 7d*: real Core Web Vitals measurement (LCP / CLS / INP / TBT / Speed Index) via Playwright performance-timeline instrumentation, plus full WCAG via axe-core in the rendered crawl pass. Both need crawler-core upgrades and ship together.

Rule severity

Every rule emits issues at one of five severities:

CRITICAL: page is uncrawlable, unindexable, or returns 5xx
HIGH: measurable ranking impact (missing canonical, blocked AI bot)
MEDIUM: best-practice violation (no schema, weak title)
LOW: minor (lang attribute missing, sub-optimal H1)
OPPORTUNITY: not a defect, just an upgrade ("ship FAQPage schema here")

Severity directly drives the deduction. A CRITICAL on a high-traffic page can take 8-15 points off the overall score; an OPPORTUNITY takes 0-1.

Issues vs Recommendations vs Action plan

Issue: something we found that's wrong (or worth a fix). Has a ruleId, severity, evidence, optional fix snippet.
Recommendation: same as an issue but more advisory ("consider adding HowTo schema to /tutorial pages").
Action plan: a rolling, prioritised list per org. Combines issues + recommendations and ranks by leverage = impact / effort × confidence. Get to it via /orgs/<id>/action-plan.

How scores change over time

Every crawl creates a fresh score row attached to the CrawlJob. The dashboard plots score history per website. Issue rows carry a firstSeenAt and a lastSeenAt so the same finding across multiple crawls is treated as one open issue, not 50 dupes. Resolving an issue (or it disappearing organically) closes the row.