Home/Docs/How Crawlmind scores your site

Concepts

How Crawlmind scores your site

Updated 2026-05-18

Each Crawlmind audit produces an overall score plus seven sub-scores, each on a 0–100 scale. Scores are not opinions: every point gained or lost ties back to a named rule (e.g. page.title.missing, site.canonical.target-noindex, page.content.atomic-answer-missing, site.hreflang.return-tag-missing). As of Phase 7c (May 2026) the rule library covers 65+ checks spanning classical SEO, indexability integrity, structured data, AI crawler access, content quality, performance, accessibility, and international SEO (hreflang) — plus 10 AI-discoverability rules competitors do not yet ship. This page lists the dimensions, the high-impact rules in each, and how the weighting works.

The 7 sub-scores

DimensionWhat it measures
Technical SEOTitle, meta, H1, canonical, redirects, sitemap, robots, broken links, mobile viewport, lang
Structured dataJSON-LD detection, type coverage, validation against Schema.org + Google Rich Results requirements
AI crawler accessExplicit allow/disallow policy for 12+ AI bots (GPTBot, ClaudeBot, PerplexityBot, OAI-SearchBot, Google-Extended, Applebot-Extended, CCBot, …)
LLM readabilityAtomic-answer presence, jargon density, grade level, paragraph length, sentence variety
Entity clarityWhether the homepage + About page describe a single unambiguous entity (LLM-scored)
Citation readinessPer-page likelihood of being cited by an AI answer engine (facts + sources + freshness)
Performance (basic)TTFB, response time, server compression, basic Core Web Vitals proxies

The overall score is a weighted blend with technical-SEO + AI-crawler-access carrying the most weight by default. You can override the weights in /docs/scoring/custom-weights (Agency+).

Indexability integrity (Phase 7a)

Crawlmind verifies that the signals search engines and AI engines use to decide *what to index* are mutually consistent. Each rule fires per-page (or as a cross-page aggregate) when it detects a misconfiguration.

  • page.indexability.x-robots-noindexX-Robots-Tag: noindex header on the response (easy to miss because it's not in the HTML)
  • page.robots-meta.noindex — meta robots noindex inside the page
  • site.canonical.target-noindex — a page canonicalises *to* another page that has noindex (contradictory signals)
  • site.canonical.target-non200 — canonical points at a 4xx / 5xx URL (broken consolidation)
  • site.canonical.chain — A→B canonical and B→C canonical (Google picks the first hop and may pick wrong)
  • site.soft-404 — HTTP 200 + 'page not found' body markers (the worst kind of stranded crawl budget)
  • page.links.internal-nofollow — rel=nofollow on internal links (silent PageRank leak, usually a plugin misconfig)
  • page.links.excessive — > 100 links per page (PageRank dilution)
  • site.orphan-pages — page exists in the crawl set but no other page links to it
  • site.thin-internal-links — page has only 1–2 inbound internal links

All of these are TECHNICAL_SEO. Severities range from CRITICAL (noindex on the root URL) down to LOW (excessive link count).

Duplicate content + page-level uniqueness

Distinct pages should send distinct signals. Crawlmind catches the three common duplicate-signal classes that demote sites at scale:

  • site.duplicate-titles — same <title> on multiple pages
  • site.duplicate-meta-descriptions — same meta description across pages
  • site.duplicate-h1s — same primary H1 across pages

Near-duplicate body-content detection (Jaccard / shingle-based) is on the Tier-2 roadmap.

AI-discoverability depth (Phase 7b)

These rules don't exist in any competing SEO crawler — they're the GEO-specific checks that decide whether AI answer engines (ChatGPT, Perplexity, Gemini, Claude) cite your pages.

  • page.indexability.noai-meta<meta robots="noai"> or noimageai opts the page out of AI use. Often set accidentally by privacy tooling.
  • page.content.atomic-answer-missing — opening paragraph doesn't define the topic in one sentence (the linguistic shape AI engines extract verbatim).
  • page.content.statistic-density-low — long-form page with too few concrete numbers. Pages with 3+ stats per 1000 chars get cited ~4× as often.
  • page.content.authoritative-links-missing — no outbound links to .gov / .edu / Wikipedia / DOI sources (E-E-A-T trust signal).
  • page.content.eeat-author-missing — Article-like JSON-LD with no author field.
  • page.content.eeat-dates-missing — Article-like JSON-LD missing datePublished and/or dateModified (kills freshness signal for time-sensitive queries).
  • page.structured-data.id-broken-reference — JSON-LD @id reference doesn't resolve to a defined node in the same graph. Breaks Google's Knowledge Graph entity disambiguation.
  • page.structured-data.sameas-invalid-urlsameAs URLs that are malformed or use http://. Silently breaks the bridge to canonical LinkedIn / Wikipedia / Twitter profiles.
  • site.llms-txt.off-host-links — your llms.txt indexes URLs outside your registered domain (spec violation).
  • site.llms-txt.optional-section-missing — no ## Optional section in the file. AI engines use it to deprioritise terms/privacy/archive URLs.

Every rule reads HTML + JSON-LD we already extract — no extra fetches against the customer's site.

Performance & accessibility quick wins

Phase 7a also added:

  • page.image.dimensions-missing<img> without explicit width/height (Cumulative Layout Shift)
  • page.image.lazy-load-missing — below-fold images without loading="lazy" (LCP regression)
  • page.security.mixed-content — HTTPS page that references http:// resources (trust + CSP failure)
  • page.heading.hierarchy-skip — H1 → H3 with no H2 (screen reader + LLM outline break)

Real Core Web Vitals measurement (LCP / CLS / INP) lands in Phase 7c.

Real Core Web Vitals (Phase 7d, rendered crawls only)

On rendered (Playwright) crawls Crawlmind injects a PerformanceObserver into every page before any of the site's own scripts. After the page hits networkidle we wait a 2-second settle window, then read:

  • LCP — Largest Contentful Paint. Threshold: ≤ 2.5s good, > 4s poor.
  • CLS — Cumulative Layout Shift. Threshold: ≤ 0.1 good, > 0.25 poor.
  • FCP — First Contentful Paint. Threshold: ≤ 1.8s good, > 3s poor.
  • TBT — Total Blocking Time. Threshold: ≤ 200ms good, > 600ms poor. Used as a lab proxy for INP — INP needs real user interactions we don't simulate.
  • Transfer size — sum of transferSize over the first 500 resource-timing entries. Threshold: < 3MB good, > 5MB poor.
  • DOM nodes — element count at finalize. Threshold: < 1500 good, > 3000 poor.

Each metric drives a dedicated rule (page.performance.lcp-slow, page.performance.cls-high, page.performance.fcp-slow, page.performance.tbt-high, page.performance.transfer-bloated, page.performance.dom-size-excessive) with severity bands matching Google's thresholds.

The PERFORMANCE sub-score is re-blended when CWV data is present: 30% weight to LCP, 25% to CLS, 20% to TBT, 15% to FCP, 10% to transfer size — mirroring PageSpeed Insights' rough split. No regression for HTTP-only crawls: when zero pages have webVitals data, the CWV penalty resolves to 0 and the score is identical to the pre-7d TTFB-based formula.

Honesty surface: these are LAB measurements from headless Chromium at the crawler's CPU/network profile. They're directional — not the same as Google's CrUX real-user data. Use them to spot regressions + diagnose problems; cross-reference with CrUX (or Google Search Console) for definitive customer-facing numbers.

International SEO — hreflang (Phase 7c)

If you publish content in more than one language, hreflang is the canonical signal Google + AI engines use to pick the right variant per user locale. Phase 7c adds five cross-page hreflang checks:

  • site.hreflang.invalid-lang-code — non-BCP-47 codes (the classic en_US underscore typo, language names like english, made-up region codes)
  • site.hreflang.x-default-missing — multi-language site with no x-default fallback for unmatched locales
  • site.hreflang.return-tag-missing — A links to B with hreflang but B doesn't link back to A. Per Google's docs, each reference must be reciprocal.
  • site.hreflang.self-reference-missing — a page in an hreflang set should include itself; missing self-references break Google's set-detection
  • site.hreflang.canonical-conflict — page declares hreflang alternates but its canonical points outside the set, blocking locale-correct indexing

All five run from extracted.hreflang[] — no new fetches.

Static accessibility + UX (Phase 7c)

WCAG-aligned spot checks that run from already-extracted HTML — no external accessibility-engine dep:

  • page.accessibility.link-text-non-descriptive — anchors with text like "click here" / "read more" / "here". WCAG 2.4.4 violation; screen readers + AI engines can't infer destination from context.
  • page.accessibility.link-text-empty — icon-only links with no accessible name. WCAG 4.1.2.
  • page.accessibility.viewport-zoom-blocked — viewport meta with user-scalable=no or maximum-scale=1. WCAG 1.4.4 (Resize Text). Cargo-culted into many SPA boilerplates.
  • page.title.generic — title is suspiciously generic ("Home", "Untitled", "Document", "Page"). The title is the highest-weight SERP-CTR signal — generic wastes the slot.
  • page.canonical.fragment — canonical URL contains #fragment. Silently dropped by Google but signals a buggy canonical generator.

*Coming in Phase 7d*: real Core Web Vitals measurement (LCP / CLS / INP / TBT / Speed Index) via Playwright performance-timeline instrumentation, plus full WCAG via axe-core in the rendered crawl pass. Both need crawler-core upgrades and ship together.

Rule severity

Every rule emits issues at one of five severities:

  • CRITICAL — page is uncrawlable, unindexable, or returns 5xx
  • HIGH — measurable ranking impact (missing canonical, blocked AI bot)
  • MEDIUM — best-practice violation (no schema, weak title)
  • LOW — minor (lang attribute missing, sub-optimal H1)
  • OPPORTUNITY — not a defect, just an upgrade ("ship FAQPage schema here")

Severity directly drives the deduction. A CRITICAL on a high-traffic page can take 8-15 points off the overall score; an OPPORTUNITY takes 0-1.

Issues vs Recommendations vs Action plan

  • Issue — something we found that's wrong (or worth a fix). Has a ruleId, severity, evidence, optional fix snippet.
  • Recommendation — same as an issue but more advisory ("consider adding HowTo schema to /tutorial pages").
  • Action plan — a rolling, prioritised list per org. Combines issues + recommendations and ranks by leverage = impact / effort × confidence. Get to it via /orgs/<id>/action-plan.

How scores change over time

Every crawl creates a fresh score row attached to the CrawlJob. The dashboard plots score history per website. Issue rows carry a firstSeenAt and a lastSeenAt so the same finding across multiple crawls is treated as one open issue, not 50 dupes. Resolving an issue (or it disappearing organically) closes the row.

Related docs

Ready to try it?

Free tier: 5 crawls / month, no credit card.