Home/Docs/How Crawlmind scores your site
Concepts
How Crawlmind scores your site
Updated 2026-05-18
Each Crawlmind audit produces an overall score plus seven sub-scores, each on a 0–100 scale. Scores are not opinions: every point gained or lost ties back to a named rule (e.g. page.title.missing, site.canonical.target-noindex, page.content.atomic-answer-missing, site.hreflang.return-tag-missing). As of Phase 7c (May 2026) the rule library covers 65+ checks spanning classical SEO, indexability integrity, structured data, AI crawler access, content quality, performance, accessibility, and international SEO (hreflang) — plus 10 AI-discoverability rules competitors do not yet ship. This page lists the dimensions, the high-impact rules in each, and how the weighting works.
The 7 sub-scores
| Dimension | What it measures |
|---|---|
| Technical SEO | Title, meta, H1, canonical, redirects, sitemap, robots, broken links, mobile viewport, lang |
| Structured data | JSON-LD detection, type coverage, validation against Schema.org + Google Rich Results requirements |
| AI crawler access | Explicit allow/disallow policy for 12+ AI bots (GPTBot, ClaudeBot, PerplexityBot, OAI-SearchBot, Google-Extended, Applebot-Extended, CCBot, …) |
| LLM readability | Atomic-answer presence, jargon density, grade level, paragraph length, sentence variety |
| Entity clarity | Whether the homepage + About page describe a single unambiguous entity (LLM-scored) |
| Citation readiness | Per-page likelihood of being cited by an AI answer engine (facts + sources + freshness) |
| Performance (basic) | TTFB, response time, server compression, basic Core Web Vitals proxies |
The overall score is a weighted blend with technical-SEO + AI-crawler-access carrying the most weight by default. You can override the weights in /docs/scoring/custom-weights (Agency+).
Indexability integrity (Phase 7a)
Crawlmind verifies that the signals search engines and AI engines use to decide *what to index* are mutually consistent. Each rule fires per-page (or as a cross-page aggregate) when it detects a misconfiguration.
page.indexability.x-robots-noindex—X-Robots-Tag: noindexheader on the response (easy to miss because it's not in the HTML)page.robots-meta.noindex— meta robots noindex inside the pagesite.canonical.target-noindex— a page canonicalises *to* another page that has noindex (contradictory signals)site.canonical.target-non200— canonical points at a 4xx / 5xx URL (broken consolidation)site.canonical.chain— A→B canonical and B→C canonical (Google picks the first hop and may pick wrong)site.soft-404— HTTP 200 + 'page not found' body markers (the worst kind of stranded crawl budget)page.links.internal-nofollow— rel=nofollow on internal links (silent PageRank leak, usually a plugin misconfig)page.links.excessive— > 100 links per page (PageRank dilution)site.orphan-pages— page exists in the crawl set but no other page links to itsite.thin-internal-links— page has only 1–2 inbound internal links
All of these are TECHNICAL_SEO. Severities range from CRITICAL (noindex on the root URL) down to LOW (excessive link count).
Duplicate content + page-level uniqueness
Distinct pages should send distinct signals. Crawlmind catches the three common duplicate-signal classes that demote sites at scale:
site.duplicate-titles— same<title>on multiple pagessite.duplicate-meta-descriptions— same meta description across pagessite.duplicate-h1s— same primary H1 across pages
Near-duplicate body-content detection (Jaccard / shingle-based) is on the Tier-2 roadmap.
AI-discoverability depth (Phase 7b)
These rules don't exist in any competing SEO crawler — they're the GEO-specific checks that decide whether AI answer engines (ChatGPT, Perplexity, Gemini, Claude) cite your pages.
page.indexability.noai-meta—<meta robots="noai">ornoimageaiopts the page out of AI use. Often set accidentally by privacy tooling.page.content.atomic-answer-missing— opening paragraph doesn't define the topic in one sentence (the linguistic shape AI engines extract verbatim).page.content.statistic-density-low— long-form page with too few concrete numbers. Pages with 3+ stats per 1000 chars get cited ~4× as often.page.content.authoritative-links-missing— no outbound links to .gov / .edu / Wikipedia / DOI sources (E-E-A-T trust signal).page.content.eeat-author-missing— Article-like JSON-LD with noauthorfield.page.content.eeat-dates-missing— Article-like JSON-LD missingdatePublishedand/ordateModified(kills freshness signal for time-sensitive queries).page.structured-data.id-broken-reference— JSON-LD@idreference doesn't resolve to a defined node in the same graph. Breaks Google's Knowledge Graph entity disambiguation.page.structured-data.sameas-invalid-url—sameAsURLs that are malformed or use http://. Silently breaks the bridge to canonical LinkedIn / Wikipedia / Twitter profiles.site.llms-txt.off-host-links— your llms.txt indexes URLs outside your registered domain (spec violation).site.llms-txt.optional-section-missing— no## Optionalsection in the file. AI engines use it to deprioritise terms/privacy/archive URLs.
Every rule reads HTML + JSON-LD we already extract — no extra fetches against the customer's site.
Performance & accessibility quick wins
Phase 7a also added:
page.image.dimensions-missing—<img>without explicit width/height (Cumulative Layout Shift)page.image.lazy-load-missing— below-fold images withoutloading="lazy"(LCP regression)page.security.mixed-content— HTTPS page that references http:// resources (trust + CSP failure)page.heading.hierarchy-skip— H1 → H3 with no H2 (screen reader + LLM outline break)
Real Core Web Vitals measurement (LCP / CLS / INP) lands in Phase 7c.
Real Core Web Vitals (Phase 7d, rendered crawls only)
On rendered (Playwright) crawls Crawlmind injects a PerformanceObserver into every page before any of the site's own scripts. After the page hits networkidle we wait a 2-second settle window, then read:
- LCP — Largest Contentful Paint. Threshold: ≤ 2.5s good, > 4s poor.
- CLS — Cumulative Layout Shift. Threshold: ≤ 0.1 good, > 0.25 poor.
- FCP — First Contentful Paint. Threshold: ≤ 1.8s good, > 3s poor.
- TBT — Total Blocking Time. Threshold: ≤ 200ms good, > 600ms poor. Used as a lab proxy for INP — INP needs real user interactions we don't simulate.
- Transfer size — sum of
transferSizeover the first 500 resource-timing entries. Threshold: < 3MB good, > 5MB poor. - DOM nodes — element count at finalize. Threshold: < 1500 good, > 3000 poor.
Each metric drives a dedicated rule (page.performance.lcp-slow, page.performance.cls-high, page.performance.fcp-slow, page.performance.tbt-high, page.performance.transfer-bloated, page.performance.dom-size-excessive) with severity bands matching Google's thresholds.
The PERFORMANCE sub-score is re-blended when CWV data is present: 30% weight to LCP, 25% to CLS, 20% to TBT, 15% to FCP, 10% to transfer size — mirroring PageSpeed Insights' rough split. No regression for HTTP-only crawls: when zero pages have webVitals data, the CWV penalty resolves to 0 and the score is identical to the pre-7d TTFB-based formula.
Honesty surface: these are LAB measurements from headless Chromium at the crawler's CPU/network profile. They're directional — not the same as Google's CrUX real-user data. Use them to spot regressions + diagnose problems; cross-reference with CrUX (or Google Search Console) for definitive customer-facing numbers.
International SEO — hreflang (Phase 7c)
If you publish content in more than one language, hreflang is the canonical signal Google + AI engines use to pick the right variant per user locale. Phase 7c adds five cross-page hreflang checks:
site.hreflang.invalid-lang-code— non-BCP-47 codes (the classicen_USunderscore typo, language names likeenglish, made-up region codes)site.hreflang.x-default-missing— multi-language site with nox-defaultfallback for unmatched localessite.hreflang.return-tag-missing— A links to B with hreflang but B doesn't link back to A. Per Google's docs, each reference must be reciprocal.site.hreflang.self-reference-missing— a page in an hreflang set should include itself; missing self-references break Google's set-detectionsite.hreflang.canonical-conflict— page declares hreflang alternates but its canonical points outside the set, blocking locale-correct indexing
All five run from extracted.hreflang[] — no new fetches.
Static accessibility + UX (Phase 7c)
WCAG-aligned spot checks that run from already-extracted HTML — no external accessibility-engine dep:
page.accessibility.link-text-non-descriptive— anchors with text like "click here" / "read more" / "here". WCAG 2.4.4 violation; screen readers + AI engines can't infer destination from context.page.accessibility.link-text-empty— icon-only links with no accessible name. WCAG 4.1.2.page.accessibility.viewport-zoom-blocked— viewport meta withuser-scalable=noormaximum-scale=1. WCAG 1.4.4 (Resize Text). Cargo-culted into many SPA boilerplates.page.title.generic— title is suspiciously generic ("Home", "Untitled", "Document", "Page"). The title is the highest-weight SERP-CTR signal — generic wastes the slot.page.canonical.fragment— canonical URL contains#fragment. Silently dropped by Google but signals a buggy canonical generator.
*Coming in Phase 7d*: real Core Web Vitals measurement (LCP / CLS / INP / TBT / Speed Index) via Playwright performance-timeline instrumentation, plus full WCAG via axe-core in the rendered crawl pass. Both need crawler-core upgrades and ship together.
Rule severity
Every rule emits issues at one of five severities:
- CRITICAL — page is uncrawlable, unindexable, or returns 5xx
- HIGH — measurable ranking impact (missing canonical, blocked AI bot)
- MEDIUM — best-practice violation (no schema, weak title)
- LOW — minor (lang attribute missing, sub-optimal H1)
- OPPORTUNITY — not a defect, just an upgrade ("ship FAQPage schema here")
Severity directly drives the deduction. A CRITICAL on a high-traffic page can take 8-15 points off the overall score; an OPPORTUNITY takes 0-1.
Issues vs Recommendations vs Action plan
- Issue — something we found that's wrong (or worth a fix). Has a
ruleId, severity, evidence, optional fix snippet. - Recommendation — same as an issue but more advisory ("consider adding HowTo schema to /tutorial pages").
- Action plan — a rolling, prioritised list per org. Combines issues + recommendations and ranks by leverage = impact / effort × confidence. Get to it via
/orgs/<id>/action-plan.
How scores change over time
Every crawl creates a fresh score row attached to the CrawlJob. The dashboard plots score history per website. Issue rows carry a firstSeenAt and a lastSeenAt so the same finding across multiple crawls is treated as one open issue, not 50 dupes. Resolving an issue (or it disappearing organically) closes the row.
Related docs
Ready to try it?
Free tier: 5 crawls / month, no credit card.