Site crawler — Docs — AuthoritySignal

The crawler is a deterministic, pure-function audit engine. Given an HTML payload and a URL, it produces a structured PageAudit: a numeric score, an indexability flag, schema presence, and an ordered list of issues with fix-effort estimates.

Rule families

Indexability — robots, canonical, noindex, hreflang, redirect chains.
On-page semantics — title, meta description, H1, internal link density, image alt coverage.
Structured data — JSON-LD presence, type validity, required fields per type, parse errors.
Performance proxies — DOM weight, render-blocking script count, image weight (the crawler does not run a real browser; CWV come from CrUX integration).
AEO signals — answer-shaped headings (Q/A pairs), citable facts (numbers + dates + named entities), methodology-link presence.

Multi-page orchestration

For multi-page audits, orchestrateCrawl takes a sitemap URL list, rate-limits requests (default 250ms between), enforces robots.txt, and aggregates per-page audits into a workspace-level summary (aggregateAudits) used in the dashboard.