gapr,
SEO and GEO in one ledger.
A self-hostable, AGPL-licensed alternative to Ahrefs and SEMrush. First-class GEO --- Generative Engine Optimization --- scoring across the major LLMs.
gapr is built around the premise that ranking inside an LLM's generated answer matters as much as ranking on a Google SERP. Beyond traditional SEO (crawls, on-page audits, SERP tracking, backlinks, content briefs, keyword difficulty) it fans a tracked prompt out to ChatGPT, Claude, Perplexity, and Gemini, captures the answer + citations, and computes a transparent GEO Presence Score. Every score in the system exposes its factor breakdown via the Score.breakdown JSON column.
ITech scope
- Three-process runtime, deployed independently.
apps/apiis a Fastify server with JWT auth and an auto-generated Swagger surface at/docs.apps/webis the Next.js 15 dashboard (App Router, React 19 RC, server components for the heavy data screens).apps/workeris the BullMQ consumer that owns the only network-heavy work in the system: Playwright sessions, LLM provider calls, and SerpAPI fetches. - The API never blocks on long work. HTTP routes enqueue jobs and return a job id; only
apps/workertalks to the outside world. Concurrency is per-queue and tuned to the upstream rate limits —serpandrankat 8,crawlandauditat 2 (Playwright sessions are memory-heavy), andgeoat 4 because the LLM providers’ per-account quotas saturate faster than CPU does. - Multi-tenant via
Workspace → WorkspaceMember → Project. Every domain record (Keyword,Crawl,Audit,AiQuery,Backlink,ContentBrief,Report) hangs off a Project; row-level scoping lives inpackages/dbas a genericscoped(projectId)helper rather than as ad-hocWHEREclauses, which keeps the boundary auditable in one place. - The generic
Scoremodel is the transparent-scoring ledger. Any new metric writes its factor breakdown intoScore.breakdownas JSON, so a customer asking "why did my GEO score drop 12 points" gets a deterministic, structured answer rather than an opinion. - Raw HTML is content-addressed.
Page.hashandSerpSnapshot.htmlHashare SHA-256 over the canonicalized response; the bytes themselves live in MinIO/S3, while Postgres only holds the references. A re-crawl that hits the same content costs nothing in object storage. - Postgres extensions are pinned and assumed:
pgcryptofor hashes,pg_trgmfor fuzzy keyword match,btree_ginfor composite indexes on big tables,citextfor case-insensitive domain comparisons. Migration tooling fails closed if an extension is missing, rather than silently fall back to a slower path.
IIGEO subsystem
Generative Engine Optimization is a first-class concern, not a bolt-on. The premise of gapr is that ranking inside an LLM-generated answer is now as load-bearing as ranking on a Google SERP; the schema reflects that:
- AiQuery — a tracked prompt for a project. Carries the prompt text, the project context, and the cadence (one-shot, daily, weekly).
- AiAnswer — one row per engine per fetch. The full raw response is retained (in object storage, referenced by hash) so a score can be re-derived months later if the rubric changes.
- AiBrandMention — per-domain mention extracted from the answer. Each row carries
kind = CITATION / PROSE / BOTH, a position rank, and a sentiment label, so a citation that praises and a citation that warns are not collapsed into the same number.
packages/geo and packages/llm route the prompt across OPENAI_CHATGPT, ANTHROPIC_CLAUDE, PERPLEXITY, and GOOGLE_GEMINI in parallel. The GEO Presence Score is citation share + prose share + position + engine coverage + sentiment; the weights are configurable per workspace, the inputs are surfaced verbatim in Score.breakdown, and the engine version ("claude-sonnet-4-6", etc.) is pinned per fetch so a score is reproducible against a specific model snapshot.
IIIWorkspace map
14 packages under apps/* and packages/*, pnpm 9 + Turborepo. The split is by responsibility, not by technology — everything that talks Postgres lives in packages/db, everything that talks an LLM provider lives in packages/llm, and the apps are thin compositions over those.
apps/web— Next.js 15 dashboard. Server components for the keyword and GEO grids, client components only for the live charts.apps/api— Fastify HTTP API. Every write enqueues to BullMQ; reads hit Postgres + the object store directly.apps/worker— BullMQ consumer of 8 queues. The only process that holds Playwright contexts and LLM provider clients.packages/db— Prisma schema + generated client + the row-levelscoped()helper.packages/types— shared Zod schemas for HTTP payloads, BullMQ job payloads, and the centralQUEUESregistry. Adding a queue means editing one file, which is what stops the queue list from drifting.packages/llm— provider router with circuit-breaker per provider; a 5xx storm at one upstream does not poison the others.packages/crawler— Playwright crawler. Honorsrobots.txt, respectsCrawl-Delay, exposes a polite-default per-host rate limit so an operator never accidentally hammers an upstream.packages/serp— SERP fetcher with SerpAPI fallback, Google + Bing parsers, intent classifier (informational / navigational / commercial / transactional).packages/geo— AI-engine fan-out, brand-mention detection, GEO Presence scoring math.packages/analyzer+packages/entities+packages/scoring— on-page rules, entity extraction (people, products, organizations), and the writers that land scores in theScoreledger.
IVWhy AGPL self-host
The license choice is operational, not ideological. A self-hosted single-tenant index avoids the per-tenant boundary problems that hosted SaaS competitors keep running into — cross-tenant data leakage from a shared crawler, throttle-budget collisions between customers, and the audit-trail tax of storing every customer’s queries in one database. AGPL is the contract that keeps fork-and-host viable: an operator can run gapr on their own infra and modify it freely, but a hosted derivative must publish its modifications. That trade keeps the project oriented toward operators rather than toward a hosted offering it doesn’t plan to build.
VSurface
Self-hostable means an operator gets everything needed to bring a single-tenant index up in one repo: the Playwright crawler, the on-page extractor, the GEO scorer (which runs LLM-style queries across OPENAI_CHATGPT, ANTHROPIC_CLAUDE, PERPLEXITY, GOOGLE_GEMINI and scores presence in their generated answers), the index, the API, and the operator console. Single-tenant is a deliberate constraint, not a limitation. Per-tenant boundaries are a known source of cross-tenant data leakage and noisy-neighbor throttle collisions in hosted SEO/GEO platforms; the AGPL self-host alternative is to make the operator own their boundary, which is structurally simpler and auditably yours.
VIRoadmap
The shape ahead: more LLM targets pinned to specific snapshot dates (so a GEO score is reproducible against an exact model version, not a moving target like ‘the latest Claude’), expansion of the crawl schedulers to honor robots.txt + per-host Crawl-Delay caps without operator hand-holding, and a small import path for existing Ahrefs / SEMrush keyword exports so operators can backfill rather than re-crawl from scratch. None of this is shippable yet — the scoring rubric is the first thing that needs to be settled, because the rubric determines what every other piece in the pipeline is being asked to optimize for. Ship a wrong rubric and every score in the system carries the same wrong shape.