Skip to main content
Back

How it works

Hybrid RAG with a curated Neon pgvector corpus and live Google Maps docs lookup when the corpus is insufficient. Two-stage retrieval (bi-encoder → cross-encoder rerank), LLM-judge eval loop, and a committed iteration history in git.

Read the full case study on ramoncarrillo.dev

Two-path retrieval

                            user query
                                │
                  voyage-code-3 embedding
                                │
                                ▼
               pgvector match_documents(top 20, cos ≥ 0.3)
                                │
                                ▼
              voyage rerank-2 → keep top 5, score ≥ 0.5
                                │
                                ▼
              claude-sonnet-4-6 answers from corpus …
                                │
          ┌─────────────────────┴─────────────────────┐
          │ corpus covers it?                          │
          │   YES → cite corpus, done                  │
          │   NO, and in-domain → call web_search      │
          │                        ↳ developers.google.│
          │                          com + cloud.google│
          │                          .com only, max 3  │
          │   NO, out-of-scope  → refuse cleanly       │
          └────────────────────────────────────────────┘

Request pipeline

  1. 1

    1. Ingestion

    12 Markdown docs in /documents are chunked (~800 tokens with 200-token overlap), embedded with voyage-code-3 (input_type=document, 1024-dim), and stored in Neon pgvector. Currently 41 chunks across Maps JS API, Places, Routes, Route Optimization, Geocoding, Address Validation, Advanced Markers, API Key Restrictions, React/Next.js integration, billing, deprecations, and troubleshooting. Re-ingestion is idempotent — it deletes rows by source_url before re-inserting.

  2. 2

    2. Query arrives

    POST /api/chat validates the UIMessage payload with Zod (max 4K chars per text part, max 20 messages), applies a 15-req/min per-IP rate limit, and extracts the latest user message.

  3. 3

    3. Stage-1 retrieval (bi-encoder)

    The query is embedded with voyage-code-3 (input_type=query). Neon's match_documents() returns the top 20 candidate chunks above a 0.3 cosine threshold. Looser than we'd show the LLM — it's a wide-net recall step whose output feeds the reranker.

  4. 4

    4. Stage-2 rerank (cross-encoder)

    Voyage rerank-2 scores each of the 20 candidates against the query and returns the top 5 above a 0.5 relevance threshold. Cross-encoders see query + candidate together and are measurably more precise than bi-encoders at the relevance boundary. If rerank rate-limits (Voyage free tier is 3 RPM shared), the system gracefully falls back to cosine-ordered top-5 — chat stays functional.

  5. 5

    5. Generation with agentic fallback

    claude-sonnet-4-6 receives system prompt (v4) + retrieved chunks + web_search_20260209 tool restricted to developers.google.com and cloud.google.com (max 3 uses per message). If the corpus covers the question, Claude answers from corpus citations. If the corpus is weak but the question is in-domain, Claude calls web_search against Google's live docs and answers with web citations. If the question is out-of-scope entirely, Claude refuses without searching. Response streams back via createUIMessageStream with a data-sources part rendered as the 'Sources consulted' panel.

  6. 6

    6. Eval loop

    evals/run-evals.ts runs 15 golden questions on every prompt, corpus, or retrieval change. Scoring combines deterministic citation-accuracy checks with Haiku-based LLM judges for mention coverage and refusal quality. Results land in evals/results/<timestamp>.md and are committed so the git history documents every iteration.

Stack at a glance

FrameworkNext.js 16 (App Router)
LLMclaude-sonnet-4-6 via @ai-sdk/anthropic
Embeddingsvoyage-code-3 (1024-dim)
Rerankervoyage rerank-2 (cross-encoder, top 20 → top 5)
Vector DBNeon Postgres + pgvector (IVFFlat cosine)
Web searchAnthropic managed web_search_20260209 (domain-restricted)
StreamingVercel AI SDK (createUIMessageStream + tool use)
Eval judgeclaude-haiku-4-5 via generateObject
UIshadcn/ui + Tailwind + Framer Motion
DeployVercel (Edge + Neon serverless driver)

Design decisions

Why hybrid RAG with agentic web search instead of just growing the corpus?

Google Maps Platform docs are 500+ pages across 30+ APIs, updated continuously. No static corpus survives contact with user questions for long — the long tail is vast. Instead of chasing completeness, the corpus covers the common 80% and the web_search tool handles the other 20% by fetching live from developers.google.com. Adversarial queries (non-existent products, off-topic questions) are still refused without searching. This is how Perplexity and ChatGPT's browsing mode work — curated-for-speed + live-for-coverage.

Why a two-stage retriever instead of just cosine + top-5?

As the corpus grew from 6 → 12 files (13 → 41 chunks), cosine-only retrieval started surfacing 'kinda relevant' chunks on adversarial queries — which caused the model to hedge instead of refuse cleanly. Adding a cross-encoder rerank stage between pgvector and the LLM sharpens the relevance boundary: the wider stage-1 net improves recall, the stage-2 rerank improves precision, and chunks that fail to clear the rerank threshold produce an empty context that fires the system prompt's refusal rule.

Why Voyage, not OpenAI embeddings?

Voyage is Anthropic's officially recommended embedding provider, and voyage-code-3 is optimized for code-heavy content — the Google Maps docs are ~40% code snippets. Using Voyage for both embedding and rerank means one API key, one rate-limit bucket, one mental model.

Why Neon over Supabase?

The stack needed plain Postgres with pgvector, nothing more. Neon's serverless driver works natively in the Vercel Edge runtime, autosuspend keeps the free tier lean, and branching would let me run ingestion experiments without risking prod data (not used in practice yet, but available).

Why 800-token chunks with 200-token overlap?

Short enough that stage-2 rerank returns of 5 chunks fit comfortably in Claude's context after the system prompt. Overlap prevents chunk boundaries from splitting a code example away from the paragraph explaining it — which was the single biggest quality win during early tuning.

Why Anthropic's managed web_search tool instead of a custom fetcher?

Three reasons. (1) No new API key or search provider to maintain — the tool is managed by Anthropic and billed at $10/1K searches, negligible at portfolio traffic. (2) allowedDomains restricts searches to Google-owned docs so the model can't wander off to Stack Overflow or blogs. (3) maxUses caps the loop so a single user message can't burn unlimited searches. Implementing this with a custom fetcher would take days; the managed tool was ~10 lines.

What guards against hallucination?

Four layers. (1) System prompt forbids answering from model memory alone — factual claims must come from retrieved chunks or web_search results. (2) Eval suite includes adversarial questions (non-existent 'Holographic API', out-of-scope weather/AWS queries) to catch regressions. (3) Sources consulted panel in the UI shows exactly which chunks the model saw, so hallucinations are immediately visible. (4) The cross-encoder reranker drops marginal chunks that would otherwise encourage the LLM to speculate.

Why return 5 chunks from stage 2, not 10?

Measured on the eval suite: going from 5 to 10 didn't improve citation accuracy but diluted the model's attention and increased hallucination risk. More context is not always better, and the rerank threshold keeps the 5 that survive genuinely relevant.

Iteration history

Every change to the eval scorer, system prompt, or retrieval layer was evaluated against the golden-question set and the result committed. The git log is an audit trail — nothing here is retroactive narrative.

RunWhat changedOutcome
v1Regex-based eval scorer7/12 baseline — 5 'failures' were false negatives where correct refusals used phrasing the regex didn't recognize.
v2Replaced regex with Haiku LLM-judge11/12 on the same answers — scorer stopped mis-classifying.
v3Prompt v2 → v3 (made refusal a floor, not a ceiling)9/12 — the 'also list adjacent APIs' rule leaked into out-of-scope refusals. Reverted v2; documented in git.
v4Two-stage retrieval + Anthropic web_search tool11/15 with all 3 newly-added long-tail questions passing. Remaining 4 failures are LLM-judge variance, not hallucinations.

Quality is measured

15-question golden set covers happy-path retrieval, out-of-scope refusal, hallucination bait, and long-tail queries that specifically exercise the web_search fallback. Every eval run writes a timestamped markdown + JSON report to evals/results/ — committed so the trend is visible.

Long-tail questions now answered via live Google docs search