Back

How it works

The architecture, the design decisions, and the eval loop behind the Maps RAG Assistant.

Request pipeline

  1. 1

    Ingestion

    Markdown docs in /documents are chunked (~800 tokens, 200 overlap), embedded with voyage-code-3 (input_type=document), and stored in Neon pgvector. Re-ingestion deletes rows by source_url so the store stays consistent.

  2. 2

    Query arrives

    POST /api/chat validates the payload with Zod (max 2000 chars per message, max 20 messages), applies a 15-req/min per-IP rate limit, and extracts the most recent user message.

  3. 3

    Retrieval

    The query is embedded with voyage-code-3 (input_type=query). Neon's match_documents() returns the top 5 chunks above a 0.65 cosine-similarity threshold. Results stream to the client as a data-sources message part before the LLM speaks.

  4. 4

    Generation

    claude-sonnet-4-6 receives the system prompt + citation instructions + retrieved chunks. Streamed back to the browser via the Vercel AI SDK. The prompt forbids answering from model memory — if retrieval returned nothing relevant, the model is instructed to refuse.

  5. 5

    Eval loop

    evals/run-evals.ts runs 12 golden questions on every prompt or chunking change. Citation accuracy, keyword coverage, and refusal handling are tracked in committed result files so prompt iterations are auditable.

Stack at a glance

FrameworkNext.js 15 (App Router)
LLMclaude-sonnet-4-6 via @ai-sdk/anthropic
Embeddingsvoyage-code-3 (1024-dim)
Vector DBNeon Postgres + pgvector (IVFFlat cosine)
StreamingVercel AI SDK (createUIMessageStream)
UIshadcn/ui + Tailwind + Framer Motion
DeployVercel (Edge + Neon serverless driver)

Design decisions

Why Voyage, not OpenAI embeddings?

Voyage is Anthropic's officially recommended embedding provider, and voyage-code-3 is optimized for code-heavy content — the Google Maps docs are ~40% code snippets. Free tier covers 200M tokens, more than enough for a portfolio project.

Why Neon over Supabase?

The stack needed plain Postgres with pgvector, nothing more. Neon's serverless driver works natively in the Vercel Edge runtime, autosuspend keeps the free tier lean, and branching lets me run ingestion experiments without risking prod data.

Why 800-token chunks with 200-token overlap?

Short enough that top-5 retrieval fits comfortably in Claude's context after the system prompt. Overlap prevents chunk boundaries from splitting a code example away from the paragraph explaining it — which was the single biggest quality win during tuning.

Why return 5 chunks, not 10?

Measured on the eval suite: going from 5 to 10 didn't improve citation accuracy but did dilute the model's attention and increased hallucination risk. More context is not always better.

Why cosine similarity, not dot product?

Voyage embeddings are normalized, so cosine is stable across document lengths. The IVFFlat index is built with vector_cosine_ops to match.

What guards against hallucination?

Three layers: (1) the system prompt explicitly instructs refusal when retrieval is empty, (2) the eval suite includes adversarial questions like 'Google Maps Holographic API' to catch regressions, (3) the Sources Used panel in the UI makes it trivially obvious to the user which chunks the model consulted.

Quality is measured

Every change to the system prompt, chunking, or retrieval parameters is evaluated against 12 golden questions — covering happy-path retrieval, out-of-scope refusals, and hallucination bait. See /evals in the repo.