Skip to content
digital garden
Back to Blog

Building a RAG Pipeline from Scratch with TypeScript

11 min read
airagembeddingstypescriptvector-db

LLMs are confident liars. Ask one about your company's internal docs, your product changelog, or last quarter's metrics, and it'll generate something plausible that's completely wrong. The model doesn't know your data — it was never trained on it.

Retrieval-Augmented Generation (RAG) fixes this by grounding LLM responses in actual documents. Instead of hoping the model knows the answer, you retrieve relevant chunks from your own data and feed them as context. Most RAG tutorials are Python-first, but TypeScript is a perfectly good choice here — especially if your application is already a Node.js stack. I built a complete RAG pipeline in TypeScript recently, and here's how every piece fits together.

Architecture Overview

Here's the full pipeline, end to end:

                        INGESTION
┌──────────┐    ┌──────────┐    ┌────────────┐    ┌───────────┐
│ Documents│───>│ Chunking │───>│ Embeddings │───>│ Vector DB │
│ (md/pdf) │    │          │    │ (OpenAI)   │    │ (pgvector)│
└──────────┘    └──────────┘    └────────────┘    └───────────┘
                                                        │
                        QUERY                           │
┌──────────┐    ┌──────────┐    ┌────────────┐          │
│  User    │───>│  Embed   │───>│ Similarity │──────────┘
│  Query   │    │  Query   │    │  Search    │
└──────────┘    └──────────┘    └────────────┘
                                      │
                               ┌──────────────┐    ┌──────────┐
                               │   Context    │───>│   LLM    │───> Response
                               │   Assembly   │    │ (Claude) │
                               └──────────────┘    └──────────┘

Two phases: ingestion (process documents once, store embeddings) and query (embed the question, find similar chunks, generate an answer). Let's build each step.

Step 1: Document Loading

Before you can chunk anything, you need to get text out of various formats. Here are loaders for the three most common sources:

// loaders.ts
import { readFile } from "fs/promises";
import pdf from "pdf-parse";
 
interface Document {
  content: string;
  metadata: {
    source: string;
    type: "markdown" | "pdf" | "web";
    title?: string;
  };
}
 
async function loadMarkdown(filePath: string): Promise<Document> {
  const content = await readFile(filePath, "utf-8");
  // Strip frontmatter if present
  const cleaned = content.replace(/^---[\s\S]*?---\n/, "");
  return {
    content: cleaned,
    metadata: { source: filePath, type: "markdown" },
  };
}
 
async function loadPDF(filePath: string): Promise<Document> {
  const buffer = await readFile(filePath);
  const data = await pdf(buffer);
  return {
    content: data.text,
    metadata: {
      source: filePath,
      type: "pdf",
      title: data.info?.Title,
    },
  };
}
 
async function loadWebPage(url: string): Promise<Document> {
  const res = await fetch(url);
  const html = await res.text();
 
  // Simple extraction — strip tags, collapse whitespace
  const text = html
    .replace(/<script[\s\S]*?<\/script>/gi, "")
    .replace(/<style[\s\S]*?<\/style>/gi, "")
    .replace(/<[^>]+>/g, "\n")
    .replace(/\s+/g, " ")
    .trim();
 
  return {
    content: text,
    metadata: { source: url, type: "web" },
  };
}

For production, you'd want something more robust for web scraping — similar to the approach in my website summarizer post. But this gets the job done for ingestion.

Step 2: Chunking Strategies

Chunking is where most RAG pipelines go wrong. Too large and your context is diluted. Too small and you lose meaning. Three strategies, each with different tradeoffs: fixed-size (simple but splits mid-sentence), sentence-based (respects boundaries but varies in size), and recursive (splits by paragraph, then sentence, then character). Recursive is what you usually want.

// chunker.ts
interface Chunk {
  text: string;
  index: number;
  metadata: { source: string; startChar: number; endChar: number };
}
 
function chunkDocument(
  doc: Document,
  options: { chunkSize?: number; overlap?: number } = {}
): Chunk[] {
  const { chunkSize = 512, overlap = 50 } = options;
  const rawChunks = recursiveChunk(doc.content, chunkSize, overlap);
 
  return rawChunks.map((text, index) => ({
    text,
    index,
    metadata: {
      source: doc.metadata.source,
      startChar: doc.content.indexOf(text),
      endChar: doc.content.indexOf(text) + text.length,
    },
  }));
}
 
function recursiveChunk(text: string, maxSize: number, overlap: number): string[] {
  const separators = ["\n\n", "\n", ". ", " "];
 
  for (const sep of separators) {
    const parts = text.split(sep);
    if (parts.length <= 1) continue;
 
    const chunks: string[] = [];
    let current = "";
 
    for (const part of parts) {
      const candidate = current ? current + sep + part : part;
      if (candidate.length > maxSize && current) {
        chunks.push(current.trim());
        // Overlap: keep the tail of the previous chunk
        current = current.slice(-overlap) + sep + part;
      } else {
        current = candidate;
      }
    }
    if (current.trim()) chunks.push(current.trim());
 
    // If any chunk is still too large, recurse with a finer separator
    return chunks.flatMap((c) =>
      c.length > maxSize * 1.5 ? recursiveChunk(c, maxSize, overlap) : [c]
    );
  }
 
  // Fallback: fixed-size split
  const chunks: string[] = [];
  let start = 0;
  while (start < text.length) {
    chunks.push(text.slice(start, start + maxSize));
    start += maxSize - overlap;
  }
  return chunks;
}

512 characters with 50-character overlap works well as a starting point. Technical docs often need larger chunks (1024), FAQ-style content works better smaller (256).

Step 3: Generating Embeddings

Embeddings turn text into vectors that capture semantic meaning. Similar text produces similar vectors, which is what makes retrieval work.

// embeddings.ts
import OpenAI from "openai";
 
const openai = new OpenAI();
 
async function embedChunks(
  chunks: Chunk[],
  batchSize: number = 100
): Promise<{ chunk: Chunk; embedding: number[] }[]> {
  const results: { chunk: Chunk; embedding: number[] }[] = [];
 
  // OpenAI allows up to 2048 inputs per request,
  // but batching at 100 keeps memory manageable
  for (let i = 0; i < chunks.length; i += batchSize) {
    const batch = chunks.slice(i, i + batchSize);
    const texts = batch.map((c) => c.text);
 
    const response = await openai.embeddings.create({
      model: "text-embedding-3-small",
      input: texts,
    });
 
    for (let j = 0; j < batch.length; j++) {
      results.push({
        chunk: batch[j],
        embedding: response.data[j].embedding,
      });
    }
 
    // Respect rate limits on large document sets
    if (i + batchSize < chunks.length) {
      await new Promise((r) => setTimeout(r, 200));
    }
  }
 
  return results;
}
 
async function embedQuery(query: string): Promise<number[]> {
  const response = await openai.embeddings.create({
    model: "text-embedding-3-small",
    input: query,
  });
  return response.data[0].embedding;
}

text-embedding-3-small produces 1536-dimensional vectors and costs $0.02 per million tokens. For most use cases, it's the sweet spot. The large variant gives marginally better retrieval at 5x the cost — only worth it if you've optimized everything else first.

Step 4: Vector Storage

You need somewhere to store embeddings and query them by similarity. Here's a quick comparison:

SolutionBest forManaged?Cost
pgvectorTeams already on PostgresSelf-hosted or managedFree / DB cost
PineconeZero-ops, scales automaticallyFully managedFree tier, then $/usage
QdrantHigh performance, rich filteringBothFree (self-hosted)
In-memoryDevelopment and prototypingN/AFree

pgvector is the most accessible — if you already have Postgres, you just add an extension:

// vector-store.ts
import pg from "pg";
 
const pool = new pg.Pool({ connectionString: process.env.DATABASE_URL });
 
async function initVectorStore() {
  await pool.query(`CREATE EXTENSION IF NOT EXISTS vector`);
  await pool.query(`
    CREATE TABLE IF NOT EXISTS document_chunks (
      id SERIAL PRIMARY KEY,
      content TEXT NOT NULL,
      source TEXT NOT NULL,
      chunk_index INTEGER NOT NULL,
      embedding vector(1536),
      metadata JSONB DEFAULT '{}'
    )
  `);
  await pool.query(`
    CREATE INDEX IF NOT EXISTS chunks_embedding_idx
    ON document_chunks USING ivfflat (embedding vector_cosine_ops)
    WITH (lists = 100)
  `);
}
 
async function insertChunks(embeddedChunks: { chunk: Chunk; embedding: number[] }[]) {
  const client = await pool.connect();
  try {
    await client.query("BEGIN");
    for (const { chunk, embedding } of embeddedChunks) {
      await client.query(
        `INSERT INTO document_chunks (content, source, chunk_index, embedding, metadata)
         VALUES ($1, $2, $3, $4, $5)`,
        [chunk.text, chunk.metadata.source, chunk.index,
         `[${embedding.join(",")}]`, JSON.stringify(chunk.metadata)]
      );
    }
    await client.query("COMMIT");
  } finally {
    client.release();
  }
}

Step 5: Retrieval

With chunks stored, retrieval is a similarity search — embed the user's query, find the closest vectors, return the top results.

// retrieval.ts
async function retrieveChunks(
  query: string,
  topK: number = 5,
  similarityThreshold: number = 0.7
): Promise<{ content: string; source: string; score: number }[]> {
  const queryEmbedding = await embedQuery(query);
 
  const result = await pool.query(
    `SELECT content, source, metadata,
            1 - (embedding <=> $1::vector) AS score
     FROM document_chunks
     WHERE 1 - (embedding <=> $1::vector) > $2
     ORDER BY embedding <=> $1::vector
     LIMIT $3`,
    [`[${queryEmbedding.join(",")}]`, similarityThreshold, topK]
  );
 
  return result.rows;
}

The <=> operator is pgvector's cosine distance. We convert to similarity with 1 - distance and filter out anything below our threshold. This prevents the LLM from getting irrelevant chunks when the user asks something outside your document set.

For better results, you can add a re-ranking step — retrieve a larger set (say top 20), then use a cross-encoder or LLM call to re-score and pick the best 5. It's slower but noticeably improves answer quality.

Step 6: Context Assembly

This is the bridge between retrieval and generation. You take the retrieved chunks and build a prompt that gives the LLM everything it needs — including source attribution.

// context.ts
function assembleContext(
  query: string,
  chunks: { content: string; source: string; score: number }[]
): string {
  const contextBlock = chunks
    .map(
      (chunk, i) =>
        `[Source ${i + 1}: ${chunk.source}]\n${chunk.content}`
    )
    .join("\n\n---\n\n");
 
  return `You are a helpful assistant that answers questions based on the provided context.
Use ONLY the information from the context below to answer. If the context doesn't
contain enough information, say so — do not make things up.
 
When you use information from a source, cite it as [Source N].
 
CONTEXT:
${contextBlock}
 
USER QUESTION:
${query}`;
}

The explicit instruction to cite sources and not hallucinate is critical. Without it, the LLM will happily blend retrieved context with its own training data, which defeats the purpose of RAG.

Step 7: Generation

Now we call the LLM with our assembled context and stream the response. This is similar to the pattern I used in the LLM agents post — structured input, streamed output.

// generate.ts
import Anthropic from "@anthropic-ai/sdk";
 
const anthropic = new Anthropic();
 
async function* generateAnswer(
  query: string,
  chunks: { content: string; source: string; score: number }[]
): AsyncGenerator<string> {
  const prompt = assembleContext(query, chunks);
 
  const stream = anthropic.messages.stream({
    model: "claude-sonnet-4-20250514",
    max_tokens: 1024,
    messages: [{ role: "user", content: prompt }],
  });
 
  for await (const event of stream) {
    if (
      event.type === "content_block_delta" &&
      event.delta.type === "text_delta"
    ) {
      yield event.delta.text;
    }
  }
}
 
// Full RAG query — ties everything together
async function ragQuery(query: string): Promise<string> {
  const chunks = await retrieveChunks(query);
 
  if (chunks.length === 0) {
    return "I couldn't find any relevant information in the documents to answer that question.";
  }
 
  let answer = "";
  for await (const token of generateAnswer(query, chunks)) {
    answer += token;
    process.stdout.write(token); // Stream to console
  }
 
  // Append sources
  const sources = [...new Set(chunks.map((c) => c.source))];
  answer += "\n\nSources:\n" + sources.map((s) => `- ${s}`).join("\n");
 
  return answer;
}

Evaluation

A RAG pipeline is only as good as your ability to measure it. Three dimensions matter: retrieval relevance (are the right chunks found?), faithfulness (does the answer stick to retrieved context?), and answer correctness (is it actually right?).

// eval.ts
interface EvalCase {
  query: string;
  expectedSources: string[];
}
 
async function evaluate(cases: EvalCase[]) {
  for (const { query, expectedSources } of cases) {
    const chunks = await retrieveChunks(query);
    const found = chunks.map((c) => c.source);
    const recall = expectedSources.filter((s) => found.includes(s)).length
      / expectedSources.length;
 
    let answer = "";
    for await (const token of generateAnswer(query, chunks)) answer += token;
 
    console.log({ query, recall, chunksFound: chunks.length, answerLength: answer.length });
  }
}

Start with 20-30 eval cases covering your key queries. Increase the set as you tune chunk size, overlap, and retrieval parameters.

Production Considerations

A few things you'll need before this is production-ready:

  • Cache embeddings — Store a hash of each chunk's text alongside its embedding. On re-ingestion, skip chunks that haven't changed. This saves significant cost on large document sets.
  • Incremental updates — Don't re-embed everything when one document changes. Delete chunks for the modified document and re-insert only those.
  • Chunk size tuning — Start at 512 characters, measure retrieval quality, then adjust. Technical documentation often works better at 1024. FAQ-style content works better at 256.
  • Cost estimates — For text-embedding-3-small: 1 million tokens costs $0.02. A 100-page technical doc is roughly 50,000 tokens to embed. That's $0.001. Queries are even cheaper — a single question is a few dozen tokens. The LLM generation call is where the real cost is.
  • Metadata filtering — Add filters to your retrieval query so users can scope searches to specific documents, date ranges, or categories. pgvector supports this natively through standard SQL WHERE clauses alongside vector search.
  • Hybrid search — Combine vector similarity with keyword search (BM25) for better results. pgvector plus Postgres full-text search gives you both in one database.

Wrapping Up

The full pipeline is seven steps: load, chunk, embed, store, retrieve, assemble, generate. Each step has knobs to tune, but the core pattern is straightforward. Start with the recursive chunker, text-embedding-3-small, pgvector, and a simple prompt template. Measure with a small eval set. Then iterate.

The TypeScript ecosystem has everything you need — pg for Postgres, the OpenAI and Anthropic SDKs for embeddings and generation, and standard Node.js APIs for document loading. No Python required.

Comments

No comments yet. Be the first to comment!