System Design — From URL Shortener to Distributed Cache

April 13, 202616 min read

system-designarchitectureinterviewsdistributed-systems

Every system design interview recycles the same ten or so patterns. Different question, same toolkit. A URL shortener and a message queue have almost nothing in common on the surface, but underneath they both need partitioning, replication, caching, and async processing. Once you see the patterns, every question becomes a remix — just with different knobs turned up or down.

This article teaches those patterns through two progressively complex systems. We start with a URL shortener — the "hello world" of system design — and then level up to a distributed cache, where the patterns repeat but the stakes are higher. By the end, you'll have a reusable framework that works for any design question. If you've read my Design Patterns: The Complete Map, think of this as the system-level equivalent — a catalog of building blocks you can name, recognize, and combine.

Part 1: URL Shortener

Gathering Requirements

Before drawing a single box, pin down what you're actually building. This is where most candidates stumble — they jump straight to architecture and design for the wrong problem.

The questions that matter:

Question	Typical Answer	Why It Matters
Read:write ratio?	100:1	Reads dominate — optimize for redirect speed
Scale?	100M URLs, 10B redirects/month	Drives partitioning and caching decisions
URL length?	As short as possible	Constrains encoding strategy
TTL?	Optional, default 5 years	Affects storage and eviction
Analytics?	Click counts, referrer, geo	Adds async processing requirements
Custom aliases?	Nice to have	Adds collision handling complexity

From these answers, two things become clear: this is a read-heavy system, and the hot path (redirect) must be fast — sub-10ms ideally.

The Basic Design

Start simple. Three boxes. You can always add complexity later.

┌──────────┐      ┌──────────────────┐      ┌──────────┐
│  Client   │─────▶│  API Server      │─────▶│ Database │
│           │◀─────│  (Stateless)     │◀─────│ (KV Store)│
└──────────┘      └──────────────────┘      └──────────┘

POST /shorten  { url: "https://example.com/very/long/path" }
  → Returns: { shortUrl: "https://sho.rt/a3Bf9k" }

GET /a3Bf9k
  → 301 Redirect to original URL

The API layer is stateless — any server can handle any request. The database is a key-value store (DynamoDB, Cassandra, or even PostgreSQL with the right schema). The key is the short code, the value is the original URL plus metadata.

URL Encoding Strategy

We need to generate short, unique codes. Base62 (a-z, A-Z, 0-9) gives us 62 characters to work with. A 7-character code gives us 62^7 = ~3.5 trillion unique URLs. That's plenty.

const BASE62 = "0123456789abcdefghijklmnopqrstuvwxyzABCDEFGHIJKLMNOPQRSTUVWXYZ";
 
function encodeBase62(num: bigint): string {
  if (num === 0n) return BASE62[0];
 
  let result = "";
  while (num > 0n) {
    result = BASE62[Number(num % 62n)] + result;
    num = num / 62n;
  }
 
  return result.padStart(7, "0");
}
 
// Two strategies for generating the input number:
 
// 1. Counter-based (simple, requires coordination)
//    Use a distributed counter (ZooKeeper, Redis INCR)
//    Each server gets a range: Server A gets 1-1000, Server B gets 1001-2000
const shortCode = encodeBase62(BigInt(nextId));
 
// 2. Hash-based (no coordination, risk of collisions)
//    MD5/SHA256 the URL, take first 43 bits, encode as base62
//    On collision: append a salt and rehash

I prefer the counter-based approach. It's simpler to reason about, and you can pre-allocate ranges to avoid coordination on every write. The tradeoff: if a server crashes mid-range, you lose those IDs. That's fine — we have 3.5 trillion.

Scaling Reads: Add a Cache

With a 100:1 read-write ratio, caching is the single highest-impact optimization. Most shortened URLs follow a Pareto distribution — 20% of URLs get 80% of traffic.

┌──────────┐      ┌──────────┐      ┌──────────┐      ┌──────────┐
│  Client   │─────▶│   Load   │─────▶│   App    │─────▶│    DB    │
│           │◀─────│ Balancer │◀─────│  Server  │◀─────│          │
└──────────┘      └──────────┘      └────┬─────┘      └──────────┘
                                         │  ▲
                                         ▼  │
                                    ┌──────────┐
                                    │  Cache    │
                                    │  (Redis)  │
                                    └──────────┘

Read path:
  1. Check cache → hit? Return immediately (< 1ms)
  2. Cache miss → query DB → populate cache → return

Cache invalidation is straightforward here because URLs rarely change. A simple TTL of 24-48 hours works. When a URL is deleted or expires, either invalidate eagerly (delete from cache on write) or let the TTL handle it. For a URL shortener, eventual consistency is perfectly acceptable — a deleted short link working for another hour won't hurt anyone.

Scaling Writes: Async Processing

At 100M URLs, writes aren't the bottleneck. But at 1B+, you need to think about write throughput. Two techniques help:

Write-ahead log (WAL): Buffer writes to an append-only log (Kafka, Kinesis) and flush to the database asynchronously. The short code is generated synchronously (from the counter), so the client gets an immediate response. The actual database write happens in the background.

Batch writes: Group multiple inserts into a single database transaction. Most KV stores handle batch writes much more efficiently than individual ones.

┌──────────┐      ┌───────────┐      ┌──────────┐      ┌──────────┐
│  Client   │─────▶│ App Server │─────▶│  Kafka   │─────▶│    DB    │
│           │◀─────│ (respond   │      │  (WAL)   │      │          │
│  short    │      │ immediately)│      └──────────┘      └──────────┘
│  code     │      └───────────┘
└──────────┘

The tradeoff: there's a brief window where a newly created short URL isn't in the database yet. If someone clicks it before the write flushes, they get a 404. In practice this window is milliseconds, and you can mitigate it by writing to the cache synchronously.

Analytics Without Slowing Redirects

You want click tracking — timestamps, referrers, geo, device. But the redirect must stay fast. Never do analytics synchronously in the redirect path.

Instead, fire and forget: on every redirect, push an event to a message queue (Kafka, SQS). A separate analytics pipeline consumes these events, aggregates them, and writes to an analytics store (ClickHouse, BigQuery). The redirect response returns in under 5ms; the analytics data arrives seconds later.

GET /a3Bf9k
  → Cache hit → return 301
  → Async: push { code, timestamp, ip, userAgent, referrer } to Kafka

This is a pattern you'll see everywhere: separate the hot path from the data path. The redirect is the hot path — it must be fast. The analytics is the data path — it can be eventually consistent.

The analytics pipeline itself is a whole sub-system: Kafka consumers aggregate events in time windows (1-minute buckets), then write rollups to the analytics store. You can then query "how many clicks did this URL get in the last hour?" without scanning raw events. This is the same pattern behind every real-time dashboard you've ever seen.

Part 2: Distributed Cache

Now let's build something harder. We're designing a distributed cache — think Redis, but one you'd build when a single Redis instance isn't enough. Maybe you need 500GB of cache across 50 nodes, or you need custom eviction logic, or you're in an environment where managed Redis isn't available.

Why Build Your Own?

Usually, you wouldn't. Redis, Memcached, and managed services cover 95% of cases. But there are legitimate reasons: you need more memory than a single instance supports, you need custom eviction logic tied to your business rules, or your organization has compliance requirements that rule out managed services.

More importantly, a system design interview asks you to build one because it forces you to confront:

Data partitioning: How do you decide which node holds which key?
Replication: What happens when a node dies?
Consistency: Can two clients see different values for the same key?
Eviction: What do you throw away when memory is full?

These are the same questions every distributed system must answer.

Consistent Hashing

The core problem: given N cache nodes, which node should store key K? Naive hashing (hash(key) % N) breaks catastrophically when you add or remove a node — every key remaps.

Consistent hashing fixes this. Imagine a ring of integers from 0 to 2^32. Each node is placed on the ring at a position determined by hashing its ID. Each key is also hashed onto the ring, and it's stored on the first node clockwise from its position.

                    0
                    │
            ┌───────┼───────┐
           ╱        │        ╲
         ╱          │          ╲
       Node A       │        Node B
      (pos 45)      │       (pos 120)
        ╲           │          ╱
          ╲         │        ╱
            ╲       │      ╱
              Node C│
             (pos 240)
                    │
                   360

  key "user:42"  → hash = 80  → lands between A(45) and B(120) → stored on B
  key "post:99"  → hash = 200 → lands between B(120) and C(240) → stored on C
  key "sess:7"   → hash = 300 → lands between C(240) and A(45)  → stored on A

  If Node B dies:
    key "user:42" now maps to C (next clockwise) — only B's keys remap
    A and C's keys stay put

The beauty: when a node joins or leaves, only keys between the affected node and its predecessor need to move — roughly 1/N of total keys instead of all of them.

In practice, we use virtual nodes — each physical node gets 100-200 positions on the ring for even distribution.

import { createHash } from "crypto";
 
class ConsistentHashRing<T> {
  private ring = new Map<number, T>();
  private sortedKeys: number[] = [];
  private virtualNodes: number;
 
  constructor(virtualNodes = 150) {
    this.virtualNodes = virtualNodes;
  }
 
  private hash(key: string): number {
    const h = createHash("md5").update(key).digest();
    return h.readUInt32BE(0);
  }
 
  addNode(node: T, id: string): void {
    for (let i = 0; i < this.virtualNodes; i++) {
      const virtualKey = this.hash(`${id}:${i}`);
      this.ring.set(virtualKey, node);
      this.sortedKeys.push(virtualKey);
    }
    this.sortedKeys.sort((a, b) => a - b);
  }
 
  removeNode(id: string): void {
    for (let i = 0; i < this.virtualNodes; i++) {
      const virtualKey = this.hash(`${id}:${i}`);
      this.ring.delete(virtualKey);
      this.sortedKeys = this.sortedKeys.filter((k) => k !== virtualKey);
    }
  }
 
  getNode(key: string): T | undefined {
    if (this.sortedKeys.length === 0) return undefined;
 
    const hash = this.hash(key);
    // Find first node position >= hash (binary search in production)
    for (const pos of this.sortedKeys) {
      if (pos >= hash) return this.ring.get(pos);
    }
    // Wrap around to first node
    return this.ring.get(this.sortedKeys[0]);
  }
}

Data Partitioning and Replication

Consistent hashing gives us partitioning for free — each node owns a segment of the ring. For replication, store each key on N consecutive nodes (typically N=3). The first node is the primary, the next two are replicas.

Write "user:42" (replication factor = 3):

  1. Hash key → position 80 → primary = Node B
  2. Replicate clockwise → Node C, then Node A
  3. Return success after W nodes acknowledge (W = 2 for quorum)

Read "user:42":
  1. Hash key → primary = Node B
  2. Read from R nodes (R = 2 for quorum)
  3. If values disagree → return the one with highest timestamp

The quorum formula: W + R > N ensures consistency. With N=3, W=2, R=2, every read will see at least one node that has the latest write. You can tune this — W=1 for faster writes (at the cost of durability), R=1 for faster reads (at the cost of consistency).

Cache Eviction Policies

When memory is full, something has to go. The right policy depends on your access patterns.

Policy	How It Works	Best For	Weakness
LRU (Least Recently Used)	Evict the key not accessed for the longest time	General purpose, temporal locality	Scan pollution — a one-time full scan evicts hot keys
LFU (Least Frequently Used)	Evict the key with the lowest access count	Stable popularity distributions	Slow to adapt — previously hot keys linger
TTL (Time To Live)	Evict keys after a fixed duration	Data with known freshness windows	Doesn't consider access patterns at all
LRU + LFU hybrid	Frequency-weighted recency (like Redis's `allkeys-lfu`)	Mixed workloads	More complex to implement and tune
Random	Evict a random key	Surprisingly effective in many cases	No guarantees — simple but unpredictable

For most caches, LRU is the default choice. It's simple, well-understood, and handles the common case (temporal locality) well. LFU is better when you have a stable set of hot keys that might not be accessed recently but should stay cached (think: a product catalog where top items are always queried).

Handling Node Failures

Nodes will fail. Your design needs to handle this gracefully.

Health checks: Every node pings its neighbors on a schedule (every 2-5 seconds). If a node misses 3 consecutive heartbeats, it's marked as suspected-down.

Gossip protocol: Instead of a central monitor, nodes gossip failure information to each other. Node A tells Node B that Node C might be down. If B agrees (it also can't reach C), they propagate this. Once a majority of nodes agree, C is officially dead and its keys are redistributed.

Gossip protocol flow:

  Node A ──(C is down)──▶ Node B
  Node B ──(C is down)──▶ Node D    (B confirms, spreads the word)
  Node D ──(C is down)──▶ Node E

  Once majority agrees → C's key range is reassigned
  → Node D (next on ring) takes over C's primary keys
  → Replicas are rebuilt on remaining healthy nodes

The advantage over a central monitor: no single point of failure. The tradeoff: convergence takes longer — it might take 10-30 seconds for the entire cluster to agree a node is down. For a cache, that's acceptable. For a database, you'd want faster detection.

Write Policies

How you handle writes relative to the backing store determines your consistency and performance characteristics.

Policy	How It Works	Latency	Consistency	Data Loss Risk
Write-through	Write to cache AND backing store synchronously	High (2 writes)	Strong	None
Write-back	Write to cache, async flush to backing store	Low (1 write)	Eventual	Data loss if node dies before flush
Write-around	Write directly to backing store, skip cache	Medium	Strong	Cache miss on next read

Write-through is the safe default. Write-back is what you reach for when write latency matters more than durability (session stores, analytics counters). Write-around makes sense when you write data that's rarely read immediately (log entries, audit trails).

In our distributed cache, write-back with replication is a solid middle ground: you write to 2 of 3 replicas synchronously (fast, durable), and the replicas flush to the backing store asynchronously.

The Recurring Patterns

Here's the payoff. Look at the two systems we just designed and notice how the same patterns keep appearing:

Pattern	URL Shortener	Distributed Cache	Shows Up In...
Horizontal scaling	Stateless app servers behind LB	Add/remove cache nodes	Everything
Caching	Redis layer for redirect lookups	The system is a cache	Read-heavy systems
Load balancing	Distribute requests across app servers	Consistent hashing distributes keys	Everything
Data partitioning	Shard DB by key range	Consistent hashing ring	Large datasets
Replication	DB replicas for read scaling	N=3 copies per key	Anything that can't lose data
Async processing	Analytics events via Kafka	Write-back flushing	Anywhere latency matters
Monitoring	Redirect latency, cache hit rate	Node health, eviction rate	Everything (seriously)

These seven patterns cover roughly 80% of what you need in any system design interview. The other 20% is domain-specific knowledge — rate limiting for API design, conflict resolution for collaborative editing, feed ranking for social media. But the infrastructure patterns are always the same.

The Interview Framework

When you sit down for a system design interview, follow these five steps. Every time. The structure itself signals competence.

Step 1: Requirements (3-5 minutes). Ask clarifying questions. Functional requirements (what does it do?), non-functional requirements (scale, latency, availability), and constraints (existing tech stack, budget). Write them down visibly.

Step 2: API Design (3-5 minutes). Define the core endpoints or interfaces. This forces you to think about the system from the user's perspective. Keep it to 3-5 endpoints — if you have more, you're overcomplicating it.

Step 3: High-Level Design (10-15 minutes). Draw the boxes and arrows. Client, load balancer, application servers, database, cache, message queue. Start simple — three boxes — and add components only when you have a reason to. Every box should earn its place.

Step 4: Deep Dive (10-15 minutes). Pick 2-3 areas to go deep on. The interviewer will often guide you, but if they don't, choose the most interesting or challenging aspects. This is where you discuss partitioning strategies, cache policies, consistency models, and failure handling.

Step 5: Bottlenecks and Tradeoffs (5 minutes). Identify what could break. Single points of failure? Hot partitions? What happens at 10x scale? Always discuss tradeoffs — there are no perfect designs, only tradeoffs you've chosen deliberately.

A few more tips that I've found make a real difference:

Narrate your thinking. Silence is your enemy. Even when you're unsure, talk through your reasoning — interviewers give credit for structured thinking, not just correct answers.
Use numbers. Back-of-the-envelope math ("100M URLs at 100 bytes each = 10GB, fits in memory") shows you can reason about scale concretely, not just abstractly.
Draw before you talk. A quick diagram anchors the conversation and gives the interviewer something to point at when they want to redirect you.

The biggest mistake I see: spending 20 minutes on Step 3 and running out of time before the deep dive. The high-level design should be a sketch, not a masterpiece. The depth is where you demonstrate real understanding.

Takeaway

System design isn't about memorizing architectures — it's about recognizing patterns and knowing which knobs to turn. A URL shortener and a distributed cache look nothing alike, but they share the same DNA: partitioning, replication, caching, async processing, failure handling. Master those patterns once, and every new problem is just a different configuration of the same building blocks. Start with three boxes, add complexity only when you can articulate why, and always talk in tradeoffs — not absolutes.

Comments

No comments yet. Be the first to comment!