System Design — From URL Shortener to Distributed Cache
Every system design interview recycles the same ten or so patterns. Different question, same toolkit. A URL shortener and a message queue have almost nothing in common on the surface, but underneath they both need partitioning, replication, caching, and async processing. Once you see the patterns, every question becomes a remix — just with different knobs turned up or down.
This article teaches those patterns through two progressively complex systems. We start with a URL shortener — the "hello world" of system design — and then level up to a distributed cache, where the patterns repeat but the stakes are higher. By the end, you'll have a reusable framework that works for any design question. If you've read my Design Patterns: The Complete Map, think of this as the system-level equivalent — a catalog of building blocks you can name, recognize, and combine.
Part 1: URL Shortener
Gathering Requirements
Before drawing a single box, pin down what you're actually building. This is where most candidates stumble — they jump straight to architecture and design for the wrong problem.
The questions that matter:
| Question | Typical Answer | Why It Matters |
|---|---|---|
| Read:write ratio? | 100:1 | Reads dominate — optimize for redirect speed |
| Scale? | 100M URLs, 10B redirects/month | Drives partitioning and caching decisions |
| URL length? | As short as possible | Constrains encoding strategy |
| TTL? | Optional, default 5 years | Affects storage and eviction |
| Analytics? | Click counts, referrer, geo | Adds async processing requirements |
| Custom aliases? | Nice to have | Adds collision handling complexity |
From these answers, two things become clear: this is a read-heavy system, and the hot path (redirect) must be fast — sub-10ms ideally.
The Basic Design
Start simple. Three boxes. You can always add complexity later.
┌──────────┐ ┌──────────────────┐ ┌──────────┐
│ Client │─────▶│ API Server │─────▶│ Database │
│ │◀─────│ (Stateless) │◀─────│ (KV Store)│
└──────────┘ └──────────────────┘ └──────────┘
POST /shorten { url: "https://example.com/very/long/path" }
→ Returns: { shortUrl: "https://sho.rt/a3Bf9k" }
GET /a3Bf9k
→ 301 Redirect to original URL
The API layer is stateless — any server can handle any request. The database is a key-value store (DynamoDB, Cassandra, or even PostgreSQL with the right schema). The key is the short code, the value is the original URL plus metadata.
URL Encoding Strategy
We need to generate short, unique codes. Base62 (a-z, A-Z, 0-9) gives us 62 characters to work with. A 7-character code gives us 62^7 = ~3.5 trillion unique URLs. That's plenty.
const BASE62 = "0123456789abcdefghijklmnopqrstuvwxyzABCDEFGHIJKLMNOPQRSTUVWXYZ";
function encodeBase62(num: bigint): string {
if (num === 0n) return BASE62[0];
let result = "";
while (num > 0n) {
result = BASE62[Number(num % 62n)] + result;
num = num / 62n;
}
return result.padStart(7, "0");
}
// Two strategies for generating the input number:
// 1. Counter-based (simple, requires coordination)
// Use a distributed counter (ZooKeeper, Redis INCR)
// Each server gets a range: Server A gets 1-1000, Server B gets 1001-2000
const shortCode = encodeBase62(BigInt(nextId));
// 2. Hash-based (no coordination, risk of collisions)
// MD5/SHA256 the URL, take first 43 bits, encode as base62
// On collision: append a salt and rehashI prefer the counter-based approach. It's simpler to reason about, and you can pre-allocate ranges to avoid coordination on every write. The tradeoff: if a server crashes mid-range, you lose those IDs. That's fine — we have 3.5 trillion.
Scaling Reads: Add a Cache
With a 100:1 read-write ratio, caching is the single highest-impact optimization. Most shortened URLs follow a Pareto distribution — 20% of URLs get 80% of traffic.
┌──────────┐ ┌──────────┐ ┌──────────┐ ┌──────────┐
│ Client │─────▶│ Load │─────▶│ App │─────▶│ DB │
│ │◀─────│ Balancer │◀─────│ Server │◀─────│ │
└──────────┘ └──────────┘ └────┬─────┘ └──────────┘
│ ▲
▼ │
┌──────────┐
│ Cache │
│ (Redis) │
└──────────┘
Read path:
1. Check cache → hit? Return immediately (< 1ms)
2. Cache miss → query DB → populate cache → return
Cache invalidation is straightforward here because URLs rarely change. A simple TTL of 24-48 hours works. When a URL is deleted or expires, either invalidate eagerly (delete from cache on write) or let the TTL handle it. For a URL shortener, eventual consistency is perfectly acceptable — a deleted short link working for another hour won't hurt anyone.
Scaling Writes: Async Processing
At 100M URLs, writes aren't the bottleneck. But at 1B+, you need to think about write throughput. Two techniques help:
Write-ahead log (WAL): Buffer writes to an append-only log (Kafka, Kinesis) and flush to the database asynchronously. The short code is generated synchronously (from the counter), so the client gets an immediate response. The actual database write happens in the background.
Batch writes: Group multiple inserts into a single database transaction. Most KV stores handle batch writes much more efficiently than individual ones.
┌──────────┐ ┌───────────┐ ┌──────────┐ ┌──────────┐
│ Client │─────▶│ App Server │─────▶│ Kafka │─────▶│ DB │
│ │◀─────│ (respond │ │ (WAL) │ │ │
│ short │ │ immediately)│ └──────────┘ └──────────┘
│ code │ └───────────┘
└──────────┘
The tradeoff: there's a brief window where a newly created short URL isn't in the database yet. If someone clicks it before the write flushes, they get a 404. In practice this window is milliseconds, and you can mitigate it by writing to the cache synchronously.
Analytics Without Slowing Redirects
You want click tracking — timestamps, referrers, geo, device. But the redirect must stay fast. Never do analytics synchronously in the redirect path.
Instead, fire and forget: on every redirect, push an event to a message queue (Kafka, SQS). A separate analytics pipeline consumes these events, aggregates them, and writes to an analytics store (ClickHouse, BigQuery). The redirect response returns in under 5ms; the analytics data arrives seconds later.
GET /a3Bf9k
→ Cache hit → return 301
→ Async: push { code, timestamp, ip, userAgent, referrer } to Kafka
This is a pattern you'll see everywhere: separate the hot path from the data path. The redirect is the hot path — it must be fast. The analytics is the data path — it can be eventually consistent.
The analytics pipeline itself is a whole sub-system: Kafka consumers aggregate events in time windows (1-minute buckets), then write rollups to the analytics store. You can then query "how many clicks did this URL get in the last hour?" without scanning raw events. This is the same pattern behind every real-time dashboard you've ever seen.
Part 2: Distributed Cache
Now let's build something harder. We're designing a distributed cache — think Redis, but one you'd build when a single Redis instance isn't enough. Maybe you need 500GB of cache across 50 nodes, or you need custom eviction logic, or you're in an environment where managed Redis isn't available.
Why Build Your Own?
Usually, you wouldn't. Redis, Memcached, and managed services cover 95% of cases. But there are legitimate reasons: you need more memory than a single instance supports, you need custom eviction logic tied to your business rules, or your organization has compliance requirements that rule out managed services.
More importantly, a system design interview asks you to build one because it forces you to confront:
- Data partitioning: How do you decide which node holds which key?
- Replication: What happens when a node dies?
- Consistency: Can two clients see different values for the same key?
- Eviction: What do you throw away when memory is full?
These are the same questions every distributed system must answer.
Consistent Hashing
The core problem: given N cache nodes, which node should store key K? Naive hashing (hash(key) % N) breaks catastrophically when you add or remove a node — every key remaps.
Consistent hashing fixes this. Imagine a ring of integers from 0 to 2^32. Each node is placed on the ring at a position determined by hashing its ID. Each key is also hashed onto the ring, and it's stored on the first node clockwise from its position.
0
│
┌───────┼───────┐
╱ │ ╲
╱ │ ╲
Node A │ Node B
(pos 45) │ (pos 120)
╲ │ ╱
╲ │ ╱
╲ │ ╱
Node C│
(pos 240)
│
360
key "user:42" → hash = 80 → lands between A(45) and B(120) → stored on B
key "post:99" → hash = 200 → lands between B(120) and C(240) → stored on C
key "sess:7" → hash = 300 → lands between C(240) and A(45) → stored on A
If Node B dies:
key "user:42" now maps to C (next clockwise) — only B's keys remap
A and C's keys stay put
The beauty: when a node joins or leaves, only keys between the affected node and its predecessor need to move — roughly 1/N of total keys instead of all of them.
In practice, we use virtual nodes — each physical node gets 100-200 positions on the ring for even distribution.
import { createHash } from "crypto";
class ConsistentHashRing<T> {
private ring = new Map<number, T>();
private sortedKeys: number[] = [];
private virtualNodes: number;
constructor(virtualNodes = 150) {
this.virtualNodes = virtualNodes;
}
private hash(key: string): number {
const h = createHash("md5").update(key).digest();
return h.readUInt32BE(0);
}
addNode(node: T, id: string): void {
for (let i = 0; i < this.virtualNodes; i++) {
const virtualKey = this.hash(`${id}:${i}`);
this.ring.set(virtualKey, node);
this.sortedKeys.push(virtualKey);
}
this.sortedKeys.sort((a, b) => a - b);
}
removeNode(id: string): void {
for (let i = 0; i < this.virtualNodes; i++) {
const virtualKey = this.hash(`${id}:${i}`);
this.ring.delete(virtualKey);
this.sortedKeys = this.sortedKeys.filter((k) => k !== virtualKey);
}
}
getNode(key: string): T | undefined {
if (this.sortedKeys.length === 0) return undefined;
const hash = this.hash(key);
// Find first node position >= hash (binary search in production)
for (const pos of this.sortedKeys) {
if (pos >= hash) return this.ring.get(pos);
}
// Wrap around to first node
return this.ring.get(this.sortedKeys[0]);
}
}Data Partitioning and Replication
Consistent hashing gives us partitioning for free — each node owns a segment of the ring. For replication, store each key on N consecutive nodes (typically N=3). The first node is the primary, the next two are replicas.
Write "user:42" (replication factor = 3):
1. Hash key → position 80 → primary = Node B
2. Replicate clockwise → Node C, then Node A
3. Return success after W nodes acknowledge (W = 2 for quorum)
Read "user:42":
1. Hash key → primary = Node B
2. Read from R nodes (R = 2 for quorum)
3. If values disagree → return the one with highest timestamp
The quorum formula: W + R > N ensures consistency. With N=3, W=2, R=2, every read will see at least one node that has the latest write. You can tune this — W=1 for faster writes (at the cost of durability), R=1 for faster reads (at the cost of consistency).
Cache Eviction Policies
When memory is full, something has to go. The right policy depends on your access patterns.
| Policy | How It Works | Best For | Weakness |
|---|---|---|---|
| LRU (Least Recently Used) | Evict the key not accessed for the longest time | General purpose, temporal locality | Scan pollution — a one-time full scan evicts hot keys |
| LFU (Least Frequently Used) | Evict the key with the lowest access count | Stable popularity distributions | Slow to adapt — previously hot keys linger |
| TTL (Time To Live) | Evict keys after a fixed duration | Data with known freshness windows | Doesn't consider access patterns at all |
| LRU + LFU hybrid | Frequency-weighted recency (like Redis's allkeys-lfu) | Mixed workloads | More complex to implement and tune |
| Random | Evict a random key | Surprisingly effective in many cases | No guarantees — simple but unpredictable |
For most caches, LRU is the default choice. It's simple, well-understood, and handles the common case (temporal locality) well. LFU is better when you have a stable set of hot keys that might not be accessed recently but should stay cached (think: a product catalog where top items are always queried).
Handling Node Failures
Nodes will fail. Your design needs to handle this gracefully.
Health checks: Every node pings its neighbors on a schedule (every 2-5 seconds). If a node misses 3 consecutive heartbeats, it's marked as suspected-down.
Gossip protocol: Instead of a central monitor, nodes gossip failure information to each other. Node A tells Node B that Node C might be down. If B agrees (it also can't reach C), they propagate this. Once a majority of nodes agree, C is officially dead and its keys are redistributed.
Gossip protocol flow:
Node A ──(C is down)──▶ Node B
Node B ──(C is down)──▶ Node D (B confirms, spreads the word)
Node D ──(C is down)──▶ Node E
Once majority agrees → C's key range is reassigned
→ Node D (next on ring) takes over C's primary keys
→ Replicas are rebuilt on remaining healthy nodes
The advantage over a central monitor: no single point of failure. The tradeoff: convergence takes longer — it might take 10-30 seconds for the entire cluster to agree a node is down. For a cache, that's acceptable. For a database, you'd want faster detection.
Write Policies
How you handle writes relative to the backing store determines your consistency and performance characteristics.
| Policy | How It Works | Latency | Consistency | Data Loss Risk |
|---|---|---|---|---|
| Write-through | Write to cache AND backing store synchronously | High (2 writes) | Strong | None |
| Write-back | Write to cache, async flush to backing store | Low (1 write) | Eventual | Data loss if node dies before flush |
| Write-around | Write directly to backing store, skip cache | Medium | Strong | Cache miss on next read |
Write-through is the safe default. Write-back is what you reach for when write latency matters more than durability (session stores, analytics counters). Write-around makes sense when you write data that's rarely read immediately (log entries, audit trails).
In our distributed cache, write-back with replication is a solid middle ground: you write to 2 of 3 replicas synchronously (fast, durable), and the replicas flush to the backing store asynchronously.
The Recurring Patterns
Here's the payoff. Look at the two systems we just designed and notice how the same patterns keep appearing:
| Pattern | URL Shortener | Distributed Cache | Shows Up In... |
|---|---|---|---|
| Horizontal scaling | Stateless app servers behind LB | Add/remove cache nodes | Everything |
| Caching | Redis layer for redirect lookups | The system is a cache | Read-heavy systems |
| Load balancing | Distribute requests across app servers | Consistent hashing distributes keys | Everything |
| Data partitioning | Shard DB by key range | Consistent hashing ring | Large datasets |
| Replication | DB replicas for read scaling | N=3 copies per key | Anything that can't lose data |
| Async processing | Analytics events via Kafka | Write-back flushing | Anywhere latency matters |
| Monitoring | Redirect latency, cache hit rate | Node health, eviction rate | Everything (seriously) |
These seven patterns cover roughly 80% of what you need in any system design interview. The other 20% is domain-specific knowledge — rate limiting for API design, conflict resolution for collaborative editing, feed ranking for social media. But the infrastructure patterns are always the same.
The Interview Framework
When you sit down for a system design interview, follow these five steps. Every time. The structure itself signals competence.
Step 1: Requirements (3-5 minutes). Ask clarifying questions. Functional requirements (what does it do?), non-functional requirements (scale, latency, availability), and constraints (existing tech stack, budget). Write them down visibly.
Step 2: API Design (3-5 minutes). Define the core endpoints or interfaces. This forces you to think about the system from the user's perspective. Keep it to 3-5 endpoints — if you have more, you're overcomplicating it.
Step 3: High-Level Design (10-15 minutes). Draw the boxes and arrows. Client, load balancer, application servers, database, cache, message queue. Start simple — three boxes — and add components only when you have a reason to. Every box should earn its place.
Step 4: Deep Dive (10-15 minutes). Pick 2-3 areas to go deep on. The interviewer will often guide you, but if they don't, choose the most interesting or challenging aspects. This is where you discuss partitioning strategies, cache policies, consistency models, and failure handling.
Step 5: Bottlenecks and Tradeoffs (5 minutes). Identify what could break. Single points of failure? Hot partitions? What happens at 10x scale? Always discuss tradeoffs — there are no perfect designs, only tradeoffs you've chosen deliberately.
A few more tips that I've found make a real difference:
- Narrate your thinking. Silence is your enemy. Even when you're unsure, talk through your reasoning — interviewers give credit for structured thinking, not just correct answers.
- Use numbers. Back-of-the-envelope math ("100M URLs at 100 bytes each = 10GB, fits in memory") shows you can reason about scale concretely, not just abstractly.
- Draw before you talk. A quick diagram anchors the conversation and gives the interviewer something to point at when they want to redirect you.
The biggest mistake I see: spending 20 minutes on Step 3 and running out of time before the deep dive. The high-level design should be a sketch, not a masterpiece. The depth is where you demonstrate real understanding.
Takeaway
System design isn't about memorizing architectures — it's about recognizing patterns and knowing which knobs to turn. A URL shortener and a distributed cache look nothing alike, but they share the same DNA: partitioning, replication, caching, async processing, failure handling. Master those patterns once, and every new problem is just a different configuration of the same building blocks. Start with three boxes, add complexity only when you can articulate why, and always talk in tradeoffs — not absolutes.
Comments
No comments yet. Be the first to comment!