I Designed Netflix in 45 Minutes Flat - Here's the Exact Blueprint
There is a specific kind of silence that only exists in interview rooms. It is the silence between the interviewer finishing a question and your brain catching up to the fact that yes, they really did just ask you to design one of the most complex distributed systems on the planet.
"Design Netflix."
Two words. Said casually, like they were asking me to grab coffee. The interviewer leaned back, uncapped a dry-erase marker, and slid it across the table toward me. Behind him, a whiteboard the size of a small country waited.
My palms were sweating. My mind was blank. And the clock had already started.
This is the story of how I went from frozen silence to a structured, confident answer. Every architectural decision I walked through that day is a plot point in this guide. By the end, you will be able to stand in front of any whiteboard, pick up a marker, and design Netflix in 45 minutes flat.
The Panic Moment
Let me be honest about what happened in the first 30 seconds. My brain did what every unprepared brain does: it tried to think about everything at once. Video encoding. Databases. CDNs. Recommendation algorithms. Microservices. I could feel my thoughts scattering like marbles on a hardwood floor.
But here is the thing I had drilled into myself during weeks of preparation: the first 30 seconds are not for solving. They are for organizing.
I took a breath, picked up the marker, and said the most important sentence of the entire interview:
"Before I start drawing, can I take two minutes to clarify the requirements and scope?"
The interviewer nodded. And just like that, I had bought myself the most valuable currency in a system design interview: structured thinking time.
If you take nothing else from this post, take this: never start drawing immediately. Candidates who start sketching boxes right away paint themselves into architectural corners fifteen minutes later. Candidates who ask clarifying questions first build systems that actually make sense.
Breaking Down the Beast: Functional vs Non-Functional Requirements
I divided the whiteboard into two columns. On the left, I wrote "Functional Requirements." On the right, "Non-Functional Requirements." This simple act told the interviewer three things: I think in structured ways, I understand that systems serve both users and infrastructure, and I am about to scope this problem down to something solvable in 45 minutes.
Functional Requirements
Here is what I wrote:
-
Users can browse and search a catalog of videos. This means we need a content metadata store, a search index, and a way to serve personalized home pages.
-
Users can stream videos on demand. This is the core of the entire system. A user clicks play and expects video to start within two seconds, regardless of whether they are in Mumbai, Manhattan, or Melbourne.
-
Users can upload videos (content creators/admins). Netflix is not just a player - it is a pipeline. Content goes in raw and comes out encoded in dozens of formats.
-
Users get personalized recommendations. The home screen is not the same for any two users. The recommendation engine is what keeps people subscribing month after month.
-
Users can pause, resume, and track watch history. This sounds trivial, but it means we need per-user state that persists across devices and sessions.
Non-Functional Requirements
This is where I really got the interviewer's attention, because most candidates skip this entirely.
-
High availability (99.99% uptime). Netflix cannot go down. A minute of downtime costs millions in subscriber trust.
-
Low latency for video playback. Time-to-first-byte for video content needs to be under 200 milliseconds for a good experience.
-
Massive scale. We are designing for 200+ million subscribers, with peak concurrent streams in the tens of millions.
-
Global reach. Content must be served with low latency from every continent.
-
Fault tolerance. Individual service failures should not cascade. If the recommendation engine goes down, users should still be able to search and stream.
The interviewer gave a slight nod when I mentioned fault tolerance. That was the signal that I was on the right track. In system design interviews, non-functional requirements are where you separate yourself from the pack. Anyone can draw boxes. Not everyone can articulate why those boxes need to be resilient.
The Video Pipeline Nobody Talks About
Here is a dirty secret about system design interviews: most candidates spend all their time on the streaming architecture and completely ignore the upload and processing pipeline. But the pipeline is where Netflix's real engineering brilliance lives - and talking about it shows depth that interviewers love.
The Upload Path
When a studio delivers a master file to Netflix, that file might be a 4K HDR master weighing in at several hundred gigabytes. That raw file is useless for streaming. It needs to be transformed, and the transformation process is one of the most computationally expensive operations in all of tech.
Here is the pipeline I drew on the whiteboard:
Raw Upload → Object Storage (S3) → Transcoding Service → Encoded Segments → CDN Distribution
Each step deserves its own discussion.
Transcoding: The Hidden Monster
Transcoding is the process of converting a video from one format and resolution to many. Netflix does not serve you one version of a movie. They serve you one of potentially thousands of versions, optimized for your specific device, network speed, and display capabilities.
Here is what the transcoding service needs to produce from a single source file:
- Multiple resolutions: 240p, 360p, 480p, 720p, 1080p, 4K
- Multiple bitrates per resolution: A 1080p stream at 3 Mbps looks different from 1080p at 8 Mbps
- Multiple codecs: H.264 for broad compatibility, H.265/HEVC for better compression, AV1 for next-gen efficiency
- Multiple audio tracks: Different languages, different quality levels, Dolby Atmos vs stereo
I explained to the interviewer that Netflix uses a concept called per-title encoding. Instead of applying the same bitrate ladder to every piece of content, they analyze each title individually. An animated movie compresses much more efficiently than a live-action thriller with lots of fast camera movement. By customizing the encoding profile per title, Netflix saves bandwidth without sacrificing quality.
The transcoding itself happens on a distributed cluster of workers. Each worker picks up a chunk of the video, encodes it, and writes the result back to object storage. The chunks are coordinated by a job orchestrator - think of it as a conductor directing an orchestra where each musician is encoding a different five-second segment of the film.
Adaptive Bitrate Streaming: HLS and DASH
Once the video is encoded into multiple quality levels, we need a way for the client to dynamically switch between them based on network conditions. This is where adaptive bitrate streaming (ABR) comes in, and it is one of the most elegant pieces of the Netflix architecture.
The two dominant protocols are:
HLS (HTTP Live Streaming): Developed by Apple. The video is split into small segments (typically 2-10 seconds each), and a manifest file (.m3u8) lists all available quality levels and their segment URLs. The client reads the manifest, picks a quality level, and starts downloading segments. If the network slows down, the client seamlessly drops to a lower quality. If bandwidth improves, it steps back up.
DASH (Dynamic Adaptive Streaming over HTTP): An open standard that works similarly to HLS but uses an MPD (Media Presentation Description) XML manifest instead. DASH is codec-agnostic and supports more flexible segment durations.
I told the interviewer that Netflix primarily uses DASH for most devices but falls back to HLS for Apple devices that require it. The key insight is that both protocols work over plain HTTP, which means they play nicely with CDN caching. Every segment is just an HTTP resource that can be cached at the edge.
The client-side ABR algorithm is surprisingly sophisticated. It considers buffer levels, historical throughput, device capabilities, and even the complexity of upcoming video segments. Netflix has published papers on their buffer-based approach (BBA), which prioritizes keeping the playback buffer full over chasing the highest possible quality. This prevents rebuffering events - the single biggest predictor of user churn.
CDN: Your Secret Weapon
This is the section where I watched the interviewer lean forward. CDN architecture is the backbone of any video streaming system, and explaining it well demonstrates that you understand how the internet actually works - not just how software works.
Why CDNs Matter
Imagine a user in Tokyo trying to stream a movie. If that movie is stored on a server in Virginia, every single video segment has to travel across the Pacific Ocean. The round-trip time alone could be 150-200 milliseconds, and that is before we account for packet loss, congestion, and TCP slow-start. The result: buffering, stuttering, and a user who cancels their subscription.
A CDN solves this by caching content at edge servers located close to users. Netflix operates its own CDN called Open Connect, which is one of the largest CDNs in the world. They place Open Connect Appliances (OCAs) directly inside ISP networks. When a user in Tokyo hits play, the video segments come from a server that is literally inside their ISP's data center, not from across an ocean.
The CDN Architecture I Drew
Here is the hierarchy I put on the whiteboard:
-
Origin Servers: The source of truth. All encoded video segments live in object storage (like S3) in a few central regions. The origin is the fallback when edge servers do not have the content.
-
Regional Edge Caches: Intermediate caching layer. These sit in major metro areas and cache the most popular content for their region. A regional cache in Singapore might hold the top 5,000 titles popular in Southeast Asia.
-
ISP-Embedded OCAs: The final mile. These are physical servers installed inside ISP facilities, holding the most popular content for that specific ISP's subscriber base. This is where most Netflix traffic is actually served from.
Cache Warming and Invalidation
Content does not magically appear on edge servers. Netflix uses predictive cache warming - when a new title is released, the system pre-populates edge caches in regions where that title is predicted to be popular. For a new season of a Korean drama, the caches in South Korea, Southeast Asia, and other high-interest regions are warmed hours before the release.
Cache invalidation in a CDN is simpler than in a traditional cache because video content is immutable. Once a segment is encoded, it never changes. If Netflix needs to update content (say, to fix a subtitle error), they encode a new version with a new segment URL. The old segments expire naturally via TTL, and the new ones are fetched on demand.
I mentioned this immutability property to the interviewer. He asked about live streaming, and I explained that live streaming changes the game - segments are generated in real-time, TTLs are measured in seconds, and cache coherence becomes much harder. But for a VOD system like Netflix, immutable content is a massive architectural advantage.
The Database Layer That Makes or Breaks You
This is where many candidates stumble, because the instinct is to pick one database and move on. But Netflix has at least four distinct data domains, each with different access patterns and consistency requirements. Picking the right storage for each domain is what separates a good answer from a great one.
User Data (SQL - PostgreSQL/MySQL)
User accounts, subscription status, payment history, device registrations. This data is highly relational (a user has subscriptions, subscriptions have payment methods, payment methods have transaction histories) and requires strong consistency. You cannot have a race condition where a user's subscription shows as active in one service and expired in another.
I recommended a relational database like PostgreSQL for this domain, with read replicas for scaling read traffic and a primary-secondary setup for write availability. The dataset is large but not enormous - 200 million user records is well within what a properly sharded PostgreSQL cluster can handle.
Content Metadata (Document Store - MongoDB or DynamoDB)
Movie titles, descriptions, cast information, genre tags, thumbnails, trailer URLs. This data is read-heavy (millions of reads per second as users browse the catalog) and has a flexible schema (different content types have different metadata fields - a movie has a runtime, a series has seasons and episodes, a documentary might have interview subjects).
A document store like MongoDB or DynamoDB is ideal here. The data is denormalized for fast reads, and eventual consistency is acceptable - if a new movie's description takes a few seconds to propagate, nobody notices.
Viewing History and Watch Progress (Wide-Column Store - Cassandra)
Every time a user watches, pauses, rewinds, or finishes a title, that event is recorded. This data is write-heavy (hundreds of millions of events per day), time-series in nature, and needs to be partitioned by user for fast lookups.
Apache Cassandra is a natural fit. Its partition key model maps perfectly to user-based access patterns. You partition by user ID, and each partition contains that user's chronological viewing history. Writes are fast because Cassandra is optimized for sequential writes (LSM tree storage). Reads for a single user's history are fast because all their data lives in one partition.
I explained the partition key strategy to the interviewer: (user_id) as the partition key, (timestamp) as the clustering key, sorted in descending order so the most recent viewing activity comes first. This means the query "show me what User X watched recently" is a single-partition read - the fastest possible query in Cassandra.
Search Index (Elasticsearch)
Users need to search for titles by name, actor, director, genre, and natural language queries like "funny movies with robots." A dedicated search engine like Elasticsearch provides full-text search, fuzzy matching, and faceted filtering.
The search index is populated asynchronously from the content metadata store via a change data capture (CDC) pipeline. When a new title is added or metadata is updated, an event is published to a message queue (Kafka), and a consumer updates the Elasticsearch index.
The interviewer asked me about the consistency model between the metadata store and the search index. I explained that it is eventually consistent with a typical lag of a few seconds. For a content catalog that changes at most a few hundred times per day, this is perfectly acceptable.
Recommendation Engine - The Bonus Round
By the time I reached this section, I was 30 minutes into the interview and feeling good. The architecture was solid, the whiteboard was filling up with clearly connected components, and the interviewer's body language was positive. The recommendation engine is where I went for extra credit.
Why It Matters Architecturally
Netflix has said publicly that their recommendation system saves them over $1 billion per year by reducing churn. If a user opens the app and does not find something interesting within 60-90 seconds, they close it. Do that a few times, and they cancel. The recommendation engine's job is to make sure every user sees content they will love, immediately.
From a system design perspective, the recommendation engine is interesting because it combines offline batch processing with real-time inference.
Collaborative Filtering
The classic approach. "Users who are similar to you watched these titles." The algorithm builds a matrix of users vs. titles, where each cell represents a rating or implicit signal (watch time, completion rate, thumbs up/down). It then finds users with similar preference vectors and recommends titles that similar users enjoyed but the target user has not seen.
The key challenge is scale. A 200-million-user by 50,000-title matrix is enormous. Netflix uses matrix factorization techniques (like ALS - Alternating Least Squares) to decompose this into smaller latent factor matrices. These are computed offline on a Spark cluster and stored in a feature store for real-time retrieval.
Content-Based Filtering
"Because you watched titles with these attributes, you might like other titles with similar attributes." This approach uses metadata features - genre, director, cast, keywords, even visual style (Netflix has actually trained models on video frames to extract visual similarity features).
Content-based filtering works well for new titles that do not yet have enough viewing data for collaborative filtering (the cold-start problem). It also provides explainability: "Because you watched Inception" is a recommendation reason users can understand.
The Hybrid Approach
In practice, Netflix uses a hybrid system that combines collaborative filtering, content-based filtering, and contextual signals (time of day, device type, recent viewing patterns). The final ranking is produced by a machine learning model that takes features from all three approaches and outputs a ranked list of titles.
I drew this as a three-layer architecture:
-
Offline Layer (Batch): Runs nightly or hourly on Spark. Produces user embeddings, item embeddings, and candidate sets. These are stored in a feature store.
-
Nearline Layer (Streaming): Consumes real-time events (user just watched a horror movie, user searched for "comedy") via Kafka and updates lightweight features in a real-time feature store.
-
Online Layer (Request-time): When the user opens the app, a ranking service pulls precomputed candidates from the offline layer, enriches them with real-time features from the nearline layer, and runs a final ranking model to produce the personalized home page.
The interviewer asked how I would handle a brand-new user with no viewing history. I explained the cold-start strategy: use demographic data from sign-up (age, location), trending content in their region, and content they explicitly select during the onboarding flow ("Pick 3 titles you like") to bootstrap their profile. After a few viewing sessions, the collaborative and content-based signals take over.
The Interviewer's Follow-Up Traps
The core design was done. But here is where the interview shifted from "can you design a system" to "can you think about what happens when things go wrong." This is the gauntlet that separates Senior from Staff-level answers.
Trap 1: "How Do You Scale the Transcoding Pipeline?"
A single popular title needs to be transcoded into hundreds of variants, and on a big release day, dozens of titles might drop simultaneously. A fixed-size transcoding cluster would either be over-provisioned (expensive) or under-provisioned (slow).
My answer: autoscaling worker pools on cloud infrastructure. The transcoding service uses a job queue (SQS or Kafka). Workers pull jobs from the queue, encode the segment, and write the result to object storage. If the queue depth grows beyond a threshold, the autoscaler spins up more workers. When the queue drains, workers are terminated. This is a classic elastic compute pattern.
I also mentioned that the transcoding pipeline should be idempotent. If a worker crashes mid-encode, another worker can pick up the same job and reprocess it without producing corrupt or duplicate output. Each job has a unique ID, and the output is written to a deterministic path based on that ID. If the file already exists, the job is a no-op.
Trap 2: "What Happens When a Region Goes Down?"
This is the fault tolerance question, and Netflix is famous for their answer: Chaos Engineering. They built Chaos Monkey, which randomly kills instances in production to ensure the system can tolerate failures. They even have Chaos Kong, which simulates an entire region going offline.
I explained the multi-region architecture: Netflix runs in at least three AWS regions. Each region is a full deployment of all services. User traffic is routed to the nearest healthy region via DNS-based global load balancing (Route 53). If one region goes down, DNS is updated to route traffic to the next nearest region.
The critical consideration is stateful failover. Stateless services fail over trivially - just route requests elsewhere. But stateful services (like viewing history) need cross-region replication. Cassandra shines here: it supports multi-datacenter replication natively, with configurable consistency levels. Use LOCAL_QUORUM for low-latency writes that replicate asynchronously. For critical reads (like "has this user paid?"), use EACH_QUORUM for cross-region consistency.
Trap 3: "How Do You Handle a Viral Premiere - Say, a New Season of Stranger Things?"
This is the thundering herd problem. Millions of users hit play at the exact same moment. The CDN edge caches might not have all segments pre-warmed. The metadata service gets hammered with concurrent requests. The recommendation engine is irrelevant - everyone wants the same title.
My approach:
-
Aggressive cache warming. Days before the premiere, warm all edge caches with the first several episodes. Priority on ISP-embedded OCAs in regions with high pre-save counts.
-
Request coalescing at the CDN layer. If a thousand users request the same segment simultaneously and it is not cached, the CDN should make one request to the origin and serve all thousand from the response. This prevents origin overload.
-
Graceful degradation. If the recommendation engine is overwhelmed, serve a static "trending now" page. If the search service lags, increase its cache TTL. Identify which features are critical (video playback) and which can degrade (personalized thumbnails, social features).
-
Queue-based admission control. For extreme cases, implement a virtual waiting room. Users see "You're in line, estimated wait: 2 minutes" rather than error pages.
The interviewer seemed particularly pleased with the request coalescing point. It is a detail that shows you understand CDN internals beyond the textbook level.
Trap 4: "How Do You Monitor All of This?"
Observability. I briefly described the three-pillar approach: metrics (Prometheus/Grafana for throughput, latency percentiles, CDN cache hit ratios), logs (structured logging via ELK for debugging specific failures), and distributed traces (Jaeger/Zipkin for following a single request through the service mesh). Tie it all together with SLOs - video start time under 2 seconds at p99, rebuffer ratio below 0.5%, API error rate under 0.01%. SLOs drive alerting and on-call priorities.
The Final Whiteboard
By the time 45 minutes were up, my whiteboard told a complete story. At the top, users and their devices. Below that, a global load balancer routing to regional API gateways. The gateway fanning out to microservices: catalog service, user service, streaming service, recommendation service, search service. Below the services, the data layer: PostgreSQL for users, DynamoDB for metadata, Cassandra for viewing history, Elasticsearch for search, a Redis cache layer for hot data. Off to the side, the video pipeline: upload, transcoding workers, object storage, CDN distribution. And at the bottom, the CDN hierarchy: origin, regional edges, ISP-embedded OCAs.
Every box had a label. Every arrow had a protocol (HTTP, gRPC, async via Kafka). Every data store had a justification. And most importantly, every design decision had a why.
The interviewer shook my hand and said, "That was thorough."
Two weeks later, I got the offer.
What Made It Work
Looking back, the success came down to four principles:
- Start with requirements, not architecture. Those two minutes of clarification saved me from building the wrong system.
- Go deep on one or two areas. I went deep on the video pipeline and CDN, which showed technical depth without trying to boil the ocean.
- Address non-functional requirements explicitly. Availability, latency, scale, fault tolerance - they are the reason the architecture exists.
- Anticipate follow-ups. Every design decision I made, I had a "what if it breaks" answer ready.
For a related deep dive into database strategies, check out our post on database sharding with an Instagram case study. If the follow-up traps section resonated with you, our guide on rate limiting and circuit breaker patterns covers defensive architecture in detail. And for a framework on making confident scaling decisions without falling into buzzword traps, see how to identify bottlenecks in any system design interview.
Now go find a whiteboard and practice. The next time someone says "Design Netflix," you will not panic. You will pick up the marker and smile.
FAQ
How long should a system design answer take?
Most system design interviews are 45 to 60 minutes, but you will not have the full time for your answer. Expect to spend the first 3 to 5 minutes on requirement clarification, 25 to 30 minutes on your core design, and the remaining 10 to 15 minutes fielding follow-up questions. The worst thing you can do is spend 40 minutes on a perfect design and have no time for follow-ups - interviewers use those questions to gauge your depth. Pace yourself by setting mental checkpoints: requirements done by minute 5, high-level architecture by minute 15, deep dives by minute 30.
What is the difference between HLS and DASH?
HLS (HTTP Live Streaming) was developed by Apple and uses .m3u8 playlist files as manifests and .ts (MPEG-2 Transport Stream) or .fmp4 segments. It is the dominant protocol on Apple devices and has broad browser support. DASH (Dynamic Adaptive Streaming over HTTP) is an international open standard (ISO/IEC 23009-1) that uses XML-based MPD (Media Presentation Description) manifests and supports any codec, making it more flexible. The key practical difference is ecosystem: HLS is mandatory for iOS/Safari, while DASH is preferred on Android and smart TVs. Netflix uses DASH as their primary protocol but falls back to HLS where required. Both protocols achieve the same goal - adaptive bitrate streaming over HTTP - so in an interview, mentioning either one (and explaining why) is sufficient.
Should I draw the CDN layer first?
Not first, but do not leave it for last either. Start with your high-level architecture showing clients, load balancers, application services, and data stores. Once that skeleton is in place, add the CDN as a layer between clients and your origin servers. Drawing the CDN early shows the interviewer you understand that the majority of Netflix traffic (video segments) never touches your application servers - it is served entirely from edge caches. A common mistake is treating the CDN as an afterthought or a single box labeled "CDN." Instead, show the hierarchy: origin, regional edges, ISP-embedded caches. That level of detail demonstrates real understanding of content delivery.
How does Netflix handle 200M concurrent streams?
The short answer is that most of the heavy lifting is done by the CDN, not by centralized servers. Netflix's Open Connect CDN serves over 95% of video traffic from edge servers embedded inside ISP networks. This means the central infrastructure only handles metadata requests (catalog browsing, search, recommendations) and the video pipeline (upload, transcoding, CDN distribution). Even those centralized services are distributed across multiple AWS regions with autoscaling. For the metadata path, Netflix uses a microservices architecture with over 1,000 services, each independently scalable. The streaming path is almost entirely offloaded to edge infrastructure. So while "200 million concurrent streams" sounds terrifying, no single server or cluster is handling all of them - the load is distributed across thousands of edge locations worldwide, each serving their local subscribers.