System Design Scaling: Stop Using Buzzwords, Start Measuring Bottlenecks
Here is a scene that plays out in thousands of software engineering interviews every week.
The interviewer says: "Design a URL shortener."
The candidate says: "Sure. So we'll use microservices, Kubernetes for orchestration, a distributed cache with Redis, Cassandra for the database because it scales horizontally, and we'll put a CDN in front."
The interviewer says: "Why Cassandra?"
The candidate pauses. "Because... it scales?"
The interview goes sideways from there.
This is the buzzword trap. It happens because candidates have watched enough system design videos that they've memorized the vocabulary of scalable systems without developing the judgment to apply it. They know what the components are. They don't know when to use them. And interviewers -- especially at FAANG companies -- are specifically testing for the when.
This guide gives you the mental model that separates candidates who get offers from candidates who walk away confused about what went wrong.
The Core Principle: Every Scaling Decision Answers a Specific Question
Before you add any component to your system design -- a cache, a CDN, a message queue, a separate microservice -- you need to answer this question:
"What measured or estimated constraint is this component solving?"
If you can't answer that question with a specific number or ratio, you're adding the component because it sounds impressive, not because the system needs it. Interviewers can tell the difference immediately.
The good news: there are only three fundamental constraints that force scaling decisions. Once you can identify which one your system will hit first, every subsequent architectural decision falls into place logically.
The Three Bottlenecks That Drive Every Scaling Decision
Bottleneck 1: Database I/O
This is the most common bottleneck in system design interviews, and for good reason -- databases are almost always the first thing to buckle under load.
The key question to ask yourself: What is this system's read-to-write ratio?
This single number tells you more about what your architecture needs than any other metric.
A URL shortener has an extremely high read-to-write ratio -- probably 100:1 or higher. For every person who creates a short URL, hundreds or thousands of people click it. This means your database is being hammered with reads. Under load, your primary database will start queuing up read requests, latency climbs, and eventually the system degrades.
The solution? A read cache. You add Redis between your application layer and your database, store the most frequently accessed URL mappings in memory, and serve the vast majority of read requests without touching the database at all. Cache hit rates of 90%+ are realistic for a URL shortener because popular URLs get clicked disproportionately often -- a natural power law distribution.
The right way to say this in an interview:
"Based on my capacity estimate, we're handling approximately 10,000 reads per second. A typical Postgres instance can handle around 5,000-10,000 simple queries per second under optimal conditions, but at 10K reads with variable latency, we'll start seeing degradation. Given that our read-to-write ratio is approximately 100:1, a Redis cache with an LRU eviction policy will serve most reads from memory. If we assume a 90% cache hit rate, we're down to 1,000 database reads per second -- well within comfortable limits for a single primary with a read replica."
That answer gets heads nodding. "We should add a cache" gets a follow-up question you don't want to answer.
When database I/O is NOT your bottleneck:
A real-time chat application has a very different read-to-write ratio. Users are constantly sending and receiving messages -- the ratio might be 1:1 or even skewed toward writes. A cache doesn't help much when the data is constantly changing. Here, you're more likely to hit bottlenecks around write throughput, and the solution looks different: write-optimized databases, message queues to absorb write spikes, and potentially database sharding.
Common database scaling patterns and when to apply each:
Read replicas -- when read traffic is high and data doesn't need to be real-time consistent. Cost: replication lag, increased operational complexity.
Caching layer (Redis/Memcached) -- when read-to-write ratio is high (greater than 10:1) and cache invalidation is manageable. Cost: cache invalidation complexity, eventual consistency.
Database sharding -- when a single database instance can't handle total data volume or write throughput, even with replicas. Cost: significant operational complexity, cross-shard queries become painful. Use this later, not first.
Write-ahead log / event sourcing -- when you need durability guarantees for write-heavy workloads and are willing to trade query flexibility for write performance. Cost: query complexity, operational overhead.
Bottleneck 2: Network Bandwidth
This bottleneck surprises candidates who haven't thought about payload sizes.
The question to ask: How large is the average payload this system transfers, and how many are transferred per second?
Most CRUD APIs deal with JSON responses that are a few kilobytes. Network is not a bottleneck here -- even a modestly provisioned server handles tens of thousands of small JSON responses per second without approaching bandwidth limits.
But when your system transfers media -- images, audio, video -- the math changes dramatically.
Consider an image hosting service. If your average image is 500KB and you're serving 10,000 images per second, that's 5GB/s of outbound bandwidth. A typical server uplink is 1Gbps (125MB/s). You'd need 40 servers just to handle bandwidth, even if the CPUs were completely idle. This is a network bottleneck, and the only practical solution is distributing your content.
This is where a CDN earns its existence -- not because "CDNs are good" but because you've done the math and discovered that no reasonable number of origin servers can deliver your content to geographically distributed users fast enough without one.
The right way to say this in an interview:
"Let me estimate the bandwidth requirements. We're targeting 10,000 image requests per second with an average image size of 500KB. That's 5GB/s of outbound bandwidth from our origin servers. A 10Gbps uplink gives us 1.25GB/s of raw capacity -- we'd need at least four origin servers just to saturate our bandwidth, with no headroom for traffic spikes. More importantly, latency for users in Southeast Asia hitting servers in US-East would be 200-300ms just from network round-trip time, before we even serve the first byte. A CDN with edge nodes solves both problems: bandwidth is distributed across hundreds of edge locations, and users are served from a node geographically close to them."
When network is NOT your bottleneck:
An internal analytics dashboard querying aggregated data returns responses measured in kilobytes. Network is nowhere near the constraint. Mentioning a CDN here signals that you don't understand what CDNs actually solve.
Network scaling patterns:
CDN -- for static assets and media that don't change per-user. Moves content physically closer to users. Not useful for dynamic, personalized content.
Compression -- gzip or Brotli for text-based responses, WebP instead of PNG for images. Reduces payload size before it hits the network.
Chunked/streaming delivery -- for large files, stream in chunks rather than buffering the whole response. Reduces time-to-first-byte and allows progressive rendering.
Edge computing -- for situations where you need both low latency AND dynamic content. Run lightweight computation at CDN edge nodes. Handles personalized content that a pure CDN can't serve.
Bottleneck 3: CPU Compute
The third bottleneck is often the most overlooked, and it shows up in the most interesting places.
The question to ask: Does this system need to do meaningful computation per request, or is it mostly moving data from one place to another?
Most web applications are I/O bound, not CPU bound. A typical API endpoint reads from a database, does some transformation, and returns a response. The CPU work here is trivial.
But some systems genuinely need to crunch numbers:
- Search ranking: Scoring and ranking 10 million documents for a search query involves non-trivial computation per query.
- Video transcoding: Converting a video from one codec and resolution to multiple output formats is extremely CPU intensive.
- Machine learning inference: Running a recommendation model or fraud detection model on each request is compute-heavy.
- Image processing: Resizing, cropping, watermarking, and format conversion at scale eat CPU.
When CPU is your bottleneck, caches and replicas don't help much -- you need to either do less computation per request or distribute the computation.
Common patterns for CPU-bound workloads:
Pre-computation -- calculate expensive results ahead of time and store them. A news feed doesn't need to rank 10,000 posts in real time when a user loads their feed. You can pre-compute and cache the ranked feed for each user. Trade-off: staleness. Your feed might be 30 seconds out of date. For most applications, this is completely acceptable.
Asynchronous job queues -- offload expensive computation to background workers. When a user uploads a video, you don't transcode it synchronously during the HTTP request (that would time out). You put the job in a queue (SQS, RabbitMQ, Kafka), return immediately with "your video is processing," and have dedicated worker instances pull jobs from the queue and process them asynchronously.
Dedicated compute tiers -- separate your CPU-intensive services from your lightweight API services. Don't run ML inference on the same instances as your API -- the resource profiles are completely different and they'll interfere with each other under load.
Caching inference results -- if your ML model is deterministic and inputs repeat, cache the results. A fraud detection model that's seen a particular user before can serve cached risk scores instead of re-running inference.
The Four-Step Framework for Any Interview
Here is the process that works for every system design problem:
Step 1: Write Down the Functional Requirements
What must the system actually do? Be specific. Not "handle users" but "allow users to create short URLs, redirect on access, and track click counts per URL."
This step matters because it defines what data flows through your system and what operations you need to support. You can't estimate bottlenecks without knowing what the system does.
Step 2: Estimate the Numbers
For every core operation, estimate:
- Requests per second (read and write separately)
- Data size per request or record
- Total data volume over time
- Read-to-write ratio
You don't need precise numbers. Orders of magnitude are fine. "Roughly 10K reads/second" is a useful estimate. "A lot" is not.
If the interviewer hasn't given you specifics, make reasonable assumptions and state them explicitly. "I'll assume this is a Twitter-scale system -- around 100 million daily active users, with roughly 500 million reads and 5 million writes per day." Interviewers appreciate candidates who make their assumptions explicit.
Step 3: Identify the First Bottleneck You'll Hit
With your estimates in hand, work through each layer:
- Database: Can a single instance handle this read/write throughput? What's the read-to-write ratio telling you about caching potential?
- Network: What's the bandwidth requirement? Is this in the range where a single server becomes a network bottleneck?
- CPU: Is there meaningful computation per request, or is this mostly data movement?
In most interview systems, you'll identify a clear primary bottleneck within two minutes of running the numbers. Design for that first.
Step 4: Apply the Right Solution for That Bottleneck
Now your architectural decisions are justified by the numbers.
"I'm adding Redis because our read-to-write ratio is 100:1 and the estimated read throughput exceeds what a single Postgres instance handles comfortably" is a statement that earns trust.
"I'm adding Redis because caching is good" is a statement that invites a follow-up you may not want to answer.
What FAANG Interviewers Are Actually Evaluating
When a senior engineer at Google, Meta, or Amazon interviews you on system design, they are not checking whether you know what Kafka is. They are evaluating three things:
1. Can you estimate? Can you take an ambiguous problem and turn it into numbers that inform decisions? Engineers who work at scale think in orders of magnitude constantly. This is a muscle they want to see you using.
2. Can you justify? When you make an architectural decision, can you explain why that decision is correct given the constraints? A candidate who says "we should shard the database" and can explain the specific write throughput that justifies it is far more valuable than one who knows sharding exists.
3. Can you trade off? Every architectural decision comes with costs. Caching adds staleness risk. Microservices add operational complexity. Sharding makes cross-shard queries painful. Can you articulate these trade-offs and explain why the benefit outweighs the cost in this specific context?
Estimation, justification, trade-offs -- these are what separate "good enough" answers from "strong hire" answers. The buzzwords are just vocabulary. The judgment is what gets you the offer.
Classic Problems Mapped to Bottlenecks
Design a URL Shortener -- Primary bottleneck: Database I/O (high read-to-write ratio, simple key-value lookups). Solution: Read cache (Redis), standard relational DB or key-value store.
Design Netflix or YouTube -- Primary bottleneck: Network bandwidth (large video payloads, global user base). Solution: CDN for video delivery, edge caching of popular content. Secondary: CPU compute for video transcoding (async job queues, dedicated transcode workers).
Design Twitter's News Feed -- Primary bottleneck: CPU compute (ranking personalized feeds for 300M users). Solution: Pre-computed feeds via fan-out on write, cached results. Secondary: Database I/O (massive read traffic on pre-computed feeds).
Design a Real-Time Chat App -- Primary bottleneck: Database I/O (write-heavy with high message volume) plus unique challenge of persistent connection management. Solution: Message queues, write-optimized storage, WebSockets for persistent connections.
Design a Search Engine -- Primary bottleneck: CPU compute (indexing, ranking, query processing at planetary scale). Solution: Distributed inverted index, pre-computed rankings, massive horizontal scaling.
None of these answers start with "microservices" or "Kubernetes." Those are implementation details that become relevant once you've identified the bottleneck and chosen a solution pattern. They're not where you start.
How Aurora Teaches This Framework
Aurora, Levelop's System Design AI, is built around exactly this process. Before you touch the canvas -- before you draw a single component -- Aurora walks you through:
- Functional requirements clarification: What specific operations does this system need to support?
- Capacity estimation: How many users, how many requests, how much data?
- Bottleneck identification: Based on your estimates, where will this system break first?
The canvas only unlocks after you've worked through these three phases with Aurora's guidance. Not to gatekeep, but because the architecture you draw should be a direct consequence of the analysis you did before. Candidates who skip this and jump straight to the diagram are practicing the interview wrong -- and Aurora won't let you do that.
After you complete the canvas, Aurora evaluates your architectural decisions against your stated requirements. If you added a CDN but your system handles only internal API traffic, Aurora flags it. If your system has a 100:1 read-to-write ratio and you didn't add a cache, Aurora asks you why. The feedback is specific to your design, not a generic rubric.
This is how you close the gap between "I've watched 40 hours of system design videos" and "I can walk into an interview and make defensible architectural decisions under pressure."
Practice Problems: One for Each Bottleneck
Try designing each of these using the four-step framework -- estimate first, identify the bottleneck, then design for it:
- Design a URL Shortener -- Your bottleneck is Database I/O. What's the read-to-write ratio? What does that tell you about caching?
- Design an Image Hosting Service -- Your bottleneck is Network Bandwidth. What's the bandwidth requirement? What does that tell you about distribution?
- Design a News Feed Ranking System -- Your bottleneck is CPU Compute. What's the cost of ranking per request? What does that tell you about pre-computation?
Start your free sprint on Levelop to practice these with Aurora walking you through the requirements and estimation phases before you touch the canvas.
Frequently Asked Questions
Is it ever right to use microservices from the start? Rarely. Microservices solve specific problems: independent deployability of services with very different scaling requirements, team autonomy at large organizational scale, isolation of failure domains. These problems don't exist in most interview scenarios unless explicitly stated. If you propose microservices, be ready to explain which specific problem they're solving and at what scale that trade-off becomes worth it.
When should I use SQL vs. NoSQL? SQL is the default. It gives you ACID transactions, rich query capabilities, and a mature ecosystem. Switch to NoSQL when you have a specific reason: you need horizontal write scaling that SQL can't provide (Cassandra), you need flexible schema for evolving document structure (MongoDB), or you need sub-millisecond key-value lookups at massive scale (DynamoDB). "NoSQL is more scalable" is not a reason -- plenty of systems handle billions of records on relational databases.
How precise do my capacity estimates need to be? Not very. The point is to identify which order of magnitude you're operating in, and whether that puts you in a regime where standard solutions work or where you need something more sophisticated. Being off by 2x in your estimates almost never changes the architectural decision. State your assumptions, be directionally correct, and don't get paralyzed trying to be precise.
Should I always start with a monolith? It's a reasonable starting point for most interview systems -- easier to reason about, fewer failure modes, appropriate for early-stage scale. The interviewer usually wants to see you evolve the design as scale increases. "Here's the simple version that works at low scale, and here's how I'd modify it as we scale to 10 million users" is a great way to structure the conversation.