Load Balancer vs API Gateway - The Debate That Nearly Cost Me a Google Offer
It was a Tuesday afternoon phone screen with Google, and I was feeling confident. I had spent three weeks grinding system design problems, and the interviewer had just asked me to walk through a high-level architecture for a URL shortening service. Easy. Bread and butter. I drew the mental picture in my head and started talking.
"So traffic comes in through the API gateway, which distributes requests across our service instances using round-robin..."
There was a pause. The kind of pause that makes your stomach drop.
"You just described a load balancer," the interviewer said, with the gentle patience of someone who has heard this mistake a hundred times. "What does the API gateway actually do in your architecture?"
I froze. I knew both terms. I had read about both. But in that moment, under pressure, I realized I had been treating them as interchangeable. I had a vague cloud of "they both sit in front of your services" in my head, and that cloud had just evaporated under the heat of a direct question.
I stumbled through an answer that I knew was not good enough. The rest of the interview went fine, but the damage was done. When the recruiter called with feedback, the notes included: "Confusion around infrastructure components and their responsibilities."
I did not get that offer.
But that failure became the best thing that happened to my system design preparation. Because over the next two weeks, I went on a deep dive into load balancers, API gateways, reverse proxies, and every related concept until I could explain the differences in my sleep. This post is the result of that deep dive - the guide I wish I had before that phone screen.
What a Load Balancer Actually Does
Let me start with the concept that seems simpler but is deceptively deep.
The Restaurant Analogy
Imagine a popular restaurant on a Friday night. There is a host standing at the entrance. Guests arrive, and the host's only job is to seat them at available tables so that no single server gets overwhelmed while others stand idle.
The host does not check reservations. The host does not take drink orders. The host does not translate the menu into different languages. The host does one thing: distribute incoming guests to available tables as efficiently as possible.
That is a load balancer. It sits in front of a pool of servers (the tables) and distributes incoming network requests (the guests) so that no single server is overwhelmed. That is its primary purpose, and it does it extremely well.
How It Actually Works
A load balancer receives incoming TCP connections or HTTP requests and forwards them to one of several backend servers based on a distribution algorithm. The most common algorithms are:
Round Robin: Requests are distributed sequentially. Server 1 gets request 1, server 2 gets request 2, server 3 gets request 3, then back to server 1. Simple, effective, and works well when all servers have similar capacity.
Weighted Round Robin: Same as round robin, but servers have weights. A server with weight 3 gets three times as many requests as a server with weight 1. Useful when your fleet has mixed instance sizes.
Least Connections: Each new request goes to the server currently handling the fewest active connections. This is better than round robin when requests have variable processing times - a server handling a complex query should not receive the next request just because it is "next in line."
IP Hash: The client's IP address is hashed, and the hash determines which server receives the request. This provides session affinity (also called sticky sessions) - the same client always hits the same server. Useful when server-side session state exists, though generally you want to avoid server-side sessions in modern architectures.
Least Response Time: Combines least connections with response time monitoring. Requests go to the server that currently has the fewest connections and the fastest average response time. This is the most adaptive algorithm but requires continuous health monitoring.
The Layer 4 vs Layer 7 Distinction
This is where load balancers get interesting, and where the interview follow-up questions live.
Layer 4 (Transport Layer) Load Balancing operates at the TCP/UDP level. The load balancer sees source IP, destination IP, source port, and destination port - but it does not inspect the contents of the request. It simply forwards TCP connections to backend servers. This is extremely fast (nanosecond-level decision making) because there is no payload parsing. AWS Network Load Balancer (NLB) operates at Layer 4.
Layer 7 (Application Layer) Load Balancing operates at the HTTP/HTTPS level. The load balancer can inspect HTTP headers, URLs, cookies, and even request bodies. This enables content-based routing: send all /api/users requests to the user service cluster, send all /api/orders requests to the order service cluster. AWS Application Load Balancer (ALB) operates at Layer 7.
The trade-off is performance vs. intelligence. Layer 4 is faster but dumber. Layer 7 is slower but can make routing decisions based on the actual content of requests. In a system design interview, the key insight is: Layer 7 load balancing starts to overlap with API gateway functionality. This is exactly where my confusion came from in that Google interview, and it is where most candidates get tripped up.
Health Checks: The Unsung Hero
A load balancer is only useful if it knows which backends are healthy. Load balancers perform health checks - periodic requests to each backend server to verify it is alive and responsive.
Active health checks: The load balancer sends a probe (HTTP GET to /health, TCP connection attempt, etc.) at regular intervals. If a server fails a configurable number of consecutive checks, it is removed from the pool.
Passive health checks: The load balancer monitors real traffic responses. If a server starts returning 5xx errors or timing out, it is marked unhealthy without needing a separate probe.
I learned to always mention health checks in interviews because it shows you think about failure modes. A load balancer without health checks is a traffic distributor that happily sends requests to dead servers.
API Gateway: The Bouncer, Translator, and Traffic Cop
If a load balancer is the restaurant host, an API gateway is the entire front-of-house operation combined: the bouncer checking IDs at the door, the maitre d' confirming reservations, the hostess translating the menu for foreign guests, and the manager enforcing the "two drink minimum" policy.
An API gateway is a single entry point for all client requests that provides a wide range of cross-cutting concerns that individual microservices should not have to implement themselves.
Authentication and Authorization
The most fundamental API gateway function. When a request arrives, the gateway validates the authentication token (JWT, OAuth2 bearer token, API key) before the request ever reaches a backend service. If the token is invalid, the gateway returns a 401 immediately. If the token is valid but the user lacks permission for the requested resource, the gateway returns a 403.
This is critical in a microservices architecture. Without a gateway, every single microservice would need to implement token validation logic. That means every service needs to know about your auth provider, handle token refresh, manage signing keys, and deal with edge cases like expired tokens and revoked sessions. By centralizing this in the gateway, your microservices can trust that any request they receive has already been authenticated.
In the Google interview, this is the answer I should have given. The API gateway handles who you are and what you are allowed to do. The load balancer handles which server should process your request. Different questions entirely.
Rate Limiting
An API gateway enforces rate limits to protect backend services from abuse, whether malicious (DDoS attacks) or accidental (a buggy client in a retry loop). Rate limiting typically works on a per-client basis: "This API key can make 1,000 requests per minute."
Common algorithms include:
Token Bucket: Each client has a bucket that fills with tokens at a fixed rate. Each request consumes a token. If the bucket is empty, the request is rejected (429 Too Many Requests). The bucket has a maximum capacity, allowing short bursts.
Sliding Window: Track the number of requests in a rolling time window. More precise than fixed windows (which can allow double the limit at window boundaries) but more memory-intensive.
Leaky Bucket: Requests enter a queue (bucket) and are processed at a fixed rate. If the queue is full, new requests are rejected. This smooths out traffic spikes, ensuring backends receive a steady flow.
A load balancer does not do this. It distributes traffic but does not limit it. If a million requests per second arrive, the load balancer will happily distribute a million requests per second across your backends until they collapse. The API gateway is the one that says "no, you have exceeded your limit." For a deeper look at these defensive patterns, check out our post on rate limiting and circuit breaker patterns.
Request Transformation and Protocol Translation
This is where API gateways earn their keep in complex architectures.
Request/Response Transformation: The gateway can modify requests before forwarding them and modify responses before returning them. Add headers, remove sensitive fields, rename JSON keys, merge responses from multiple services into a single response. This is especially useful for Backend for Frontend (BFF) patterns where mobile clients need different response shapes than web clients.
Protocol Translation: External clients speak HTTP/REST. Internal services might use gRPC, Thrift, or GraphQL. The gateway translates between them. A client sends a REST POST to /api/orders, and the gateway converts it to a gRPC call to the OrderService. The response comes back as a Protocol Buffer, and the gateway serializes it to JSON for the client.
Request Aggregation: A single client request might need data from multiple backend services. Instead of making the client call three services separately (chatty API), the gateway can fan out to all three in parallel, aggregate the responses, and return a single combined response. This reduces client round trips and simplifies client logic.
API Versioning and Routing
The gateway manages API versions, routing /v1/users to the legacy user service and /v2/users to the rewritten user service. This allows backend teams to deploy new service versions without coordinating client migrations. The gateway handles the routing, and clients can migrate to the new version at their own pace.
Logging, Metrics, and Tracing
The gateway is the perfect place to instrument cross-cutting observability because every request passes through it. Log request/response metadata, record latency histograms, inject distributed tracing headers (like OpenTelemetry trace IDs). This gives you a centralized view of all API traffic without requiring each microservice to implement its own logging.
When They Overlap - and When They Don't
Here is the section that would have saved me in that Google interview. The confusion between load balancers and API gateways exists because Layer 7 load balancers can do some of what API gateways do. Let me draw a clear line.
The Comparison Table
| Capability | Load Balancer | API Gateway |
|---|---|---|
| Distribute traffic across servers | Yes (primary purpose) | Sometimes (basic round-robin) |
| Health checks on backends | Yes | Sometimes |
| SSL/TLS termination | Yes | Yes |
| Layer 4 (TCP/UDP) routing | Yes | No |
| Layer 7 (HTTP) routing | Yes (L7 LBs only) | Yes |
| Authentication/Authorization | No | Yes (primary purpose) |
| Rate limiting | No | Yes |
| Request/response transformation | No | Yes |
| Protocol translation (REST to gRPC) | No | Yes |
| API versioning | No | Yes |
| Request aggregation | No | Yes |
| Circuit breaking | Some (basic) | Yes |
| Caching | Some (basic) | Yes |
| Developer portal / API documentation | No | Yes (some gateways) |
The pattern is clear: a load balancer distributes traffic. An API gateway manages traffic. Distribution is about where a request goes. Management is about whether a request is allowed, how it is transformed, and what policies apply to it.
Where the Lines Blur
Modern Layer 7 load balancers like AWS ALB, Envoy, and HAProxy can do content-based routing, SSL termination, and basic header manipulation. This overlaps with some gateway features.
Conversely, many API gateways (Kong, AWS API Gateway, Apigee) include built-in load balancing to distribute requests across service instances.
So in practice, the tools overlap. But their architectural intent is different:
- You add a load balancer because you need to scale horizontally - you have multiple instances of the same service and need to distribute traffic among them.
- You add an API gateway because you need to manage your API surface - you have multiple different services and need a single entry point that handles cross-cutting concerns.
This distinction is what the Google interviewer was looking for. Not a memorized definition, but an understanding of why each component exists and what architectural problem it solves.
Drawing Them on a Whiteboard
When I eventually aced a system design interview at another top company, I drew the infrastructure layer like this, and the interviewer nodded approvingly:
The Architecture Flow
Start from the top with the client (browser, mobile app). The request path flows downward through distinct layers, each with a specific responsibility:
Layer 1: DNS and Global Load Balancing. The client resolves the API domain name, and DNS-based load balancing (like AWS Route 53 or Cloudflare) directs the client to the nearest healthy region. This is geographic load balancing - before the request even reaches your infrastructure, it is routed to the correct continent.
Layer 2: CDN / Edge Layer. For cacheable content (static assets, images, potentially API responses with cache headers), the CDN serves the response directly from the edge. Non-cacheable requests pass through to the next layer.
Layer 3: API Gateway. The single entry point for all API traffic. Here the request is authenticated (JWT validation), rate-limited (check against per-client quotas), and routed to the appropriate backend service based on the URL path and HTTP method. If the request requires data from multiple services, the gateway may perform request aggregation. If the client speaks REST but the backend speaks gRPC, the gateway translates.
Layer 4: Internal Load Balancer. Behind the API gateway, each microservice runs as multiple instances for redundancy and scale. An internal load balancer sits in front of each service cluster, distributing requests using least-connections or round-robin. This is where the load balancer lives - not at the edge, but between the gateway and the services. Think of it as the traffic distribution layer within your private network.
Layer 5: Microservices. The actual business logic. User Service, Order Service, Payment Service, Notification Service, etc. Each service is stateless and horizontally scalable behind its load balancer.
Layer 6: Data Layer. Databases, caches, message queues. Each service owns its data store. Services communicate asynchronously via message queues (Kafka, SQS) for eventual consistency.
The Key Insight for Interviews
When you draw this architecture, the API gateway and the load balancer are at different layers. The gateway is the front door of your entire API - it faces the public internet. The load balancer is an internal component - it distributes traffic within your private network, behind the gateway.
Some architects place an external load balancer in front of the API gateway itself (for high availability of the gateway layer). In that case, the flow is: Client, External LB, API Gateway, Internal LB, Service. This is common in production but can confuse candidates who then say "the load balancer does everything the gateway does." No - the external LB is just ensuring the gateway itself does not become a single point of failure. It distributes connections to multiple gateway instances. It does not authenticate, rate-limit, or transform.
If you have been studying system design through the lens of practical interviews, our post on designing Netflix walks through how these components work together in a real-world streaming architecture.
The Follow-Up Questions Google Loves
After my embarrassing phone screen, I compiled every follow-up question I could find about load balancers and API gateways. Here are the ones that come up most frequently, especially at companies like Google, Meta, and Amazon.
L4 vs L7 Load Balancing: When Do You Choose Which?
Choose Layer 4 when:
- You need maximum performance (millions of connections per second)
- You are load balancing non-HTTP protocols (database connections, custom TCP protocols, gaming servers)
- You do not need content-based routing
- You want the simplest possible configuration
Choose Layer 7 when:
- You need to route based on URL path, hostname, headers, or cookies
- You need SSL/TLS termination at the load balancer
- You want to inspect and modify HTTP traffic
- You need WebSocket support with intelligent routing
In a Google interview, the sophisticated answer is: "Use L4 for internal service-to-service traffic where performance matters and routing is simple. Use L7 at the edge where you need intelligent routing and TLS termination." This shows you understand that different layers of the architecture have different requirements.
Sticky Sessions: When and Why
Sticky sessions (session affinity) ensure that all requests from the same client go to the same backend server. This is sometimes necessary when servers maintain local state - for example, an in-memory shopping cart or a WebSocket connection.
The problem with sticky sessions: They undermine the core benefit of load balancing. If one server accumulates many sticky clients, it becomes overloaded while others are idle. If that server dies, all its sticky clients lose their sessions.
The better approach: Make your services stateless. Store session data in a shared store (Redis, DynamoDB). Then any server can handle any request, and sticky sessions become unnecessary. If an interviewer asks about sticky sessions, explain the trade-offs and recommend externalized session state as the preferred pattern.
Health Check Strategies
Shallow health check (liveness): "Is the process running and accepting connections?" A simple TCP connect or HTTP 200 from /health. This catches crashed processes and network issues.
Deep health check (readiness): "Can the service actually handle requests?" The health endpoint verifies database connectivity, checks cache availability, validates that critical dependencies are reachable. This catches scenarios where the process is running but functionally broken - a database connection pool is exhausted, a downstream service is unreachable.
Trade-off: Deep health checks can cause cascading failures. If the database is slow, all services fail their deep health checks simultaneously, and the load balancer removes all backends from the pool. Suddenly you have zero healthy servers. The mitigation is to have the health check degrade gracefully: report unhealthy only if the service itself is broken, not if a dependency is slow.
Circuit Breaking
A circuit breaker sits between a service and its downstream dependency. It monitors failure rates, and when failures exceed a threshold, it "opens" the circuit - subsequent requests immediately fail without actually calling the downstream service. After a timeout, the circuit moves to "half-open" and lets a few test requests through. If they succeed, the circuit closes and normal traffic resumes.
Where does the circuit breaker live? In some architectures, it is in the API gateway (centralized). In others, it is in a service mesh sidecar (like Envoy in an Istio mesh). In simpler architectures, it is a library within each service (like Netflix Hystrix, now superseded by Resilience4j).
This is a topic where the API gateway and service mesh worlds collide, and interviewers love to explore it. The key point is that circuit breaking is a resilience pattern, not a load balancing pattern. It prevents cascade failures, not traffic distribution.
The Service Mesh Question
Advanced interviewers might ask: "If you have a service mesh like Istio, do you still need an API gateway?"
Yes. They serve different purposes. The service mesh handles east-west traffic (service-to-service communication within your cluster) - mutual TLS, retries, circuit breaking, observability between internal services. The API gateway handles north-south traffic (client-to-service communication from outside your cluster) - authentication, rate limiting, API versioning, protocol translation.
Think of it as: the API gateway is the front door of your building. The service mesh is the hallway system inside the building. You need both.
The Redemption
Six months after that failed Google phone screen, I interviewed at another major tech company. The system design question was about designing a real-time messaging platform. When I drew the architecture, I placed the API gateway and load balancers at their correct layers, explained the responsibility of each, and anticipated the follow-up questions before they were asked.
When the interviewer said, "What's the difference between the load balancer and the API gateway in your diagram?" I smiled. I had been waiting for that question for six months.
I explained the restaurant analogy. I walked through the comparison table from memory. I discussed L4 vs L7 trade-offs. I talked about how the external load balancer protects the gateway from being a single point of failure while the gateway protects the services from unauthenticated and unthrottled traffic.
The interviewer said, "Clear. Let's move on." And I knew that "clear" was the best word I could possibly hear.
I got that offer.
The lesson is not that I memorized the right definitions. The lesson is that I understood the why behind each component. A load balancer exists because horizontal scaling requires traffic distribution. An API gateway exists because microservices require a unified entry point that handles cross-cutting concerns. Once you understand the why, the what follows naturally, and no interviewer can trip you up.
The confusion between these two components is one of the most common stumbling blocks in system design interviews. But it is also one of the easiest to fix. You now have the mental model, the comparison table, and the follow-up answers. The next time someone asks you the difference, you will not freeze. You will explain it clearly, draw it correctly, and move on to designing the rest of the system with confidence.
FAQ
Can you use both a load balancer and API gateway?
Absolutely, and in most production architectures, you will. They are complementary, not competing. The typical setup has an external load balancer (like AWS NLB or ALB) in front of multiple API gateway instances for high availability of the gateway layer itself. Behind the gateway, internal load balancers distribute traffic across instances of each microservice. The external LB ensures no single gateway instance is a bottleneck or single point of failure. The gateway handles authentication, rate limiting, and routing. The internal LBs handle service-level traffic distribution. In a system design interview, drawing both components in their correct positions shows architectural maturity. Just make sure you can explain why each one is there and what would break if you removed it.
What is the difference between L4 and L7 load balancing?
Layer 4 load balancing operates at the transport layer (TCP/UDP). The load balancer makes routing decisions based on source IP, destination IP, and port numbers - without inspecting the content of the packets. It simply forwards raw TCP connections to backend servers. This makes it extremely fast and efficient, capable of handling millions of connections per second with minimal latency overhead. Layer 7 load balancing operates at the application layer (HTTP/HTTPS). The load balancer terminates the TCP connection, inspects the HTTP request (URL path, headers, cookies, body), and makes routing decisions based on that content. For example, it can route /api/users to the user service and /api/orders to the order service. The trade-off is straightforward: L4 is faster but can only route based on network-level information. L7 is more flexible but adds latency because it must parse the application protocol. In interviews, the nuanced answer is that most modern architectures use L7 at the edge (where intelligent routing and TLS termination matter) and L4 internally (where raw performance between known services matters).
Is NGINX a load balancer or a reverse proxy?
NGINX is both, and this question is actually a trick question that tests whether you understand the relationship between these concepts. A reverse proxy is a server that sits in front of backend servers and forwards client requests to them. A load balancer is a specific use case of a reverse proxy that distributes traffic across multiple backend instances. So load balancing is a function that a reverse proxy can perform, not a separate category of software. NGINX is fundamentally a reverse proxy that can be configured for load balancing, SSL termination, static file serving, caching, rate limiting, and more. In the same way, HAProxy is a reverse proxy specialized for load balancing, and Envoy is a reverse proxy designed for service mesh use cases. When an interviewer asks this question, the best answer is: "NGINX is a reverse proxy that supports load balancing as one of its core features. The terms are not mutually exclusive - load balancing is a function, and reverse proxy is an architectural role. NGINX fills the role and provides the function."
When should I use an API gateway vs a service mesh?
Use an API gateway for north-south traffic - requests coming from external clients (browsers, mobile apps, third-party integrations) into your system. The gateway handles concerns that matter at the boundary between the outside world and your services: authentication, rate limiting, API versioning, protocol translation, and request aggregation. Use a service mesh for east-west traffic - communication between your internal microservices. The mesh handles concerns that matter between trusted services: mutual TLS (mTLS) for encryption, retries with backoff, circuit breaking, load balancing between service instances, and fine-grained observability. You need both in a large microservices architecture because the concerns are different. External traffic needs authentication and throttling. Internal traffic needs encryption and resilience. The API gateway does not know (or care) how services talk to each other. The service mesh does not know (or care) how external clients authenticate. They operate at different boundaries with different responsibilities. That said, smaller systems with only a few services may not need a service mesh - the operational overhead of running Istio or Linkerd can outweigh the benefits. Start with an API gateway, and add a service mesh when your internal service-to-service communication becomes complex enough to warrant it.