Rate Limiting, Circuit Breakers, and the Beautiful Art of Failing Gracefully
The 3 AM PagerDuty Call That Changed Everything
It was a Tuesday at 3:14 AM when the first alert fired. Then the second. Then thirty more in the next ninety seconds.
Our payment processing service - one of twelve microservices in a mid-sized e-commerce platform - had started responding slowly. Not failing, not crashing, just... slow. Response times climbed from 50ms to 2 seconds, then 8 seconds, then 30 seconds. The root cause was mundane: a third-party fraud detection API we depended on was having a bad night. Their P99 latency had jumped from 200ms to 15 seconds.
Here is where things got ugly.
Our payment service had no timeout configured for that upstream call. So every incoming request to our payment service now held a thread open for 15+ seconds waiting on the fraud API. Our thread pool, sized at 200 threads, filled up in under a minute. New requests started queuing. The queue grew. Our payment service stopped responding entirely - not because it was broken, but because every thread was stuck waiting on a slow dependency.
The order service called the payment service to validate payment status before confirming orders. Those calls started timing out. The order service's thread pool began filling up. The product catalog service called the order service to check inventory reservations. It too started backing up. The API gateway, which routed to all of these services, ran out of connections. The entire platform went dark.
One slow third-party API - not even down, just slow - took out every single service in our architecture. Total downtime: 47 minutes. Revenue lost: north of $200,000. Customer trust lost: immeasurable.
When we did our post-mortem the next morning, the timeline was painful to read. The actual problem was a single slow dependency. But we had built a system where failure in one corner could cascade everywhere because we had no resilience patterns in place. No rate limiting. No circuit breakers. No bulkheads. No meaningful timeouts. We had built a beautiful distributed system that was, under the hood, as fragile as a house of cards.
This is the story of how we rebuilt it - and every pattern we are about to cover is something that comes up regularly in system design interviews. If you can talk about cascade failures from first-hand understanding rather than textbook definitions, interviewers notice.
Rate Limiting: Protecting Yourself From Your Own Users
Let us start with the most fundamental resilience pattern: rate limiting. Before we even talk about protecting services from each other, we need to protect our system from the outside world.
Rate limiting is the practice of controlling how many requests a client can make to your system within a given time window. It sounds simple, but the reasons for doing it are more nuanced than most candidates realize.
Why rate limit at all?
The obvious answer is "to prevent abuse" - stopping a malicious actor from DDoSing your API. That is true, but it is the least interesting reason. Here are the reasons that matter more in practice:
Protecting shared resources. Your database has a finite number of connections. Your server has finite memory. When one client sends 10,000 requests per second, they are consuming resources that other clients need. Rate limiting ensures fair access. This is especially critical in multi-tenant SaaS platforms where one noisy customer can degrade the experience for everyone.
Cost control. If your service calls downstream paid APIs (think Twilio, OpenAI, or any cloud provider), an uncontrolled spike in traffic translates directly to an uncontrolled spike in your bill. Rate limiting caps your exposure.
Maintaining SLAs. You promised your clients P99 latency under 200ms. Without rate limiting, a sudden traffic spike can push all your response times above that threshold. Rate limiting lets you shed excess load to protect the quality of service for accepted requests.
Preventing cascade failures. This connects directly to our 3 AM story. If an upstream client can bombard your service with unlimited requests, and your service calls downstream dependencies for each request, you are amplifying load at every hop. Rate limiting at the entry point contains the blast radius.
Where to implement rate limiting is an important design decision. You have three main options:
-
At the API gateway / load balancer level. This is the first line of defense. It is infrastructure-level, language-agnostic, and handles the simplest cases - per-IP rate limiting, per-API-key limits. Tools like NGINX, Kong, AWS API Gateway, and Envoy all support this natively. This is where you stop brute-force attacks and misbehaving clients before they even reach your application servers.
-
At the application level. For more nuanced rate limiting - different limits per user tier, per-endpoint limits, limits based on business logic (e.g., "free users get 100 AI queries per day, premium get 5,000") - you need application-level rate limiting. This typically lives in middleware.
-
At the service-to-service level. In a microservices architecture, internal services should also rate-limit calls from other internal services. This prevents one runaway service from overwhelming another. This is the layer most teams forget, and it is exactly what could have saved us that night.
In practice, you want rate limiting at multiple layers. The gateway handles the blunt-force protection. The application handles business-logic limits. Internal services protect themselves from each other.
Token Bucket vs Sliding Window
When an interviewer asks about rate limiting, they often want to hear about the algorithms. Let us walk through the two most important ones in detail.
Token Bucket Algorithm
The token bucket is probably the most widely used rate limiting algorithm in production systems. It is used by AWS, Stripe, and most API gateways. The mental model is intuitive: imagine a bucket that holds tokens.
- The bucket has a maximum capacity (say, 10 tokens).
- Tokens are added to the bucket at a fixed rate (say, 1 token per second).
- Each request consumes one token from the bucket.
- If the bucket is empty, the request is rejected (or queued).
- If the bucket is full, new tokens are discarded (the bucket does not overflow).
Here is the key insight: because the bucket can accumulate tokens over idle periods, it naturally allows bursts. If a client has not made any requests for 10 seconds, their bucket is full at 10 tokens. They can then fire off 10 requests in rapid succession - a burst - and all 10 will be accepted. After that burst, they are back to 1 request per second until tokens accumulate again.
class TokenBucket:
def __init__(self, capacity, refill_rate):
self.capacity = capacity # max tokens
self.tokens = capacity # start full
self.refill_rate = refill_rate # tokens per second
self.last_refill = now()
def allow_request(self):
self._refill()
if self.tokens >= 1:
self.tokens -= 1
return True
return False
def _refill(self):
elapsed = now() - self.last_refill
new_tokens = elapsed * self.refill_rate
self.tokens = min(self.capacity, self.tokens + new_tokens)
self.last_refill = now()
Tradeoffs of token bucket:
- Allows bursts up to the bucket capacity. This is usually desirable - real traffic is bursty, and you want to accommodate natural usage patterns.
- Simple to implement with O(1) time and space per client.
- The two parameters (capacity and refill rate) give you intuitive control: capacity controls burst size, refill rate controls sustained throughput.
- In a distributed system, you need a centralized store (Redis is the go-to) to maintain bucket state across multiple server instances. This adds latency to every request for the Redis lookup, though it is typically sub-millisecond.
Sliding Window Algorithm
The sliding window algorithm takes a different approach. Instead of tokens, it tracks the actual timestamps of recent requests and counts how many fall within the current window.
Fixed window is the simpler variant: divide time into fixed intervals (e.g., 1-minute windows) and count requests per window. If the count exceeds the limit, reject. The problem with fixed windows is the boundary problem - a client can make 100 requests at 11:59:59 and another 100 at 12:00:01, effectively getting 200 requests in a 2-second span despite a 100-per-minute limit.
Sliding window log fixes this by storing the timestamp of every request and counting how many timestamps fall within the trailing window. It is precise but memory-intensive - you are storing every timestamp.
Sliding window counter is the practical middle ground used in production. It works by combining the counts from the current and previous fixed windows, weighted by how far into the current window you are:
def sliding_window_count(prev_count, curr_count, window_size, elapsed):
weight = (window_size - elapsed) / window_size
return prev_count * weight + curr_count
If you are 30 seconds into a 60-second window, and the previous window had 80 requests and the current window has 20 so far, the effective count is: 80 * 0.5 + 20 = 60. If your limit is 100, this request is allowed.
Tradeoffs of sliding window counter:
- Smooths out the boundary problem of fixed windows.
- Only requires storing two counters per window per client - much less memory than the sliding window log.
- Does not naturally allow bursts the way token bucket does. The rate is more strictly enforced over the window.
- Slightly more complex to reason about than token bucket, but still O(1) per request.
When to use which? Token bucket is generally the better default for API rate limiting because it accommodates natural burst patterns. Sliding window counter is better when you need strict rate enforcement without bursts - for example, rate limiting outgoing calls to a paid third-party API where you are paying per call and have a hard quota.
In an interview, mentioning both and articulating the tradeoff between burst tolerance and strict enforcement is exactly what interviewers want to hear.
Circuit Breakers: The Electrical Metaphor That's Actually Perfect
Now we get to the pattern that would have directly prevented our cascade failure. The circuit breaker pattern, popularized by Michael Nygard in "Release It!" and implemented in libraries like Netflix's Hystrix (now Resilience4j), is borrowed directly from electrical engineering - and the metaphor holds up remarkably well.
In your home's electrical panel, a circuit breaker monitors current flowing through a circuit. If the current exceeds a safe threshold (indicating a short circuit or overload), the breaker trips open, physically breaking the circuit. This prevents the excess current from causing a fire. Once you fix the underlying problem, you manually reset the breaker to restore the circuit.
A software circuit breaker does exactly the same thing, but for network calls between services.
The Three States
A circuit breaker has three states:
Closed (normal operation). Requests flow through normally. The circuit breaker monitors every call and tracks failures. Think of this as the normal state of a light switch - current flows, lights are on. In the closed state, the breaker maintains a failure counter. Every failed request (timeout, 5xx error, connection refused) increments the counter. Every successful request may decrement it or reset it, depending on your implementation. If the failure count exceeds a configured threshold within a configured time window, the circuit breaker transitions to the open state.
Open (failing fast). All requests are immediately rejected without even attempting the downstream call. The breaker returns a predefined fallback response or throws a fast-fail exception. This is the critical state. Instead of sending requests to a service that is clearly struggling (making its problems worse), you fail fast. Your thread is freed in microseconds instead of blocking for 15 seconds waiting for a timeout. This is exactly what we needed at 3 AM - if our payment service had a circuit breaker on the fraud API call, it would have tripped open after the first handful of slow responses and stopped sending traffic to the struggling dependency.
The circuit breaker stays open for a configurable reset timeout (e.g., 30 seconds, 60 seconds). During this time, the failing service gets breathing room to recover.
Half-Open (testing the waters). After the reset timeout expires, the circuit breaker transitions to half-open. In this state, it allows a limited number of probe requests through to the downstream service. If these probes succeed, the breaker transitions back to closed - the service has recovered, resume normal traffic. If the probes fail, the breaker transitions back to open and resets the timeout.
class CircuitBreaker:
def __init__(self, failure_threshold=5, reset_timeout=30,
half_open_max_calls=3):
self.state = "CLOSED"
self.failure_count = 0
self.failure_threshold = failure_threshold
self.reset_timeout = reset_timeout
self.half_open_max_calls = half_open_max_calls
self.last_failure_time = None
self.half_open_successes = 0
def call(self, func):
if self.state == "OPEN":
if now() - self.last_failure_time > self.reset_timeout:
self.state = "HALF_OPEN"
self.half_open_successes = 0
else:
raise CircuitOpenException("Circuit is open")
if self.state == "HALF_OPEN":
if self.half_open_successes >= self.half_open_max_calls:
self.state = "CLOSED"
self.failure_count = 0
try:
result = func()
if self.state == "HALF_OPEN":
self.half_open_successes += 1
self.failure_count = max(0, self.failure_count - 1)
return result
except Exception as e:
self.failure_count += 1
self.last_failure_time = now()
if self.failure_count >= self.failure_threshold:
self.state = "OPEN"
raise e
Configuring Thresholds
The art of circuit breakers is in the configuration. Set the failure threshold too low, and the breaker trips on normal transient errors (a single network blip opens the circuit). Set it too high, and the breaker does not trip fast enough to prevent cascade failures.
Here are practical guidelines:
- Failure threshold: 50% failure rate over the last 10-20 calls is a common starting point. Some implementations use a rolling window of the last N calls rather than a simple counter. Resilience4j, for example, uses a ring buffer of configurable size and trips when the failure percentage exceeds the threshold.
- Reset timeout: Start with 30 seconds. Too short and you hammer the recovering service with probes before it is ready. Too long and you keep the circuit open unnecessarily, degrading user experience.
- Half-open probe count: 3-5 probes is typical. You want enough data points to be confident the service has recovered, but not so many that you recreate the overload condition.
- What counts as a failure: This is crucial. A 4xx error (client error) should generally NOT count as a failure - the downstream service handled the request correctly, the client just sent bad input. A 5xx error, a timeout, and a connection refused should all count. Slow responses (above a configured latency threshold) should also count, because slow responses are often worse than errors - they hold threads open.
Fallback Strategies
When the circuit is open, you need a fallback. The right fallback depends on your use case:
- Cached data: Return the last known good response. For our product catalog service, if the inventory service is down, return the last cached inventory counts (even if they are slightly stale). A slightly stale price is better than no page at all.
- Degraded functionality: Disable the feature that depends on the failing service. If the recommendation engine is down, show trending products instead of personalized recommendations. The page still loads; it is just less personalized.
- Default values: Return a sensible default. If the fraud detection API is down, you might accept payments below a certain dollar threshold (accepting slightly higher fraud risk to maintain revenue) while rejecting high-value transactions.
- Queue for retry: Accept the request, store it in a queue, and process it later when the downstream service recovers. This works well for operations that do not need a synchronous response.
In an interview, articulating the fallback strategy is just as important as explaining the circuit breaker mechanism. It shows you think about the user experience, not just the plumbing.
Bulkheads, Timeouts, and Retries with Backoff
Circuit breakers are powerful, but they are not the complete picture. Let us cover three more patterns that work in concert with circuit breakers to create a truly resilient system.
Bulkhead Pattern
The bulkhead pattern is named after the compartments in a ship's hull. If the hull is breached, water floods one compartment but the watertight bulkheads prevent it from flooding the entire ship. The ship stays afloat.
In software, the bulkhead pattern isolates different parts of your system so that a failure in one does not consume all shared resources. There are two main flavors:
Thread pool bulkhead. Assign each downstream dependency its own dedicated thread pool. In our 3 AM scenario, the payment service had a single shared thread pool of 200 threads. When the fraud API got slow, it consumed all 200 threads, starving every other operation. With thread pool bulkheads, the fraud API calls get their own pool of, say, 30 threads. When those 30 fill up, only fraud-related requests are affected. The remaining 170 threads continue serving other payment operations normally.
# Thread pool bulkhead configuration
bulkheads:
fraud_api:
max_concurrent: 30
max_wait: 500ms
payment_gateway:
max_concurrent: 50
max_wait: 1000ms
order_service:
max_concurrent: 40
max_wait: 750ms
Semaphore bulkhead. A lighter-weight alternative that limits concurrency using a semaphore (counter) rather than a dedicated thread pool. Each call acquires a permit before proceeding; if no permits are available, the call is immediately rejected. Lower overhead than thread pool isolation but no queuing capability.
The bulkhead pattern is the unsung hero of resilience. Most candidates in system design interviews mention circuit breakers but forget bulkheads. Yet bulkheads are what prevent the cascade in the first place - the circuit breaker trips after failures are detected, but the bulkhead prevents resource exhaustion from ever reaching other parts of the system.
Timeouts: The Simplest Pattern That Everyone Gets Wrong
Every network call should have a timeout. This is not controversial. Yet in our production incident, we had calls with no timeout configured, defaulting to the HTTP client's default (which in some libraries is infinite).
There are two types of timeouts you need to configure:
Connection timeout. How long to wait to establish a TCP connection. This should be short - 1 to 5 seconds. If you cannot establish a connection in 5 seconds, the remote service is likely unreachable. Waiting longer just wastes a thread.
Read/response timeout. How long to wait for the response after the connection is established. This depends on the expected response time of the service. For a database query, 2-5 seconds. For a complex report generation, maybe 30 seconds. But always configure it explicitly.
A critical rule of thumb: your timeout should be slightly above the P99 latency of the downstream service. If the service normally responds in 100ms with a P99 of 500ms, set your timeout to 1 second. This allows normal variance but catches pathological slowness. If you set your timeout to 30 seconds for a service that normally responds in 100ms, you have effectively no protection - a thread can be held for 30 seconds doing nothing useful.
Request-level timeouts vs hedged requests. For critical-path requests where latency matters, you can also use hedged requests - if the primary request has not responded within the P50 latency, fire a second request to a different server. Take whichever responds first. Google uses this technique extensively. The tradeoff is increased load on your downstream service (you are sending extra requests), so it should only be used for latency-critical paths.
Retries with Exponential Backoff and Jitter
Transient failures happen. A network blip, a momentary overload, a brief garbage collection pause - these produce errors that would succeed if tried again. Retries handle this. But naive retries are dangerous.
The retry storm problem. Imagine 1,000 clients all get an error at the same moment (say, when your service restarts). If all 1,000 immediately retry, you have just doubled the load on a service that was already struggling. If they retry again a second later, you have tripled it. Naive retries amplify the exact problem they are trying to solve.
Exponential backoff fixes the thundering herd problem by increasing the wait time between retries exponentially:
def retry_with_backoff(func, max_retries=3, base_delay=1.0):
for attempt in range(max_retries):
try:
return func()
except TransientError:
if attempt == max_retries - 1:
raise
delay = base_delay * (2 ** attempt)
# Attempt 0: 1s, Attempt 1: 2s, Attempt 2: 4s
sleep(delay)
But exponential backoff alone is not enough. If 1,000 clients all start retrying at the same time with the same backoff schedule, they will all retry at the same moments - 1 second later, then 2 seconds later, then 4 seconds later. You have just created synchronized waves of retries.
Jitter solves this by adding randomness to the backoff:
def retry_with_backoff_and_jitter(func, max_retries=3, base_delay=1.0):
for attempt in range(max_retries):
try:
return func()
except TransientError:
if attempt == max_retries - 1:
raise
delay = base_delay * (2 ** attempt)
jittered_delay = random.uniform(0, delay) # full jitter
sleep(jittered_delay)
AWS's architecture blog identifies three jitter strategies:
- Full jitter:
delay = random(0, base * 2^attempt)- maximum spread, best at distributing retries. - Equal jitter:
delay = base * 2^attempt / 2 + random(0, base * 2^attempt / 2)- guarantees a minimum wait while still spreading. - Decorrelated jitter:
delay = random(base, prev_delay * 3)- each retry's range depends on the previous delay.
Full jitter gives the best distribution and is the recommended default in most cases.
Retry budgets. One more advanced concept: instead of configuring retries per-request, some systems use a retry budget - a cap on the total percentage of requests that can be retries. For example, "retries should not exceed 10% of total requests." If you made 1,000 requests and 100 of them were retries, no more retries are allowed until the ratio drops. This prevents retry storms at the system level regardless of individual retry configurations. Google's gRPC framework implements this pattern.
How These Patterns Compose
Here is the critical insight: these patterns are not alternatives; they are layers that compose together. Here is how a well-designed service handles a call to a downstream dependency:
- Rate limiter at the edge controls inbound traffic.
- Bulkhead ensures this dependency's calls have isolated resources.
- Timeout ensures no single call blocks longer than acceptable.
- Circuit breaker monitors the failure rate and trips if the dependency is unhealthy.
- Retry with backoff and jitter handles transient failures (only when the circuit is closed).
- Fallback provides a degraded response when the circuit is open.
This layered approach is what we implemented after our incident. Each pattern addresses a different failure mode, and together they create a system that degrades gracefully rather than cascading catastrophically.
Designing for Failure in Your Interview
Now let us talk about how to actually use this knowledge in a system design interview. You cannot spend 20 minutes explaining circuit breaker internals - but you can weave resilience patterns into your design in a way that signals deep understanding.
When to bring up resilience patterns. The right moment is when you are discussing service-to-service communication in your architecture. After you have drawn the high-level boxes and arrows, you should be talking about "what happens when service B is slow or down?" This is where you naturally introduce circuit breakers and timeouts. If the interviewer asks about handling traffic spikes or protecting the system, that is where rate limiting comes in.
What interviewers actually want to hear. Interviewers are not looking for textbook definitions. They want to hear you reason about failure modes specific to the system you are designing. For example:
"Our recommendation service calls the user-preferences service and the product-catalog service. I would put a circuit breaker on both of those calls. If the preferences service goes down, we fall back to showing popular items - degraded but functional. If the product catalog goes down, that is more critical, so we should have a cached copy of the catalog with a 5-minute TTL that we serve when the circuit is open."
That shows you understand not just what a circuit breaker is, but how to configure the fallback differently based on the criticality of each dependency. That is senior-level thinking.
Do not forget the client side. Rate limiting is often discussed as a server-side concern, but good candidates also mention client-side considerations: how do you communicate rate limits to clients (HTTP 429 with Retry-After header), how do API clients implement backoff, and how do you design your rate limit tiers (free vs paid users).
Monitoring and observability. Mention that circuit breaker state transitions should be logged and alerted on. If a circuit trips open, that should fire an alert. The half-open state transitions tell you whether the service is recovering. Rate limiter rejections should be tracked as a metric - a sudden spike in 429 responses might indicate a client bug or an attack. Dashboard visibility into these patterns is what makes them operational, not just architectural.
The Resilience Checklist
Here is a concrete checklist you can use when reviewing the system you are designing in an interview. Run through this mentally before you say "I think that covers the design":
Rate Limiting:
- Have you defined rate limits at the API gateway level?
- Are there per-user and per-tier rate limits at the application level?
- Do internal services rate-limit calls from other internal services?
- Have you specified how rate limit information is communicated to clients (429 + Retry-After)?
- Have you chosen an appropriate algorithm (token bucket for burst tolerance, sliding window for strict limits)?
Circuit Breakers:
- Does every outbound service call have a circuit breaker?
- Have you defined what constitutes a failure (5xx, timeout, latency threshold)?
- Have you defined a fallback strategy for each dependency?
- Are fallbacks appropriate for the criticality of each dependency?
- Is circuit breaker state monitored and alerted on?
Bulkheads:
- Are downstream dependencies isolated into separate thread pools or semaphores?
- Is the sizing of each bulkhead proportional to the importance and expected load of that dependency?
- What happens when a bulkhead is full - does the request queue or fail fast?
Timeouts:
- Does every network call have an explicit connection timeout AND read timeout?
- Are timeouts set based on the downstream service's observed latency (slightly above P99)?
- For critical-path calls, have you considered hedged requests?
Retries:
- Are retries configured with exponential backoff AND jitter?
- Is there a maximum retry count (typically 2-3 for synchronous calls)?
- Are only transient errors retried (not 4xx client errors)?
- Is there a retry budget to prevent retry storms at the system level?
- Are retries disabled when the circuit breaker is open?
Graceful Degradation:
- If any single dependency fails, does the overall system still function (possibly in a degraded mode)?
- Are there cached fallbacks for read-heavy paths?
- Is there a clear prioritization of which features can be degraded vs which are critical?
- Have you tested the degraded states to ensure they actually work?
After our 3 AM incident, we implemented every item on this checklist. The system still had incidents after that - distributed systems always will - but we never again had a cascade failure where one slow service took down the entire platform. Failures became contained, predictable, and manageable. That is the goal: not to prevent all failures, but to fail gracefully.
The engineers who understand this deeply - who can talk about these patterns not as textbook concepts but as battle-tested solutions to real problems - those are the engineers who get the offers. Because resilience is not a feature you add at the end. It is a design philosophy that shapes every architectural decision you make.
Preparing for system design interviews? Check out our guides on designing Netflix for a system design interview and understanding load balancers vs API gateways for more patterns you will encounter.
Frequently Asked Questions
What's the difference between rate limiting and throttling?
The terms are often used interchangeably, but there is a subtle difference. Rate limiting is the broader concept of controlling the rate of requests. When a rate limit is exceeded, the request is typically rejected outright with a 429 (Too Many Requests) response. Throttling is a specific response to exceeding a rate limit where requests are slowed down rather than rejected - they might be queued and processed at a reduced rate, or the client might be asked to wait (via a Retry-After header) before sending more. In practice, most production systems use rate limiting with rejection because throttling (queuing excess requests) requires server resources to hold those requests, which defeats the purpose of protecting the server from overload. When discussing these in an interview, it is fine to use them interchangeably, but if asked to distinguish, the key difference is reject (rate limiting) vs delay (throttling).
How does the token bucket algorithm work?
The token bucket algorithm maintains a virtual "bucket" that holds tokens up to a maximum capacity. Tokens are added to the bucket at a constant refill rate (e.g., 10 tokens per second). When a request arrives, it must consume one token from the bucket to proceed. If the bucket has tokens, the request is allowed and one token is removed. If the bucket is empty, the request is rejected. Because tokens accumulate during idle periods (up to the maximum capacity), the algorithm naturally allows bursts - a client that has been idle can send a burst of requests equal to the bucket's capacity. After the burst, they are limited to the refill rate. This burst tolerance is why token bucket is the most popular rate limiting algorithm for public APIs. The two tunable parameters - capacity (controls burst size) and refill rate (controls sustained throughput) - make it flexible for different use cases. In distributed systems, the bucket state is typically stored in Redis so all server instances share the same counter for each client.
When should a circuit breaker trip?
A circuit breaker should trip (transition from closed to open) when the downstream service is exhibiting sustained failure that is unlikely to resolve within the timeframe of a single request. The specific configuration depends on your system, but general guidelines are: trip when the failure rate exceeds 50% of requests over a recent window (e.g., the last 20 calls), or when consecutive failures exceed a threshold (e.g., 5 consecutive failures). Importantly, "failure" should include not just errors (5xx responses, connection refused) but also slow responses that exceed a latency threshold - in our cascade failure story, the downstream service was not returning errors, it was just slow, and that was enough to bring everything down. A 4xx response (client error) should generally not count as a failure because it indicates the downstream service is functioning correctly. The reset timeout (how long the circuit stays open before testing with half-open probes) should be long enough for the downstream service to recover - typically 30 to 60 seconds - but not so long that you keep the circuit open unnecessarily when the service has already recovered.
Should rate limiting happen at the API gateway or service level?
Both, and they serve different purposes. API gateway rate limiting is your first line of defense and handles broad, infrastructure-level concerns: per-IP limits to prevent DDoS attacks, per-API-key limits to enforce subscription tiers, and global rate limits to protect your overall system capacity. This is coarse-grained and applies before requests even reach your application servers. Service-level rate limiting handles more nuanced, business-logic-driven limits: per-user quotas that vary by subscription plan, per-endpoint limits (your search endpoint might need stricter limits than your profile endpoint because it is more expensive), and per-operation limits (e.g., "a user can only reset their password 3 times per hour"). In microservices architectures, there is a third layer that is equally important: service-to-service rate limiting, where each internal service protects itself from excessive calls by other internal services. This prevents a bug in one service (like an infinite retry loop) from overwhelming its dependencies. The recommended approach is defense in depth - rate limiting at every layer, with each layer handling the concerns appropriate to its level of abstraction.