Back to blog
Editorial illustration of a single large monumental monolith block standing confidently in the center, glowing with warm light to represent one unified application, with streams of glowing data flowing into it from many directions to suggest enormous traffic at scale, and a few small supporting glowing server racks, a cache layer, and a queue around its base quietly absorbing the load, in a calm teal and deep blue palette with warm amber accents.
System Design

Scaling a Monolithic Architecture to Billions: PostgreSQL, Django, No Kubernetes

Jun 30, 2026 10 min read Avinash Tyagi
monolithic architecture monolith architecture what is monolithic architecture scaling a monolith scale without kubernetes postgresql at scale django at scale monolith vs microservices database sharding system design

Every few years the industry rediscovers the same lesson. A small team ships a product on a boring stack, the product takes off, and suddenly the engineering conversation turns to microservices, service meshes, and container orchestration. The assumption is that real scale demands a distributed architecture with dozens of services. Yet some of the largest consumer products ever built reached enormous scale on a monolithic architecture: a single application written in Python, backed by PostgreSQL, fronted by Memcached, and coordinated by a task queue. No Kubernetes. No microservices. Just a well-run monolith.

This post is a case study in how to scale a monolith without reaching for the heavy machinery. It draws on publicly documented architectures, most notably the early engineering write-ups from Instagram, whose team scaled a Django and PostgreSQL application to hundreds of millions of users with a famously small engineering staff. The point is not that a monolithic architecture is always right or that microservices are wrong. The point is that the bottleneck you actually hit at scale is rarely the one the conference talks warn you about.

The Monolithic Architecture Stack That Refuses to Die

The components in the title are not exotic. They are the same pieces a junior engineer would reach for to build a side project, which is exactly why they are underrated. This is what a monolithic architecture actually looks like in production once you strip away the marketing around distributed systems.

PostgreSQL is the system of record. Django is the application layer, handling routing, ORM access, templating, and business logic in one deployable unit. Memcached sits in front of the database to absorb read traffic. RabbitMQ (or a comparable broker) decouples slow work from the request path. That is the entire backbone, and variations of it run some of the highest-traffic sites on the internet.

What makes this stack durable is that each layer has a single, well-understood job. When something gets slow, you can reason about which layer is responsible without tracing a request across a dozen network hops. Observability in a monolith is local. That is a real operational advantage when you are debugging at three in the morning.

Why Teams Reach for Microservices Too Early

Before looking at how to scale the monolith, it is worth being honest about why teams abandon it. The reasons are usually organizational rather than technical.

Microservices solve a people problem first and a scaling problem second. When you have fifty teams that all want to deploy independently, splitting the codebase along team boundaries reduces coordination cost. That is a legitimate reason. But a ten-person team does not have a coordination problem. It has a focus problem, and splitting into services multiplies the operational surface area without buying anything. You trade in-process function calls for network calls, local transactions for distributed sagas, and a single deploy for an orchestration platform.

The cost of that trade is real. Distributed systems introduce partial failure, eventual consistency, and debugging that spans process boundaries. As Martin Fowler has argued in his writing on the "monolith first" approach, most systems should start as a monolith and only decompose once the boundaries are well understood. Premature decomposition locks in the wrong boundaries and is far harder to undo than to avoid.

For a deeper treatment of when the split actually makes sense, our modular monolith vs microservices decision framework walks through the specific signals that justify the move.

A horizontal flow diagram titled four levers for scaling a monolith, in order. Four boxes connected by arrows read, in sequence: one, cache, using Memcached and a cache-aside pattern; two, replicas plus pool, using read replicas and PgBouncer; three, async queue, using RabbitMQ and Celery workers; and four, shard, splitting writes across multiple databases. Captions explain that caching absorbs repeated reads, replicas scale reads, the queue moves slow work off the request path, and you only shard when a single primary can no longer keep up with writes, all without Kubernetes or microservices.
The four levers for scaling a monolithic architecture, applied in order: cache, then read replicas and connection pooling, then an asynchronous queue, and only then sharding.

Lever One: Cache Aggressively With Memcached

The first thing that breaks under load is the database, and the first fix is to stop asking it the same question repeatedly. Memcached is an in-memory key-value store that holds the results of expensive reads so the database never sees them twice.

The pattern is cache-aside. The application checks the cache first. On a miss, it reads the database, writes the result back to the cache, and returns it. On a hit, the database is never touched.

cache.pypython
import hashlib
from django.core.cache import cache
from myapp.models import Post

def get_post(post_id):
    key = f"post:{post_id}"
    cached = cache.get(key)
    if cached is not None:
        return cached

    post = Post.objects.get(pk=post_id)
    cache.set(key, post, timeout=3600)  # one hour
    return post

def invalidate_post(post_id):
    cache.delete(f"post:{post_id}")

The hard part of caching is not the read. It is invalidation. When the underlying row changes, the cache entry must be removed or updated, or users will see stale data. A disciplined approach is to invalidate on write inside the same code path that mutates the row, so the cache and the database never drift for long.

At very high scale, Memcached itself becomes a fleet. You run many instances and shard keys across them with consistent hashing so that adding or removing a node only remaps a small fraction of keys. This is the same idea behind distributing any large dataset, covered in more depth in our writing on consistent hashing and data partitioning.

Lever Two: Read Replicas and Connection Pooling

Caching removes repeated reads, but the database still handles every write and every cache miss. The next lever is to scale reads horizontally with replication.

PostgreSQL supports streaming replication, where one primary handles writes and one or more replicas receive a continuous stream of changes. The application sends writes to the primary and spreads reads across the replicas. Django supports this directly through database routers.

routers.pypython
import random

class PrimaryReplicaRouter:
    replicas = ["replica_1", "replica_2"]

    def db_for_read(self, model, **hints):
        return random.choice(self.replicas)

    def db_for_write(self, model, **hints):
        return "default"  # the primary

    def allow_relation(self, obj1, obj2, **hints):
        return True

Replication introduces replication lag. A write committed on the primary takes a short time to appear on the replicas. If a user updates their profile and then immediately reads it from a replica, they may see the old value. The usual fix is read-your-writes routing: for a short window after a write, route that user's reads to the primary.

The second database problem at scale is connection exhaustion. Each PostgreSQL connection costs memory, and a fleet of application servers can easily open more connections than the database can hold. A connection pooler such as PgBouncer sits between the application and the database, multiplexing thousands of client connections onto a small pool of real ones.

Lever Three: Move Slow Work Off the Request Path

A web request should do the minimum work required to return a response. Anything that can happen later should happen later. This is where RabbitMQ and a task worker such as Celery earn their place.

When a user uploads a photo, the request should save the file and return immediately. Generating thumbnails, updating followers' feeds, running content checks, and sending notifications all happen asynchronously. The web server publishes a message to the broker and moves on. A separate pool of worker processes consumes those messages and does the heavy lifting.

tasks.pypython
from celery import shared_task

@shared_task(bind=True, max_retries=3)
def process_upload(self, photo_id):
    try:
        generate_thumbnails(photo_id)
        fan_out_to_followers(photo_id)
        enqueue_safety_scan(photo_id)
    except TransientError as exc:
        # retry with exponential backoff
        raise self.retry(exc=exc, countdown=2 ** self.request.retries)

This pattern keeps the request path fast and predictable. It also gives you a natural place to apply backpressure. If a downstream system slows down, messages queue up in the broker instead of timing out user requests. You can scale workers independently of web servers, which is a far cheaper and simpler form of independent scaling than splitting the application into services.

The fan-out problem, where one action must update many other users' data, is one of the classic challenges of social applications. Whether you fan out on write or compute on read is a genuine design decision with real tradeoffs, and the right answer depends on your read-to-write ratio.

Lever Four: Shard the Database When a Replica Is Not Enough

Read replicas scale reads, but a single primary still has to absorb every write. When write volume outgrows one machine, you shard: you split the data across multiple independent PostgreSQL servers, each owning a slice of the keyspace.

The Instagram team documented an elegant approach using logical shards mapped onto a smaller number of physical databases, with IDs generated to encode their shard. Generating unique IDs that sort roughly by time and carry their shard location lets the application route any row to the correct database without a central lookup. The structure of such an ID typically combines a timestamp, a shard identifier, and a per-shard sequence number into a single 64-bit integer.

sharded-id.txttext
 41 bits: milliseconds since a custom epoch
 13 bits: logical shard id
 10 bits: per-shard auto-increment sequence
 ----------------------------------------
 64 bits total, time-sortable, shard-routable
A diagram titled anatomy of a shard-routable 64-bit ID. A horizontal bar is split into three colored segments: a wide 41 bit segment labeled milliseconds since a custom epoch, a 13 bit segment labeled logical shard id, and a 10 bit segment labeled per-shard sequence. The caption explains that a timestamp, a shard identifier, and a per-shard sequence number are packed into a single 64-bit integer that is time-sortable and carries its own shard location, so any row routes to the right database without a central lookup, the approach Instagram documented for sharding PostgreSQL behind a Django monolith.
A shard-routable 64-bit ID packs a timestamp, a logical shard id, and a per-shard sequence number into one integer, so any row routes to the right database without a central lookup.

Sharding is the most invasive lever, and you should delay it as long as caching, replicas, and pooling are still buying you headroom. Once you shard, cross-shard queries and transactions become expensive, so the schema and access patterns have to be designed around the shard key. This is precisely why you do not shard early: you want to understand your access patterns deeply before you bake them into the physical layout.

Why No Kubernetes Was Fine

Container orchestration is excellent for managing many heterogeneous services with complex scheduling needs. A monolith does not have that shape. It is a handful of identical stateless application servers behind a load balancer, plus a few stateful systems (PostgreSQL, Memcached, the broker) that you generally do not want an orchestrator rescheduling out from under you.

For a uniform fleet of stateless app servers, an autoscaling group and a load balancer do the job with a fraction of the operational complexity. You add servers when CPU rises and remove them when it falls. There is no control plane to operate, no networking abstraction to debug, no new failure mode to learn. The simplicity compounds: fewer moving parts means fewer incidents and a smaller on-call burden, which matters enormously for a small team.

This is the throughline of the entire architecture. Every choice optimizes for a small team's ability to understand, operate, and debug the system. That is a feature, not a limitation. Teams routinely underestimate how much velocity they lose to operational complexity they adopted before they needed it.

When You Genuinely Should Decompose

None of this is an argument that the monolith scales forever or fits every situation. There are real signals that justify splitting it up.

You should consider decomposing when deploy contention becomes a tax, meaning teams are blocked waiting on each other's releases. You should consider it when one part of the system has wildly different scaling or reliability requirements than the rest, such as a real-time messaging subsystem living inside a content app. You should consider it when distinct parts genuinely need different technology stacks. And you should consider it when the codebase has grown large enough that the cognitive load of the whole exceeds any one engineer's grasp.

The mistake is treating those signals as a reason to start with microservices rather than a reason to evolve toward them. As covered in our piece on why teams are leaving microservices in 2026, a surprising number of organizations are consolidating back into modular monoliths after discovering the distributed version cost more than it returned. When the time does come, the path from a well-structured monolith is itself a project worth planning, which you can read more about on the Levelop blog.

Frequently Asked Questions

Can a monolith really handle billions of users?

A single application process cannot, but a horizontally scaled fleet of identical monolith instances, backed by a cached and replicated database, demonstrably can. Publicly documented systems such as Instagram reached hundreds of millions of users on a Django and PostgreSQL architecture. The monolith refers to the codebase being one deployable unit, not to running on one machine. You still run many copies behind a load balancer.

What is the first thing to optimize when a monolith gets slow?

Almost always the database. Start by finding the slow queries and the most frequent reads, then put a cache such as Memcached in front of those reads. Caching the hottest objects typically removes the majority of database load with very little code change, which makes it the highest-return first move before you consider replicas or sharding.

Do I need Kubernetes to scale a web application?

No. Kubernetes is valuable when you operate many different services with complex scheduling needs. A monolith is a uniform fleet of stateless servers, which an autoscaling group and a load balancer handle with far less operational complexity. Adopt orchestration when the shape of your system actually calls for it, not by default.

When should I move from read replicas to sharding?

Move to sharding only when a single primary can no longer absorb your write volume, since replicas scale reads but not writes. Exhaust caching, read replicas, and connection pooling first. Sharding is the most invasive change because it constrains your schema and access patterns around the shard key, so delay it until you deeply understand those patterns.

Is a monolithic architecture outdated in 2026?

Not at all. The industry trend has actually shifted back toward the modular monolith for many use cases, as teams account for the operational cost of distributed systems. A monolithic architecture remains the correct default for most products, with decomposition reserved for specific organizational and technical signals rather than adopted preemptively.

Closing Thoughts

The most important lesson from systems that scaled a monolith to enormous size is that scaling is a sequence of targeted, reversible moves against well-understood bottlenecks, not a wholesale architectural rewrite. Cache the hot reads. Replicate to scale reads. Pool your connections. Push slow work to a queue. Shard only when writes truly demand it. Each of these is something a small team can implement, reason about, and operate without a platform engineering org standing behind it.

If you are building toward scale, the question to keep asking is not "how do the big companies do it" but "what is my actual bottleneck right now, and what is the smallest change that removes it." For more system design deep dives, explore the Levelop blog or start practicing architecture decisions hands-on at Levelop.

References

  1. Instagram Engineering, "What Powers Instagram: Hundreds of Instances, Dozens of Technologies," instagram-engineering.com.
  2. Instagram Engineering, "Sharding & IDs at Instagram," instagram-engineering.com.
  3. Martin Fowler, "MonolithFirst," martinfowler.com/bliki/MonolithFirst.html.
  4. PostgreSQL Documentation, "High Availability, Load Balancing, and Replication," postgresql.org.
  5. PgBouncer Documentation, "Connection Pooling for PostgreSQL," pgbouncer.org.

Keep reading

System Design

Modular Monolith: Why Teams Are Leaving Microservices

42% of organizations are consolidating microservices back into modular monoliths. Here is why the economics shifted, what a modular monolith looks like in practice, and how to decide which architecture fits your team.

Read article
System Design

Bloom Filters Explained: When "Probably Yes" Is Good Enough and Saves 99% of Memory

Bloom filters trade a tiny false positive rate for 95% memory savings. Here is how they work and where they show up in Cassandra, Redis, and Chrome.

Read article
System Design

Consistent Hashing Explained: Virtual Nodes and Why They Matter for Distributed Systems

Learn how consistent hashing with virtual nodes solves load balancing and data distribution in distributed systems like Cassandra, DynamoDB, and Redis Cluster.

Read article