Category
system design
83 articles across 26 sub-topics
Stream Processing Pipeline Pattern: Stateful Real-Time Data Products
TLDR: Stream pipelines succeed when event-time semantics, state management, and replay strategy are designed together — and Kafka Streams lets you build all three directly inside your Spring Boot service. Stripe's real-time fraud detection processes...
Service Mesh Pattern: Control Plane, Data Plane, and Zero-Trust Traffic
TLDR: A service mesh intercepts all service-to-service traffic via injected Envoy sidecar proxies, letting a platform team enforce mTLS, retries, timeouts, and circuit breaking centrally — without changing application code. Reach for it when cross-te...
Serverless Architecture Pattern: Event-Driven Scale with Operational Guardrails
TLDR: Serverless is strongest for spiky asynchronous workloads when cold-start, observability, and state boundaries are intentionally designed. TLDR: Serverless works best for spiky, event-driven workloads when you design for idempotency, observabili...
Saga Pattern: Coordinating Distributed Transactions with Compensation
TLDR: A Saga replaces fragile distributed 2PC with a sequence of local transactions, each backed by an explicit compensating transaction. Use orchestration when workflow control needs a single brain; use choreography when services must stay loosely c...
Modernization Architecture Patterns: Strangler Fig, Anti-Corruption Layers, and Modular Monoliths
TLDR: Large-scale modernization usually fails when teams try to replace an entire legacy platform in one synchronized rewrite. The safer approach is to create seams, translate old contracts into stable new ones, and move traffic gradually with measur...
Microservices Data Patterns: Saga, Transactional Outbox, CQRS, and Event Sourcing
TLDR: Microservices get risky when teams distribute writes without defining how business invariants survive network delays, retries, and partial failures. Patterns like transactional outbox, saga, CQRS, and event sourcing exist to make those rules ex...
Lambda Architecture Pattern: Balancing Batch Accuracy with Streaming Freshness
TLDR: Lambda architecture is justified when replay correctness and sub-minute freshness are both non-negotiable despite dual-path complexity. TLDR: Lambda architecture is a fit only when you need both low-latency views and deterministic recompute fro...
Integration Architecture Patterns: Orchestration, Choreography, Schema Contracts, and Idempotent Receivers
TLDR: Integration failures usually come from weak contracts, unsafe retries, and missing ownership rather than from choosing the wrong transport. Orchestration, choreography, schema contracts, and idempotent receivers are patterns for making cross-bo...
Infrastructure as Code Pattern: GitOps, Reusable Modules, and Policy Guardrails
TLDR: Infrastructure as code is useful because it makes infrastructure changes reviewable, repeatable, and testable. It becomes production-grade only when module boundaries, state locking, GitOps flow, and policy checks are treated as operational con...
Feature Flags Pattern: Decouple Deployments from User Exposure
TLDR: Feature flags separate deploy from exposure. They are operationally valuable when you need cohort rollout, instant kill switches, or entitlement control without rebuilding or redeploying the service. TLDR: Flags help only when they are treated ...
Event Sourcing Pattern: Auditability, Replay, and Evolution of Domain State
TLDR: Event sourcing pays off when regulatory audit history and replay are first-class requirements — but it demands strict schema evolution, a snapshot strategy, and a framework that owns aggregate lifecycle. Spring Boot + Axon Framework is the fast...
Dimensional Modeling and SCD Patterns: Building Stable Analytics Warehouses
TLDR: Dimensional modeling with explicit SCD policy is the foundation for reproducible metrics and trustworthy historical analytics. TLDR: Dimensional models stay trustworthy only when teams define grain, history rules, and reload procedures before d...
Deployment Architecture Patterns: Blue-Green, Canary, Shadow Traffic, Feature Flags, and GitOps
TLDR: Release safety is an architecture capability, not just a CI/CD convenience. Blue-green, canary, shadow traffic, feature flags, and GitOps patterns exist to control blast radius, measure regressions early, and make rollback fast enough to matter...
Dead Letter Queue Pattern: Isolating Poison Messages and Recovering Safely
TLDR: A dead letter queue protects throughput by moving repeatedly failing messages out of the hot path. It only works if retries are bounded, triage has an owner, and replay is a deliberate workflow instead of a panic button. TLDR: The main SRE ques...
Data Pipeline Orchestration Pattern: DAG Scheduling, Retries, and Recovery
TLDR: Pipeline orchestration is an operational control plane problem that requires explicit dependency, retry, and backfill contracts. TLDR: Pipeline orchestration is less about drawing DAGs and more about controlling freshness, replay, and recovery ...
CQRS Pattern: Separating Write Models from Query Models at Scale
TLDR: CQRS works when read and write workloads diverge, but only with explicit freshness budgets and projection reliability. The hard part is not separating models — it is operating lag, replay, and rollback safely. An e-commerce platform's order se...
Cloud Architecture Patterns: Cells, Control Planes, Sidecars, and Queue-Based Load Leveling
TLDR: Cloud scale is not created by sprinkling managed services around a diagram. It comes from isolating failure domains, separating coordination from request serving, and smoothing bursty work before it overloads synchronous paths. TLDR: Cloud patt...
Circuit Breaker Pattern: Prevent Cascading Failures in Service Calls
TLDR: Circuit breakers protect callers from repeatedly hitting a failing dependency. They turn slow failure into fast failure, giving the rest of the system room to recover. TLDR: A circuit breaker is useful only if it is paired with good timeouts, l...
Change Data Capture Pattern: Log-Based Data Movement Without Full Reloads
TLDR: Change data capture moves committed database changes into downstream systems without full reloads. It is most useful when freshness matters, replay matters, and the source database must remain the system of record. TLDR: CDC becomes production-...
Canary Deployment Pattern: Progressive Delivery Guarded by SLOs
TLDR: Canary deployment is useful only when the rollout gates are defined before the rollout starts. Sending 1% of traffic to a bad build is still a bad release if you do not know what metric forces rollback. TLDR: Canary is the practical choice when...
Bulkhead Pattern: Isolating Capacity to Protect Critical Workloads
TLDR: Bulkheads isolate capacity so one overloaded dependency or workload class cannot consume every thread, queue slot, or connection in the service. TLDR: Use bulkheads when different workloads do not deserve equal blast radius. The practical goal ...
Blue-Green Deployment Pattern: Safe Cutovers with Instant Rollback
TLDR: Blue-green deployment reduces release risk by preparing the new environment completely before traffic moves. It is most effective when rollback is a routing change, not a rebuild. TLDR: Blue-green is practical for SRE teams when three things ar...
Big Data Architecture Patterns: Lambda, Kappa, CDC, Medallion, and Data Mesh
TLDR: A serious data platform is defined less by where files are stored and more by how changes enter the system, how serving layers are materialized, and who owns quality over time. Lambda, Kappa, CDC, Medallion, and Data Mesh are patterns for makin...
System Design HLD Example: URL Shortener (TinyURL and Bitly)
TLDR: Design a URL shortener like TinyURL or Bitly. This article now follows your system design interview template flow: use cases, requirements, estimations, design goals, HLD, and design deep dive. TLDR: A URL shortener converts long links into com...
System Design HLD Example: Search Autocomplete
TLDR: Design search autocomplete for a large-scale product. This article follows the system design interview template flow: use cases, requirements, estimations, design goals, HLD, and design deep dive. TLDR: Sub-10ms prefix lookups via Redis sorted ...
System Design HLD Example: Rate Limiter (Global API Protection)
TLDR: Design a distributed rate limiter for public APIs. This article covers the full solution: per-user/per-key policies, endpoint quotas, burst handling with token bucket, atomic Redis Lua enforcement, and retry metadata. TLDR: A distributed rate l...
System Design HLD Example: Payment Processing Platform
TLDR: Design a payment processing system for online checkout. This article covers idempotent authorization, two-phase authorize-capture, double-entry ledger writes, webhook delivery with retry, and nightly reconciliation — with concrete schema, Redis...
System Design HLD Example: Notification Service (Email, SMS, Push)
TLDR: Design a notification service for email, SMS, and push channels. This article covers the full solution: data model, per-channel queue isolation, deduplication, retry with dead-letter queue, provider failover, and delivery tracking. TLDR: A noti...
System Design HLD Example: News Feed (Home Timeline)
TLDR: Design a news feed for a social platform. This article now follows your system design interview template flow: use cases, requirements, estimations, design goals, HLD, and design deep dive. TLDR: A news feed system builds personalized timelines...
System Design HLD Example: File Storage and Sync (Google Drive and Dropbox)
TLDR: Design a cloud file storage and sync system like Dropbox. This article now follows your system design interview template flow: use cases, requirements, estimations, design goals, HLD, and design deep dive. TLDR: File storage and sync systems se...
System Design HLD Example: Distributed Cache Platform
TLDR: A distributed cache reduces read latency and source-of-truth load while introducing consistency trade-offs. This walkthrough solves the full problem: consistent hashing for online scaling, per-namespace eviction policies, synchronous invalidati...
System Design HLD Example: Chat and Messaging Platform
TLDR: Design a chat application like WhatsApp or Slack direct messages. This article covers the full solution: data model, per-feature deep dives, WebSocket fan-out, delivery receipts, presence, reconnect sync, and multi-device behavior. TLDR: A chat...
System Design HLD Example: API Gateway for Microservices
TLDR: Design an API gateway for a microservice platform. This article now follows your system design interview template flow: use cases, requirements, estimations, design goals, HLD, and design deep dive. TLDR: An API gateway centralizes ingress conc...
System Design Roadmap: A Complete Learning Path from Basics to Advanced Architecture
TLDR: This roadmap organizes every system-design-tagged post in this repository into learning groups and a recommended order. It is designed for interview prep and practical architecture thinking, from fundamentals to scaling, reliability, and implem...
System Design Requirements and Constraints: Ask Better Questions Before You Draw
TLDR: In system design interviews, weak answers fail early because requirements are fuzzy. Strong answers start by turning vague prompts into explicit functional scope, measurable non-functional targets, and clear trade-off boundaries before any arch...
Understanding Consistency Patterns: An In-Depth Analysis
TLDR TLDR: Consistency is about whether all nodes in a distributed system show the same data at the same time. Strong consistency gives correctness but costs latency. Eventual consistency gives speed but requires tolerance for briefly stale reads. C...
Little's Law: The Secret Formula for System Performance
TLDR: Little's Law ($L = \lambda W$) connects three metrics every system designer measures: $L$ = concurrent requests in flight, $\lambda$ = throughput (RPS), $W$ = average response time. If latency spikes, your concurrency requirement explodes with ...
How Transformer Architecture Works: A Deep Dive
TLDR: The Transformer is the architecture behind every major LLM (GPT, BERT, Claude, Gemini). Its core innovation is Self-Attention — a mechanism that lets the model weigh relationships between all tokens in a sequence simultaneously, regardless of d...
The 8 Fallacies of Distributed Systems
TLDR TLDR: In 1994, L. Peter Deutsch at Sun Microsystems listed 8 assumptions that developers make about distributed systems — all of which are false. Believing them leads to hard-to-reproduce bugs, timeout cascades, and security holes. Knowing them...
Data Warehouse vs Data Lake vs Data Lakehouse: Which One to Choose?
TLDR: Warehouse = structured, clean data for BI and SQL dashboards (Snowflake, BigQuery). Lake = raw, messy data for ML and data science (S3, HDFS). Lakehouse = open table formats (Delta Lake, Iceberg) that bring SQL performance to raw storage — the ...

Strategy Design Pattern: Simplifying Software Design
TLDR: The Strategy Pattern replaces giant if-else or switch blocks with a family of interchangeable algorithm classes. Each strategy is a self-contained unit that can be swapped at runtime without touching the client code. The result: Open/Closed Pri...
System Design Service Discovery and Health Checks: Routing Traffic to Healthy Instances
TLDR: Service discovery is how clients find the right service instance at runtime, and health checks are how systems decide whether an instance should receive traffic. Together, they turn dynamic infrastructure from guesswork into deterministic routi...
System Design Observability, SLOs, and Incident Response: Operating Systems You Can Trust
TLDR: Observability is how you understand system behavior from telemetry, SLOs are explicit reliability targets, and incident response is the execution model when those targets are at risk. Together, they convert operational chaos into measurable, re...
System Design Message Queues and Event-Driven Architecture: Building Reliable Asynchronous Systems
TLDR: Message queues and event-driven architecture let services communicate asynchronously, absorb bursty traffic, and isolate failures. The core design challenge is not adding a queue — it is defining delivery semantics, retry behavior, and idempote...
System Design Multi-Region Deployment: Latency, Failover, and Consistency Across Regions
TLDR: Multi-region deployment means running the same system across more than one geographic region so users get lower latency and the business can survive a regional outage. The design challenge is no longer just scaling compute. It is coordinating r...
System Design Interview Basics: A Beginner-Friendly Framework for Clear Answers
TLDR: System design interviews are not about inventing a perfect architecture on the spot. They are about showing a calm, repeatable process: clarify requirements, estimate scale, sketch a simple design, explain trade-offs, and improve it when constr...
How Kafka Works: The Log That Never Forgets
TLDR: Kafka is a distributed event store. Unlike a traditional queue (RabbitMQ) where messages disappear after reading, Kafka stores them in a persistent Log. This allows multiple consumers to read the same data at their own pace, replay history, and...
Consistent Hashing: Scaling Without Chaos
TLDR: Standard hashing (key % N) breaks when $N$ changes — adding or removing a server reshuffles almost all keys. Consistent Hashing maps both servers and keys onto a ring (0–360°). When a server is added, only its immediate neighbors' keys move, mi...

System Design Databases: SQL vs NoSQL and Scaling
TLDR: SQL gives you ACID guarantees and powerful relational queries; NoSQL gives you horizontal scale and flexible schemas. The real decision is not "which is better" — it is "which trade-offs align with your workload." Understanding replication, sha...

System Design Protocols: REST, RPC, and TCP/UDP
TLDR: 🎯 Use REST (HTTP + JSON) for public, browser-facing APIs where interoperability matters. Choose gRPC (HTTP/2 + Protobuf) for internal microservice communication when latency counts. Under the hood, TCP guarantees reliable ordered delivery; UDP...

System Design Networking: DNS, CDNs, and Load Balancers
TLDR: When you hit a URL, DNS translates the name to an IP, CDNs serve static assets from the edge nearest to you, and Load Balancers spread traffic across many servers so no single machine becomes a bottleneck. These three layers are the traffic con...

System Design Core Concepts: Scalability, CAP, and Consistency
TLDR: 🚀 Scalability, the CAP Theorem, and consistency models are the three concepts that determine whether a distributed system can grow, stay reliable, and deliver correct results. Get these three right and you can reason about any system design qu...

The Ultimate Guide to Acing the System Design Interview
TLDR: System Design interviews are collaborative whiteboard sessions, not trick-question coding tests. Follow the framework — Requirements → Estimations → API → Data Model → High-Level Architecture → Deep-Dive — and you turn vague product ideas into ...
System Design Sharding Strategy: Choosing Keys, Avoiding Hot Spots, and Resharding Safely
TLDR: Sharding means splitting one logical dataset across multiple physical databases so no single node carries all the data and traffic. The hard part is not adding more nodes. The hard part is choosing a shard key that keeps data balanced and queri...
System Design Replication and Failover: Keep Services Alive When a Primary Dies
TLDR: Replication means keeping multiple copies of your data so the system can survive machine, process, or availability-zone failures. Failover is the coordinated act of promoting a healthy replica, rerouting traffic, and recovering without corrupti...
Elasticsearch vs Time-Series DB: Key Differences Explained
TLDR: Elasticsearch is built for search — full-text log queries, fuzzy matching, and relevance ranking via an inverted index. InfluxDB and Prometheus are built for metrics — numeric time series with aggressive compression. Picking the wrong one waste...
What are Hash Tables? Basics Explained
TLDR: A hash table gives you near-O(1) lookups, inserts, and deletes by using a hash function to map keys to array indices. The tradeoff: collisions (when two keys hash to the same slot) must be handled, and a full hash table must be resized. 📖 Th...
Understanding Inverted Index and Its Benefits in Software Development
TLDR TLDR: An Inverted Index maps every word to the list of documents containing it — the same structure as the back-of-the-book index. It is the core data structure behind every full-text search engine, including Elasticsearch, Lucene, and PostgreS...
How Bloom Filters Work: The Probabilistic Set
TLDR TLDR: A Bloom Filter is a bit array + multiple hash functions that answers "Is X in the set?" in $O(1)$ constant space. It can return false positives (say "yes" when the answer is "no") but never false negatives (never says "no" when the answer...
System Design API Design for Interviews: Contracts, Idempotency, and Pagination
TLDR: In system design interviews, API design is not a list of HTTP verbs. It is a contract strategy: clear resource boundaries, stable request and response shapes, pagination, idempotency, error semantics, and versioning decisions that survive scale...
Backend for Frontend (BFF): Tailoring APIs for UI
TLDR: A "one-size-fits-all" API causes bloated mobile payloads and underpowered desktop dashboards. The Backend for Frontend (BFF) pattern solves this by creating a dedicated API server for each client type — the mobile BFF reshapes data for small sc...
LLD for Movie Booking System: Designing BookMyShow
TLDR TLDR: A Movie Booking System (like BookMyShow) is an inventory management problem with an expiry: seats expire when the show starts. The core engineering challenge is preventing double-booking under concurrent user load with a 3-state seat mode...

Types of Locks Explained: Optimistic vs. Pessimistic Locking
TLDR: Pessimistic locking locks the record before editing — safe but slower under low contention. Optimistic locking checks for changes before saving using a version number — fast but can fail and require retry under high contention. Choosing correct...
AI Architecture Patterns: Routers, Planner-Worker Loops, Memory Layers, and Evaluation Guardrails
TLDR: A single agent loop is enough for a demo, but production AI systems need explicit layers for routing, execution, memory, and evaluation. Those layers determine safety, latency, cost, and traceability far more than model choice alone. TLDR: Prod...
System Design: Caching and Asynchronism
TLDR: Caching stores hot data in fast RAM so you skip slow database round-trips. Asynchronism moves slow tasks (email, video processing) off the critical path via message queues. Together, they turn a blocking, slow system into a responsive, scalable...
LLD for URL Shortener: Designing TinyURL
TLDR TLDR: A URL Shortener maps long URLs to short IDs. The core challenge is generating a globally unique, short, collision-free ID at scale. We use Base62 encoding on auto-incrementing database IDs for deterministic, collision-free short codes. ...
X.509 Certificates: A Deep Dive into How They Work
TLDR: An X.509 Certificate is a digital document that binds a Public Key to an Identity (e.g., google.com). It is digitally signed by a trusted Certificate Authority (CA). It prevents attackers from impersonating websites via man-in-the-middle attack...
How SSL/TLS Works: The Handshake Explained
TLDR: SSL (now TLS) secures data between your browser and a server. It uses Asymmetric Encryption (Public/Private keys) once — to safely exchange a fast Symmetric Session Key. Everything after the handshake is encrypted with the session key. 📖 The...
How Kubernetes Works: The Container Orchestrator
TLDR TLDR: Kubernetes (K8s) is an operating system for the cloud. It manages clusters of computers (Nodes) and schedules applications (Pods) onto them via a continuous declarative control loop — you describe what you want, and Kubernetes continuousl...
How Fluentd Works: The Unified Logging Layer
TLDR: Fluentd is an open-source data collector that decouples log sources from destinations. It ingests logs from 100+ sources (Nginx, Docker, syslog), normalizes them to JSON, applies filters and transformations, and routes them to 100+ outputs (Ela...
BASE Theorem Explained: How it Stands Against ACID
TLDR TLDR: ACID (Atomicity, Consistency, Isolation, Durability) is the gold standard for banking. BASE (Basically Available, Soft state, Eventual consistency) is the standard for social media. BASE intentionally sacrifices instant accuracy in exchan...
A Guide to Raft, Paxos, and Consensus Algorithms
TLDR TLDR: Consensus algorithms allow a cluster of computers to agree on a single value (e.g., "Who is the leader?"). Paxos is the academic standard — correct but notoriously hard to understand. Raft is the practical standard — designed for understa...

API Gateway vs. Load Balancer vs. Reverse Proxy: What's the Difference?
TLDR: A Reverse Proxy hides your servers and handles caching/SSL. A Load Balancer spreads traffic across server instances. An API Gateway manages API concerns — auth, rate limiting, routing, and protocol translation. Modern tools (Nginx, AWS ALB, Kon...

Java Memory Model Demystified: Stack vs. Heap
TLDR: Java memory is split into two main areas: the Stack for method execution frames and primitives, and the Heap for all objects. Understanding their differences is essential for avoiding stack overflow errors, memory leaks, and garbage collection ...


