All Posts

The 8 Fallacies of Distributed Systems

The Network is Reliable. Latency is Zero. Bandwidth is Infinite. If you believe these, your system will fail. We debunk the 8 fallacies.

Abstract AlgorithmsAbstract Algorithms
Β·Β·14 min read
Share
Share on X / Twitter
Share on LinkedIn
Copy link

TLDR

TLDR: In 1994, L. Peter Deutsch at Sun Microsystems listed 8 assumptions that developers make about distributed systems β€” all of which are false. Believing them leads to hard-to-reproduce bugs, timeout cascades, and security holes. Knowing them is a prerequisite for designing systems that actually work at scale.


πŸ“– The Eight Assumptions That Will Break Your System

The first time I heard "the network is reliable", I nodded. The second time was when our payment service was retrying duplicate charges because the network wasn't reliable and we hadn't handled it. Orders were double-billed, customers called support, and the root cause traced back to a single missing idempotency check β€” a check that would have been obvious if we had taken this fallacy seriously from day one. These 8 fallacies are the production bugs waiting to happen.

These are not theoretical warnings. They are a field guide to the most common production bugs in distributed software.

The eight fallacies:

  1. The network is reliable.
  2. Latency is zero.
  3. Bandwidth is infinite.
  4. The network is secure.
  5. Topology doesn't change.
  6. There is one administrator.
  7. Transport cost is zero.
  8. The network is homogeneous.

πŸ” What Makes a System "Distributed" in the First Place

Before the fallacies can bite you, you need to know what kind of system exposes you to them.

A distributed system is any application where components run on separate machines and communicate over a network. That sounds obvious β€” until you realise it covers microservices, mobile apps talking to a backend, two data centre regions sharing a database, and even a monolith that makes third-party API calls.

The key insight: a network call is fundamentally different from a function call. A local function call either returns or raises an exception. A network call can silently timeout, partially complete, or succeed on the server while the response is lost on the way back. All eight fallacies are rooted in forgetting that distinction.

PropertyLocal function callNetwork call
SpeedNanosecondsMilliseconds to seconds
Failure modeException is deterministicSilent timeout, partial success
EncryptionN/A (same process)Must be explicit (mTLS, TLS)
TopologyFixed (same host)Changes: pods restart, IPs change
CostCPU onlyCPU + bandwidth + egress charges

Every row in that table is one of the eight fallacies in disguise.


πŸ”’ Network Fallacies 1–4: Reliability, Latency, Bandwidth, and Security

Fallacy 1: The network is reliable.

Packets are dropped. Connections are reset. Load balancers time out. A single function call that succeeds 99.9% of the time fails once every 1,000 requests, and distributed systems call each other thousands of times per second.

Design response: Retry with exponential backoff. Use circuit breakers (Hystrix, Resilience4j). Design for idempotency.

Fallacy 2: Latency is zero.

A function call on the same machine takes nanoseconds. A call to a service in the same data center takes ~0.5 ms. A call across regions takes 50–200 ms. At 100 chained calls, that is 20+ seconds.

Design response: Avoid deep synchronous call chains. Use async messaging. Parallelize independent calls.

Fallacy 3: Bandwidth is infinite.

Sending large JSON payloads is cheap in development where teams work on fast LANs. In production, AWS cross-AZ bandwidth costs money; serializing large object graphs creates GC pressure.

Design response: Use binary serialization (Protobuf, Avro). Filter fields at the API boundary. Use compression for large payloads.

Fallacy 4: The network is secure.

Traffic between services inside your VPC is not automatically encrypted or authenticated. An attacker who gains access to your network can intercept or inject requests.

Design response: mTLS between services. Zero-trust network model. Never pass secrets in logs.


βš™οΈ Infrastructure Fallacies 5–8: Topology, Administration, Cost, and Compatibility

Fallacy 5: Topology doesn't change.

Servers fail. Auto-scaling adds and removes instances. Kubernetes restarts pods. Hardcoding IP addresses breaks as soon as a node is replaced.

Design response: Use service discovery (Consul, Kubernetes DNS, AWS Cloud Map). Never hardcode IPs.

Fallacy 6: There is one administrator.

Real systems involve the platform team, the security team, the application team, and the database team. A schema migration "owned" by the app team may be blocked by the DBA team for a week.

Design response: Design for backward and forward compatibility. Feature flags for deployments. Self-service infra via IaC.

Fallacy 7: Transport cost is zero.

Serializing a Java object to JSON, compressing it, encrypting it, sending it over a socket, deserializing it on the other side β€” all of this costs CPU cycles, memory, and money (cloud egress charges).

Design response: Right-size payloads. Batch small messages. Cache at the boundary.

Fallacy 8: The network is homogeneous.

Mobile clients, desktop browsers, IoT devices, and internal services all speak different protocols, have different MTUs, and fail in different ways. Expecting all consumers to behave like your tested Java client will lead to interoperability bugs.

Design response: Use standard protocols (HTTP/1.1, HTTP/2, gRPC). Handle content negotiation. Test with diverse client types.


πŸ“Š From False Assumption to Production Outage

Every fallacy follows the same lifecycle: the assumption goes unchallenged in local development, test environments mask the problem, and then production traffic amplifies the gap until something fails.

flowchart TD
    F[False Assumption baked into code] --> D[No defensive code written]
    D --> T[Tests pass on localhost / staging]
    T --> P[Hits real production traffic]
    P --> O[Outage or silent data corruption]
    O --> M[Team adds retry / circuit breaker / mTLS]
    M --> R[Resilient system]

The painful part: teams almost always reach R β€” but only after the outage forces their hand. The goal of knowing these fallacies is to reach R before the first incident, not after.


🌍 Real-World Applications: Where Each Fallacy Bites in Production

Abstract warnings become concrete when mapped to documented production failures.

FallacyReal incidentFailure symptomDefense
Network reliableGitHub 2012 MySQL failover β€” one failing DB cascaded to all API endpoints with no circuit breakerEntire platform degraded for 2+ hoursRetry + circuit breaker
Latency zeroAmazon (Vogels, 2006) β€” 100ms added latency correlated with measurable revenue drop across all storefrontsLong sync call chains destroyed user p99Async messaging, parallel fan-out
Bandwidth infiniteObject serialisation over RESTHigh GC pressure, egress cost spikeProtobuf / Avro, field filtering
Network secureCapital One 2019 β€” SSRF via misconfigured WAF reached AWS metadata endpoint; internal traffic assumed safeCredentials exfiltrated from "trusted" VPCmTLS, zero-trust model
Topology doesn't changeHardcoded IP on auto-scaled nodeDeployment breaks after replacementService discovery (Consul, k8s DNS)
One administratorSchema migrations blocked by cross-team ownershipMigration blocked for a weekBackward-compatible deploys, IaC
Transport cost is zeroChatty microservicesUnexpected cloud egress billBatching, caching at the boundary
Network is homogeneousSame payload to all client typesMobile app crashes on older protocolContent negotiation, version headers

A new service failing in all eight ways simultaneously is rare. A new service failing in two or three of them in the first quarter after launch is extremely common.


πŸ§ͺ A Pre-Flight Checklist for Every New Service Call

Use this at design-review time to catch fallacy violations before they reach production.

Reliability and latency (Fallacies 1 & 2)

  • [ ] Timeout configured explicitly (never rely on OS defaults)
  • [ ] Retry with exponential backoff and random jitter
  • [ ] Circuit breaker so a failing downstream cannot exhaust upstream threads
  • [ ] Bulkhead: limit concurrent calls per downstream

Bandwidth and security (Fallacies 3 & 4)

  • [ ] Payload size validated at p99
  • [ ] Binary serialisation considered for high-frequency paths (Protobuf saves 50–80% over JSON)
  • [ ] mTLS enabled; secrets never in query strings or logs

Topology and administration (Fallacies 5 & 6)

  • [ ] No hardcoded IPs β€” service discovery only
  • [ ] API versioned for backward-compatible rollout
  • [ ] Feature flag on the new dependency for zero-redeploy rollback

Cost and compatibility (Fallacies 7 & 8)

  • [ ] Cloud egress cost estimated; cross-AZ and cross-region calls tagged for billing
  • [ ] Response tested against at least two different client types

🧠 Deep Dive: The Practical Antidote for Distributed Failures

flowchart LR
    Call[Service A calls B] --> Retry{Retry logic?}
    Retry -- No --> Crash[Hard failure\nno retry = cascading outage]
    Retry -- Yes --> CB{Circuit breaker?}
    CB -- No --> Flood[B is down\nA floods with retries]
    CB -- Yes --> Timeout[Fail fast\nReturn fallback]

The minimal production checklist for every service call:

  • Timeout set (never rely on OS default)
  • Retry with exponential backoff + jitter
  • Circuit breaker to stop cascade
  • Bulkhead (limit concurrent calls per downstream)

βš–οΈ Trade-offs & Failure Modes: Why These Fallacies Persist

These fallacies are taught in university β€” and still violated in code every week because:

  • Local development masks network problems (everything runs on localhost)
  • Unit tests don't simulate network partitions or latency spikes
  • Monolith-to-microservices migrations often copy in-process assumptions to network calls

The most common production outage pattern: a service that worked fine in staging fails under load in production because no retry logic handles the 1-in-1,000 packet drop rate.


πŸ“š Hard-Won Lessons from Teams Who Ignored These Fallacies

Lesson 1: Localhost is a lie. Local development masks every fallacy. Latency is sub-millisecond, the network is reliable, and there is one admin. Every assumption is validated β€” until it isn't.

Lesson 2: The first microservice extraction almost always violates Fallacy 2. Teams migrating a monolith to microservices copy in-process synchronous call patterns directly onto network calls. A chain of five synchronous service calls that took 2 ms in-process now takes 200 ms over the network, and nobody budgeted for that latency at the design stage.

Lesson 3: Security is the fallacy most often deferred until after the breach. Internal network traffic is left unencrypted because it feels like an internal problem. Zero-trust requires mTLS everywhere β€” a compromised internal node can otherwise intercept all unencrypted service-to-service traffic.

Lesson 4: Defensive patterns are not optional extras β€” they are load-bearing. Retries, circuit breakers, and bulkheads are not polish. Remove them and a single degraded service takes down every service that depends on it.


🧭 Decision Guide: Choosing Your Defenses Against the 8 Fallacies

Map each fallacy to a specific architectural defense. Prioritize based on the failure modes most likely to affect your system first.

FallacyPrimary defenseWhen to implement
Network unreliableRetry + circuit breakerEvery synchronous RPC
Latency non-zeroAsync messaging / timeout budgetCall chains > 2 hops
Bandwidth finiteProtobuf / field filteringPayload > 10 KB
Network insecuremTLS + zero-trustAny sensitive data
Topology changesService discoveryBefore first deployment
Multiple adminsBackward-compatible APIsAny shared interface
Transport costsBatch + cache at boundaryHigh-frequency paths
Heterogeneous clientsContent negotiationMulti-client APIs

πŸ“Š Fallacy Decision Tree: Which Applies to Your Bug?

flowchart TD
    Bug[Production bug or outage] --> Q1{Is the failure\ntransient / intermittent?}
    Q1 -->|Yes| Q2{Does retrying fix it?}
    Q1 -->|No| Q3{Is it a latency spike\nor timeout?}
    Q2 -->|Yes| F1[Fallacy 1\nNetwork is Reliable\nFix: retry + circuit breaker]
    Q2 -->|No| Q4{Is a node IP\nchanged / missing?}
    Q4 -->|Yes| F5[Fallacy 5\nTopology Constant\nFix: service discovery]
    Q4 -->|No| F4[Fallacy 4\nNetwork is Secure\nFix: mTLS + zero-trust]
    Q3 -->|Latency spike| Q5{Sync call chain\n> 2 hops?}
    Q3 -->|No: large bills| F7[Fallacy 7\nTransport is Free\nFix: Protobuf + batching]
    Q5 -->|Yes| F2[Fallacy 2\nLatency is Zero\nFix: async messaging]
    Q5 -->|No: payload size| F3[Fallacy 3\nBandwidth Infinite\nFix: compression]

πŸ“Œ TLDR: Summary & Key Takeaways

  • All 8 fallacies are false assumptions developers make about networks that cause production bugs.
  • The four network fallacies: reliability, latency, bandwidth, security.
  • The four infrastructure fallacies: topology, administration, transport cost, homogeneity.
  • Every service call needs: timeout, retry with backoff, circuit breaker, and idempotency.
  • These bugs appear in production β€” not in local development β€” because localhost masks all of them.

πŸ“ Practice Quiz

  1. Your payment service calls an inventory service with no retry logic and no timeout. A brief network blip causes 0.1 % of requests to drop silently. Which fallacy is your design violating?

    • A) The network is homogeneous
    • B) The network is reliable
    • C) Latency is zero
    • D) There is one administrator Correct Answer: B β€” Fallacy 1. No retry or timeout means you are treating the network as perfectly reliable. A 0.1 % drop rate across millions of daily calls produces thousands of silent failures.
  2. A developer hardcodes 192.168.1.45 as the database host because the IP "never changes." A routine auto-scaling event replaces the database node and the app goes down immediately. Which fallacy caused the outage?

    • A) The network is secure
    • B) Transport cost is zero
    • C) Topology doesn't change
    • D) The network is homogeneous Correct Answer: C β€” Fallacy 5. Hardcoding IPs assumes the topology is static. Service discovery (Kubernetes DNS, Consul) eliminates this class of failure by resolving hostnames dynamically.
  3. Your ops team receives an unexpected $4,000 AWS bill from cross-AZ data transfer charges. The culprit is a new microservice that serialises a full 5 MB user-activity object on every request. Which fallacy best describes the root cause?

    • A) Latency is zero
    • B) The network is reliable
    • C) Transport cost is zero
    • D) The network is secure Correct Answer: C β€” Fallacy 7. Serialisation, encryption, and network egress all carry real CPU and financial costs. Right-sizing payloads and using binary serialisation (Protobuf) would have prevented the overspend.
  4. You add mTLS between your API gateway and all backend services. Which two fallacies are you directly addressing with that single change?

    • A) Fallacy 4 (network is secure) and Fallacy 8 (network is homogeneous)
    • B) Fallacy 4 (network is secure) and Fallacy 7 (transport cost is zero)
    • C) Fallacy 1 (network is reliable) and Fallacy 4 (network is secure)
    • D) Fallacy 2 (latency is zero) and Fallacy 4 (network is secure) Correct Answer: B β€” mTLS directly addresses Fallacy 4 by encrypting and authenticating traffic. It also touches Fallacy 7 because TLS handshakes and encryption add measurable CPU cost and slight latency, making transport non-free.

πŸ› οΈ Resilience4j: Defending Against Fallacies 1 & 2 in Spring Boot

Resilience4j is a lightweight, zero-dependency fault-tolerance library for Java and Spring Boot. It provides circuit breakers, retry policies, rate limiters, bulkheads, and time limiters β€” a single dependency that directly addresses Fallacies 1, 2, and 7 (network unreliability, non-zero latency, and transport cost) from this post.

The snippet below wires together the three most critical defenses: exponential backoff with jitter (survives transient packet drops from Fallacy 1), a circuit breaker (stops cascade failures when a service is down), and a time limiter (enforces the Fallacy 2 lesson β€” latency is not zero, so callers must have a deadline).

// build.gradle: io.github.resilience4j:resilience4j-spring-boot3:2.2.0

import io.github.resilience4j.circuitbreaker.annotation.CircuitBreaker;
import io.github.resilience4j.retry.annotation.Retry;
import io.github.resilience4j.timelimiter.annotation.TimeLimiter;
import org.springframework.stereotype.Service;
import java.util.concurrent.CompletableFuture;

@Service
public class InventoryClient {

    // @Retry: up to 3 attempts with exponential backoff + jitter (Fallacy 1)
    @Retry(name = "inventory", fallbackMethod = "inventoryFallback")
    // @CircuitBreaker: opens after 50% failures in a 10-call window (Fallacy 1)
    @CircuitBreaker(name = "inventory")
    // @TimeLimiter: fail fast if call takes > 2 s β€” latency is NOT zero (Fallacy 2)
    @TimeLimiter(name = "inventory")
    public CompletableFuture<Integer> getStock(String productId) {
        return CompletableFuture.supplyAsync(() ->
            externalInventoryApi.fetchStock(productId)
        );
    }

    // Fallback: return a safe cached default instead of propagating failure
    public CompletableFuture<Integer> inventoryFallback(
            String productId, Throwable ex) {
        return CompletableFuture.completedFuture(-1);
    }
}
# application.yml β€” Resilience4j configuration for the inventory circuit
resilience4j:
  retry:
    instances:
      inventory:
        max-attempts: 3
        wait-duration: 200ms
        enable-exponential-backoff: true
        exponential-backoff-multiplier: 2     # 200 ms β†’ 400 ms β†’ 800 ms
        randomized-wait-factor: 0.3           # Β±30% jitter β€” prevents thundering herd
  circuitbreaker:
    instances:
      inventory:
        sliding-window-size: 10
        failure-rate-threshold: 50            # Open after 5 / 10 calls fail
        wait-duration-in-open-state: 10s      # Stay open 10 s before half-open probe
  timelimiter:
    instances:
      inventory:
        timeout-duration: 2s

Each annotation maps to a specific pre-flight checklist item from this post: @Retry β†’ Fallacy 1 (network reliability); @TimeLimiter β†’ Fallacy 2 (latency non-zero); @CircuitBreaker β†’ prevents the cascade failure pattern that originates from believing both Fallacies 1 and 2 simultaneously.

For a full deep-dive on Resilience4j circuit breakers, bulkheads, and rate limiters in Spring Boot, a dedicated follow-up post is planned.


Abstract Algorithms

Written by

Abstract Algorithms

@abstractalgorithms