All Posts

What are Hash Tables? Basics Explained

Hash Tables are the magic behind O(1) lookups. We explain Hash Functions, Collisions, and why the...

Abstract AlgorithmsAbstract Algorithms
··12 min read
Share
Share on X / Twitter
Share on LinkedIn
Copy link

TLDR: A hash table gives you near-O(1) lookups, inserts, and deletes by using a hash function to map keys to array indices. The tradeoff: collisions (when two keys hash to the same slot) must be handled, and a full hash table must be resized.


📖 The Magic Mailroom Stamp

Imagine a mailroom with 100 drawers. A magic stamp converts any employee name to a number 1–100. Want John's mail? Stamp "John" → you get 42 → open Drawer 42. No search needed.

That stamp is a hash function. The row of drawers is the array (the backing store). The combination is a hash table.


🔍 Hash Table Fundamentals

A hash table combines two simple ideas into a data structure with near-constant-time performance:

  1. An array provides O(1) access by index: array[42] is instant.
  2. A hash function converts any key (string, integer, object) to an index: h("john") → 42.

Together: given any key, compute its index in one step, then access the array slot directly. No iteration. No comparison loop.

The three fundamental operations and their complexity:

OperationAverage caseWorst caseWhy worst case occurs
LookupO(1)O(N)All keys hash to same slot
InsertO(1) amortizedO(N)Rehash triggered on full table
DeleteO(1)O(N)All keys collide into same bucket

Worst case requires a pathologically bad hash function (or deliberate hash flooding). In practice, well-designed hash functions achieve average O(1) for all three operations.

What makes a good hash function?

  • Deterministic: Same key always maps to the same slot.
  • Uniform: Keys distribute evenly across all slots to minimize collisions.
  • Fast: Must compute in O(1) — the hash computation cannot become the bottleneck.
  • Avalanche effect: Similar keys produce very different hash values (prevents clustering).

🔢 Deep Dive: Array + Hash Function

flowchart LR
    Key["Key: 'cat'"] --> HashFn["Hash Function\nh('cat') = 2"]
    HashFn --> Array["Array\n[0][1][2: 'cat→meow'][3]..."]
    Array --> Value["Value: 'meow'"]

A hash function must be:

  1. Deterministic: Same input always produces the same output.
  2. Fast: O(1) computation.
  3. Uniform: Keys distribute evenly to minimize collisions.

Simple hash function (NOT production-quality):

def simple_hash(key: str, size: int) -> int:
    return sum(ord(c) for c in key) % size

Python's built-in hash() uses SipHash (cryptographically secure version) to prevent hash flooding attacks.


📊 How a Hash Table Handles a Lookup

The lookup flow for key "alice" in a hash table with separate chaining:

flowchart TD
    K["Key: 'alice'"] --> HF["Hash Function: h('alice') = 7"]
    HF --> IDX["Array index 7"]
    IDX --> BK["Bucket 7: [('alice', 25), ('charlie', 30)]"]
    BK --> SCAN{"Scan bucket: key == 'alice'?"}
    SCAN -->|Match found| RET["Return value: 25"]
    SCAN -->|No match| MISS["Return None / KeyError"]
    style RET fill:#90EE90
    style MISS fill:#FFB6C1

Three scenarios during lookup:

  1. Direct hit: array[index] contains exactly the target key → O(1).
  2. Collision chain: array[index] has multiple entries → scan the short chain → O(k) where k = bucket length (typically 1–2 with good load factor).
  3. Miss: The slot is empty → key not in table → O(1).

For a table with load factor 0.75 and a good hash function, average bucket length is < 2, making even chained lookups extremely fast in practice.


⚙️ Collisions: When Two Keys Land in the Same Slot

No hash function is perfect. Given a finite array, two different keys will eventually map to the same slot.

Example:

  • hash('cat') = 2 → Stored at index 2.
  • hash('bird') = 2 → Collision at index 2!

Two collision resolution strategies:

Separate Chaining

Each slot holds a linked list of key-value pairs. On collision, append to the list.

index 2: [('cat', 'meow') → ('bird', 'tweet')]

O(1) average, O(n) worst case if all keys hash to the same slot.

Open Addressing (Linear Probing)

No linked lists. On collision, step forward until you find an empty slot.

hash('cat') = 2 → store at 2
hash('bird') = 2 (full) → probe 3 → store at 3

Pros: Better cache performance (all data in one array). Cons: Clustering — many consecutive full slots slow probing.

StrategySpaceLookup (avg)Cache FriendlyUsed In
Separate ChainingExtra memory for listsO(1)NoJava HashMap (until JDK 8)
Linear ProbingCompact arrayO(1)YesPython dict, Java HashMap (JDK 8+)
Quadratic ProbingCompact arrayO(1)PartialOpen-addressing tables
Robin Hood HashingCompact arrayO(1)YesRust HashMap

📊 Open Addressing: Collision Probe Sequence

sequenceDiagram
    participant K as Key "bird"
    participant HF as Hash Function
    participant AR as Array

    K->>HF: h("bird") = 2
    HF->>AR: check slot 2
    AR-->>HF: slot 2 occupied by "cat"
    HF->>AR: probe slot 3 (linear)
    AR-->>HF: slot 3 empty
    HF-->>K: insert "bird" at slot 3
    Note over AR: No linked list needed

🧠 Deep Dive: Load Factor and Rehashing

Load factor = (number of entries) / (array capacity).

As the table fills up (load factor → 1.0), collisions increase and performance degrades.

Standard practice: rehash when load factor exceeds 0.75.

Rehashing:

  1. Create a new array 2× the size.
  2. Re-insert every existing key into the new array (re-hash all keys).
  3. Old array is garbage collected.

Cost: O(N) for one rehash. But since it happens at doubly-spaced intervals, the amortized cost per insert is O(1).

📊 Load Factor Decision: Keep or Resize

flowchart TD
    A[Insert new key] --> B{Load factor > 0.75?}
    B -- No --> C[Insert into current array]
    C --> D[O(1) average]
    B -- Yes --> E[Allocate 2x array]
    E --> F[Re-hash all existing keys]
    F --> G[Insert new key]
    G --> H[Old array GC'd]
    H --> I[Amortized O(1) per insert]

⚙️ Why Python Dicts Are Fast

Python dict uses open addressing with compact arrays (since Python 3.6). Entries are stored in an incrementally-growing compact array indexed by insertion order, with a separate hash table of indices pointing into it. This gives:

  • Fast iteration (compact memory, no pointer chasing).
  • Memory-efficient (no linked list overhead).
  • Dict guaranteed ordered by insertion order.

⚖️ Trade-offs & Failure Modes: Trade-offs, Failure Modes & Decision Guide: Hash Tables vs Others

OperationHash TableBST (e.g., TreeMap)Array (sorted)
Lookup by keyO(1) avgO(log n)O(log n) binary search
InsertO(1) avgO(log n)O(n) shift
DeleteO(1) avgO(log n)O(n) shift
Range query❌ Not supported✅ O(log n + k)✅ O(log n + k)
Ordered iteration❌ No order✅ In-order✅ Sorted

Choose hash table when: You need fast key-value lookup and don't need ordering or range queries. Choose BST when: You need sorted iteration or range queries (e.g., "find all keys between 10 and 50").


🌍 Real-World Application: Hash Tables in Production

Hash tables are one of the most widely used data structures in existence:

ApplicationHash table usedWhy
Programming language runtimesPython dict, JavaScript object, Java HashMapCore key-value primitive for all dynamic lookups
Database query executionHash join, hash aggregateJoin two tables by hashing the join key for O(1) match
Caches (Redis, Memcached)Hash table over key-value pairsO(1) cache hit lookup by key
DNS resolutionCached DNS recordsMap hostname → IP in O(1) after first resolution
Compiler symbol tablesVariable name → type/locationO(1) lookup during type-checking and code generation
Counting frequenciesWord count, histogramcount[word] += 1 pattern — O(1) per word
DeduplicationSeen-set during ETLO(1) check if record was already processed

In coding interviews, hash tables solve a huge class of problems in O(N):

  • Two Sum / Four Sum (O(N) with hash lookup vs O(N²) brute force)
  • Anagram detection (frequency map comparison)
  • Longest subarray with constraint (sliding window + hash)
  • Graph adjacency representation (adjacency list as hash map)

🧪 Practical: Building a Frequency Counter

The most common hash table interview pattern: count occurrences and use them to answer queries.

Problem: Given two strings, determine if one is an anagram of the other in O(N) time.

def is_anagram(s: str, t: str) -> bool:
    if len(s) != len(t):
        return False

    freq = {}

    # Count characters in s
    for char in s:
        freq[char] = freq.get(char, 0) + 1

    # Subtract characters in t
    for char in t:
        if char not in freq or freq[char] == 0:
            return False
        freq[char] -= 1

    return True

# Examples
print(is_anagram("anagram", "nagaram"))  # True
print(is_anagram("rat", "car"))          # False

Why this is O(N): Building the frequency map is one pass over s (O(N)). Validating against t is one pass over t (O(N)). Total: O(2N) = O(N). No sorting required (sorting would be O(N log N)).

Python shortcut:

from collections import Counter

def is_anagram_v2(s: str, t: str) -> bool:
    return Counter(s) == Counter(t)

collections.Counter is Python's built-in frequency hash table — a dict subclass with the same O(1) insert and lookup.


📚 What Hash Tables Teach You About Trade-offs

  • O(1) average is not O(1) guaranteed. Every hash table has a worst case. Security-sensitive systems (web servers) must use cryptographic hash functions (SipHash, etc.) to prevent deliberate hash flooding attacks that force O(N) collisions.
  • Space and time have a direct relationship. A larger backing array → fewer collisions → faster lookups. Load factor is the knob that controls this trade-off.
  • Ordering is sacrificed for speed. A hash table scrambles keys by design. If you need to iterate keys in sorted order, you need a BST or a sorted array instead.
  • Rehashing is amortized, not free. Individual rehash events cost O(N). Latency-sensitive systems sometimes pre-size hash tables to avoid mid-operation rehashing and jitter.

📌 TLDR: Summary & Key Takeaways

  • Hash function maps key → array index in O(1). Must be deterministic and uniform.
  • Collisions are inevitable; Separate Chaining uses linked lists; Linear Probing uses the array itself.
  • Rehash at load factor 0.75 to keep average O(1); amortized cost is O(1) per insert.
  • Hash tables have no order — use a BST when you need sorted iteration or range queries.

📝 Practice Quiz

  1. What is the primary purpose of a hash function in a hash table?

    • A) To encrypt the key for security and privacy.
    • B) To convert a key to an integer array index in O(1) time, enabling direct-access storage.
    • C) To sort the keys in the table for binary search.
    • D) To compress values to save memory. Correct Answer: B — The hash function maps an arbitrary key to an array index. This enables direct array access (O(1)) rather than a sequential search (O(N)).
  2. A hash table has 100 slots and 90 entries. What is the load factor, and what should happen next?

    • A) Load factor = 0.9; rehash now (exceeds 0.75 threshold) — create a 200-slot table and re-insert all entries.
    • B) Load factor = 90; nothing — the table still has 10 empty slots.
    • C) Load factor = 0.9; delete the 15 least-recently-used entries to stay under capacity.
    • D) Load factor = 0.09; the table is mostly empty and performs optimally. Correct Answer: A — Load factor = entries / capacity = 90/100 = 0.9. This exceeds the 0.75 threshold. Standard action: allocate a 2× array (200 slots) and re-insert all existing entries with fresh hash computations.
  3. When would you choose a Binary Search Tree (BST) over a Hash Table?

    • A) When you need O(1) average-case lookup performance.
    • B) When you need sorted iteration or range queries (e.g., "all keys between 50 and 100").
    • C) When your dataset is too small to justify a hash table's overhead.
    • D) When you want to avoid collisions entirely. Correct Answer: B — Hash tables have no inherent ordering. For sorted iteration or range queries (critical in databases), a BST provides O(log N) ordered traversal and O(log N + k) range queries that hash tables cannot support.

🛠️ Java HashMap & ConcurrentHashMap: Hash Table Mechanics Under the Hood

Java's HashMap is a production-quality hash table that uses separate chaining internally (linked list that upgrades to a red-black tree when a bucket reaches depth ≥ 8) and automatically rehashes at a 0.75 load factor — exactly the mechanics described in the collision resolution and rehashing sections above.

ConcurrentHashMap extends the same design for multithreaded environments: instead of locking the entire map, it uses CAS operations on individual bucket heads, allowing many concurrent writers with no lock contention on separate buckets.

import java.util.HashMap;
import java.util.Map;
import java.util.concurrent.ConcurrentHashMap;

public class HashTableDemo {

    // ── Custom hash table with separate chaining (illustrative) ──────────────
    static class ChainedHashTable<K, V> {
        private static final int SIZE = 16;
        @SuppressWarnings("unchecked")
        private final java.util.LinkedList<Map.Entry<K,V>>[] buckets =
            new java.util.LinkedList[SIZE];

        public void put(K key, V value) {
            int idx = Math.abs(key.hashCode() % SIZE);
            if (buckets[idx] == null) buckets[idx] = new java.util.LinkedList<>();
            buckets[idx].removeIf(e -> e.getKey().equals(key)); // update existing
            buckets[idx].add(Map.entry(key, value));
        }

        public V get(K key) {
            int idx = Math.abs(key.hashCode() % SIZE);
            if (buckets[idx] == null) return null;
            return buckets[idx].stream()
                               .filter(e -> e.getKey().equals(key))
                               .map(Map.Entry::getValue)
                               .findFirst().orElse(null);
        }
    }

    public static void main(String[] args) {
        // ── HashMap: O(1) average — word frequency counter (mirrors 🧪 section)
        Map<String, Integer> freq = new HashMap<>();
        for (String word : new String[]{"apple","banana","apple","cherry","apple"}) {
            freq.merge(word, 1, Integer::sum);  // merge is atomic put-or-update
        }
        System.out.println(freq);  // {apple=3, banana=1, cherry=1}

        // ── ConcurrentHashMap: thread-safe, no full-map lock ──────────────────
        ConcurrentHashMap<String, Integer> concurrent = new ConcurrentHashMap<>();
        concurrent.merge("apple", 1, Integer::sum);  // CAS-based atomic increment
        System.out.println("ConcurrentHashMap: " + concurrent);
    }
}
ClassThread-safeNull keys/valuesOrderedUse when
HashMap✅ / ✅Single-threaded or externally synchronized
LinkedHashMap✅ / ✅Insertion orderNeed insertion-order iteration (e.g., LRU cache)
TreeMap❌ / ✅Sorted by keyNeed sorted keys or range queries (BST internally)
ConcurrentHashMap❌ / ❌High-concurrency, write-heavy, multi-threaded

For a full deep-dive on Java HashMap internals (treeification, resize strategy) and ConcurrentHashMap's segment locking, a dedicated follow-up post is planned.


Abstract Algorithms

Written by

Abstract Algorithms

@abstractalgorithms