What are Hash Tables? Basics Explained
Hash Tables are the magic behind O(1) lookups. We explain Hash Functions, Collisions, and why the...
Abstract AlgorithmsTLDR: A hash table gives you near-O(1) lookups, inserts, and deletes by using a hash function to map keys to array indices. The tradeoff: collisions (when two keys hash to the same slot) must be handled, and a full hash table must be resized.
📖 The Magic Mailroom Stamp
Imagine a mailroom with 100 drawers. A magic stamp converts any employee name to a number 1–100. Want John's mail? Stamp "John" → you get 42 → open Drawer 42. No search needed.
That stamp is a hash function. The row of drawers is the array (the backing store). The combination is a hash table.
🔍 Hash Table Fundamentals
A hash table combines two simple ideas into a data structure with near-constant-time performance:
- An array provides O(1) access by index:
array[42]is instant. - A hash function converts any key (string, integer, object) to an index:
h("john") → 42.
Together: given any key, compute its index in one step, then access the array slot directly. No iteration. No comparison loop.
The three fundamental operations and their complexity:
| Operation | Average case | Worst case | Why worst case occurs |
| Lookup | O(1) | O(N) | All keys hash to same slot |
| Insert | O(1) amortized | O(N) | Rehash triggered on full table |
| Delete | O(1) | O(N) | All keys collide into same bucket |
Worst case requires a pathologically bad hash function (or deliberate hash flooding). In practice, well-designed hash functions achieve average O(1) for all three operations.
What makes a good hash function?
- Deterministic: Same key always maps to the same slot.
- Uniform: Keys distribute evenly across all slots to minimize collisions.
- Fast: Must compute in O(1) — the hash computation cannot become the bottleneck.
- Avalanche effect: Similar keys produce very different hash values (prevents clustering).
🔢 Deep Dive: Array + Hash Function
flowchart LR
Key["Key: 'cat'"] --> HashFn["Hash Function\nh('cat') = 2"]
HashFn --> Array["Array\n[0][1][2: 'cat→meow'][3]..."]
Array --> Value["Value: 'meow'"]
A hash function must be:
- Deterministic: Same input always produces the same output.
- Fast: O(1) computation.
- Uniform: Keys distribute evenly to minimize collisions.
Simple hash function (NOT production-quality):
def simple_hash(key: str, size: int) -> int:
return sum(ord(c) for c in key) % size
Python's built-in hash() uses SipHash (cryptographically secure version) to prevent hash flooding attacks.
📊 How a Hash Table Handles a Lookup
The lookup flow for key "alice" in a hash table with separate chaining:
flowchart TD
K["Key: 'alice'"] --> HF["Hash Function: h('alice') = 7"]
HF --> IDX["Array index 7"]
IDX --> BK["Bucket 7: [('alice', 25), ('charlie', 30)]"]
BK --> SCAN{"Scan bucket: key == 'alice'?"}
SCAN -->|Match found| RET["Return value: 25"]
SCAN -->|No match| MISS["Return None / KeyError"]
style RET fill:#90EE90
style MISS fill:#FFB6C1
Three scenarios during lookup:
- Direct hit:
array[index]contains exactly the target key → O(1). - Collision chain:
array[index]has multiple entries → scan the short chain → O(k) where k = bucket length (typically 1–2 with good load factor). - Miss: The slot is empty → key not in table → O(1).
For a table with load factor 0.75 and a good hash function, average bucket length is < 2, making even chained lookups extremely fast in practice.
⚙️ Collisions: When Two Keys Land in the Same Slot
No hash function is perfect. Given a finite array, two different keys will eventually map to the same slot.
Example:
hash('cat') = 2→ Stored at index 2.hash('bird') = 2→ Collision at index 2!
Two collision resolution strategies:
Separate Chaining
Each slot holds a linked list of key-value pairs. On collision, append to the list.
index 2: [('cat', 'meow') → ('bird', 'tweet')]
O(1) average, O(n) worst case if all keys hash to the same slot.
Open Addressing (Linear Probing)
No linked lists. On collision, step forward until you find an empty slot.
hash('cat') = 2 → store at 2
hash('bird') = 2 (full) → probe 3 → store at 3
Pros: Better cache performance (all data in one array). Cons: Clustering — many consecutive full slots slow probing.
| Strategy | Space | Lookup (avg) | Cache Friendly | Used In |
| Separate Chaining | Extra memory for lists | O(1) | No | Java HashMap (until JDK 8) |
| Linear Probing | Compact array | O(1) | Yes | Python dict, Java HashMap (JDK 8+) |
| Quadratic Probing | Compact array | O(1) | Partial | Open-addressing tables |
| Robin Hood Hashing | Compact array | O(1) | Yes | Rust HashMap |
📊 Open Addressing: Collision Probe Sequence
sequenceDiagram
participant K as Key "bird"
participant HF as Hash Function
participant AR as Array
K->>HF: h("bird") = 2
HF->>AR: check slot 2
AR-->>HF: slot 2 occupied by "cat"
HF->>AR: probe slot 3 (linear)
AR-->>HF: slot 3 empty
HF-->>K: insert "bird" at slot 3
Note over AR: No linked list needed
🧠 Deep Dive: Load Factor and Rehashing
Load factor = (number of entries) / (array capacity).
As the table fills up (load factor → 1.0), collisions increase and performance degrades.
Standard practice: rehash when load factor exceeds 0.75.
Rehashing:
- Create a new array 2× the size.
- Re-insert every existing key into the new array (re-hash all keys).
- Old array is garbage collected.
Cost: O(N) for one rehash. But since it happens at doubly-spaced intervals, the amortized cost per insert is O(1).
📊 Load Factor Decision: Keep or Resize
flowchart TD
A[Insert new key] --> B{Load factor > 0.75?}
B -- No --> C[Insert into current array]
C --> D[O(1) average]
B -- Yes --> E[Allocate 2x array]
E --> F[Re-hash all existing keys]
F --> G[Insert new key]
G --> H[Old array GC'd]
H --> I[Amortized O(1) per insert]
⚙️ Why Python Dicts Are Fast
Python dict uses open addressing with compact arrays (since Python 3.6). Entries are stored in an incrementally-growing compact array indexed by insertion order, with a separate hash table of indices pointing into it. This gives:
- Fast iteration (compact memory, no pointer chasing).
- Memory-efficient (no linked list overhead).
- Dict guaranteed ordered by insertion order.
⚖️ Trade-offs & Failure Modes: Trade-offs, Failure Modes & Decision Guide: Hash Tables vs Others
| Operation | Hash Table | BST (e.g., TreeMap) | Array (sorted) |
| Lookup by key | O(1) avg | O(log n) | O(log n) binary search |
| Insert | O(1) avg | O(log n) | O(n) shift |
| Delete | O(1) avg | O(log n) | O(n) shift |
| Range query | ❌ Not supported | ✅ O(log n + k) | ✅ O(log n + k) |
| Ordered iteration | ❌ No order | ✅ In-order | ✅ Sorted |
Choose hash table when: You need fast key-value lookup and don't need ordering or range queries. Choose BST when: You need sorted iteration or range queries (e.g., "find all keys between 10 and 50").
🌍 Real-World Application: Hash Tables in Production
Hash tables are one of the most widely used data structures in existence:
| Application | Hash table used | Why |
| Programming language runtimes | Python dict, JavaScript object, Java HashMap | Core key-value primitive for all dynamic lookups |
| Database query execution | Hash join, hash aggregate | Join two tables by hashing the join key for O(1) match |
| Caches (Redis, Memcached) | Hash table over key-value pairs | O(1) cache hit lookup by key |
| DNS resolution | Cached DNS records | Map hostname → IP in O(1) after first resolution |
| Compiler symbol tables | Variable name → type/location | O(1) lookup during type-checking and code generation |
| Counting frequencies | Word count, histogram | count[word] += 1 pattern — O(1) per word |
| Deduplication | Seen-set during ETL | O(1) check if record was already processed |
In coding interviews, hash tables solve a huge class of problems in O(N):
- Two Sum / Four Sum (O(N) with hash lookup vs O(N²) brute force)
- Anagram detection (frequency map comparison)
- Longest subarray with constraint (sliding window + hash)
- Graph adjacency representation (adjacency list as hash map)
🧪 Practical: Building a Frequency Counter
The most common hash table interview pattern: count occurrences and use them to answer queries.
Problem: Given two strings, determine if one is an anagram of the other in O(N) time.
def is_anagram(s: str, t: str) -> bool:
if len(s) != len(t):
return False
freq = {}
# Count characters in s
for char in s:
freq[char] = freq.get(char, 0) + 1
# Subtract characters in t
for char in t:
if char not in freq or freq[char] == 0:
return False
freq[char] -= 1
return True
# Examples
print(is_anagram("anagram", "nagaram")) # True
print(is_anagram("rat", "car")) # False
Why this is O(N): Building the frequency map is one pass over s (O(N)). Validating against t is one pass over t (O(N)). Total: O(2N) = O(N). No sorting required (sorting would be O(N log N)).
Python shortcut:
from collections import Counter
def is_anagram_v2(s: str, t: str) -> bool:
return Counter(s) == Counter(t)
collections.Counter is Python's built-in frequency hash table — a dict subclass with the same O(1) insert and lookup.
📚 What Hash Tables Teach You About Trade-offs
- O(1) average is not O(1) guaranteed. Every hash table has a worst case. Security-sensitive systems (web servers) must use cryptographic hash functions (SipHash, etc.) to prevent deliberate hash flooding attacks that force O(N) collisions.
- Space and time have a direct relationship. A larger backing array → fewer collisions → faster lookups. Load factor is the knob that controls this trade-off.
- Ordering is sacrificed for speed. A hash table scrambles keys by design. If you need to iterate keys in sorted order, you need a BST or a sorted array instead.
- Rehashing is amortized, not free. Individual rehash events cost O(N). Latency-sensitive systems sometimes pre-size hash tables to avoid mid-operation rehashing and jitter.
📌 TLDR: Summary & Key Takeaways
- Hash function maps key → array index in O(1). Must be deterministic and uniform.
- Collisions are inevitable; Separate Chaining uses linked lists; Linear Probing uses the array itself.
- Rehash at load factor 0.75 to keep average O(1); amortized cost is O(1) per insert.
- Hash tables have no order — use a BST when you need sorted iteration or range queries.
📝 Practice Quiz
What is the primary purpose of a hash function in a hash table?
- A) To encrypt the key for security and privacy.
- B) To convert a key to an integer array index in O(1) time, enabling direct-access storage.
- C) To sort the keys in the table for binary search.
- D) To compress values to save memory. Correct Answer: B — The hash function maps an arbitrary key to an array index. This enables direct array access (O(1)) rather than a sequential search (O(N)).
A hash table has 100 slots and 90 entries. What is the load factor, and what should happen next?
- A) Load factor = 0.9; rehash now (exceeds 0.75 threshold) — create a 200-slot table and re-insert all entries.
- B) Load factor = 90; nothing — the table still has 10 empty slots.
- C) Load factor = 0.9; delete the 15 least-recently-used entries to stay under capacity.
- D) Load factor = 0.09; the table is mostly empty and performs optimally. Correct Answer: A — Load factor = entries / capacity = 90/100 = 0.9. This exceeds the 0.75 threshold. Standard action: allocate a 2× array (200 slots) and re-insert all existing entries with fresh hash computations.
When would you choose a Binary Search Tree (BST) over a Hash Table?
- A) When you need O(1) average-case lookup performance.
- B) When you need sorted iteration or range queries (e.g., "all keys between 50 and 100").
- C) When your dataset is too small to justify a hash table's overhead.
- D) When you want to avoid collisions entirely. Correct Answer: B — Hash tables have no inherent ordering. For sorted iteration or range queries (critical in databases), a BST provides O(log N) ordered traversal and O(log N + k) range queries that hash tables cannot support.
🛠️ Java HashMap & ConcurrentHashMap: Hash Table Mechanics Under the Hood
Java's HashMap is a production-quality hash table that uses separate chaining internally (linked list that upgrades to a red-black tree when a bucket reaches depth ≥ 8) and automatically rehashes at a 0.75 load factor — exactly the mechanics described in the collision resolution and rehashing sections above.
ConcurrentHashMap extends the same design for multithreaded environments: instead of locking the entire map, it uses CAS operations on individual bucket heads, allowing many concurrent writers with no lock contention on separate buckets.
import java.util.HashMap;
import java.util.Map;
import java.util.concurrent.ConcurrentHashMap;
public class HashTableDemo {
// ── Custom hash table with separate chaining (illustrative) ──────────────
static class ChainedHashTable<K, V> {
private static final int SIZE = 16;
@SuppressWarnings("unchecked")
private final java.util.LinkedList<Map.Entry<K,V>>[] buckets =
new java.util.LinkedList[SIZE];
public void put(K key, V value) {
int idx = Math.abs(key.hashCode() % SIZE);
if (buckets[idx] == null) buckets[idx] = new java.util.LinkedList<>();
buckets[idx].removeIf(e -> e.getKey().equals(key)); // update existing
buckets[idx].add(Map.entry(key, value));
}
public V get(K key) {
int idx = Math.abs(key.hashCode() % SIZE);
if (buckets[idx] == null) return null;
return buckets[idx].stream()
.filter(e -> e.getKey().equals(key))
.map(Map.Entry::getValue)
.findFirst().orElse(null);
}
}
public static void main(String[] args) {
// ── HashMap: O(1) average — word frequency counter (mirrors 🧪 section)
Map<String, Integer> freq = new HashMap<>();
for (String word : new String[]{"apple","banana","apple","cherry","apple"}) {
freq.merge(word, 1, Integer::sum); // merge is atomic put-or-update
}
System.out.println(freq); // {apple=3, banana=1, cherry=1}
// ── ConcurrentHashMap: thread-safe, no full-map lock ──────────────────
ConcurrentHashMap<String, Integer> concurrent = new ConcurrentHashMap<>();
concurrent.merge("apple", 1, Integer::sum); // CAS-based atomic increment
System.out.println("ConcurrentHashMap: " + concurrent);
}
}
| Class | Thread-safe | Null keys/values | Ordered | Use when |
HashMap | ❌ | ✅ / ✅ | ❌ | Single-threaded or externally synchronized |
LinkedHashMap | ❌ | ✅ / ✅ | Insertion order | Need insertion-order iteration (e.g., LRU cache) |
TreeMap | ❌ | ❌ / ✅ | Sorted by key | Need sorted keys or range queries (BST internally) |
ConcurrentHashMap | ✅ | ❌ / ❌ | ❌ | High-concurrency, write-heavy, multi-threaded |
For a full deep-dive on Java HashMap internals (treeification, resize strategy) and ConcurrentHashMap's segment locking, a dedicated follow-up post is planned.
🔗 Related Posts

Written by
Abstract Algorithms
@abstractalgorithms
More Posts

Types of LLM Quantization: By Timing, Scope, and Mapping
TLDR: There is no single "best" LLM quantization. You classify and choose quantization along three axes: when you quantize (timing), what you quantize (scope), and how values are encoded (mapping). In practice, most teams start with weight quantizati...
Stream Processing Pipeline Pattern: Stateful Real-Time Data Products
TLDR: Stream pipelines succeed when event-time semantics, state management, and replay strategy are designed together — and Kafka Streams lets you build all three directly inside your Spring Boot service. Stripe's real-time fraud detection processes...
Service Mesh Pattern: Control Plane, Data Plane, and Zero-Trust Traffic
TLDR: A service mesh intercepts all service-to-service traffic via injected Envoy sidecar proxies, letting a platform team enforce mTLS, retries, timeouts, and circuit breaking centrally — without changing application code. Reach for it when cross-te...
Serverless Architecture Pattern: Event-Driven Scale with Operational Guardrails
TLDR: Serverless is strongest for spiky asynchronous workloads when cold-start, observability, and state boundaries are intentionally designed. TLDR: Serverless works best for spiky, event-driven workloads when you design for idempotency, observabili...
