1. The Core Idea
Imagine you're building a web crawler and want to avoid visiting the same URL twice. You could store all visited URLs in a HashSet — but with 10 billion URLs, that's ~640GB of memory. Alternatively, use a Bloom filter at ~12 bits per URL = ~15GB. The trade-off: 1% false positives (occasionally skip a URL you haven't visited), but zero false negatives (never revisit a URL you already crawled).
2. How a Bloom Filter Works
A Bloom filter consists of two components: a bit array of m bits (all initialized to 0) and k independent hash functions, each of which maps an element to one of the m positions in the array.
Bloom Filter: m=18 bits, k=3 hash functions (h1, h2, h3)
Initial state:
[0][0][0][0][0][0][0][0][0][0][0][0][0][0][0][0][0][0]
0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17
INSERT "apple":
h1("apple") = 1, h2("apple") = 5, h3("apple") = 13
Set bits 1, 5, 13 → 1
[0][1][0][0][0][1][0][0][0][0][0][0][0][1][0][0][0][0]
INSERT "mango":
h1("mango") = 4, h2("mango") = 11, h3("mango") = 16
Set bits 4, 11, 16 → 1
[0][1][0][0][1][1][0][0][0][0][0][1][0][1][0][0][1][0]
QUERY "apple": check bits 1, 5, 13 → all 1 → MAYBE PRESENT ✓
QUERY "grape": check bits 2, 7, 14 → bit 2 = 0 → DEFINITELY NOT PRESENT ✓
QUERY "peach": check bits 1, 11, 16 → all 1 → MAYBE PRESENT (FALSE POSITIVE!)
"peach" was never inserted, but all 3 bits happen to be 1
3. False Positive Rate and Sizing
The false positive probability increases as more elements are inserted. Two key formulas:
Memory Comparison: HashSet vs Bloom Filter
Storing 1 billion URLs in a HashSet: each URL ~50 bytes + 8-byte pointer = ~58GB. Storing the same in a Bloom filter at 1% FPR: 9.6 bits per URL = ~1.2GB — a 48x reduction. The trade-off is 1 in 100 queries may incorrectly say "seen before" for a URL you haven't visited. For a web crawler, this means occasionally missing 1% of URLs — acceptable for most use cases.
4. Implementation
5. Real-World Use Cases
| System | Use Case | What It Avoids | Consequence of False Positive |
|---|---|---|---|
| Chrome Safe Browsing | Check if URL is potentially malicious | Server API call for safe URLs | Extra server round-trip (minor latency) |
| Apache Cassandra | Check if SSTable might contain a key before reading it | Disk I/O on SSTables that don't have the key | Read an SSTable that doesn't have the row (minor overhead) |
| HBase | Skip disk lookups for non-existent rows | Multiple disk seeks per missing key | Unnecessary disk read |
| Web Crawlers | URL deduplication — don't crawl same URL twice | HashSet storage of billions of URLs | Occasionally skip an uncrawled URL |
| Password Breach Detection | Is this password in a known breach database? | Sending raw password to server API | Extra k-anonymity API check (privacy safe) |
| CDN / Proxy Cache | Skip DB lookup for content definitely not cached | Cache miss lookup latency for uncached content | Occasional unnecessary cache miss check |
6. The Counting Bloom Filter — Supporting Deletes
Standard Bloom filters don't support deletion. The Counting Bloom Filter extends each bit to a small counter (typically 4 bits, allowing counts 0–15). Insertion increments k counters; deletion decrements k counters. A position's effective "bit" is whether its counter is >0.
Standard Bloom Filter bit array: [0][1][0][1][1][0][1][0] ← only set/unset Counting Bloom Filter counter array: [0][2][0][1][1][0][3][0] ← counts: can decrement Insert "apple" → adds to counters at positions 1, 4, 6: counters: [0][3][0][1][2][0][4][0] Delete "apple" → decrements counters at positions 1, 4, 6: counters: [0][2][0][1][1][0][3][0] ← back to pre-insert state Counter overflow (>15 for 4-bit) is a rare edge case — use saturation (stop at 15, don't wrap) to prevent false negatives from overflow.
Watch Out — Deleting Items Never Added
If you attempt to delete an item from a Counting Bloom Filter that was never added, you will decrement counters that were set by other items — causing false negatives. Always ensure you only delete items you previously added. In practice, this means maintaining a separate authoritative set of what's in the filter (defeating some of the memory savings) or accepting that the application guarantees delete-only-if-inserted semantics.
7. Bloom Filter Variants
Several variants address specific limitations of the standard Bloom filter:
- Scalable Bloom Filter: Grows dynamically by adding new standard filters when capacity is reached. Maintains a target FPR as n grows. Useful when the total number of elements is unknown upfront.
- Partitioned Bloom Filter: Splits the m bits into k partitions, one per hash function. Improves cache locality during lookups. Used in hardware implementations.
- Blocked Bloom Filter: Divides the bit array into blocks that fit in a cache line. A lookup reads only one cache line — significantly better cache performance than standard Bloom filter.
- Cuckoo Filter: Supports deletion natively, has better space efficiency at low FPR (<3%), and faster lookups than Bloom filter. Uses cuckoo hashing instead of a bit array. Preferred over Counting Bloom Filter in most new designs.
How We Research and Update This Guide
We test the underlying formula or workflow, compare outputs with reliable references, and revise examples whenever the page content changes.
- The workflow or formula is tested directly in the tool and compared against independent reference examples.
- Examples are kept practical so readers can verify the result without hidden assumptions.
- Pages are revised whenever the interface, calculation flow, or surrounding guidance materially changes.
Frequently Asked Questions — Bloom Filter
A Bloom filter is a space-efficient probabilistic data structure used to test whether an element is a member of a set. It uses a bit array of m bits (all initially 0) and k independent hash functions. To add an element: compute all k hash functions and set those k bit positions to 1. To query membership: compute all k hash positions — if ANY bit is 0, the element is definitely NOT in the set; if ALL bits are 1, the element is PROBABLY in the set (could be a false positive). The guarantee: false negatives are impossible. False positives are possible.
False negatives are impossible because adding an element always sets k bits to 1 — those bits stay 1 forever (you can only set bits, never clear them). So if you query for an element you previously added, all k bits will be 1 and the filter correctly says "possibly present." False positives occur when all k bit positions for an element you never added happen to be 1 — set to 1 by other elements that were added. As the filter fills up, more bits are 1, increasing the false positive rate. This is why you must size the filter appropriately for your expected number of elements.
The false positive probability (FPR) is approximately: FPR = (1 - e^(-kn/m))^k, where k = number of hash functions, n = number of elements inserted, m = number of bits. For optimal k = (m/n) × ln(2), this simplifies to FPR ≈ (0.6185)^(m/n). To achieve 1% FPR with n elements, you need m ≈ 9.6n bits. For 0.1% FPR, m ≈ 14.4n bits. The optimal number of hash functions k = (m/n) × ln(2) ≈ 0.693 × (m/n).
No — you cannot delete from a standard Bloom filter. Clearing a bit set by one element might also clear a bit shared by another element, causing false negatives (which violates the core guarantee). The Counting Bloom Filter solves this by replacing each bit with a small counter (e.g. 4-bit counter). Insert increments the k counters; delete decrements them. A counter reaching 0 is cleared. The downside: 4x more memory than a standard Bloom filter. Counting Bloom Filters are used in network routers for flow counting and cache eviction tracking.
Chrome's Safe Browsing feature maintains a local Bloom filter of malicious URLs. When you visit a URL, Chrome checks it against the local filter first — this takes microseconds and requires no network call. If the filter says "definitely not malicious" (no false negatives!), Chrome proceeds without any network check. Only if the filter returns "possibly malicious" does Chrome make a server call to verify — which actually touches the network for a small fraction of URLs. Without the Bloom filter, every URL visit would require a network round-trip to Google's Safe Browsing API.
Cassandra uses a Bloom filter per SSTable (immutable on-disk data file) to avoid unnecessary disk reads. SSTables are never modified after writing; Cassandra may have dozens of SSTables per table. When reading a row, Cassandra must check which SSTables might contain it. Without Bloom filters, it would have to read the index of every SSTable — many disk I/Os. With Bloom filters, each SSTable's filter is kept in memory. Cassandra checks all Bloom filters; only SSTables whose filter says "possibly contains this row" are actually read from disk. This reduces disk reads by 50–80% for most workloads.