What is a Bloom Filter — Probabilistic Data Structures Explained [2026]

Q: What is a Bloom filter and how does it work?

A Bloom filter is a space-efficient probabilistic data structure used to test whether an element is a member of a set. It uses a bit array of m bits (all initially 0) and k independent hash functions. To add an element: compute all k hash functions and set those k bit positions to 1. To query membership: compute all k hash positions — if ANY bit is 0, the element is definitely NOT in the set; if ALL bits are 1, the element is PROBABLY in the set (could be a false positive). The guarantee: false negatives are impossible. False positives are possible.

Q: Why can a Bloom filter have false positives but not false negatives?

False negatives are impossible because adding an element always sets k bits to 1 — those bits stay 1 forever (you can only set bits, never clear them). So if you query for an element you previously added, all k bits will be 1 and the filter correctly says "possibly present." False positives occur when all k bit positions for an element you never added happen to be 1 — set to 1 by other elements that were added. As the filter fills up, more bits are 1, increasing the false positive rate. This is why you must size the filter appropriately for your expected number of elements.

Q: How do you calculate the false positive rate of a Bloom filter?

The false positive probability (FPR) is approximately: FPR = (1 - e^(-kn/m))^k, where k = number of hash functions, n = number of elements inserted, m = number of bits. For optimal k = (m/n) × ln(2), this simplifies to FPR ≈ (0.6185)^(m/n). To achieve 1% FPR with n elements, you need m ≈ 9.6n bits. For 0.1% FPR, m ≈ 14.4n bits. The optimal number of hash functions k = (m/n) × ln(2) ≈ 0.693 × (m/n).

Q: Can you delete elements from a Bloom filter?

No — you cannot delete from a standard Bloom filter. Clearing a bit set by one element might also clear a bit shared by another element, causing false negatives (which violates the core guarantee). The Counting Bloom Filter solves this by replacing each bit with a small counter (e.g. 4-bit counter). Insert increments the k counters; delete decrements them. A counter reaching 0 is cleared. The downside: 4x more memory than a standard Bloom filter. Counting Bloom Filters are used in network routers for flow counting and cache eviction tracking.

Q: How does Google Chrome use a Bloom filter?

Chrome's Safe Browsing feature maintains a local Bloom filter of malicious URLs. When you visit a URL, Chrome checks it against the local filter first — this takes microseconds and requires no network call. If the filter says "definitely not malicious" (no false negatives!), Chrome proceeds without any network check. Only if the filter returns "possibly malicious" does Chrome make a server call to verify — which actually touches the network for a small fraction of URLs. Without the Bloom filter, every URL visit would require a network round-trip to Google's Safe Browsing API.

Q: How does Cassandra use Bloom filters?

Cassandra uses a Bloom filter per SSTable (immutable on-disk data file) to avoid unnecessary disk reads. SSTables are never modified after writing; Cassandra may have dozens of SSTables per table. When reading a row, Cassandra must check which SSTables might contain it. Without Bloom filters, it would have to read the index of every SSTable — many disk I/Os. With Bloom filters, each SSTable's filter is kept in memory. Cassandra checks all Bloom filters; only SSTables whose filter says "possibly contains this row" are actually read from disk. This reduces disk reads by 50–80% for most workloads.

1. The Core Idea

Imagine you're building a web crawler and want to avoid visiting the same URL twice. You could store all visited URLs in a HashSet — but with 10 billion URLs, that's ~640GB of memory. Alternatively, use a Bloom filter at ~12 bits per URL = ~15GB. The trade-off: 1% false positives (occasionally skip a URL you haven't visited), but zero false negatives (never revisit a URL you already crawled).

2. How a Bloom Filter Works

A Bloom filter consists of two components: a bit array of m bits (all initialized to 0) and k independent hash functions, each of which maps an element to one of the m positions in the array.

Bloom Filter: m=18 bits, k=3 hash functions (h1, h2, h3)

Initial state:
  [0][0][0][0][0][0][0][0][0][0][0][0][0][0][0][0][0][0]
   0  1  2  3  4  5  6  7  8  9 10 11 12 13 14 15 16 17

INSERT "apple":
  h1("apple") = 1, h2("apple") = 5, h3("apple") = 13
  Set bits 1, 5, 13 → 1
  [0][1][0][0][0][1][0][0][0][0][0][0][0][1][0][0][0][0]

INSERT "mango":
  h1("mango") = 4, h2("mango") = 11, h3("mango") = 16
  Set bits 4, 11, 16 → 1
  [0][1][0][0][1][1][0][0][0][0][0][1][0][1][0][0][1][0]

QUERY "apple": check bits 1, 5, 13 → all 1 → MAYBE PRESENT ✓
QUERY "grape": check bits 2, 7, 14 → bit 2 = 0 → DEFINITELY NOT PRESENT ✓
QUERY "peach": check bits 1, 11, 16 → all 1 → MAYBE PRESENT (FALSE POSITIVE!)
               "peach" was never inserted, but all 3 bits happen to be 1

3. False Positive Rate and Sizing

The false positive probability increases as more elements are inserted. Two key formulas:

Bloom filter sizing formulas # Given: # n = expected number of elements to insert # p = desired false positive rate (e.g. 0.01 for 1%) # Optimal number of bits (m): m = -n × ln(p) / (ln(2))^2 m = -n × ln(p) / 0.4805 # simplified # Optimal number of hash functions (k): k = (m / n) × ln(2) k = (m / n) × 0.6931 # Example: 1 million URLs at 1% false positive rate # m = -1,000,000 × ln(0.01) / 0.4805 # m = -1,000,000 × (-4.605) / 0.4805 # m ≈ 9,585,058 bits ≈ 1.14 MB (vs ~64 MB for a HashSet!) # k = (9,585,058 / 1,000,000) × 0.693 ≈ 6.6 → round to 7 # Actual false positive rate with m bits, k functions, n elements: # FPR ≈ (1 - e^(-k×n/m))^k # Reference table: # FPR m/n (bits per element) k (hash functions) # 1% 9.6 bits 6-7 # 0.1% 14.4 bits 10 # 0.01% 19.2 bits 13

Memory Comparison: HashSet vs Bloom Filter

Storing 1 billion URLs in a HashSet: each URL ~50 bytes + 8-byte pointer = ~58GB. Storing the same in a Bloom filter at 1% FPR: 9.6 bits per URL = ~1.2GB — a 48x reduction. The trade-off is 1 in 100 queries may incorrectly say "seen before" for a URL you haven't visited. For a web crawler, this means occasionally missing 1% of URLs — acceptable for most use cases.

4. Implementation

Python — Bloom filter implementation import math import hashlib class BloomFilter: def __init__(self, n: int, p: float = 0.01): """ n: expected number of elements p: desired false positive rate """ self.m = math.ceil(-n * math.log(p) / (math.log(2) ** 2)) self.k = math.ceil((self.m / n) * math.log(2)) self.bits = bytearray(math.ceil(self.m / 8)) print(f"Bloom filter: {self.m} bits ({self.m/8/1024:.1f} KB), {self.k} hash functions") def _hashes(self, item: str): """Generate k hash positions using double hashing""" h1 = int(hashlib.md5(item.encode()).hexdigest(), 16) h2 = int(hashlib.sha1(item.encode()).hexdigest(), 16) for i in range(self.k): yield (h1 + i * h2) % self.m def add(self, item: str): for pos in self._hashes(item): byte_idx, bit_idx = divmod(pos, 8) self.bits[byte_idx] |= (1 << bit_idx) def contains(self, item: str) -> bool: """Returns False = DEFINITELY not present; True = PROBABLY present""" for pos in self._hashes(item): byte_idx, bit_idx = divmod(pos, 8) if not (self.bits[byte_idx] & (1 << bit_idx)): return False return True # Example: URL deduplication for a web crawler bloom = BloomFilter(n=1_000_000, p=0.01) # → 9,585,058 bits (1.14 MB), 7 hash functions bloom.add("https://example.com") bloom.add("https://google.com") bloom.contains("https://example.com") # True (definitely or probably seen) bloom.contains("https://bing.com") # False (DEFINITELY not seen)

5. Real-World Use Cases

System	Use Case	What It Avoids	Consequence of False Positive
Chrome Safe Browsing	Check if URL is potentially malicious	Server API call for safe URLs	Extra server round-trip (minor latency)
Apache Cassandra	Check if SSTable might contain a key before reading it	Disk I/O on SSTables that don't have the key	Read an SSTable that doesn't have the row (minor overhead)
HBase	Skip disk lookups for non-existent rows	Multiple disk seeks per missing key	Unnecessary disk read
Web Crawlers	URL deduplication — don't crawl same URL twice	HashSet storage of billions of URLs	Occasionally skip an uncrawled URL
Password Breach Detection	Is this password in a known breach database?	Sending raw password to server API	Extra k-anonymity API check (privacy safe)
CDN / Proxy Cache	Skip DB lookup for content definitely not cached	Cache miss lookup latency for uncached content	Occasional unnecessary cache miss check

6. The Counting Bloom Filter — Supporting Deletes

Standard Bloom filters don't support deletion. The Counting Bloom Filter extends each bit to a small counter (typically 4 bits, allowing counts 0–15). Insertion increments k counters; deletion decrements k counters. A position's effective "bit" is whether its counter is >0.

Standard Bloom Filter bit array:
  [0][1][0][1][1][0][1][0]  ← only set/unset

Counting Bloom Filter counter array:
  [0][2][0][1][1][0][3][0]  ← counts: can decrement

Insert "apple" → adds to counters at positions 1, 4, 6:
  counters: [0][3][0][1][2][0][4][0]

Delete "apple" → decrements counters at positions 1, 4, 6:
  counters: [0][2][0][1][1][0][3][0]  ← back to pre-insert state

Counter overflow (>15 for 4-bit) is a rare edge case — use saturation
(stop at 15, don't wrap) to prevent false negatives from overflow.

Watch Out — Deleting Items Never Added

If you attempt to delete an item from a Counting Bloom Filter that was never added, you will decrement counters that were set by other items — causing false negatives. Always ensure you only delete items you previously added. In practice, this means maintaining a separate authoritative set of what's in the filter (defeating some of the memory savings) or accepting that the application guarantees delete-only-if-inserted semantics.

7. Bloom Filter Variants

Several variants address specific limitations of the standard Bloom filter:

Scalable Bloom Filter: Grows dynamically by adding new standard filters when capacity is reached. Maintains a target FPR as n grows. Useful when the total number of elements is unknown upfront.
Partitioned Bloom Filter: Splits the m bits into k partitions, one per hash function. Improves cache locality during lookups. Used in hardware implementations.
Blocked Bloom Filter: Divides the bit array into blocks that fit in a cache line. A lookup reads only one cache line — significantly better cache performance than standard Bloom filter.
Cuckoo Filter: Supports deletion natively, has better space efficiency at low FPR (<3%), and faster lookups than Bloom filter. Uses cuckoo hashing instead of a bit array. Preferred over Counting Bloom Filter in most new designs.

How We Research and Update This Guide

We test the underlying formula or workflow, compare outputs with reliable references, and revise examples whenever the page content changes.

The workflow or formula is tested directly in the tool and compared against independent reference examples.
Examples are kept practical so readers can verify the result without hidden assumptions.
Pages are revised whenever the interface, calculation flow, or surrounding guidance materially changes.

Frequently Asked Questions — Bloom Filter

What is a Bloom filter and how does it work?

Why can a Bloom filter have false positives but not false negatives?

How do you calculate the false positive rate of a Bloom filter?

Can you delete elements from a Bloom filter?

How does Google Chrome use a Bloom filter?

How does Cassandra use Bloom filters?