1. The Core Idea

Imagine you're building a web crawler and want to avoid visiting the same URL twice. You could store all visited URLs in a HashSet — but with 10 billion URLs, that's ~640GB of memory. Alternatively, use a Bloom filter at ~12 bits per URL = ~15GB. The trade-off: 1% false positives (occasionally skip a URL you haven't visited), but zero false negatives (never revisit a URL you already crawled).

2. How a Bloom Filter Works

A Bloom filter consists of two components: a bit array of m bits (all initialized to 0) and k independent hash functions, each of which maps an element to one of the m positions in the array.

Bloom Filter: m=18 bits, k=3 hash functions (h1, h2, h3)

Initial state:
  [0][0][0][0][0][0][0][0][0][0][0][0][0][0][0][0][0][0]
   0  1  2  3  4  5  6  7  8  9 10 11 12 13 14 15 16 17

INSERT "apple":
  h1("apple") = 1, h2("apple") = 5, h3("apple") = 13
  Set bits 1, 5, 13 → 1
  [0][1][0][0][0][1][0][0][0][0][0][0][0][1][0][0][0][0]

INSERT "mango":
  h1("mango") = 4, h2("mango") = 11, h3("mango") = 16
  Set bits 4, 11, 16 → 1
  [0][1][0][0][1][1][0][0][0][0][0][1][0][1][0][0][1][0]

QUERY "apple": check bits 1, 5, 13 → all 1 → MAYBE PRESENT ✓
QUERY "grape": check bits 2, 7, 14 → bit 2 = 0 → DEFINITELY NOT PRESENT ✓
QUERY "peach": check bits 1, 11, 16 → all 1 → MAYBE PRESENT (FALSE POSITIVE!)
               "peach" was never inserted, but all 3 bits happen to be 1

3. False Positive Rate and Sizing

The false positive probability increases as more elements are inserted. Two key formulas:

Bloom filter sizing formulas # Given: # n = expected number of elements to insert # p = desired false positive rate (e.g. 0.01 for 1%) # Optimal number of bits (m): m = -n × ln(p) / (ln(2))^2 m = -n × ln(p) / 0.4805 # simplified # Optimal number of hash functions (k): k = (m / n) × ln(2) k = (m / n) × 0.6931 # Example: 1 million URLs at 1% false positive rate # m = -1,000,000 × ln(0.01) / 0.4805 # m = -1,000,000 × (-4.605) / 0.4805 # m ≈ 9,585,058 bits ≈ 1.14 MB (vs ~64 MB for a HashSet!) # k = (9,585,058 / 1,000,000) × 0.693 ≈ 6.6 → round to 7 # Actual false positive rate with m bits, k functions, n elements: # FPR ≈ (1 - e^(-k×n/m))^k # Reference table: # FPR m/n (bits per element) k (hash functions) # 1% 9.6 bits 6-7 # 0.1% 14.4 bits 10 # 0.01% 19.2 bits 13

Memory Comparison: HashSet vs Bloom Filter

Storing 1 billion URLs in a HashSet: each URL ~50 bytes + 8-byte pointer = ~58GB. Storing the same in a Bloom filter at 1% FPR: 9.6 bits per URL = ~1.2GB — a 48x reduction. The trade-off is 1 in 100 queries may incorrectly say "seen before" for a URL you haven't visited. For a web crawler, this means occasionally missing 1% of URLs — acceptable for most use cases.

4. Implementation

Python — Bloom filter implementation import math import hashlib class BloomFilter: def __init__(self, n: int, p: float = 0.01): """ n: expected number of elements p: desired false positive rate """ self.m = math.ceil(-n * math.log(p) / (math.log(2) ** 2)) self.k = math.ceil((self.m / n) * math.log(2)) self.bits = bytearray(math.ceil(self.m / 8)) print(f"Bloom filter: {self.m} bits ({self.m/8/1024:.1f} KB), {self.k} hash functions") def _hashes(self, item: str): """Generate k hash positions using double hashing""" h1 = int(hashlib.md5(item.encode()).hexdigest(), 16) h2 = int(hashlib.sha1(item.encode()).hexdigest(), 16) for i in range(self.k): yield (h1 + i * h2) % self.m def add(self, item: str): for pos in self._hashes(item): byte_idx, bit_idx = divmod(pos, 8) self.bits[byte_idx] |= (1 << bit_idx) def contains(self, item: str) -> bool: """Returns False = DEFINITELY not present; True = PROBABLY present""" for pos in self._hashes(item): byte_idx, bit_idx = divmod(pos, 8) if not (self.bits[byte_idx] & (1 << bit_idx)): return False return True # Example: URL deduplication for a web crawler bloom = BloomFilter(n=1_000_000, p=0.01) # → 9,585,058 bits (1.14 MB), 7 hash functions bloom.add("https://example.com") bloom.add("https://google.com") bloom.contains("https://example.com") # True (definitely or probably seen) bloom.contains("https://bing.com") # False (DEFINITELY not seen)

5. Real-World Use Cases

SystemUse CaseWhat It AvoidsConsequence of False Positive
Chrome Safe BrowsingCheck if URL is potentially maliciousServer API call for safe URLsExtra server round-trip (minor latency)
Apache CassandraCheck if SSTable might contain a key before reading itDisk I/O on SSTables that don't have the keyRead an SSTable that doesn't have the row (minor overhead)
HBaseSkip disk lookups for non-existent rowsMultiple disk seeks per missing keyUnnecessary disk read
Web CrawlersURL deduplication — don't crawl same URL twiceHashSet storage of billions of URLsOccasionally skip an uncrawled URL
Password Breach DetectionIs this password in a known breach database?Sending raw password to server APIExtra k-anonymity API check (privacy safe)
CDN / Proxy CacheSkip DB lookup for content definitely not cachedCache miss lookup latency for uncached contentOccasional unnecessary cache miss check

6. The Counting Bloom Filter — Supporting Deletes

Standard Bloom filters don't support deletion. The Counting Bloom Filter extends each bit to a small counter (typically 4 bits, allowing counts 0–15). Insertion increments k counters; deletion decrements k counters. A position's effective "bit" is whether its counter is >0.

Standard Bloom Filter bit array:
  [0][1][0][1][1][0][1][0]  ← only set/unset

Counting Bloom Filter counter array:
  [0][2][0][1][1][0][3][0]  ← counts: can decrement

Insert "apple" → adds to counters at positions 1, 4, 6:
  counters: [0][3][0][1][2][0][4][0]

Delete "apple" → decrements counters at positions 1, 4, 6:
  counters: [0][2][0][1][1][0][3][0]  ← back to pre-insert state

Counter overflow (>15 for 4-bit) is a rare edge case — use saturation
(stop at 15, don't wrap) to prevent false negatives from overflow.

Watch Out — Deleting Items Never Added

If you attempt to delete an item from a Counting Bloom Filter that was never added, you will decrement counters that were set by other items — causing false negatives. Always ensure you only delete items you previously added. In practice, this means maintaining a separate authoritative set of what's in the filter (defeating some of the memory savings) or accepting that the application guarantees delete-only-if-inserted semantics.

7. Bloom Filter Variants

Several variants address specific limitations of the standard Bloom filter:

  • Scalable Bloom Filter: Grows dynamically by adding new standard filters when capacity is reached. Maintains a target FPR as n grows. Useful when the total number of elements is unknown upfront.
  • Partitioned Bloom Filter: Splits the m bits into k partitions, one per hash function. Improves cache locality during lookups. Used in hardware implementations.
  • Blocked Bloom Filter: Divides the bit array into blocks that fit in a cache line. A lookup reads only one cache line — significantly better cache performance than standard Bloom filter.
  • Cuckoo Filter: Supports deletion natively, has better space efficiency at low FPR (<3%), and faster lookups than Bloom filter. Uses cuckoo hashing instead of a bit array. Preferred over Counting Bloom Filter in most new designs.

How We Research and Update This Guide

We test the underlying formula or workflow, compare outputs with reliable references, and revise examples whenever the page content changes.

  • The workflow or formula is tested directly in the tool and compared against independent reference examples.
  • Examples are kept practical so readers can verify the result without hidden assumptions.
  • Pages are revised whenever the interface, calculation flow, or surrounding guidance materially changes.

Frequently Asked Questions — Bloom Filter