-
-
Notifications
You must be signed in to change notification settings - Fork 46.8k
Bloom Filter #8615
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Merged
Merged
Bloom Filter #8615
Changes from 29 commits
Commits
Show all changes
32 commits
Select commit
Hold shift + click to select a range
173ab0e
Bloom filter with tests
isidroas 08bc970
has functions constant
isidroas 0448109
fix type
isidroas 486dcbc
isort
isidroas 4111807
passing ruff
isidroas e6ce098
type hints
isidroas e4d39db
type hints
isidroas 7629686
from fail to erro
isidroas 3926167
captital leter
isidroas 280ffa0
type hints requested by boot
isidroas 5d460aa
descriptive name for m
isidroas cc54095
more descriptibe arguments II
isidroas 78d19fd
moved movies_test to doctest
isidroas 8b1bec0
commented doctest
isidroas 28e6691
removed test_probability
isidroas 2fd7196
estimated error
isidroas 314237d
added types
isidroas 9b01472
again hash_
isidroas c132d50
Update data_structures/hashing/bloom_filter.py
isidroas 313c80c
from b to bloom
isidroas 18e0dde
Update data_structures/hashing/bloom_filter.py
isidroas 54041ff
Update data_structures/hashing/bloom_filter.py
isidroas 483a2a0
[pre-commit.ci] auto fixes from pre-commit.com hooks
pre-commit-ci[bot] 174ce08
syntax error in dict comprehension
isidroas 00cc60e
from goodfather to godfather
isidroas 35fa5f5
removed Interestellar
isidroas 5cd20ea
forgot the last Godfather
isidroas 7617143
Revert "removed Interestellar"
isidroas 799171a
pretty dict
isidroas 1a71f4c
Apply suggestions from code review
cclauss 4e0263f
[pre-commit.ci] auto fixes from pre-commit.com hooks
pre-commit-ci[bot] e746746
Update bloom_filter.py
cclauss File filter
Filter by extension
Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
There are no files selected for viewing
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,102 @@ | ||
""" | ||
See https://en.wikipedia.org/wiki/Bloom_filter | ||
|
||
The use of this data structure is to test membership in a set. | ||
Compared to Python's built-in set() it is more space-efficient. | ||
In the following example, only 8 bits of memory will be used: | ||
>>> bloom = Bloom(size=8) | ||
|
||
Initially, the filter contains all zeros: | ||
>>> bloom.bitstring | ||
'00000000' | ||
|
||
When an element is added, two bits are set to 1 | ||
since there are 2 hash functions in this implementation: | ||
>>> "Titanic" in bloom | ||
False | ||
>>> bloom.add("Titanic") | ||
>>> bloom.bitstring | ||
'01100000' | ||
>>> "Titanic" in bloom | ||
True | ||
|
||
However, sometimes only one bit is added | ||
because both hash functions return the same value | ||
>>> bloom.add("Avatar") | ||
>>> bloom.format_hash("Avatar") | ||
'00000100' | ||
>>> bloom.bitstring | ||
'01100100' | ||
|
||
Not added elements should return False ... | ||
>>> not_present_films = ("The Godfather", "Interstellar", "Parasite", "Pulp Fiction") | ||
>>> { | ||
... film: bloom.format_hash(film) | ||
... for film in not_present_films | ||
cclauss marked this conversation as resolved.
Show resolved
Hide resolved
|
||
... } # doctest: +NORMALIZE_WHITESPACE | ||
{'The Godfather': '00000101', | ||
'Interstellar': '00000011', | ||
'Parasite': '00010010', | ||
'Pulp Fiction': '10000100'} | ||
cclauss marked this conversation as resolved.
Show resolved
Hide resolved
|
||
>>> any(film in bloom for film in not_present_films) | ||
False | ||
|
||
but sometimes there are false positives: | ||
>>> "Ratatouille" in bloom | ||
True | ||
>>> bloom.format_hash("Ratatouille") | ||
'01100000' | ||
|
||
The probability increases with the number of added elements | ||
cclauss marked this conversation as resolved.
Show resolved
Hide resolved
|
||
>>> bloom.estimated_error_rate() | ||
0.140625 | ||
>>> bloom.add("The Godfather") | ||
>>> bloom.estimated_error_rate() | ||
cclauss marked this conversation as resolved.
Show resolved
Hide resolved
|
||
0.25 | ||
>>> bloom.bitstring | ||
'01100101' | ||
""" | ||
from hashlib import md5, sha256 | ||
|
||
HASH_FUNCTIONS = (sha256, md5) | ||
|
||
|
||
class Bloom: | ||
def __init__(self, size: int = 8) -> None: | ||
self.bitarray = 0b0 | ||
self.size = size | ||
|
||
def add(self, value: str) -> None: | ||
h = self.hash_(value) | ||
self.bitarray |= h | ||
|
||
def exists(self, value: str) -> bool: | ||
h = self.hash_(value) | ||
return (h & self.bitarray) == h | ||
|
||
def __contains__(self, other: str) -> bool: | ||
return self.exists(other) | ||
|
||
def format_bin(self, bitarray: int) -> str: | ||
res = bin(bitarray)[2:] | ||
return res.zfill(self.size) | ||
|
||
@property | ||
def bitstring(self) -> str: | ||
return self.format_bin(self.bitarray) | ||
|
||
def hash_(self, value: str) -> int: | ||
cclauss marked this conversation as resolved.
Show resolved
Hide resolved
|
||
res = 0b0 | ||
for func in HASH_FUNCTIONS: | ||
b = func(value.encode()).digest() | ||
position = int.from_bytes(b, "little") % self.size | ||
cclauss marked this conversation as resolved.
Show resolved
Hide resolved
|
||
res |= 2**position | ||
return res | ||
|
||
def format_hash(self, value: str) -> str: | ||
return self.format_bin(self.hash_(value)) | ||
|
||
def estimated_error_rate(self) -> float: | ||
cclauss marked this conversation as resolved.
Show resolved
Hide resolved
|
||
n_ones = bin(self.bitarray).count("1") | ||
k = len(HASH_FUNCTIONS) | ||
return (n_ones / self.size) ** k | ||
cclauss marked this conversation as resolved.
Show resolved
Hide resolved
|
Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
Uh oh!
There was an error while loading. Please reload this page.