-
Notifications
You must be signed in to change notification settings - Fork 29k
SPARK-18252: Using RoaringBitmap for bloom filters #15917
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
| * Best perfomance is on x86_64 platform | ||
| * Based on implementation https://github.com/aappleby/smhasher/blob/master/src/MurmurHash3.cpp | ||
| */ | ||
| final class Murmur3_128 { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Could you explain why we need to add a custom murmurhash implementation? it seems like conceptually we're just swapping bitmaps for bitmapss. What's the license of this code?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This decision came from fact that RoaringBitmap supports only int indexes(and can have only Int.MaxValue bits), since we use RoaringBitmap[] to represent vector way larger than Int.MaxValue, we need good hash function to generate long hash values. This implementation is based on smhasher and solr implementation which are both freely available I suppose. Also this implementation must be way faster on 64 bit platform than Murmur3_x86_32
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Rather than make a new implementation, use Guava's Hashing? Scala and Spark itself already have murmur hash impls but they seem 32-bit only
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Well Guava`s implementation is allocating a lot of temp objects(works only with ByteBuffer source). So I decided to do standalone implementation - all tests for Hasher are taken from Guava lib.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
OK, maybe that makes sense. If bothering with a custom impl, why not 64 bits instead of 128? it would avoid needing a long[2] to take the result?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
In fact the is Murmur hash version which produces 64 - it is called Murmur 2 Wiki, but it is previous version of Murmur, current is Murmur 3 which is faster and give better distribution(Murmur2 has some flaws). Since in bloom filter we always need 1 or more hash functions, creating 128 bit hash(two longs) is pretty good I suppose.
| @@ -0,0 +1,30 @@ | |||
| package org.apache.spark.util.sketch; | |||
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Missing a license
| import java.io.IOException; | ||
| import java.util.Arrays; | ||
|
|
||
| class RoaringBitmapArray { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Probably a dumb question but why do we need an array of bitmaps? I thought RoaringBitmap was a BitArray counterpart?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Unfortunately, RoaringBitmap supports only int indexes javadoc. Since we need to support bloom filters with size more than Int.MaxValue(BitArray supported 64*IntMaxValue) I decided to partition bloom filter with RoaringBitmap[]. Now Bloom filter can be Long.MaxValue large in size(number of bits).
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Hm, I see: RoaringBitmap/RoaringBitmap#109
I wonder if it's worth using RoaringBitmap then, but, of course making the improvements we want from RB would take work too, and, removing code helps.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
well, the main point of this improvement is to save memory - now even filters with few elements acquire space needed for full bloom filter. With this improvement bloom filter size will increase as we insert values in it.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It's certainly worth documenting this in the scaladoc BTW, why the class has to exist.
srowen
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
CC @cloud-fan for an initial look before we get further
| import java.io.IOException; | ||
| import java.util.Arrays; | ||
|
|
||
| class RoaringBitmapArray { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It's certainly worth documenting this in the scaladoc BTW, why the class has to exist.
| return (int) Math.ceil(numBits / (double)Integer.MAX_VALUE); | ||
| } | ||
|
|
||
| private static RoaringBitmap[] initialVector(int numOfBuckets){ |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Small style things: space before braces, and after casts, as in a few lines above. Spaces around elements of a for loop, as below.
|
|
||
| /** Number of bits */ | ||
| long bitSize() { | ||
| return numBits; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
2 space indent everywhere
|
|
||
| @Override | ||
| public int hashCode() { | ||
| return Arrays.hashCode(data); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is it worth incorporating bitCount, numBits? it's not wrong to omit them, but might be didier.
| @Override | ||
| public boolean equals(Object other) { | ||
| if (this == other) return true; | ||
| if (other == null || !(other instanceof RoaringBitmapArray)) return false; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Use braces, newline for return
| if (other == null || !(other instanceof RoaringBitmapArray)) return false; | ||
| RoaringBitmapArray that = (RoaringBitmapArray) other; | ||
| return (this.bitCount == that.bitCount) && | ||
| (this.numBits == that.numBits) && |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
4-space continuation indents
| class RoaringBitmapArray { | ||
|
|
||
| private final RoaringBitmap[] data; | ||
| private long bitCount; // number of 1`s in bitset |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is it worth maintaining this? it slows down paths like set; is it called frequently enough to cache?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
it is only need to calculate expectedFpp (false positive rate) which is not very frequent operation, I agree. will remove
|
|
||
| return new BloomFilterImpl(optimalNumOfHashFunctions(expectedNumItems, numBits), numBits); | ||
| } | ||
|
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Go ahead and back out this spurious change; there's another newline at the end of the next file
| * Best perfomance is on x86_64 platform | ||
| * Based on implementation https://github.com/aappleby/smhasher/blob/master/src/MurmurHash3.cpp | ||
| */ | ||
| final class Murmur3_128 { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
OK, maybe that makes sense. If bothering with a custom impl, why not 64 bits instead of 128? it would avoid needing a long[2] to take the result?
|
Jenkins add to whitelist |
|
Test build #68782 has finished for PR 15917 at commit
|
|
Please see my comment on the jira ticket. I'm not sure this is a good idea. Also the current implementation ignores versioning. |
|
@srowen It seems like I have fixed all your comments. |
|
Test build #68796 has finished for PR 15917 at commit
|
|
Also cc @lemire Can you take a look at the discussion on the JIRA ticket and chime in? https://issues.apache.org/jira/browse/SPARK-18252 |
|
One of the core Roaring authors (@gssiyankai) worked specifically on Roaring and Bloom filters. Any thoughts Gregory? |
|
Test build #68797 has finished for PR 15917 at commit
|
|
Yes all good points, this is more complex than I had imagined, especially with the good point about needing to support older serialized forms. This may not be worth it, yes. It would be nice to have just one implementation one way or the other, rather than 2. At the least, it may be fine to improve the existing implementation and punt on the rest. |
|
@ponkin mind closing the pull request given your findings on https://issues.apache.org/jira/browse/SPARK-18252 ? |
Closes apache#15689 Closes apache#14640 Closes apache#15917 Closes apache#16188 Closes apache#16206
What changes were proposed in this pull request?
How was this patch tested?
All previous tests are valid
Added more accurate test for Byte type
Please review https://cwiki.apache.org/confluence/display/SPARK/Contributing+to+Spark before opening a pull request.