Skip to content

Conversation

@ponkin
Copy link

@ponkin ponkin commented Nov 17, 2016

What changes were proposed in this pull request?

  • Bloom filter is now using RoaringBitmap as bit vector - to save memory usage
  • Bloom filter is using Murmur 128 bit hash

How was this patch tested?

All previous tests are valid
Added more accurate test for Byte type

Please review https://cwiki.apache.org/confluence/display/SPARK/Contributing+to+Spark before opening a pull request.

* Best perfomance is on x86_64 platform
* Based on implementation https://github.com/aappleby/smhasher/blob/master/src/MurmurHash3.cpp
*/
final class Murmur3_128 {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Could you explain why we need to add a custom murmurhash implementation? it seems like conceptually we're just swapping bitmaps for bitmapss. What's the license of this code?

Copy link
Author

@ponkin ponkin Nov 17, 2016

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This decision came from fact that RoaringBitmap supports only int indexes(and can have only Int.MaxValue bits), since we use RoaringBitmap[] to represent vector way larger than Int.MaxValue, we need good hash function to generate long hash values. This implementation is based on smhasher and solr implementation which are both freely available I suppose. Also this implementation must be way faster on 64 bit platform than Murmur3_x86_32

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Rather than make a new implementation, use Guava's Hashing? Scala and Spark itself already have murmur hash impls but they seem 32-bit only

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Well Guava`s implementation is allocating a lot of temp objects(works only with ByteBuffer source). So I decided to do standalone implementation - all tests for Hasher are taken from Guava lib.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

OK, maybe that makes sense. If bothering with a custom impl, why not 64 bits instead of 128? it would avoid needing a long[2] to take the result?

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In fact the is Murmur hash version which produces 64 - it is called Murmur 2 Wiki, but it is previous version of Murmur, current is Murmur 3 which is faster and give better distribution(Murmur2 has some flaws). Since in bloom filter we always need 1 or more hash functions, creating 128 bit hash(two longs) is pretty good I suppose.

@@ -0,0 +1,30 @@
package org.apache.spark.util.sketch;
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Missing a license

import java.io.IOException;
import java.util.Arrays;

class RoaringBitmapArray {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Probably a dumb question but why do we need an array of bitmaps? I thought RoaringBitmap was a BitArray counterpart?

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Unfortunately, RoaringBitmap supports only int indexes javadoc. Since we need to support bloom filters with size more than Int.MaxValue(BitArray supported 64*IntMaxValue) I decided to partition bloom filter with RoaringBitmap[]. Now Bloom filter can be Long.MaxValue large in size(number of bits).

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hm, I see: RoaringBitmap/RoaringBitmap#109
I wonder if it's worth using RoaringBitmap then, but, of course making the improvements we want from RB would take work too, and, removing code helps.

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

well, the main point of this improvement is to save memory - now even filters with few elements acquire space needed for full bloom filter. With this improvement bloom filter size will increase as we insert values in it.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It's certainly worth documenting this in the scaladoc BTW, why the class has to exist.

Copy link
Member

@srowen srowen left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

CC @cloud-fan for an initial look before we get further

import java.io.IOException;
import java.util.Arrays;

class RoaringBitmapArray {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It's certainly worth documenting this in the scaladoc BTW, why the class has to exist.

return (int) Math.ceil(numBits / (double)Integer.MAX_VALUE);
}

private static RoaringBitmap[] initialVector(int numOfBuckets){
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Small style things: space before braces, and after casts, as in a few lines above. Spaces around elements of a for loop, as below.


/** Number of bits */
long bitSize() {
return numBits;
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

2 space indent everywhere


@Override
public int hashCode() {
return Arrays.hashCode(data);
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is it worth incorporating bitCount, numBits? it's not wrong to omit them, but might be didier.

@Override
public boolean equals(Object other) {
if (this == other) return true;
if (other == null || !(other instanceof RoaringBitmapArray)) return false;
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Use braces, newline for return

if (other == null || !(other instanceof RoaringBitmapArray)) return false;
RoaringBitmapArray that = (RoaringBitmapArray) other;
return (this.bitCount == that.bitCount) &&
(this.numBits == that.numBits) &&
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

4-space continuation indents

class RoaringBitmapArray {

private final RoaringBitmap[] data;
private long bitCount; // number of 1`s in bitset
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is it worth maintaining this? it slows down paths like set; is it called frequently enough to cache?

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

it is only need to calculate expectedFpp (false positive rate) which is not very frequent operation, I agree. will remove


return new BloomFilterImpl(optimalNumOfHashFunctions(expectedNumItems, numBits), numBits);
}

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Go ahead and back out this spurious change; there's another newline at the end of the next file

* Best perfomance is on x86_64 platform
* Based on implementation https://github.com/aappleby/smhasher/blob/master/src/MurmurHash3.cpp
*/
final class Murmur3_128 {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

OK, maybe that makes sense. If bothering with a custom impl, why not 64 bits instead of 128? it would avoid needing a long[2] to take the result?

@srowen
Copy link
Member

srowen commented Nov 17, 2016

Jenkins add to whitelist

@SparkQA
Copy link

SparkQA commented Nov 17, 2016

Test build #68782 has finished for PR 15917 at commit 2cd92c3.

  • This patch fails RAT tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@rxin
Copy link
Contributor

rxin commented Nov 17, 2016

Please see my comment on the jira ticket. I'm not sure this is a good idea.

Also the current implementation ignores versioning.

@ponkin
Copy link
Author

ponkin commented Nov 17, 2016

@srowen It seems like I have fixed all your comments.

@SparkQA
Copy link

SparkQA commented Nov 17, 2016

Test build #68796 has finished for PR 15917 at commit 66f58d1.

  • This patch fails Scala style tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@rxin
Copy link
Contributor

rxin commented Nov 17, 2016

Also cc @lemire

Can you take a look at the discussion on the JIRA ticket and chime in? https://issues.apache.org/jira/browse/SPARK-18252

@lemire
Copy link
Contributor

lemire commented Nov 17, 2016

One of the core Roaring authors (@gssiyankai) worked specifically on Roaring and Bloom filters. Any thoughts Gregory?

@SparkQA
Copy link

SparkQA commented Nov 17, 2016

Test build #68797 has finished for PR 15917 at commit 487da8f.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@srowen
Copy link
Member

srowen commented Nov 18, 2016

Yes all good points, this is more complex than I had imagined, especially with the good point about needing to support older serialized forms. This may not be worth it, yes.

It would be nice to have just one implementation one way or the other, rather than 2. At the least, it may be fine to improve the existing implementation and punt on the rest.

@rxin
Copy link
Contributor

rxin commented Nov 18, 2016

@ponkin mind closing the pull request given your findings on https://issues.apache.org/jira/browse/SPARK-18252 ?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

5 participants