vlmutolo2 hours ago
This article is a little confusing. I think this is a roundabout way to invent the blocked bloom filter with k=2 bits inserted per element.

It seems like the authors wanted to use a single hash for performance (?). Maybe they correctly determined that naive Bloom filters have poor cache locality and reinvented block bloom filters from there.

Overall, I think block bloom filters should be the default most people reach for. They completely solve the cache locality issues (single cache miss per element lookup), and they sacrifice only like 10–15% space increase to do it. I had a simple implementation running at something like 20ns per query with maybe k=9. It would be about 9x that for native Bloom filters.

There’s some discussion in the article about using a single hash to come up with various indexing locations, but it’s simpler to just think of block bloom filters as:

1. Hash-0 gets you the block index

2. Hash-1 through hash-k get you the bits inside the block

If your implementation slices up a single hash to divide it into multiple smaller hashes, that’s fine.

sakras1 hour ago
Yeah I kind of think authors didn't conduct a thorough-enough literature review here. There are well-known relations between number of hash functions you use and the FPR, cache-blocking and register-blocking are classic techniques (Cache-, Hash-, and Space-Efficient Bloom Filters by Putze et. al), and there are even ways of generating patterns from only a single hash function that works well (shamelessly shilling my own blogpost on the topic: https://save-buffer.github.io/bloom_filter.html)

I also find the use of atomics to build the filter confusing here. If you're doing a join, you're presumably doing a batch of hashes, so it'd be much more efficient to partition your Bloom filter, lock the partitions, and do a bulk insertion.

thomasmg25 minutes ago
Your blogpost is great! Except for one detail: you have used modulo n. If n is not known at compile time, multiply+shift is much faster [1]. Division and modulo (remainder) are slow, except on Apple silicon (I don't know what they did there). BTW for blocked Bloom filters, there are some SIMD variants that seem to be simpler than yours [2] (maybe I'm wrong, I didn't look at the details, just it seems yours uses more code). I implemented a register-based one in one in Java here [3].

Bulk insertion: yes, if there are many keys, bulk insertion is faster. For xor filters, I used radix sort before insertion [4] (I should have documented the code better), but for fuse filters and blocked Bloom filters it might not be worth it, unless if the filter is huge.

[1] https://lemire.me/blog/2016/06/27/a-fast-alternative-to-the-... [2] https://github.com/FastFilter/fastfilter_cpp/blob/master/src... [3] https://github.com/FastFilter/fastfilter_java/blob/master/fa... [4] https://github.com/FastFilter/fastfilter_cpp/blob/master/src...

ozgrakkurt17 minutes ago
You can actually make those two bits more independent afaik.

https://github.com/apache/parquet-format/blob/master/BloomFi...

https://github.com/facebook/rocksdb/blob/main/util/bloom_imp...

First one is useful for grasping the idea second one is more comprehensive. Both try to make multiple bit loads but try to make them as independent as possible as far a I can understand.

Also hash function has huge effect on bloom filter performance. I was getting 2x perf when using xxhash3 instead of wyhash even though wyhash is a faster hash function afaik.

pkoird2 hours ago
Clever. My first impression was that surely this saturates the filter too fast as we're setting more bits at once but looks like the maths checks out. It's one of those non-intuitive things that I am glad I learned today.
lemagedurage2 hours ago
True, I had the same feeling. The article does go off 256K elements in a bloom filter of 2M. After 1M elements, using 2 bits actually increases false positive rate, but at that point the false positive rate is higher than 50% already.
h33t-l4x0r1 hour ago
Hmm, Bloom filters seem important. I'm wondering why my CS education never even touched on them and it's tbh triggering my imposter syndrome.
benmanns1 hour ago
They were only touched on (and just barely) in my CS education, so don’t feel too left out. Spend an evening or two on the Wiki for Probabilistic data structures[0]. With a CS education you should have the baseline knowledge to find them really fascinating. Enjoy!

Oh, and I don’t find myself actually implementing any of these very often or knowing that they are in use. I occasionally use things like APPROX_COUNT_DISTINCT in Snowflake[1], which is a HyperLogLog (linked in the Wiki).

[0]: https://en.wikipedia.org/wiki/Category:Probabilistic_data_st...

[1]: https://docs.snowflake.com/en/sql-reference/functions/approx...

dheera1 minute ago
My education didn't touch upon it but I've been grilled on it multiple times in interviews.

I learned about them after the first time I got grilled and rejected. Sucks to be the first company that grilled me about it, thanks for the tip though, you just didn't stick around long enough to see how fast I learn