Correct. I've been going deep into this looking for a unicorn implementation friendly to distributed environments:
* Mutable - streaming applications
* Mergable - so you can avoid querying N filters all the time or share slices.
* Scalable - so you don't have to pre-size
* Performant
* Serialize to bytes
* Language - Implemented in multiple languages. Many are in C only which is suboptimal (and often too complicated to easily port).
* Bonus: Thread safe implementation
Typically you can get some of these, but not all. Sometimes its not just possible with the structure (e.g. mutability in xor filter), sometimes its an implementation gap (serialization, etc.)
In my case finding the magical combination in Java has been quite difficult. The only way was to do custom, and relax some requirements (like auto scalable structures).
After doing a lot of testing I ended up with pre-sized bloom filters behind an MPSC model. The tradeoff of having to monitor/pre-size filters is one I'm not happy with but weighed against other tradeoffs it made sense.
Cuckoo was far too slow (and merging was iffy), and I can get a huge speed increase switching and customizing implementations (Guava -> spark -> fastfilter), and also big boost using xx3 (or xxh) rather then murmur which is pretty popular in off the shelf implementations.
Porting CQF, and much of the latest research was just too much (and many implementations lack merging, etc so you have to really understand the structure to do it right).
* Mutable - streaming applications
* Mergable - so you can avoid querying N filters all the time or share slices.
* Scalable - so you don't have to pre-size
* Performant
* Serialize to bytes
* Language - Implemented in multiple languages. Many are in C only which is suboptimal (and often too complicated to easily port).
* Bonus: Thread safe implementation
Typically you can get some of these, but not all. Sometimes its not just possible with the structure (e.g. mutability in xor filter), sometimes its an implementation gap (serialization, etc.)
In my case finding the magical combination in Java has been quite difficult. The only way was to do custom, and relax some requirements (like auto scalable structures).
After doing a lot of testing I ended up with pre-sized bloom filters behind an MPSC model. The tradeoff of having to monitor/pre-size filters is one I'm not happy with but weighed against other tradeoffs it made sense.
Cuckoo was far too slow (and merging was iffy), and I can get a huge speed increase switching and customizing implementations (Guava -> spark -> fastfilter), and also big boost using xx3 (or xxh) rather then murmur which is pretty popular in off the shelf implementations.
Porting CQF, and much of the latest research was just too much (and many implementations lack merging, etc so you have to really understand the structure to do it right).