After the Bloom Filter post the other day, I've been doing a dive on probabilistic data structures. This one was obviously on the list along with: skip list, bloom filter, count-min sketch, linear counting, loglog, hyperloglog. Are there any major ones I'm missing, or even better, any lesser known ones I should be aware of? Are there any good resources/courses on this class of data structure/algorithm?
MinHash works well if you need to approximate Jaccard similarity. To approximate cosine similarity, you can use Charikar's SimHash: http://d3s.mff.cuni.cz/~holub/sw/shash/.
You might be interested in reading the section at the end of v.3 of Knuth's "The Art of Computer Programming" on superimposed coding. It was first developed by Mooers in the late 1940s as a way to reduce the number of punches needed on a punched card.
Mooers was annoyed that the computer scientists decades after him re-discovered he principles he first worked on, but without citing him. Knuth points out the connection.
For the uninformed like me, what is the real world application for this? This is 100% curiosity and not criticism.
Basically I'm curious in plain English, what kind of application would use this and for what? Also, am I understanding correctly that this is simply a sort of low fidelity hash table?
Bloom filters are [one of?] the most used 'negative cache' structures. Consulting a bloom filter can give a decisive answer that a key is not in a collection (ie. a key was never hashed). You can then avoid a (supposedly more expensive) lookup. I used bloom filters often in my database work.
I was not familiar with cockoo filters (I am familiar with cockoo hash though), using them as a 'better bloom' seems to be a relatively new development. A quick googling found 'Cuckoo Filter: Practically Better Than Bloom' [https://www.cs.cmu.edu/~dga/papers/cuckoo-conext2014.pdf].
The paper also cites the main advantage of cockoo filters over bloom: ability to remove a key. (Vanilla) bloom require a rebuild of the entire filter if a key is removed from the set. There are more details in the paper, seems a good read (I'm going though it now)
Here's one example. Web browsers often use bloom filters or similar data structures to check for malicious URLs. They first check the URL against the local bloom filter, and if a match is found, they double check against an online API to make sure it wasn't a false positive.
Shipping a full list of malicious URLs would be too expensive, but a bloom filter fits in a fraction of the space and can still eliminate a vast majority of the API calls.
They're useful in cases like this, where the set of entries is much smaller than the total set size (e.g., all URLs). Once you approach sets with 50% membership, a bitset becomes more efficient (accuracy per space) than a bloom filter or any other probabilistic structure.
Say you are writing a spider and want to determine whether you have to revisit a website today or tomorrow. The minimal chance of visiting it tomorrow instead of today even if the crawler should have visited it today is trade-off you would take instantly if it increases the look-up speed, space requirements and makes stuff like this even feasible in the first place without a huge budget.
Everything related to caching when working with a huge dataset is a good use-case. Another case would be when the dataset does not fit into memory, but you need very fast lookups.
Edit: To clearify, you are obviously not caching the data itself, but a flag whether to use the cache/do nothing or compute something.
> For applications that store many items and target moderately low false positive rates, cuckoo filters have lower space overhead than space-optimised Bloom filters. Some possible use-cases that depend on approximated set-membership queries would be databases, caches, routers, and storage systems where it is used to decide if a given item is in a (usually large) set, with some small false positive probability. Alternatively, given it is designed to be a viable replacement to Bloom filters, it can also be used to reduce the space required in probabilistic routing tables, speed longest-prefix matching for IP addresses, improve network state management and monitoring, and encode multicast forwarding information in packets, among many other applications.
from the original research paper and the repository itself, thank you for asking!
One could also use it to verify membership (or non-membership) in a set without having access to the set.
Suppose I have a secret list of websites from whom you shouldn't load ads. I can then just publish a bloom filter which you can use in your browser to see if a web request should be allowed or not, without having access to the original list. I know this is contrived, but hopefully you get the idea.
What is the situation with Patents on the Cuckoo filter? I saw no less than 2 pending applications with some quick googling, one by NetSpeed, one by TI, and quick uspto search shows over a dozen patents.
Is the entire useful concept patented at this point?
What I don't understand is comment on the delete function:
// tries deleting 'bonjour' from filter, may delete another element
// this could occur when another byte slice with the same fingerprint
// as another is 'deleted'
If it deletes another element, then the guarantee of no false negatives is no longer kept. Is that right?
yes, you are correct. drawing from the example in the README you could 'delete' "bounjour" assuming it had the same fingerprint as "buongiorno", subsequent lookups for "buongiorno" would thereby return negatives (assuming it was entered only once). the original research paper makes no explicit guarantees of no false negatives, the invariant can only be maintained if prior to deletion of an element it is ensured that the element is definitely in the filter (e.g., based on records on external storage).
Even if the element is definitely in the filter, another element can have the same fingerprint. Thus, deleting that event which is definitely in the filter will also cause false negatives for that other element. Isn't that right?
are you assuming the other element is also in the filter? my assumption is your 'deletions' occur only when you know before hand the element already is in the filter. if two elements with the same fingerprints are added to the filter, the finger print is stored two separate times. deletion of any one of them would effectively remove only the single copy thereby not affecting the other (you could be deleting the fingerprint that technically belongs to the other element but it's all the same), so no in this case you do not get false negatives.