Intersections of large distributed datasets
Heres the problem:
You have two (or more) datasets that you want to return the intersection of, but they are only accessible over a slow link. You can’t upload the smaller set to the larger set and have that do the intersection, so what do you do?
Example:
You have a distributed search engine where each node holds a list of urls that contain one term. A search comes in for “Hello World”, and you want to return the results of the Intersection of the urls that contain “Hello”, and the urls that contain “World”.
My solution:
Both lists are sorted in some way (and both are sorted the same way).
- NodeA sends NodeB it’s lowest numbered id.
- NodeB sends NodeA it’s lowest numbered id that is greater than or equal to NodeA’s id, and greater than any item it has previously sent.
- NodeA and NodeB repeat step 2 until there are no more ids.
- NodeA and NodeB now both have the intersection of NodeA’s ids and NodeB’s ids (as they’ve both been sent an id that they have too).
This will take at most 2*n operations where n is the size of the intersection. [Edit: hrm? where did I come up with that from. It’s obviously not!]
As an optimisation you could split the list up into “n” divisions, and send the lowest one in each division each time.