Intersections of large distributed datasets

Heres the problem:
You have two (or more) datasets that you want to return the intersection of, but they are only accessible over a slow link. You can’t upload the smaller set to the larger set and have that do the intersection, so what do you do?

Example:
You have a distributed search engine where each node holds a list of urls that contain one term. A search comes in for “Hello World”, and you want to return the results of the Intersection of the urls that contain “Hello”, and the urls that contain “World”.
My solution:
Both lists are sorted in some way (and both are sorted the same way).

  1. NodeA sends NodeB it’s lowest numbered id.
  2. NodeB sends NodeA it’s lowest numbered id that is greater than or equal to NodeA’s id, and greater than any item it has previously sent.
  3. NodeA and NodeB repeat step 2 until there are no more ids.
  4. NodeA and NodeB now both have the intersection of NodeA’s ids and NodeB’s ids (as they’ve both been sent an id that they have too).

This will take at most 2*n operations where n is the size of the intersection. [Edit: hrm? where did I come up with that from. It’s obviously not!]
As an optimisation you could split the list up into “n” divisions, and send the lowest one in each division each time.

Comments are closed.