Distributed Caching
Sunday, November 6th, 2005One neat idea I thought would be a content addressible web. Have a DHT, that indexes sha1(content) -> URL(content). If you have another copy of the content, you update the DHT to register your “mirror” of that content. When you want to fetch a document with a uri of say sha1:0a4d55a8d778e5022fab701977c5d840bbc486d0 then you look up the DHT for 0a4d55a8d778e5022fab701977c5d840bbc486d0, which returns you a list of url’s. You can pick the “best” one, a random one, or break the file up into sections and fetch it using byte ranges from multiple places simultaniously. Once you have the file, you save it into your local “cache” which is exported to the world via HTTP, and you register your new URL in the DHT.
If you want to do this as a normal web proxy you could also store in the dht sha1(url) -> sha1(content) mappings too.
Vunerabilities:
- People could register sha1(content) -> url(other content). This would be obvious (after you’ve downloaded the content), as the content you downloaded didn’t match the sha1(content) key you originally had. This could be rather annoying if the file is large. Possibly you could grab random byte ranges that you already have from other url’s, and verify they are identical. Those byte ranges could be short (1k?), so long as they are random then you can use it to verify that the content is legit. If the byte ranges disagree you’re kinda screwed tho, which one is correct? If you have sha1(url)->sha1(content) mappings, you can grab the same byte range from the original url and use it to prove which one is correct conclusively.
- To get around the above issue, the attacker may attack the original url so that you cannot verify the content.
- An attacker may also have most of the file intact, but have corrupted small sections. At the end of the download you know that the file is corrupt (because the sha1(content) doesn’t match), but you don’t know where the error is. rsync’ing with the original url will find and correct the error fast enough, but who runs an rsync server on their webserver anyway?
- People could register bogus sha1(url) -> sha1(content) mappings. There is no way to detect this. Maybe do random byte ranges from the original url, and verify it against the content you’re downloading. This also has the problem above.
- sha1(url) -> sha1(content) mappings may become obsolete quickly. If the original webserver supports it you could use sha1(len(url)+”:”+url+lastmodified) which you can get from a HEAD request. We could also ask for an X-Sha1: header which we could use directly and bypass the dht sha1(uri)->sha1(content) mapping all together.
Hopefully this would mean that the original hoster has a much lower bandwidth load, they only occasionally have people requesting small byte ranges off them to do verification, most of the bandwidth consumption is actually being downloaded from other hosts. Popular files get quickly mirrored around the internet.
Abusers will find that their “corrupt” data quickly gets drowned out by lots of people with uncorrupt versions. If you pick one at random then chances are you’ll pick the uncorrupt versions. However, there is still the gazillion trojanned hosts approach to poisoning the DHT. I guess if you give much higher preference to people who have given you good data in the past, then you will tend to avoid these. The worst an attacker can do anyway is slow down your download massively (by forcing you to waste time and download multiple, useless copies) rather than convince you to accept corrupted data.
All of this could be very easily written in an afternoon as a web proxy by someone that knows java and or C#, as both languages have practically everything here as libraries, all that is required is writting a bit of glue.