Suppose an adversary with 10,000 to 100,000 machines wanted to poison
a BitTorrent torrent with bogus files (spoofed files). Because
BitTorrent has hashed pieces (the hashes stored in the metainfo file),
this would be an attack of bogus pieces (spoofed pieces). A hacked
node would advertise (to the tracker) that it has a piece but
distribute bogus pieces.
To combat such an attack we would want to know quickly whether a piece
was bogus, quicker than downloading the entire piece and checking the
hash. Typically a BitTorrent piece is 256 KiB. For a 640MB CD-image,
there 2,500 256KiB pieces, each with a 20 byte SHA-1 checksum for a
torrent metainfo file of about 50 KiB.
Ideally, we want hashes at a much finer resolution, but that makes the
metainfo file a lot larger. The key idea is to distribute the
high-resolution file of hashes itself through BitTorrent, and do so
recursively. This is very roughly the idea of Merkle Hash Trees to be
part of the BitTorrent 2 (BT2) protocol.
Assume a 1 GiB file. If each 6,464-byte is hashed with a 64 byte
SHA-512 (SHA2) checksum, there's 9,900,990 bytes of first-level
overhead. To transfer this 9.9 MiB of hashes, we use the BitTorrent
(still under attack from the spoofing adversary) we hash each 6,464
byte chunk with SHA-512 for an additional 98,029 bytes of second-level
overhead. Recursing again, the third-level file of hashes is 970
bytes. The sum of the infinite series is 10 MiB, or 1% of the original
1 GiB file. The chunk size 6,464 was chosen for one-percent overhead;
obviously tweakable for different overheads and checksum widths.
For downloading, the metainfo file need only contain the ``root''
checksum, and first the client would download the depth-1 portion of
the hash tree. This should be easy because it's the first thing
everyone downloads, so there should be lots of copies. Then using the
hashes of the depth-1 level, download the next level, eventually
working to the leaves which contain the actual data, in 6,464 byte
chunks.
We assume no propagation of information of what peers are bad, because
that is a hard problem. (Who do you trust?)
One assumes it is worth it to waste 6,464 bytes of download to mark
that a particular peer (an IP / port tuple?) is bad. With 100,000 bad
nodes however, that's up to 646,400,000 bytes of wasted communication.
Furthermore, during the time it takes to test all those bad nodes, the
adversary can change IP addresses. Ugh.
Probabilistically, though, assume the adversary only has resources to
create a 100:1 bad-node to good-node ratio (this is still a tremendous
number of nodes for popular files which my have hundreds or thousands
of real downloaders, i.e., good nodes). Assuming a swarm size of 20,
we need to ask 2,000 machines to find 20 good nodes, so wasted
communication of 1980*6464=12.8 MiB of wasted communication.