Suppose an adversary with 10,000 to 100,000 machines wanted to poison a BitTorrent torrent with bogus files (spoofed files). Because BitTorrent has hashed pieces (the hashes stored in the metainfo file), this would be an attack of bogus pieces (spoofed pieces). A hacked node would advertise (to the tracker) that it has a piece but distribute bogus pieces.
To combat such an attack we would want to know quickly whether a piece was bogus, quicker than downloading the entire piece and checking the hash. Typically a BitTorrent piece is 256 KiB. For a 640MB CD-image, there 2,500 256KiB pieces, each with a 20 byte SHA-1 checksum for a torrent metainfo file of about 50 KiB.
Ideally, we want hashes at a much finer resolution, but that makes the metainfo file a lot larger. The key idea is to distribute the high-resolution file of hashes itself through BitTorrent, and do so recursively. This is very roughly the idea of Merkle Hash Trees to be part of the BitTorrent 2 (BT2) protocol.
Assume a 1 GiB file. If each 6,464-byte is hashed with a 64 byte SHA-512 (SHA2) checksum, there's 9,900,990 bytes of first-level overhead. To transfer this 9.9 MiB of hashes, we use the BitTorrent (still under attack from the spoofing adversary) we hash each 6,464 byte chunk with SHA-512 for an additional 98,029 bytes of second-level overhead. Recursing again, the third-level file of hashes is 970 bytes. The sum of the infinite series is 10 MiB, or 1% of the original 1 GiB file. The chunk size 6,464 was chosen for one-percent overhead; obviously tweakable for different overheads and checksum widths.
For downloading, the metainfo file need only contain the ``root'' checksum, and first the client would download the depth-1 portion of the hash tree. This should be easy because it's the first thing everyone downloads, so there should be lots of copies. Then using the hashes of the depth-1 level, download the next level, eventually working to the leaves which contain the actual data, in 6,464 byte chunks.
We assume no propagation of information of what peers are bad, because that is a hard problem. (Who do you trust?)
One assumes it is worth it to waste 6,464 bytes of download to mark that a particular peer (an IP / port tuple?) is bad. With 100,000 bad nodes however, that's up to 646,400,000 bytes of wasted communication. Furthermore, during the time it takes to test all those bad nodes, the adversary can change IP addresses. Ugh.
Probabilistically, though, assume the adversary only has resources to create a 100:1 bad-node to good-node ratio (this is still a tremendous number of nodes for popular files which my have hundreds or thousands of real downloaders, i.e., good nodes). Assuming a swarm size of 20, we need to ask 2,000 machines to find 20 good nodes, so wasted communication of 1980*6464=12.8 MiB of wasted communication.
1 comment :
We only need 80-bits of collision resistance security (160-bit hash), so we don't need the entire 512 bits (64 bytes) of SHA-512. Because SHA-1 is dead, we can simply truncate SHA-512 (or SHA-256 if one is feeling lucky) to 160 bits (20 bytes), which reduces chunk sizes from 6,464 bytes to 2,020 (still at 1% overhead), and reduces the amount of wasted communication required to determine a node is bad in a poisoning attack.
Post a Comment