Friday, March 19, 2010

[sbphocde] Cracking Colin Percival's Scrypt in hardware

Part 1: Off-chip memory

Percival's analysis uses the price of on-chip (on the ASIC itself) memory, which, like L1 cache, is very expensive.  However, for a brute force attack on a password, we do not need such close memory: we can use off-chip RAM, e.g., commodity DDR2 SDRAM.

Off-chip RAM has considerably greater latency than on-chip, but, surprisingly, a brute force attacker does not care about latency.  While waiting for a memory request to complete, suspend one thread and work on another (we have millions of passwords to try).  Assuming one application of the hash function takes 64 clock cycles (corresponding to 64 rounds of SHA-256), and assuming memory latency of at most 64 cycles, two threads per core is sufficient.

What will matter is memory bandwidth.  A core issues a memory request every 64 cycles, and one request is 256 bits, so average 2 bytes per ASIC clock cycle per core.  For 1000 cores at 1 GHz, it's 2000 GB/s, which is 157 memory channels of DDR3-1600.  Each thread accesses 2^14 (interactive passwords) to 2^20 (file encryption) memory entries, and each entry is 32 bytes, so between 0.5 MB to 32 MB per thread, or 1 GB to 64 GB per chip.  I imagine an architecture with, instead of DIMMs, each memory chip is on an independent bus to the cracking chip.  It might be difficult to fit all those pins in there and wires on the circuit board.  Optical transport, perhaps?

For 10000 cores per chip, multiply by 10 as necessary.  The memory does begin to get expensive, and I don't know about the cost of the bandwidth.   But I think custom ASIC design will easily hit average prices of $100,000 per chip including all the one-time costs design and fabrication, which is comparable to the price of 640 GB of RAM per chip.

Seymour Cray said, "You can always buy bandwidth" (but low latency must be cleverly designed).

Although the cryptographic hash function bounces all over memory, for the brute force attacker trying many passwords in parallel, the access pattern is actually quite predictable at the very large scale.  Each thread is confined to a 0.5 MB to 32 MB segment of memory, and because we know how the threads and cores are interleaved, we will know which segment of memory will be accessed when (we just don't know where in each segment).  This fact is useful if the RAM must be accessed in banks.

Part 2: Hard drives

Having established that the problem is not latency bound, why not use even slower, cheaper, memory?  Namely hard disk drives.  This allows for scaling if even more memory is required.  Because of the access delay, each core might have dozens or hundreds of threads.  It might be necessary even to store each suspended thread's state off chip.

We again take advantage that the pattern of memory, now hard drive, accesses is predictable in the large scale.  Although we will need to do small hard drive seeks within a segment, we can avoid large seeks by arranging memory requests so that next one is in a segment geographically immediately after the current one.

Once again, we face the formidable problem of bandwidth.  A 1000-core chip calls for 2000000 MB/s, but hard drive max sustained bandwidth is about 130 MB/s, off by a factor of 15000.  It is a bit inconceivable to have 15000 hard drives per chip, so we need a better way.

Note that 130 MB/s is the bandwidth of a contiguous read off the hard drive.  Our access pattern interleaves lots of short seeks, so performance is likely worse, perhaps even thousands of times.

I believe the solution is multiple read/write heads per hard drive platter.  It is all right if each head has limited range, and if groups of heads must move in parallel, because of the pattern of access.  Can we pack thousands or even millions of read heads over a platter?

No comments :