Ken's blog: [dkmthxbr] Whistleblowing a lot of data

Consider the problem of a whistleblower who seeks to disclose bad behavior of, say, a company, or maybe a government. However, the electronic documents detailing the proof of the bad behavior are excessively large, say, a petabyte.

There are two ways to initially gather the data. One is to disguise the collection as part of the backup process, so a sudden high bandwidth access to the data is not noticed. A variation is to target a finished backup, which will be less likely to cause a disruption in access to the live data. The other way is to low-bandwidth randomly sample the data slowly over a long period of time.

Having obtained the data, and somehow having transported it over the perimeter, the next problem is how to release the data. This is where things get technically interesting. The bad company will use all available legal and extra-legal means to suppress the data, in hopes of denying the leak ever occurred. A large bad company with political connections will be probably be able to enlist / bribe / blackmail the government into aiding the suppression.

Do we have a censorship resistant medium or technology that can handle such a very large amount of data?

Consider the shortcomings of Bittorrent. Bittorrent is designed under the assumption that any particular downloader can fit the entire download. During the initial period after the release, there is exactly one seed which is very vulnerable to attack.

Prior to release, the whistleblower should first replicate the data in many places, or else there will be a single point of failure for the bad company to attack. Such replication is expensive. The whistleblower will need to buy or rent tens or hundreds of petabytes of storage.

Here is a combination of cloud computing, peer to peer, and financial technologies that might work. The whistleblower rents storage from a distributed grid service such as Tahoe. We assume that the grid node owners are to be paid. The data is stored encrypted so as not to tip off any node owner about the data before the collection is complete. When the data is complete, and naturally replicated as this is the way distributed storage works, the encryption key is released, and each node is also instantly transformed into a node on a public peer-to-peer network such as (modified) Bittorrent or Freenet to facilitate additional replication and distribution. This software does not yet exist, though all the pieces do.

The original grid node owners are paid, but in this case, they were paid with IOUs. Upon public release of the data, the owner is given the opportunity to forgive the IOU if he or she feels the release of the data was good for society. If so, no money needs to change hands, reducing the paper trail. If not, third parties, perhaps organizations dedicated to Freedom or Sunlight, can also help pay off the debt (such organizations should maintain cash on hand in case a need like this arises).

We will need an application that will allow any user to browse this distributed collection of documents.

This exercise is motivated by Facebook. It seems companies will not take user privacy seriously (i.e., not record data, or record data encrypted only with a user's public key) until a massive breach occurs resulting in a tremendous public backlash. Consider a breach of the entire Facebook database. Facebook does not delete anything, ever, so they still maintain records of every account you've deleted, every picture you've untagged yourself from, every inbox message you've deleted, every picture and profile you've looked at (and when), and every link you've clicked.

If it's going to take a massive breach like this for companies to take privacy seriously, perhaps one, just one, patriotic (patriotic for the good of society) employee will make it happen.

Or, given what I have described above, perhaps the imminent feasible threat of a complete and unsuppressible data breach by one, just one, disloyal employee or hacker will convince them to take the measures I described to guard user privacy.

But probably not, given the financial incentives: Facebook's stock price is entirely based on this huge amount of user data it has collected, that it alone can access, or sell, in its entirety without anyone's permission. The stock option owners aren't going to willingly give that up, no matter how large the threat.

Ken's blog

Friday, April 09, 2010

[dkmthxbr] Whistleblowing a lot of data

No comments :