Friday, April 10, 2015

[ntqhgtdh] Comparing Wikileaks CableGate and cables.csv

The complete collection of cables published as a SQL dump by Wikileaks ("Cable database in searchable format - for developers (updated!)") has substantial differences from cables.csv whose password was divulged by David Leigh in his book WIKILEAKS: Inside Julian Assange's War on Secrecy. Here are some scripts to compare the two, assuming you have both sources. (cables.csv from z.7z on cryptome.org .)

The methodology was to reformat the CSV to look like the SQL dump, perform some normalization, then compare the two dumps. The scripts find 5189 cables which differ, though this likely still includes many trivial formatting differences.

Technically, one interesting difference is that the CSV contains timestamps at minute resolution, while the SQL dump has the timestamps "fuzzed" to a resolution of one day: all messages are at midnight. However, often (but not always), the timestamps are also present in the header text of both sources.

More substantially, the SQL dump has quite a few redactions, often replaced with "XXXXXXXXXX" or ellipses "...".

(See also previous post about Cablegate.)

No comments :