Here are some Perl scripts to process the Wikileaks leaked diplomatic cables. The main point is that a simple Perl script and regular expressions were sufficient to read and process the SQL database dump; such a heavyweight tool as PostgreSQL is not needed.
The starting point of the scripts was cable_db_full.7z.
$ for file in cksum md5sum sha1sum sha224sum sha256sum sha384sum sha512sum ; do $file cable_db_full.7z ; done
1163740332 377551297 cable_db_full.7z
820ca80dbc35932d7b9a90b1a8f9f5b9 cable_db_full.7z
586919127a911e59d88cbe27d42ddd6f6deb2683 cable_db_full.7z
365af63eefc1262f4ab9e23e020993f7a869d0998594e298d7c994ba cable_db_full.7z
d24179a96b20ca174e2819329a53f327f40a51cec2bb70cf902412e4dac2e63c cable_db_full.7z
aac241cf0431b6400f304881bb1d7636647152faa3026021b08337a8f633afa859f996b0f44f6265611d1ef47e7921aa cable_db_full.7z
d55603bd8c189d3a059293a88d0489285346d260fb4e50877ce6c9db5c5176826634a5bc1e144308e139f6f15ec671f24be082c6b88c82a9dcfad9d1e6d5591d cable_db_full.7z
Some checksums of the uncompressed file:
$ for file in cksum md5sum sha1sum sha224sum sha256sum sha384sum sha512sum ; do $file cable_db_full.sql ; done
3951629327 1750929365 cable_db_full.sql
4b0367b3af82c932ff6a07fdc353793d cable_db_full.sql
fa279172ec413dfd15faa3bfac0f45704d19807c cable_db_full.sql
663b657338db933cb7a244532cb97aa09a5816f5a36d03e2f94683d4 cable_db_full.sql
6951708faefe9757772ac171b7d40429ebda2c1a48ef0c538732c077dd86faaa cable_db_full.sql
1645e2270e1bf7e166fc5a272b15a4a142620185b15c81c321174692540d17341d106f646872ba9d2991d173b2e3e961 cable_db_full.sql
28d1607e647a5de1d0709dba9c88b13c9f029aa2b973586e43c4743252ca1d592a2417d75e9835cdb222d2d125e5f1596423b596708b9bde6af32df19bf45533 cable_db_full.sql
Incidentally, we wonder what the original file was, the encrypted torrent whose key was divulged by David Leigh. Update: possibly cables.csv (from http://cryptome.org/z/z.7z), whose checksums are here, though no guarantee I got the authoritative version:
for file in cksum md5sum sha1sum sha224sum sha256sum sha384sum sha512sum ; do $file cables.csv ; done
252081690 1730507223 cables.csv
ecd3ca941abfb4174cce43a97fa56cce cables.csv
9a3c594cf96acdbef98fde58d0556520f3e73c92 cables.csv
bf75328626bf5922f71a205618d555d4acb4dca4e08d4e639ab1d2a4 cables.csv
ad9b7530f038e4237f1729843d1dc096a63162a4ff8024e596ec3b2a8e64b139 cables.csv
05dc31d92a893bc53c05c939d2c148f76c16ef9e9e7e4cc4bbe4e3b17f5b431c8383ee2622b8328d599e0a24a963abfc cables.csv
5a503ffaa1d8efccd198409c3b889de506a18854c215b1f069e99b6e494be87c67e34d270b427d921e43809daaaa67c05da5c8a452ab413d415e4a4434fc6062 cables.csv
The scripts do, in order:
00-7bitclean.pl makes the data 7-bit clean, replacing as many of the Unicode and other encoded characters with equivalant ASCII characters, then hexadecimal escaping the rest. There are quite a few cables with seeming junk in the header. Cable 238502 refid 09CONAKRY766 is the only cable with significant junk, possibly encrypted, in the body.
10-line-lengths.pl analyzes the lengths of the lines, seeking a knee in the curve.
20-wrap.pl wraps all the cables to a specified line length, making them easier to read.
30-separate.pl separates each cable into a separate file. After separating, we did some experiments with compression, whose results are noted in the script.
No comments :
Post a Comment