Thursday, February 02, 2012

[lefoezyy] Some notes on Google Books Ngrams

Just looking at the 1-grams (unigrams):
A complete list of case-folded words and their counts (5591668 lines), in descending order.

Recompression
139321248 Dec 14 2010 googlebooks-eng-all-1gram-20090715-1.csv.lzma
181843105 Dec 14 2010 googlebooks-eng-all-1gram-20090715-1.csv.bz2
206135630 Jan 31 18:33 googlebooks-eng-all-1gram-20090715-1.csv.zip
969963702 Dec 14 2010 googlebooks-eng-all-1gram-20090715-1.csv
Sizes in bytes. It compresses better with lzma, than zip or bzip2

Building the word count list with Perl:
perl -nlwe '@F=split /\t/;$c{lc$F[0]}+=$F[2];END{for(keys%c){print "$_\t$c{$_}"}}'

It is curious that the word "the" does not occur in part 0.

No comments :