Now THAT's Open Data -- The Google NGram Viewer Corpus
Now THAT’s Open Data!
FINISHED --2016-08-18 15:30:00-- Total wall clock time: 3d 23h 19m 17s Downloaded: 6833 files, 3.2T in 3d 14h 55m 31s (10.7 MB/s) Converted links in 0 files in 0 seconds.
It’s 10% of all the books published up until 2008, on one (large) harddrive. It’s only the N-grams, rather than full text with punctuation, like Gutenberg. But N = 5. So you could reconstruct a lot of the books from that, if it weren’t for that pesky minimum document (book) frequency threshold of 40 books…
For those less hordy than me, you can just query the data through the slick API for fun things like officiating a “race” between Python, Ruby, C/C++, and Perl:
Unfortunately, I’d already used quite a bit of the 4T USB drive (downloading the Gutenberg Project books, among other things). So I scrolled further up the terminal log to see:
storage.googleapis.com/books/ngrams/books/googlebooks-spa-all-5gram-20090715-99.csv.zip: No space left on deviceCannot write to ‘storage.googleapis.com/books/ngrams/books/googlebooks-spa-all-5gram-20090715-99.csv.zip’ (Success). [email protected]:/media/nas/data/corpora/googlebooks$ df -h Filesystem Size Used Avail Use% Mounted on //totalgood/disk1_pt1/ 3.7T 3.7T 20K 100% /media/nas
Time to RAID up, I guess, before I rerun
wget -r --accept gz,zip --mirror --adjust-extension --convert-links --no-parent --backup-converted --level 1 http://storage.googleapis.com/books/ngrams/books/datasetsv2.html