HUML Day 4 -- Natural Language Processing
So we’re having a lot of fun, finally. The students are all able to run python, ipython notebooks, and install python packages, even on Windows! Thank you Anaconda! And we worked on a project to mine the Hack University slack channels for conversation text. We even got a simple generative model working, comparable to the one in Grus’s Data_Science_from_Scratch, Chapter 20.
But the dirth of humanish text on our Slack channel has inspired me to seek out text that was a bit more thoughtfully generated.
Download It All
I went big. My server (and ISP) is running flat out to download the entire Gutenberg collection. I put together a python app to build the excludes list (based on an ls-R file in the root directory). I don’t want to download any of the images or ISOs or even AVIs that are floating around on Gutenberg. My bot’s pretty bookish and only cares about
*.txt, not even HTML. I kicked off the
rsync -avz process and detached
tmux about 15 minutes ago.
Unfortunately it wiped out a few archival photos before I realized that it was syncing the entire drive rather than a subfolder. The
sync part of
rsync involve a lot of
rm -rf. And, man, those trailing slashes on the end of an
rsync destination path are dangerous. Got that fixed, but I’m beginning to worry that my
--exclude-from list is too thorough.
find . -type f returns 0 regular files. But there are about 30k directories in a deep tree so far, growing at about 1k per minute, so maybe it just does the tree first:
$ find . -type d | wc -l 28443 $ find . -type d | wc -l 34118 $ find . -type d | wc -l 34123 $ find . -type f $ find . -type f | wc -l 0 $ find . -type d | wc -l 34941
Like a watched pot, things seem to be slowing down. Better stop multi-tasking the samba server with finds.
Of course I’m getting lots of warnings about being unable to set the datetime on my NAS drive, but maybe that’s normal.
2/2/0/8/22081/22081-h/images/ 2/2/0/8/22081/22081-page-images/ 2/2/0/8/22082/ rsync: mkstemp "/media/$USER/nas/guten/2/2/0/8/22082/.22082.txt.bE6NwA" failed: Operation not permitted (1) rsync: failed to set times on "/media/$USER/nas/guten/2/2/2/5/22254/22254-h/images": Operation not permitted (1) rsync: failed to set times on "/media/$USER/nas/guten/2/2/2/5/22255": Operation not permitted (1) 2/2/0/8/22083/
I hope it’s OK to use the
-z option with Gutenberg’s ftp server. I guess I’ll know in the morning.
Turns out the problem was that I was trying to preserver, owner, group, device and other linux file properties. I only want the text, so I revised the recommended rsync for my CIFS FAT32 drive:
rsync -rgvz --delete-before --fake-super --exclude-from=data/excludes.txt firstname.lastname@example.org::gutenberg /media/$USER/nas/guten/
But that didn’t work either. In the end, I just dumbed down rsync by limitting it to DOS capabilities:
rsync -rvz --delete-during --exclude-from=data/excludes.txt email@example.com::gutenberg /media/$USER/nas/guten/
Now the machines are finally working together to give me some human text in machine-readable form!
... etext97/3babb10.txt etext97/brnte10.txt etext97/bstjg10.txt etext97/grybr10.txt etext97/morem10.txt etext97/svyrd10.txt etext97/wtrbs10.txt etext98/.message etext98/allyr10.txt etext98/mspcd10.txt etext98/sesli10.txt etext99/.message images/README pg/articles/kushalbio pg/dev/.htaccess pg/hartinfo/annual95 pg/hartinfo/savenet994 sent 1,950,622 bytes received 37,220,139,399 bytes 911,066.81 bytes/sec total size is 37,232,866,927 speedup is 1.00
All done! Thank you fibersphere for the 8 Mbit/s download rate for continuous download of 37GB! And thank you, Gutenberg Project most of all, for maintaining a reliable connection and consistently high bandwidth.
Now, shall I try out some Julia or stick with Python (really Cython}?