Slides -- Bioinformatics FAQ Bot
FAQ Bot
Slides -- Bioinformatics FAQ Bot
Bioinformatics FAQ Bot
Holiday Party Vegan Risotto
Vegan Risotto
Linux Phone for Christmas
I’m excited about my new Christmas present for myself, a linux phone. Really hope I can cut down on the distractions and manipulations by my OS and app providers. And I may soon be able to integrate nlpia-bot into it (after I finish plugging it up to mastodon.social. The decision between the Pine64 PinePhone and the Libre Phone developer edition was a no brainer for a tinkerer like me.
Linux Phone for Christmas
I’m excited about my new Christmas present for myself, a linux phone. Really hope I can cut down on the distractions and manipulations by my OS and app providers. And I may soon be able to integrate nlpia-bot into it (after I finish plugging it up to mastodon.social. The decision between the Pine64 PinePhone and the Libre Phone developer edition was a no brainer for a tinkerer like me.
Turning Points
I’m listening to the audiobook “Upheaval: Turning Points for Nations in Crisis” by Jared Diamond. Diamond talks at length about the decisions around turning points in his own life and their parallels with those changes in governments and societies. And Fall is when I start thinking about turning points in my life.
Slides -- A Smarter Chatbot
A Smarter Chatbot
nlpia-bot
Had a great time with Austin and Xavier mashing up Parul Pandey’s question-answering chatbot with nlpia-bot
. She made it super simple.
Getting started with NLP
At Manceps our interns are building web and mobile apps to interface with their natural language model for unredacting the Mueller report. Here are some of the exercises they used to get up to speed on python and NLP quickly.
Docvectors using spaCy for Springboard
One of my Springboard mentees asked how she should compute document vectors using the word2vec vectors available within a parsed document object from the spaCy parser.
Nessvectors for San Diego Python User Group
I had a lot of fun playing with words at the monthly Python User group meeting in San Diego this week. Congratulations to Torin Panick @torrinp for winning a free copy of NLP in Action. For those of you that missed out, I’ll give out one free eBook code and a 42% discount code next time. And I’ll be a bit more organized about the competition ;).
Unredact the Mueller Report?
What if the latest language models from Google were so good that they could unredact the Mueller Report? We gave it a shot at the monthly Portland Python User Group for May. BERT came up with some surprising results. The slides and code are here: PDF ODP py
Word Patterns
Word patterns are what you can use to match or generate phrases. They’re usually called grammars, in math and computer science courses. But this post is about word grammars rather than character grammars. And the word grammar means something very precise to a lot of people, so I don’t want to step on any toes by using the word incorrectly. So I’ll just talk about word patterns.
Word Patterns
Word patterns are what you can use to match or generate phrases. They’re usually called grammars, in math and computer science courses. But this post is about word grammars rather than character grammars. And the word grammar means something very precise to a lot of people, so I don’t want to step on any toes by using the word incorrectly. So I’ll just talk about word patterns.
Infinite-vocabulary word embeddings
Word embeddings are at the core of the most impressive natural language models. Dialog systems, abstractive summarizers, universal sentence embeddings, question answering systems and even unsupervised knowledge extraction engines all rely on broad vocabularies of word embeddings. But even the 1M word vocabulary of Word2vec and GloVE embeddings isn’t broad enough to solve the most useful challenges for natural language processing, such as medical record summarization, or even dialog engines that can handle the ever expanding vocabulary of teenagers.
SSH Server On Office PC Behind Building's NAT Router
Say you’re leasing space in an office building for your startup and you share the network with all the other tennants. This could be a wireless router or hard-wired ethernet router. The problem is you don’t have the password for the admin page on that router. So you can’t expose a port on your server for ssh or webhosting or whatever. Normally you’d just add a port-forwarding rule on the router to send 22 and 80 and 443 all through to your server. But that might mess up somebody else using the same router to serve up their page.
Data Science Trends
Springboard Data Science Careers students keep asking me which specialization they should pursue. And they often want to know which specialization are most likely to hire a junior data science coming out of Springboard. I try to encourage my students to pursue something that they are good at, because there will always be a market for someone who is good at what they do. But if you really want to follow the crowd and go where the employers are hiring check out the AIIndex.org 2018 report. It looks like NLP was popular in 2016 and 2017 but may be overtaken by computer vision and “deep learning” by 2020. This roughly corresponds to the widespread deployment of self-driving cars, which will eventually replace apoproximately 10% of the US workforce with machines. And those driving and logistics jobs have been “transformed” into data science jobs over the past few years. So if you’re a full time Lyft Driver, now might be a good time to start taking night classes in Data Science and getting reconnected to your nerdy friends.
Nginx web server setup
Each time I have to set up a domain name service table or database access for a web server server I forget how to do it. And there doesn’t seem to be a good online guide for it. So here are my notes.
Raspberry Pi Camera Configuration
I’ll eventually figure out where I put my notes on configuring a Raspberry Pi camera for streaming video and offline object detection. But for now, check out the BerryNet repo. These guys have done it right!
Open Source Teleprompter?
I’m recording some instructional videos for a Natural Language Processing In Motion course for Manning Publishing and maybe a Data Science for Healthcare course for UCSD. I tried using Camtasia to simultaneously record the slides on one monitor and the talking head (webcam). And I tried using a Libre Office in presentation mode to show the slides/animations on my laptop screen and read from the external display (slide notes). But in display mode Libre Office puts the notes to the right in the middle of the screen and my eyes weren’t looking at the camera. Is there a better way to set up a “teleprompter” and webcam so that the top line is always near the webcam? I’d prefer open source and free or low cost.
Poetix
A big thanks to Philip R. Baldwin for sharing this clever AI-generated sonnet.
Sentence Embedding
Sentence embeddings took off in 2017. When Google released their Universal Sentence Encoder last year researchers took notice. Google trained their sentence embedding on a massive corpus of text, everything from wikipedia and news articles to FAQs and forums. And then they refined the accuracy by training it on the Stanford Natural Language Inference corpus. Like word2vec, this enabled NLP enthusiasts to leverage Google’s text-scraping and cleaning infrastructure to build their own models using transfer learning. Transfer Learning is just a fancy way by using one model within another. Usually you’re just doing “activation” or “inference” with the pretrained model and then using its output as a feature (input) for some other model.
NLP Word Usage Trends
Here are some example Google N-gram Viewer queries that I used while researching the NLP in Action book, including one to decide how to spell “n-gram” ;)
NLP Hacks for Writers
NLP Hacks for Writers
Default to Open
You Decide
Hyperspace Topology Games
Play around with these geometries in your brain. Then see what happens for real when you do this with high dimensional vectors, like word vectors (Mikolov).
git
Git
Abstract Hyperparameter Optimization Machines Better Than Humans
Advances in neural networks and deep learning have renewed interest in algorithms to automate the tuning of the expanding list of hyper-parameters for these high-dimensional models. Open source libraries such as scikit-learn provide ready access to simple but inefficient algorithms such as exhaustive search and random search. Recently, Snoek et al showed that statistical hyper-parameter optimization approaches produce better better results than humans and are more efficient than exhaustive or random approaches in high-dimensional domains such as image and speech machine learning.[1] Similarly, Bergstra et al. improved efficiency and performance further with their Sequential Model-Based Global Optimization (SMGO) approach which approximates the computationally complex model training step with a heuristic.[2] In this paper we will demonstrate these hyper-parameter optimization algorithms on several toy and real-world problems, including machine learning problem types not previously optimized with SMGO.
Gluten-Free Antioxidant Oatmeal Cookies
Gluten-Free Oatmeal Cookies
XZ 7Zip Performance on the Latest Python 3.6 Source Code Release
With the Python 3.6 release today, I noticed the source package compression extension wasn’t one I was familiar with. Turns out it’s the old 7zip format updated for ‘nix file metadata (owner, permissions, sticky bit, etc). So I played around with it to see how it performs at its maximal and extremely maximal compression levels.
Hyper-Indexing with LSHash (Locality Sensitive Hashing)
Indexing topic vectors from an LSI Model is more difficult than it seems. My first instinct was to use the 3D indexer plugin for PostgreSQL, PostGIS
. After all that’s the typical example I keep in my head for indexing. You create a discrete “on or off” label for each location based on whether it is present or absent within a grid point. This allows you to efficiently find it (and any nearby points) with a query with a WHERE grid = 'A11'
for a letter/int 2D indexing system that you see on old paper road maps from AAA.
PyDX is Awesome!
Watched a lot of great Python talks at PyDX this weekend. Here are some memorable ones:
Automation-Safer-Than-Manual
Interactive automation is much better than fully manual keyboard bash
ing for a lot of linux tasks. It’s taken decades but many linux distributions have finally made it possible to install linux automatically without too much hassel. But other mundane tasks like adding or swapping out a harddrive are a real bear. And the online instructions (especially at Canonical’s Ubuntu docs site) sound overly protective, cautious, encouraging the user to do everything by hand instead of automating things with a script. And they often get critical steps wrong, endangering your data and your computer.
Python-Birth-Microsecond-Paradox
Cole got bit by the Birthday Paradox when using python random.randint()
and time.time()
to generate a random number to tag a DB record with a unique ID. I think Hannes does something similar to ensure user-provided files are all unique, even a user uploads the exact same file twice.
Now THAT's Open Data -- The Google NGram Viewer Corpus
Now THAT’s Open Data!
Comparison of Hybrid Mobile App Javascript Frameworks
Read MoreHistory Temp Panic
Cpu Gpu Temp Sensors Log
PenTesting Peanut Gallery
Really enjoyed getting a crash course in InfoSec and PenTesting by Dean at the Ctrl-H HackerSpace meetup. Here’s how to get some tools for easy, ethical hacking.
Wildlife Survey and Cowboy Drone
I spend a lot of time hiking around in the snow taking pictures of animal tracks and maintaining wildlife survey cameras for Cascadia Wild. And I can’t help but daydream about Drone/Robot assistants doing a lot of this for me.
Upgrading 14.04 to 15.10 on a Dual-Boot HP Spectre Laptop
Tune Down the Trackpad
Dual Boot HP Spectre 360 Laptop
I love my new Spectre laptop with the fold-back screen. It’ll make an awesome picture frame or navigation tablet at the end of its life. But to keep it relevant I configured it for dual boot with Ubuntu. I need Windows 10 because Quick Books still hasn’t gotten with the Open Source program.
HPC on a Budget
The halfling (half-length) PCIe NVidia GeForce 970 card I ordered required 1 PCIe 3 slot, but also needs physical clearance for the connectors to poke out through 2 slots in the back of the Chassis. So form-factor planning can be a bitch. The Free Geek chassis I’m using has all the PCI slots free (including a PCIe 2), so plenty of holes int eh metal chassis, but the PCIe 2.0 slot is at the wrong end of the series, and the Nvidia card needs the blocked side for its connectors. back to New Egg she goes.
Review
Review
PyPi Packaging with PyScaffolding
PyScaffold (pip install PyScaffold
or pyscaffold
) is awesome tooling. It adds a nice putup
command to your shell. The putup
command creates a boiler-plate directory structure for any python project. It can even set up .tox and .travis test config files, documentation build scripts, and a django project for you, if you ask it to. And it is very git aware. The only thing I add to my git hooks is a pandoc line to translate my README.md
into README.rst
so that both my github-trained fingers and ReST-loving PyPi can be happy.
Rabbit Hole of Automation
I got carried away with automating my development process when I discovered this pre-commit hook that makes sure your python import
s are sorted, like Two Scoops recommends. I noticed a hooks.yaml
file that revealed that FalconSocial
’s hook is actually a plugin for Yelp’s awesome pre-commit framework.
Neural Net Brainstorm
Cole’s class on neural nets inspired some “out of the box” thinking about how brains work and how we train neural nets. Students asked about the performance of regularization vs random dropout, and the computational bottlenecks for random dropout.
Smaller than Baby Steps with Julia
Julia has some impressive performance stats, so I gave it a whirl, or half a whirl.
HUML Day 4 -- Natural Language Processing
Finally Rolling
git
Git
Machine Learning Introduction
Hack University Machine Learning Introduction
Your Own Private Cloud and NAS Drive
The Buffalo Airport Extreme is pretty expensive ($100), but when coupled with a cheap multi-TB USB 3.0 drive, it makes it pretty nice personal cloud. You can even download all of the Wikipedia and Wikimedia Commons dumps directly to the drive without passing through your precious laptop SSD. 10 Mbps rates are no problem for most USB 3.0 drives.
Getting Started with your PiBot TiddlyBot
I helped my teenage nephew get started on his kickstarter TiddlyBot Christmas Present over the holidays. We a linux laptop (Ubuntu) and recorded all the tedious setup steps so you can spend more time programming your bot and less time getting set up.
B-Machine Learning
The “B” isn’t for Bot, it’s for “Benefit”, as-in B-Corporation. What do B-Corps have to do with Machine Learning?
Inspiring Night -- John Irving Explaining his Craft
It was inspiring, almost magical, listening to John Irving explain his art, his insight into life, at Portland Art Museum. OPB hosted him with the towering church organ of the First Congregational United Church of Christ as a backdrop. John Irving’s intellect and humor eventually dwarfed the organ.
Hacking Oregon's Hidden Political Connections
Read MoreNotes from Data User Group Meetup -- Text Mining Meets Neural Nets
Here are my notes from the Data User Group and PDX Data Engineering Meetup presentation titled “Text Mining Meets Neural Nets: Mining the Bio-medical Literature”, presented by Dan Sullivan, the enterprise architect for Cambria Health and Ph.D. student at Virginia Tech (the Biomed Institute).
TFNW BYOB
TFNW BYOB
AI Solves Problems for Which there are No Known Efficient Solutions
I’m not a huge fan of the “Daisy AI Podcast” but he often rattles off a lot of interesting information quickly, like in his 2013 podcast.
Awesome-Data-Mining-Introduction
I loved this blog post by Raymond Li that Aleck forwarded tonight: Top 10 Data Mining Algorithms. It’s approachable even by people who’ve never used any of these tools. And yet it’s so rich with information that I learned about some new techniques I’d never heard of and it cleared up some misconceptions I had about some algorithms (SVMs in particular).
Neural Nets Demystified
Read MoreDraft of Neural Nets Demystified
Neural Nets Demystified
Purchasing Electronics with BitCoin
The “withdrawal” option on Kraken worked well when I used it to purchase a “refurbished” Brother laser printer on NewEgg. All you need to do is
Gaussian Mixture Model
Working on this Kaggle challenge (Otto Product Categorization), it’s becoming clear that the most appropriate hard-coded model is a Bayesian Classifier. And you don’t need the “gamification” clues to tell you that. Though the clues helped. “I’m a strict Bayesian, you know” was the acknowledgment message I received last week with my first decision-tree submission (within spitting distance of the benchmark). Clever. I love Kaggle for this! For the same reason I love stack overflow… they use influence techniques for the TotalGood rather than their focusing on monetization (their own financial gain).
Connect Mac WiFi with Comcast Motorola Surfboard Extreme SBG121 or SBG6580
Larissa and house guests are often complaining about sluggish Internet with our Comcast Motorola Router and Modem. So I tried a lot of things. In the end, I think it was the “IP Flooding” filter that was gumming up the works.
Data Science Group Talk -- Neural Nets Demystified
Portia Burton asked me to speak about Neural Nets at the next Data Science Group meetup. So here’s the abstract…
Dev Resources
Keeping Up
Soul Food
Curry Chicken Sandwiches
Model and Diagram Any Database Using SQLAlchemy
Model and Diagram Any Database Using SQLAlchemy
I needed to model and diagram (ERD) a client’s database schema in order to understand their machine learning task. They don’t use Django, so I can’t just manage.py inspectdb
and manage.py graph_models
. But fortunately, sqlalchemy makes both of these tasks easy.
Graph Theory Basics, and Speech Recognition with Neural Nets
Here are the highlights from this week’s “Talking Machines” podcast from @tlkngmchns. Thank you Thunder for turning me on to this awesome podcast!
Language Trivia
Ever wonder why capital letters have mostly straight lines, especially in Latin? Carving is much easier with straight lines. Think of all those Greek and Roman buildings and their location names carved in stone. You’d straighten all the curves too if you had to carve someone’s name into a piece of granite. Lower case letters came much later in history, once we started writing with ink.
Install Mongo DB on Fedora 20 for the Ubiqity UniFi Access Point
Chick swears by his new Ubiquity WiFi access point. So I purchased the High Power version from Amazon using Prime and it arrived in only 36 hours on a Saturday! Maybe having the Ubiqity HQ here in Portland helped.
PyCon 2015 -- Predict Weather with PyBrain, Attribution Do-Over
Here’s an attribution “do-over” for my PyCon 2015 lightning talk. I didn’t even capitalize PyBrain correctly. So here’s my belated thank you to Lynn Root for herding us Lighting Talk cats with grace, and the videographer and sound crew that pulled off this technical juggling act without once dropping a ball. And a big thanks to the PyBrain creators led by IDSIA Professor Jürgen Schmidhuber, contributors, and supporters. PyBrain is an awesome library. My talk, and work for my employer, wouldn’t have been possible without it. I can only blame my attribution FAIL on public speaking nerves and my inability to maintain a stable WiFi connection as I tried to create the slides in the seconds leading up to podium time.
PyCon 2015 -- Predict Weather With PyBrain
Here are the latest slides for a PyCon 2015 lightning talk on neural nets “Predict Weather with PyBrain”, with a little help(er) from pug-ann. Appologies if you attempted to follow along and execute the code on the slides. WiFi dropped before I could save updated slides.com reveal.js slides. So the slides didn’t reflect the latest version of pug-ann. I’ve got to start building slides locally. The typos were embarrassing. TLDR; A 6-node neural net can predict the max temperature in Portland a day in advance with about 5 deg C (10 deg F) 1-sigma error.
Cleanup of Artificial Neural Net Subpackage (Module) for PUG
For February’s Python User Group I did a lightning talk and live demo using pybrain
to predict the weather. It took a whole weekend to pay off the code quality debt from the hacking I was doing during Kyle Gorman’s awesome NLP talk.
PDDL Parser for AI Planning
If you need to parse PDDL for the AI Planning class at coursera, check out this script. It’s pretty basic and hasn’t been tested on the DWR problem descriptions, but I’m really enjoying playing around with my first “compiler”. I’m sure I’ve done things the “wrong way”, but the pyparsing
package is very intuitive and seems forgiving of my mistakes.
Picture-in-Picture Talking Head Presentations
Once you have your reveal.js slides and live CYOA voting set up (see previous blog posts), now you need to record both your computer screen with the slides and a video of your talking head. This is how I did it for the “Creative Challenge” assignment in the coursera “AI Planning” class.
Predictive Analytics War Stories Video
Thank you David Barton and Innovation Enterprise for recording my presentation at the Predictive Analytics Summit in San Diego. It really knocked down my ego a notch to see my awkwardness. You’ve motivated me to practice.
Predictive Analytics War Stories
Reveal.js and slides.com enable remote-controlled presentations like this one at #PASanDiego. The dynamic voting slide has to be hosted separately, though, because the iframe doesn’t seem to refresh regularly.
Predictive Analytics Innovation Summit highlights
Clement Farabet, Twitter, presented some awesome demonstrations of image clustering using an open source Deep Learning library, Torch7. This is definitely my favorite talk so far at #PASanDiego
Predictive Analytics Innovation Summit highlights
The first day at #PASanDiego organized by @IE_analytics has been interesting. I haven’t heard a lot of controversial insights, it’s been useful nontheless.
Transparent Histograms
Spent a lot of this week working on prettifying bar charts, histograms and animations for some reveal.js slides.
Another Challenge Do-Over
I failed another coding challenge and couldn’t just put it out of my mind. The challenge is this. You’re given a passage with any number of sentences and words in it, but some of the words have slashes between them instead of spaces to indicate “or”, like “The brown/black/crazy cat crossed the road.” Your objective is to parse those strings and return a list of strings with all the possible alternative interpretations of the phrases. The unspoken, unmet challenge is to then process these alternatives to be the logical interpretations that a human would make, to resolve ambiguities when the slashed words aren’t all the same part of speech and aren’t intended to be just swapped for one another. Perhaps the ambiguity is whether the slash means “or” or “and”. In the 30 minutes I had, I never got past the recursion and book-keeping of the parsing. But here’s what I came up with, complete with doctests that pass.
Automata and Machine Intelligence
More and more, the smart people I meet are talking about Automata, Natural Language Processing, and Graph Search (AI/MI Planning) all in the same breath. I’ve taken MOOCs on all 3, but think I need to revisit automata. Math proofs rely on automata to model machine intelligence. And they are at the core of understanding what is possible with AI/MI. And I’m finding some interesting connections that I missed the first time around.
Love Python? Interested in NLP?
I gave an introduction to Natural Language Processing with python at the PDX python user group and showed how to use two of Bostock’s awesome graph optimization and visualization tools in his D3 library. Here’s a screenshot of one of my favorites:
Graph Search Using Networkx
I’m having fun with a traveling salesman, minimum spanning tree problem over here. Check it out for pretty graph diagrams and some cool Networkx python examples.
Artificial Neural Nets for Prediction with Python (pybrain)
I’ve forked the pybrain package and started to hobsonify it to suit my tastes, make it more pythonic, and correct some documentation errors that render some shortcuts unusable.
Finally a Decent Open Source Blog Framework
I’m loving this Jekyll thing. You won’t see many pull requests from me, but this thing sure is an efficient blogging tool.
You're up and running!
Next you can update your site name, avatar and other options using the _config.yml file in the root of your repository (shown below).