Notes from Data User Group Meetup -- Text Mining Meets Neural Nets

Here are my notes from the Data User Group and PDX Data Engineering Meetup presentation titled “Text Mining Meets Neural Nets: Mining the Bio-medical Literature”, presented by Dan Sullivan, the enterprise architect for Cambria Health and Ph.D. student at Virginia Tech (the Biomed Institute).

Dan’s thesis involves using machine learning to classify sentences according to relevance to topic of “virulence factor.”

Meaning of Words

All natural language processing starts with words. There are 2 main approaches to assigning meaning to words.

logical definition of a word like chair
fuzzy representation of a chair/object (as image recognition does)

Generative Models

Noam Chomsky invented and promoted the idea that words have no inherent meaning and that they merely have statistical rules that determine their sequence and how they will trigger “meaning” or response in a brain that can perceive those word sequences.

meaning of words is largely of a function of how it relates to other
100’s of biomed ontologies (graphs of relationships between concepts), very labor intensive

Topic Modeling

Topic modeling (similar to sentiment analysis)

LSA Latent semantic analysis
LDA Latent Dirlich allocaiton distribution of topics across a document, infer topic from data, no training
- starts with conditional probs prob of a topic prob of a word associated with a topic

Classification

Another quinessential topic

SVM works well (linear model)
ensembles work OK (random forest)
KNN works OK

e.g. classifiy as a sentence about virulence factors or not

requires labeled data to train

TFIDF

tf(t, d) = num occurrences of t in d D in all documents (corpus) idf = log(N / |d in D: t in D|)

tfidf = tf * idf

Note: “one hot” representation only a single count/1 for a single word, each document represented by a single word

Reducing Dimensions

ignoring stopwords
word stemming
lowercasing

SVM’s

90% accuracy on virulence factor classification of sentences
more data failed to help
tried about 10 different models and none did much better
parameter tuning is the last marginal improvement
feature engineering (expensive, slow, doesn’t scale, domain-specific)
representation (TFIDF) doesn’t capture semantic similarity, so got rid of it

Deep Learning

word representations
- word embedding (unstructured)
- math works
  - plurality direction/distance is additive property
  - sex
  - country-capital relationships
  - part-of
classification model
- neural net (structured)
  - enabled by GPUs

Implementations

Word2Vec

CBOW (continuous bag of words) works better if you have common terms used a lot
skip-gram (slower) GloVe
based on word cooccurrences (very similar to Word2Vec)

Negation is missed by CBOW (even with bigram and trigrams)

antimalaria is about .7 cosine distance from malaria
dehydrate and hydrate close to each other by cosine distance

Taxonomic membership is missed

A particular/specific Salmonella, called stapholococci wasn’t even on the list

It’s all about word neighborhood rather than synonomy or semantic similarity.

Three Neuron Activation Functions

In order of performance for Biomed NLP example:

RELU rectilinear gave good performance (linear between thresholds)
Hyperbolic tangent (-1 to +1)
Sigmoid (0 to 1) worst performance

Stochastic Gradient Decent

randomized balls rolling downhill to local minima
multiple “balls”
second order gradient decent just converges to local minima faster

Enhancements to basic NNs

Convolutional Neural Net
- Panning window to identify features
Max pooling operation
- take maxes
Recurrent nets applied to language a lot

Manifold Hypothesis

Entities that cluster around each other in hyperdimensional space that can be much lower dimensional.

Note: Saddle points are common in hyperdimensional space.

Needs/Ideas/TBD

Mathematical model of semantics (analogy algebra)

Tools

Theano Torch PyLearn2 Lasagne Keras DeepDist (works on Spark) Deeplearnign4J (Java + Scala, Hadoop and Spark)

References

Word2Vec using CBOW, skipgrams, GLoVe
SVMs for NLP
Unsupervised sentiment analysis
- Latent Semantic Indexing (LSI)
- Latent Dirichlet allocation (LDA)
“Probabilistic Topic Models”
Neural Nets for NLP using Gradient Descent
Improved (stochastic) Gradient Descent
Deep Learning Resources
Deep Learning Reading List
extracting semantics

Written on October 21, 2015