Draft of Neural Nets Demystified
Neural Nets Demystified
- Demystify
- Dig Deeper
Note:
First I’ll suck you in with a simple example (predicting Portland Weather) Then I’ll show you how to play around at the frontier of the state of the art
- Thoughts about the upcoming PDX Data Science Meetup
- “Neural Nets Demystified.”
Classification
The most basic ML task is classification
In NN lingo, this is called “association”
So lets predict “rain” (1) “no rain” (0) for PDX tomorrow
Supervised Learning
We have historical “examples” of rain and shine
Since we know the classification (training set)…
Supervised classification (association)
Rain, Shine, Partly-Cloudy ?
Wunderground lists several possible “conditions” or classes
If we wanted to predict them all
We would just make a binary classifier for each one
All classification problems can be reduced a binary classification
Perceptron
Sounds mysterious, like a “flux capacitor” or something…
It’s just a multiply and threshold check:
Perceptron
(Diagram of a perceptron)
Need something a little better
Works fine for “using” (activating) your NN
But for learning ((backpropagation)[https://en.wikipedia.org/wiki/Backpropagation]) you need it to be predictable…
[Sigmoid])(https://en.wikipedia.org/wiki/Perceptron)
Again, sounds mysterious… like a transcendental function
It is a transcendental function, but the word just means
Curved, smooth like the letter “C”
What Greek letter do you think of when you hear me say Sigma?
“Σ”
What Roman (English letter does it most look like)?
- “E”?
- “S”?
- “C”?
Sigma
You didn’t know this was a Latin/Greek class, did you…
Σ (uppercase) σ (lowercase) ς (last letter in word) c (alternatively)
Most English speakers think of an “S” when they hear “Sigma” you think of an S. So the meaning has evolved to mean S-shaped.
That’s what we want, something smooth, shaped like an “S”
The trainer ((backpropagator)[https://en.wikipedia.org/wiki/Backpropagation]) can predict the change in weights
required
Wants to nudge the output
closer to the target
target
: known classification for training examples
output
: predicted classification your network spits out
But just a nudge.
Don’t get greedy and push all the way to the answer Because your linear sloper predictions are wrong And there may be nonlinear interactions between the weights (multiply layers)
So set the learning rate (\alpha) to somthething less than 1 the portion of the predicted nudge you want to “dial back” to
Example: Predict Rain in Portland
- PyBrain
- pug-ann (helper functions TBD PyBrain2)
Get historical weather for Portland then …
- Backpropagate: train a perceptron
- Activate: predict the weather for tomorrow!
NN Advantages
- Easy
- No math!
- No tuning!
- Just plug and chug.
- General
- One model can apply to many problems
- Advanced
- They often beat all other “tuned” approaches
Disadvantage #1: Slow training
- 24+ hr for complex Kaggle example on laptop
- 90x30x20x10 model degrees freedom
- 90 input dimensions (regressors)
- 30 nodes for hidden layer 1
- 20 nodes for hidden layer 2
- 10 output dimensions (predicted values)
Disadvantage #2: They don’t scale (unparallelizable)
- Fully-connected NNs can’t be easily hyper-parallelized (GPU)
- Large matrix multiplications
- Layers depend on all elements of previous layers
Scaling Workaround
At Kaggle workshop we discussed paralleling linear algebra
Scaling Workaround Limitations
But tiles must be shared/consolidated and theirs redundancy
- Data flow: Main -> CPU -> GPU -> GPU cache (and back)
- Data com (RAM xfer) is limiting
- Data RAM size (at each stage) is limiting
- Each GPU is equivalent to 16 core node
Disadvantage #3: They overfit
- Too manu nodes = overfitting
What is the big O?
- Degrees of freedom grow with number of nodes & layers
- Each layer’s nodes connected to each previous layer’s
- That a lot of wasted “freedom”
O(N^2)
Not so fast, big O…
Rule of thumb
NOT N**2
But M * N**2
N: number of nodes M: number of layers
assert(M * N**2 < len(training_set) / 10.)
I’m serious… put this into your code. I wasted a lot of time training models for Kaggle that overfitted.
You do need to know math!
- To imprint your net with the structure (math) of the problem
- Feature analysis or transformation (conventional ML)
- Choosing the activation function and segmenting your NN
- Prune and evolve your NN
This is a virtuous cycle!
- More structure (no longer fully connected)
- Each independent path (segment) is parallelizable!
- Automatic tuning, pruning, evolving is all parallelizable!
- Just train each NN separately
- Check back in with Prefrontal to “compete”
Structure you can play with (textbook)
- limit connections
jargon: receptive fields
- limit weights
jargon: weight sharing
All the rage: convolutional networks
Unconventional structure to play with
New ideas, no jargon yet, just crackpot names
- limit weight ranges (e.g. -1 to 1, 0 to 1, etc)
- weight “snap to grid” (snap learning)
Joke: “What’s the difference between a scientist and a crackpot?”
Ans: “P-value”
- High-Probability null hypothesis
- Not Published
- Not Peer-reviewed
- No PyPi package
I’m a crackpot!
Resources
- keras.io: Scalable Python NNs