Neural Nets Demystified

Demystify
Dig Deeper

Classification

The most basic ML task is classification

In NN lingo, this is called “association”

So lets predict “rain” (1) “no rain” (0) for PDX tomorrow

Supervised Learning

We have historical “examples” of rain and shine

Weather Underground

Since we know the classification (training set)…

Supervised classification (association)

Rain, Shine, Partly-Cloudy ?

Wunderground lists several possible “conditions” or classes

If we wanted to predict them all

We would just make a binary classifier for each one

All classification problems can be reduced a binary classification

Perceptron

Sounds mysterious, like a “flux capacitor” or something…

It’s just a multiply and threshold check:

if (weights * inputs) > 0:
    output = 1
else:
    output = 0

Need something better

Fine for “using” (activating) a NN
But for "training" (backpropagation) need ...

Doesn’t change direction: monotonic
Doesn’t jump around: smooth

Sigmoid

Again, sounds mysterious… like a transcendental function

It is a transcendental function, but the word just means

Curved, smooth like the letter “C”

What Greek letter do you think of when I say “Sigma”?

“Σ”

What Roman (English) character?

“E”?
“S”?
“C”?

Sigma

You didn’t know this was a Latin class, did you…

Σ (uppercase)
σ (lowercase)
ς (last letter in word)
c (alternatively)

Most English speakers think of an “S” when they hear “Sigma”.

So the meaning has evolved to mean S-shaped.

That’s what we want

something smooth, shaped like an “S”

so it goes from 0 to 1 in an S shape

Perceptron

Linear

Time Series

The weights in a NN form a sequence of matrices

One matrix for each mess of connections between layers

Once you've trained the NN you can disply them as heat maps

Look for structure and oportunities to "prune"

Input biases are the 1st column of weights

(in the first matrix of weights)

weight output column vector

Output biases are the first row of weights

(in the last matrix of weights)

weight input row vector

6 Perceptrons = 6 Rows

13 Outputs = 13 columns

Trainer ((backpropagator)[https://en.wikipedia.org/wiki/Backpropagation])

can predict the change in weights required Wants to nudge the output closer to the target

target: known classification for training examples output: predicted classification your network spits out

But just a nudge.

Don’t get greedy and push all the way to the answer Because your linear sloper predictions are wrong And there may be nonlinear interactions between the weights (multiply layers)

So set the learning rate (\alpha) to somthething less than 1 the portion of the predicted nudge you want to “dial back” to

Example: Predict Rain in Portland

PyBrain
pug-ann (helper functions TBD PyBrain2)

Get historical weather for Portland then …

Backpropagate: train a perceptron
Activate: predict the weather for tomorrow!

NN Advantages

Easy
- No math!
- No tuning!
- Just plug and chug.
General
- One model can apply to many problems
Advanced
- They often beat all other “tuned” approaches

Disadvantage #1: Slow training

24+ hr for complex Kaggle example on laptop
90x30x20x10 model degrees freedom
- 90 input dimensions (regressors)
- 30 nodes for hidden layer 1
- 20 nodes for hidden layer 2
- 10 output dimensions (predicted values)

Disadvantage #2: They don’t scale (unparallelizable)

Fully-connected NNs can’t be easily hyper-parallelized (GPU)
- Large matrix multiplications
- Layers depend on all elements of previous layers

Scaling Workaround

At Kaggle workshop we discussed paralleling linear algebra

Split matrices up and work on “tiles”
Theano, Keras for python
PLASMA for BLAS

Scaling Workaround Limitations

But tiles must be shared/consolidated and theirs redundancy

Data flow: Main -> CPU -> GPU -> GPU cache (and back)
Data com (RAM xfer) is limiting
Data RAM size (at each stage) is limiting
Each GPU is equivalent to 16 core node

Disadvantage #3: They overfit

Too manu nodes = overfitting

What is the big O?

Degrees of freedom grow with number of nodes & layers
Each layer’s nodes connected to each previous layer’s
That a lot of wasted “freedom”

O(N^2)

Rule of thumb

NOT N**2

But M * N**2

N: number of nodes M: number of layers

assert(M * N**2 < len(training_set) / 10.)

I’m serious… put this into your code. I wasted a lot of time training models for Kaggle that overfitted.

You do need to know math!

To imprint your net with the structure (math) of the problem
- Feature analysis or transformation (conventional ML)
- Choosing the activation function and segmenting your NN
Prune and evolve your NN

This is a virtuous cycle!

More structure (no longer fully connected)
- Each independent path (segment) is parallelizable!
Automatic tuning, pruning, evolving is all parallelizable!
- Just train each NN separately
- Check back in with Prefrontal to “compete”

Structure you can play with (textbook)

limit connections

jargon: receptive fields

limit weights

jargon: weight sharing

All the rage: convolutional networks

Unconventional structure to play with

New ideas, no jargon yet, just crackpot names

limit weight ranges (e.g. -1 to 1, 0 to 1, etc)
weight “snap to grid” (snap learning)

Joke: “What’s the difference between a scientist and a crackpot?”

Ans: “P-value”

No PHD
High-Probability null hypothesis
Not Published
Not Peer-reviewed
No PyPi package

I’m a crackpot!

Resources

keras.io: Scalable Python NNs
Neural Network Design: Free NN Textbook!
pug-ann: Helpers for PyBrain and Wunderground
PyBrain2: We’re working on it