Cleanup of Artificial Neural Net Subpackage (Module) for PUG
For February’s Python User Group I did a lightning talk and live demo using
pybrain to predict the weather. It took a whole weekend to pay off the code quality debt from the hacking I was doing during Kyle Gorman’s awesome NLP talk.
- Improve prediction accuracy
- Modularizing/generalize the approach
- Eliminate validation cheating
- no future data used in training
- nonoverlapping validation and training data sets
Gotchas with time-series data
With time-series data and “tapped delay lines” (a linear, finite impulse response filter), it’s easy to accidentally polute your training set with validation data and vice versa. For instance, if you don’t segment your data manually,
pybrain segments it using random sampling. Since each of your samples in the dataset provided to pybrain includes time samples that are also included in N other sample vectors, where N is order of your filter or the number of tapped delay lines in your block diagram of the filter.
So I reworked the function that builds a neural net from a
pybrain dataset as well as the function that builds a dataset from a pandas time-series dataframe. I also added a
delays argument to specify the irregularly sampled time-series rows to use in your FIR filter. I tested it by predicting the weather here in Camas, WA (daily max temp) to within 6%. Not super-great, but the model is pretty simple and only takes 5 lines to exercise the general functions for gathering the data then training and testing the model. Perhaps there’s still some “cheating” hidden in the pybrain training and data segmentation.
pip install -e [email protected]:hobson/[email protected]#egg=pug-nlp-master pip install -e [email protected]:hobson/[email protected]#egg=pug-nlp-master
Try it out on your city, or explore other model configurations to help me improve the accuracy. You might also find the wunderground API wrapper useful if you need historical weather data on your project. I’ll push it up to the cheese shop this week to make it as easy to install as:
pip install pug.
I’m looking forward to wrapping these helpers with meta-helpers:
- A “pruner” to eliminate useless inputs (features), nodes (activation functions), and connections (weights).
- An “explorer” to try new inputs and node/connection types to explore all the possabilities and improve performance.
I also want to:
- Build a non-overlapping time series dataset segmenter for validation a nd cross-validation testing
- Build a
pybraintraining algorithm that take any user-specified error function rather than the MSE calc it uses by default