This tutorial is modeled after a series of tutorials by Jake Vanderplas. Text, code, and licenses to those may be found here
This tutorial can be read and executed at https://tinyurl.com/la-ml-demo
We are often faced with a pile of data and little notion of what to do with it. We know that the data reflects some information about the real world, but not necessarily what that is, and how to get at it.
A model is a representation of a real-world process that we want to understand. They can be qualitative or quantitative, complex or simple. Broadly speaking, models are useful in two ways:
-
They provide understanding. Models are simpler than the real world, and the specific "physics" of a model can often be interpreted in terms of plain language. Frequently the interpretability of a model is among the most attractive of its attributes.
-
They have predictive power. You can use a good model of the real world to predict the behavior of hypothetical data or data you have not seen before. You can use these predictions to provide guidance for future data collection, policy, and model refinements.
Machine learning is the process of building statistical models using computers. These models have tunable parameters (usually numbers), which are adjusted to fit existing data. You can then use that model to predict values from data that the model has not seen before.
There are many different classes of model, and many different methods of fitting and predicting, but they all follow this general pattern.
This tutorial consists of several exercises that introduce the user to simple machine learning. They use the de facto standard Python scientific software stack, principally numpy, scipy, matplotlib, pandas, and scikit-learn.
First is an introduction to machine learning and some motivating examples:
Second is an introduction to the workhorse of Python machine learning: scikit-learn:
Finally, we perform an example analysis of ridership data for Metro bikeshare: