This repository contains the course materials from Math 157: Intro to Mathematical Software.
Creative Commons BY-SA 4.0 license.
License: OTHER
February 26, 2018: Statistics (and pandas)
Administrivia:
CAPE evaluations are available! They close Monday, March 19 at 8am. Since this course is highly experimental, your feedback will be very helpful for shaping future offerings.
Attendance scores through Wednesday, February 21 are now posted on TritonEd. Contact Peter with any issues.
My office hours this week will be held Thursday, 3:30-4:30 (rather than 4-5).
In order to reduce latency, on Saturday, I archived the master chat room (see
master-chat-archive1.sage-chat
) and created a new empty chat with the same file name as the old one (master-chat.sage-chat
).Grades for HW5, and the HW6 solution set, will be available shortly.
HW 7 is now available. Some early comments:
Problem 1: The "Orange" dataset is not available via statsmodels. However, you can access it directly from R using the
rpy2
module.Problem 3a: The mpg dataset is in ggplot, not statsmodels:
from ggplot import mpg
Advance notice for week 9:
No sections on Tuesday, March 6. However, during this time, you are welcome to use APM 6402 as a study room; we will also try to monitor the chat room.
There will be an extra virtual office hour Thursday 6-7pm.
Thomas's office hours (usually Tuesday 11am-12pm) are moved to Friday 11:30am-12:50pm.
Added in class: Peter's office hours (usually Wednesday 3-5pm) are moved to Wednesday 5-7pm.
Advance notice for week 10:
No lectures on Monday, March 12 or Wednesday, March 14. You may wish to use this time to meet your assigned group for Part 2 of the final project.
There will be a lecture on Friday, March 16, on the topic of "Where to go from here?" This lecture will not be counted for course attendance; that is, the last lecture for which attendance counts is Friday, March 9.
My office hours on Thursday, March 15 are cancelled. All other sections and office hours meet as scheduled.
The final project
The final project has been assigned.
Look for
assignments/2018-03-16
for both parts. The files are called2018-03-16-part1.ipynb
and2018-3-16-part2.ipynb
.Part 2 points to this Google Form; please fill it out as soon as possible, preferably by Wednesday's lecture.
Remember, the final project is not a homework assignment and so cannot be dropped; it counts for 20% of the total grade (each of the two parts being weighted equally).
On the other hand, only your best 6 of 8 homeworks will count. Use this information to decide how to allocate your time between now and the end of the term.
Pause to go over the final project (both parts).
statsmodels
"Python module that allows users to explore data, estimate statistical models, and perform statistical tests."
Documentation: http://statsmodels.sourceforge.net/stable/
Here we follow the statsmodels "Getting started" tutorial.
We download the Guerry dataset, a collection of historical data used in support of Andre-Michel Guerry’s 1833 Essay on the Moral Statistics of France. The data set is hosted online in comma-separated values format (CSV) by the Rdatasets repository. We could download the file locally and then load it using read_csv, but pandas takes care of all of this automatically for us.
dept | Region | Department | Crime_pers | Crime_prop | Literacy | Donations | Infants | Suicides | MainCity | ... | Crime_parents | Infanticide | Donation_clergy | Lottery | Desertion | Instruction | Prostitutes | Distance | Area | Pop1831 | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
0 | 1 | E | Ain | 28870 | 15890 | 37 | 5098 | 33120 | 35039 | 2:Med | ... | 71 | 60 | 69 | 41 | 55 | 46 | 13 | 218.372 | 5762 | 346.03 |
1 | 2 | N | Aisne | 26226 | 5521 | 51 | 8901 | 14572 | 12831 | 2:Med | ... | 4 | 82 | 36 | 38 | 82 | 24 | 327 | 65.945 | 7369 | 513.00 |
2 | 3 | C | Allier | 26747 | 7925 | 13 | 10973 | 17044 | 114121 | 2:Med | ... | 46 | 42 | 76 | 66 | 16 | 85 | 34 | 161.927 | 7340 | 298.26 |
3 | 4 | E | Basses-Alpes | 12935 | 7289 | 46 | 2733 | 23018 | 14238 | 1:Sm | ... | 70 | 12 | 37 | 80 | 32 | 29 | 2 | 351.399 | 6925 | 155.90 |
4 | 5 | E | Hautes-Alpes | 17488 | 8174 | 69 | 6962 | 23076 | 16171 | 1:Sm | ... | 22 | 23 | 64 | 79 | 35 | 7 | 1 | 320.280 | 5549 | 129.10 |
5 rows × 23 columns
We select the variables of interest and look at the bottom 5 rows:
Department | Lottery | Literacy | Wealth | Region | |
---|---|---|---|---|---|
81 | Vienne | 40 | 25 | 68 | W |
82 | Haute-Vienne | 55 | 13 | 67 | C |
83 | Vosges | 14 | 62 | 82 | E |
84 | Yonne | 51 | 47 | 30 | C |
85 | Corse | 83 | 49 | 37 | NaN |
Notice that there is one missing observation in the Region column. We eliminate it using a DataFrame method provided by pandas:
Department | Lottery | Literacy | Wealth | Region | |
---|---|---|---|---|---|
80 | Vendee | 68 | 28 | 56 | W |
81 | Vienne | 40 | 25 | 68 | W |
82 | Haute-Vienne | 55 | 13 | 67 | C |
83 | Vosges | 14 | 62 | 82 | E |
84 | Yonne | 51 | 47 | 30 | C |
Some statistics...
Substantive motivation and model: We want to know whether literacy rates in the 86 French departments are associated with per capita wagers on the Royal Lottery in the 1820s. We need to control for the level of wealth in each department, and we also want to include a series of dummy variables on the right-hand side of our regression equation to control for unobserved heterogeneity due to regional effects. The model is estimated using ordinary least squares regression (OLS).
Use patsy‘s to create design matrices, then use statsmodels to do an ordinary least squares fit.
Dep. Variable: | Lottery | R-squared: | 0.338 |
---|---|---|---|
Model: | OLS | Adj. R-squared: | 0.287 |
Method: | Least Squares | F-statistic: | 6.636 |
Date: | Mon, 26 Feb 2018 | Prob (F-statistic): | 1.07e-05 |
Time: | 22:43:58 | Log-Likelihood: | -375.30 |
No. Observations: | 85 | AIC: | 764.6 |
Df Residuals: | 78 | BIC: | 781.7 |
Df Model: | 6 | ||
Covariance Type: | nonrobust |
coef | std err | t | P>|t| | [0.025 | 0.975] | |
---|---|---|---|---|---|---|
Intercept | 38.6517 | 9.456 | 4.087 | 0.000 | 19.826 | 57.478 |
Region[T.E] | -15.4278 | 9.727 | -1.586 | 0.117 | -34.793 | 3.938 |
Region[T.N] | -10.0170 | 9.260 | -1.082 | 0.283 | -28.453 | 8.419 |
Region[T.S] | -4.5483 | 7.279 | -0.625 | 0.534 | -19.039 | 9.943 |
Region[T.W] | -10.0913 | 7.196 | -1.402 | 0.165 | -24.418 | 4.235 |
Literacy | -0.1858 | 0.210 | -0.886 | 0.378 | -0.603 | 0.232 |
Wealth | 0.4515 | 0.103 | 4.390 | 0.000 | 0.247 | 0.656 |
Omnibus: | 3.049 | Durbin-Watson: | 1.785 |
---|---|---|---|
Prob(Omnibus): | 0.218 | Jarque-Bera (JB): | 2.694 |
Skew: | -0.340 | Prob(JB): | 0.260 |
Kurtosis: | 2.454 | Cond. No. | 371. |
statsmodels also provides graphics functions. For example, we can draw a plot of partial regression for a set of regressors by:
Machine Learning in Python
Simple and efficient tools for data mining and data analysis
Accessible to everybody, and reusable in various contexts
Built on NumPy, SciPy, and matplotlib
Open source, commercially usable - BSD license
Reminder: machine learning means trying to predict an unknown function based on some sample values. We'll illustrate this with an example from this tutorial.
The "digits" dataset consists of low-resolution scans of 1797 hand-written digits from the ZIP (postal) code numbers on letters.
It looks like each entry of digits.images
is an 8x8 array representing "pixels" in the image, and the corresponding entry of digits.target
indicates which numeral is represented by the image. To get a better idea of what is being represented here, let's plot one of these images.
Example/goal here:
"In the case of the digits dataset, the task is to predict, given an image, which digit it represents. We are given 1797 samples of each of the 10 possible classes (the digits zero through nine) on which we fit an estimator to be able to predict the classes to which unseen samples belong."
In this demonstration, we will use a support vector machine for this purpose. We'll set parameters manually, but scikit-learn also includes some functionality for tuning these parameters automatically.
This clf is a "classifier". Right now it doesn't know anything at all about our actual data.
We train it by passing in our data to the fit method...
"We use all the images of our dataset apart from the last one. We select this training set with the [:-1] Python syntax, which produces a new array that contains all but the last entry of digits.data."
** Exercise for you right now: ** Run the prediction on all 1797 scanned values to see how many are correct. Note that this is not actually meaningful, as we use some of these scanned values as training data (that is, we are performing an overfit).
To avoid overfitting, let's use only 1000 of the digits as training data, and set aside the rest as testing data. How accurate do you expect the predictions to be?
As a bonus, here are plots of some low-dimensional embeddings of the digits dataset:
http://scikit-learn.org/stable/auto_examples/manifold/plot_lle_digits.html