Contact
CoCalc Logo Icon
StoreFeaturesDocsShareSupportNewsAboutSign UpSign In
| Download

This repository contains the course materials from Math 157: Intro to Mathematical Software.

Creative Commons BY-SA 4.0 license.

Views: 3037
License: OTHER
Kernel: SageMath 8.1

Math 157: Intro to Mathematical Software

UC San Diego, winter 2018

February 23, 2018: Introduction to R and statistics (part 2 of 2)

Administrivia:

  • The final project will be assigned shortly. Look for a folder called assignments/2018-03-16 for both parts.

Added in class:

  • Attendance scores will be updated soon (hopefully by Monday).

Regarding Homework 6:

  • For problem 1c, it should read "how would the crossover value depend (asympotically) on MM?" (rather than "on NN").

  • For problem 4a, the values I gave are too big to handle in CoCalc. You may use these parameters instead:

p = 2430985831 E = EllipticCurve(GF(p), [0,1,0,1,-1]) g = E(1142679369, 725589986)
  • For problem 5c, by "one conclusion you drew from the data" I mean a statement about the "real world". (Fake example: "people born in January have bigger ears than people born in July.")

Pause here for additional questions.

Side-by-side comparison of R and Python

In this lecture, we make a side-by-side comparison of various types of data analysis functionality in R and Python. This is inspired by this blog post from a company called Dataquest which I had not heard of until I started preparing this lecture. (Reminder: I am still not a statistician! Nor am I a basketball fan, but what the heck.)

In order to switch back and forth efficiently, I will work in SageMath and use the extension I described last time to switch individual cells over to R. Since we'll be using pandas, I'll turn off the Sage preparser.

preparser(False) # Turn off the Sage preparser import pandas as pd
%load_ext rpy2.ipython

To begin, we need some data to analyze. Here, we'll use a dataset from Open Source Sports consisting of historical data about men's basketball (NBA) players from (sometime in the past) until 2012. You will find this file in the same folder as this notebook. Let's start by importing this data into R and Python.

%%R nba <- read.csv("basketball_players.csv")
nba = pd.read_csv("basketball_players.csv")

How big is this dataset?

%%R dim(nba)
[1] 23751 42
nba.shape
(23751, 42)

Let's look at the first row to get a sense of what the data looks like.

%%R head(nba, 2)
playerID year stint tmID lgID GP GS minutes points oRebounds dRebounds 1 abramjo01 1946 1 PIT NBA 47 0 0 527 0 0 2 aubucch01 1946 1 DTF NBA 30 0 0 65 0 0 rebounds assists steals blocks turnovers PF fgAttempted fgMade ftAttempted 1 0 35 0 0 0 161 834 202 178 2 0 20 0 0 0 46 91 23 35 ftMade threeAttempted threeMade PostGP PostGS PostMinutes PostPoints 1 123 0 0 0 0 0 0 2 19 0 0 0 0 0 0 PostoRebounds PostdRebounds PostRebounds PostAssists PostSteals PostBlocks 1 0 0 0 0 0 0 2 0 0 0 0 0 0 PostTurnovers PostPF PostfgAttempted PostfgMade PostftAttempted PostftMade 1 0 0 0 0 0 0 2 0 0 0 0 0 0 PostthreeAttempted PostthreeMade note 1 0 0 2 0 0
nba.head(2)
playerID year stint tmID lgID GP GS minutes points oRebounds ... PostBlocks PostTurnovers PostPF PostfgAttempted PostfgMade PostftAttempted PostftMade PostthreeAttempted PostthreeMade note
0 abramjo01 1946 1 PIT NBA 47 0 0 527 0 ... 0 0 0 0 0 0 0 0 0 NaN
1 aubucch01 1946 1 DTF NBA 30 0 0 65 0 ... 0 0 0 0 0 0 0 0 0 NaN

2 rows × 42 columns

Let's compute averages of the various statistics. (Not that these are particularly meaningful, but just as a demonstration.)

%%R sapply(nba, mean)
playerID year stint tmID NA 1982.9142352 1.0346512 NA lgID GP GS minutes NA 47.9643383 0.9356238 1097.2966612 points oRebounds dRebounds rebounds 492.1306892 50.3825944 112.8252705 209.0642078 assists steals blocks turnovers 107.0603764 29.3477327 18.0552398 55.3417961 PF fgAttempted fgMade ftAttempted 112.8513747 410.5997642 188.5476401 135.5579976 ftMade threeAttempted threeMade PostGP 101.4439813 36.4243611 12.5785020 3.1805819 PostGS PostMinutes PostPoints PostoRebounds 0.1397836 74.5521452 32.6656141 3.0950276 PostdRebounds PostRebounds PostAssists PostSteals 7.1340996 14.0891752 6.8242600 1.7684308 PostBlocks PostTurnovers PostPF PostfgAttempted 1.1938445 3.5183782 7.8783630 27.0765441 PostfgMade PostftAttempted PostftMade PostthreeAttempted 12.2920719 9.5058734 7.1634457 2.6217422 PostthreeMade note 0.9185297 NA
nba.mean()
year 1982.914235 stint 1.034651 GP 47.964338 GS 0.935624 minutes 1097.296661 points 492.130689 oRebounds 50.382594 dRebounds 112.825271 rebounds 209.064208 assists 107.060376 steals 29.347733 blocks 18.055240 turnovers 55.341796 PF 112.851375 fgAttempted 410.599764 fgMade 188.547640 ftAttempted 135.557998 ftMade 101.443981 threeAttempted 36.424361 threeMade 12.578502 PostGP 3.180582 PostGS 0.139784 PostMinutes 74.552145 PostPoints 32.665614 PostoRebounds 3.095028 PostdRebounds 7.134100 PostRebounds 14.089175 PostAssists 6.824260 PostSteals 1.768431 PostBlocks 1.193844 PostTurnovers 3.518378 PostPF 7.878363 PostfgAttempted 27.076544 PostfgMade 12.292072 PostftAttempted 9.505873 PostftMade 7.163446 PostthreeAttempted 2.621742 PostthreeMade 0.918530 dtype: float64

Let's look for correlations among columns using a scatterplot.

%%R library(GGally) ggpairs(nba[,c("assists", "fgMade", "turnovers")])
Image in a Jupyter notebook
import seaborn as sns import matplotlib.pyplot as plt sns.pairplot(nba[["assists", "fgMade", "turnovers"]]) plt.show()
Image in a Jupyter notebook

Note that in both of these examples, we had to reference external packages.

  • In R, GGally is an extension of the basic plotting library ggplot2.

  • In Python, Seaborn is a visualization library based on matplotlib.

Let's do a cluster plot. To do this, we need to remove columns which do not contain numeric values.

%%R library(cluster) set.seed(1) isGoodCol <- function(col){ sum(is.na(col)) == 0 && is.numeric(col) } goodCols <- sapply(nba, isGoodCol) clusters <- kmeans(nba[,goodCols], centers=5) labels <- clusters$cluster
from sklearn.cluster import KMeans kmeans_model = KMeans(n_clusters=5, random_state=1) good_columns = nba._get_numeric_data().dropna(axis=1) kmeans_model.fit(good_columns) labels = kmeans_model.labels_

In R, we used the cluster package. In Python, we used a module from scikit-learn, a widely used package for machine learning. More on what that phrase means shortly.

Let's plot the clusters we just computed.

%%R nba2d <- prcomp(nba[,goodCols], center=TRUE) twoColumns <- nba2d$x[,1:2] clusplot(twoColumns, labels)
Image in a Jupyter notebook
from sklearn.decomposition import PCA pca_2 = PCA(2) plot_columns = pca_2.fit_transform(good_columns) plt.scatter(x=plot_columns[:,0], y=plot_columns[:,1], c=labels) plt.show()
Image in a Jupyter notebook

These visualizations use principal component analysis. Roughly speaking, this means trying to express a large number of correlated variables in terms of a smaller number of uncorrelated variables. In geometric terms, imagine your data as a collection of points in a high-dimensional space, but suppose that they form a low-dimensional subspace; then points within the dataset should be describable in terms of a small number of "independent coordinates". (E.g., if you had points on a sphere, they sit in 3-space but you need only two coordinates to locate them, say latitude and longitude.)

Let's try some machine learning now. The general framework of machine learning is: one has a function from some domain to some codomain, and one would like to be able to "predict" the value at some point of the domain. Of course one can do this by a trivial lookup if one has a full value table for the function; for this to be meaningful, one instead should "train" on a small subset of the domain, then "test" elsewhere on the domain to see if the predictions hold up. It is cheating if the "training" and "testing" data overlap; this is referred to as overfitting.

%%R trainRowCount <- floor(0.8 * nrow(nba)) set.seed(1) trainIndex <- sample(1:nrow(nba), trainRowCount) train <- nba[trainIndex,] test <- nba[-trainIndex,]
train = nba.sample(frac=0.8, random_state=1) test = nba.loc[~nba.index.isin(train.index)]

In this example, we took 80% of the data, sampled randomly, to be the training data and the remaining 20% to be the testing data.

Let's try doing some predictions based on linear regression. Say we want to predict assists in terms of field goals.

%%R fit <- lm(assists ~ fgMade, data=train) predictions <- predict(fit, test)
from sklearn.linear_model import LinearRegression lr = LinearRegression() lr.fit(train[["fgMade"]], train["assists"]) predictions = lr.predict(test[["fgMade"]])

Let's look at some summary results.

%%R summary(fit)
Call: lm(formula = assists ~ fgMade, data = train) Residuals: Min 1Q Median 3Q Max -492.02 -36.04 -13.54 9.87 902.68 Coefficients: Estimate Std. Error t value Pr(>|t|) (Intercept) 12.53946 0.96897 12.94 <2e-16 *** fgMade 0.50158 0.00359 139.71 <2e-16 *** --- Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1 Residual standard error: 95.16 on 18998 degrees of freedom Multiple R-squared: 0.5067, Adjusted R-squared: 0.5067 F-statistic: 1.952e+04 on 1 and 18998 DF, p-value: < 2.2e-16
import statsmodels.formula.api as sm model = sm.ols(formula='assists ~ fgMade', data=train) fitted = model.fit() fitted.summary()
/ext/sage/sage-8.1/local/lib/python2.7/site-packages/patsy/util.py:652: DeprecationWarning: pandas.core.common.is_categorical_dtype is deprecated. import from the public API: pandas.api.types.is_categorical_dtype instead return safe_is_pandas_categorical_dtype(data.dtype) /ext/sage/sage-8.1/local/lib/python2.7/site-packages/patsy/util.py:679: DeprecationWarning: pandas.core.common.is_categorical_dtype is deprecated. import from the public API: pandas.api.types.is_categorical_dtype instead if safe_is_pandas_categorical_dtype(dt1):
OLS Regression Results
Dep. Variable: assists R-squared: 0.506
Model: OLS Adj. R-squared: 0.506
Method: Least Squares F-statistic: 1.948e+04
Date: Fri, 23 Feb 2018 Prob (F-statistic): 0.00
Time: 22:42:58 Log-Likelihood: -1.1344e+05
No. Observations: 19001 AIC: 2.269e+05
Df Residuals: 18999 BIC: 2.269e+05
Df Model: 1
Covariance Type: nonrobust
coef std err t P>|t| [0.025 0.975]
Intercept 12.8753 0.961 13.400 0.000 10.992 14.759
fgMade 0.4989 0.004 139.556 0.000 0.492 0.506
Omnibus: 10185.359 Durbin-Watson: 2.020
Prob(Omnibus): 0.000 Jarque-Bera (JB): 112607.349
Skew: 2.355 Prob(JB): 0.00
Kurtosis: 13.957 Cond. No. 376.

We'll take a closer look at machine learning in a later lecture.

The takeaway

Some differences between R and Python from the point of view of statistics:

  • R uses a functional approach, where everything you want to do is a named function that you can call. Python uses an object-oriented approarch, where most things you want to do are methods of particular types of objects to which they apply.

  • R includes a lot of basic functionality for statistics by default, whereas Python does not; this functionality has to be loaded from packages.

  • R has a large ecosystem of small packages, including many specialized tasks. Python has a smaller ecosystem of larger packages, which include most standard functionality but miss some specialized things.

  • A lot of functionality on both sides was modeled on the other. For instance, pandas DataFrames are quite conciously modeled on the R counterparts.

  • For "pure statistics", the R code is generally simpler than the Python code. For "general computing", Python tends to be easier. For instance, if you wanted to scrape data off a web page and then do some analysis, that scraping step is much easier in Python because the ecosystem (being the product of many people who are not all statisticians) includes very good packages for manipulating HTML files.

Particularly for this last reason, I will focus more on the Python scientific stack than R in the remainder of this course. However, if you are familiar with scientific computing in Python, you should be able to get up to speed with R fairly quickly. (Most of the statistics courses taught at UCSD use R in some form or another.)