This repository contains the course materials from Math 157: Intro to Mathematical Software.
Creative Commons BY-SA 4.0 license.
License: OTHER
Math 157: Intro to Mathematical Software
UC San Diego, winter 2018
February 23, 2018: Introduction to R and statistics (part 2 of 2)
Administrivia:
The final project will be assigned shortly. Look for a folder called
assignments/2018-03-16
for both parts.
Added in class:
Attendance scores will be updated soon (hopefully by Monday).
Regarding Homework 6:
For problem 1c, it should read "how would the crossover value depend (asympotically) on ?" (rather than "on ").
For problem 4a, the values I gave are too big to handle in CoCalc. You may use these parameters instead:
For problem 5c, by "one conclusion you drew from the data" I mean a statement about the "real world". (Fake example: "people born in January have bigger ears than people born in July.")
Pause here for additional questions.
Side-by-side comparison of R and Python
In this lecture, we make a side-by-side comparison of various types of data analysis functionality in R and Python. This is inspired by this blog post from a company called Dataquest which I had not heard of until I started preparing this lecture. (Reminder: I am still not a statistician! Nor am I a basketball fan, but what the heck.)
In order to switch back and forth efficiently, I will work in SageMath and use the extension I described last time to switch individual cells over to R. Since we'll be using pandas, I'll turn off the Sage preparser.
To begin, we need some data to analyze. Here, we'll use a dataset from Open Source Sports consisting of historical data about men's basketball (NBA) players from (sometime in the past) until 2012. You will find this file in the same folder as this notebook. Let's start by importing this data into R and Python.
How big is this dataset?
Let's look at the first row to get a sense of what the data looks like.
playerID | year | stint | tmID | lgID | GP | GS | minutes | points | oRebounds | ... | PostBlocks | PostTurnovers | PostPF | PostfgAttempted | PostfgMade | PostftAttempted | PostftMade | PostthreeAttempted | PostthreeMade | note | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
0 | abramjo01 | 1946 | 1 | PIT | NBA | 47 | 0 | 0 | 527 | 0 | ... | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | NaN |
1 | aubucch01 | 1946 | 1 | DTF | NBA | 30 | 0 | 0 | 65 | 0 | ... | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | NaN |
2 rows × 42 columns
Let's compute averages of the various statistics. (Not that these are particularly meaningful, but just as a demonstration.)
Let's look for correlations among columns using a scatterplot.
Let's do a cluster plot. To do this, we need to remove columns which do not contain numeric values.
In R, we used the cluster package. In Python, we used a module from scikit-learn, a widely used package for machine learning. More on what that phrase means shortly.
Let's plot the clusters we just computed.
These visualizations use principal component analysis. Roughly speaking, this means trying to express a large number of correlated variables in terms of a smaller number of uncorrelated variables. In geometric terms, imagine your data as a collection of points in a high-dimensional space, but suppose that they form a low-dimensional subspace; then points within the dataset should be describable in terms of a small number of "independent coordinates". (E.g., if you had points on a sphere, they sit in 3-space but you need only two coordinates to locate them, say latitude and longitude.)
Let's try some machine learning now. The general framework of machine learning is: one has a function from some domain to some codomain, and one would like to be able to "predict" the value at some point of the domain. Of course one can do this by a trivial lookup if one has a full value table for the function; for this to be meaningful, one instead should "train" on a small subset of the domain, then "test" elsewhere on the domain to see if the predictions hold up. It is cheating if the "training" and "testing" data overlap; this is referred to as overfitting.
In this example, we took 80% of the data, sampled randomly, to be the training data and the remaining 20% to be the testing data.
Let's try doing some predictions based on linear regression. Say we want to predict assists in terms of field goals.
Let's look at some summary results.
Dep. Variable: | assists | R-squared: | 0.506 |
---|---|---|---|
Model: | OLS | Adj. R-squared: | 0.506 |
Method: | Least Squares | F-statistic: | 1.948e+04 |
Date: | Fri, 23 Feb 2018 | Prob (F-statistic): | 0.00 |
Time: | 22:42:58 | Log-Likelihood: | -1.1344e+05 |
No. Observations: | 19001 | AIC: | 2.269e+05 |
Df Residuals: | 18999 | BIC: | 2.269e+05 |
Df Model: | 1 | ||
Covariance Type: | nonrobust |
coef | std err | t | P>|t| | [0.025 | 0.975] | |
---|---|---|---|---|---|---|
Intercept | 12.8753 | 0.961 | 13.400 | 0.000 | 10.992 | 14.759 |
fgMade | 0.4989 | 0.004 | 139.556 | 0.000 | 0.492 | 0.506 |
Omnibus: | 10185.359 | Durbin-Watson: | 2.020 |
---|---|---|---|
Prob(Omnibus): | 0.000 | Jarque-Bera (JB): | 112607.349 |
Skew: | 2.355 | Prob(JB): | 0.00 |
Kurtosis: | 13.957 | Cond. No. | 376. |
We'll take a closer look at machine learning in a later lecture.
The takeaway
Some differences between R and Python from the point of view of statistics:
R uses a functional approach, where everything you want to do is a named function that you can call. Python uses an object-oriented approarch, where most things you want to do are methods of particular types of objects to which they apply.
R includes a lot of basic functionality for statistics by default, whereas Python does not; this functionality has to be loaded from packages.
R has a large ecosystem of small packages, including many specialized tasks. Python has a smaller ecosystem of larger packages, which include most standard functionality but miss some specialized things.
A lot of functionality on both sides was modeled on the other. For instance, pandas DataFrames are quite conciously modeled on the R counterparts.
For "pure statistics", the R code is generally simpler than the Python code. For "general computing", Python tends to be easier. For instance, if you wanted to scrape data off a web page and then do some analysis, that scraping step is much easier in Python because the ecosystem (being the product of many people who are not all statisticians) includes very good packages for manipulating HTML files.
Particularly for this last reason, I will focus more on the Python scientific stack than R in the remainder of this course. However, if you are familiar with scientific computing in Python, you should be able to get up to speed with R fairly quickly. (Most of the statistics courses taught at UCSD use R in some form or another.)