This repository contains the course materials from Math 157: Intro to Mathematical Software.
Creative Commons BY-SA 4.0 license.
License: OTHER
Math 157: Intro to Mathematical Software
UC San Diego, winter 2018
Homework 7: due March 2, 2018
Please enter all answers within this notebook unless otherwise specified. As usual, don't forget to cite sources and collaborators.
Through this problem set, use the SageMath 8.1 kernel except as specified. You may find the following declarations useful:
This homework consists of 5 problems, each of equal value.
Problem 1: Emulation of R in Python
Grading criteria: correctness of code.
Demonstrate Python analogues of the following R code blocks from the previous homework. Hints:
The Python
statsmodels
module includes the submoduledatasets
which simulates the corresponding R package.The R
pairs
function can be simulated using the pandas functionscatter_matrix
.The seaborn function
FacetGrid
allows you to set up a grid in which each entry corresponds to a particular value of a conditioning variable. Use this, and the matplotlib scatter plot functionality, to simulate the R functioncoplot
.The statsmodels module
mosaicplot
can simulate the R functionmosaicplot
.
Problem 2: Sunspots revisited
Grading criteria: correctness of code and results.
Let sunspots
be the sunactivity dataframe (defined below for you).
2a. For how many years was the activity ?
2b. Make a histogram plot of all activity from 1900 to the end of the dataset.
2c. Which year(s) had the highest activity?
SUNACTIVITY | |
---|---|
YEAR | |
1957.0 | 190.2 |
Problem 3: Pivot tables
Grading criteria: correctness of code and explanations.
3a. Load the "mpg" R dataset from the Python ggplot library into the variable mpg
.
manufacturer | model | displ | year | cyl | trans | drv | cty | hwy | fl | class | |
---|---|---|---|---|---|---|---|---|---|---|---|
0 | audi | a4 | 1.8 | 1999 | 4 | auto(l5) | f | 18 | 29 | p | compact |
1 | audi | a4 | 1.8 | 1999 | 4 | manual(m5) | f | 21 | 29 | p | compact |
2 | audi | a4 | 2.0 | 2008 | 4 | manual(m6) | f | 20 | 31 | p | compact |
3 | audi | a4 | 2.0 | 2008 | 4 | auto(av) | f | 21 | 30 | p | compact |
4 | audi | a4 | 2.8 | 1999 | 6 | auto(l5) | f | 16 | 26 | p | compact |
3b. Using the pandas pivot_table
command, cretae a pandas DataFrame that tells you the average "cty" and "hwy" (city and highway miles per gallon) for each manufacturer.
cty | hwy | |
---|---|---|
manufacturer | ||
audi | 17.611111 | 26.444444 |
chevrolet | 15.000000 | 21.894737 |
dodge | 13.135135 | 17.945946 |
ford | 14.000000 | 19.360000 |
honda | 24.444444 | 32.555556 |
hyundai | 18.642857 | 26.857143 |
jeep | 13.500000 | 17.625000 |
land rover | 11.500000 | 16.500000 |
lincoln | 11.333333 | 17.000000 |
mercury | 13.250000 | 18.000000 |
nissan | 18.076923 | 24.615385 |
pontiac | 17.000000 | 26.400000 |
subaru | 19.285714 | 25.571429 |
toyota | 18.529412 | 24.911765 |
volkswagen | 20.925926 | 29.222222 |
3c. Has the average city mileage improved from 1999 to 2008? Has the average highway mileage improved from 1999 to 2008?
cty | hwy | |
---|---|---|
year | ||
1999 | 17.017094 | 23.427350 |
2008 | 16.700855 | 23.452991 |
3d. Create a scatterplot of pairs (displ, hwy) for all cars in 1999, and another for all cars in 2008.
3e. What effect does increasing displacement have on highway gas mileage?
Greater displacement tends to make highway gas mileage worse.
Problem 4: Irises
Grading criteria: correctness of code and explanations.
The iris dataset is a famous example used in statistics education.
4a. Load the iris dataset into a pandas DataFrame and use the describe
command to see some basic statistics.
sepal_length | sepal_width | petal_length | petal_width | |
---|---|---|---|---|
count | 150.000000 | 150.000000 | 150.000000 | 150.000000 |
mean | 5.843333 | 3.057333 | 3.758000 | 1.199333 |
std | 0.828066 | 0.435866 | 1.765298 | 0.762238 |
min | 4.300000 | 2.000000 | 1.000000 | 0.100000 |
25% | 5.100000 | 2.800000 | 1.600000 | 0.300000 |
50% | 5.800000 | 3.000000 | 4.350000 | 1.300000 |
75% | 6.400000 | 3.300000 | 5.100000 | 1.800000 |
max | 7.900000 | 4.400000 | 6.900000 | 2.500000 |
4b. Plot all of the sepal (length, width) pairs in a scatterplot, and the petal (length, width) pairs in another scatterplot.
4c. Compute the average petal width for each of the "species"-categories.
petal_width | |
---|---|
species | |
setosa | 0.246 |
versicolor | 1.326 |
virginica | 2.026 |
Problem 5: Machine learning with irises
Grading criteria: correctness and relevance of code.
5a. The Wikipedia article on the iris dataset asserts:
The use of this data set in cluster analysis however is not common, since the data set only contains two clusters with rather obvious separation.
Demonstrate this by performing a clustering computation and showing that it fails to separate the three species.
5b. Use the scikit-learn SVM classifier to classify species. Use a random sample of 80% of the initial data for training and the other 20% for testing, and report the accuracy rate of your predictions.