Contact
CoCalc Logo Icon
StoreFeaturesDocsShareSupportNewsAboutSign UpSign In
| Download

This repository contains the course materials from Math 157: Intro to Mathematical Software.

Creative Commons BY-SA 4.0 license.

Views: 3037
License: OTHER
Kernel: SageMath 8.1

Math 157: Intro to Mathematical Software

UC San Diego, winter 2018

Homework 7: due March 2, 2018

Please enter all answers within this notebook unless otherwise specified. As usual, don't forget to cite sources and collaborators.

Through this problem set, use the SageMath 8.1 kernel except as specified. You may find the following declarations useful:

preparser(False) # Turn off Sage preparser import pandas as pd # Load pandas import numpy as np # Load numpy import matplotlib # Load matplotlib import matplotlib.pyplot as plt # Load pyplot matplotlib.style.use('ggplot') # Use R-style plotting in matplotlib import seaborn as sns # Load seaborn

This homework consists of 5 problems, each of equal value.

Problem 1: Emulation of R in Python

Grading criteria: correctness of code.

Demonstrate Python analogues of the following R code blocks from the previous homework. Hints:

  • The Python statsmodels module includes the submodule datasets which simulates the corresponding R package.

  • The R pairs function can be simulated using the pandas function scatter_matrix.

  • The seaborn function FacetGrid allows you to set up a grid in which each entry corresponds to a particular value of a conditioning variable. Use this, and the matplotlib scatter plot functionality, to simulate the R function coplot.

  • The statsmodels module mosaicplot can simulate the R function mosaicplot.

from statsmodels import datasets %load_ext rpy2.ipython
%%R pairs(mtcars)
# Your Python equivalent goes here. Warning: this dataset is not available via statsmodels, and the FacetGrid function may crash the kernel.
%%R require(stats); require(graphics) coplot(circumference ~ age | Tree, data = Orange, show.given = FALSE)
# Your Python equivalent goes here.
%%R x <- apply(HairEyeColor, c(1, 2), sum) x mosaicplot(x)
df = datasets.get_rdataset("HairEyeColor")
from statsmodels.graphics.mosaicplot import mosaic mosaic(df,[1,2])
--------------------------------------------------------------------------- IndexError Traceback (most recent call last) <ipython-input-28-ac441a8a3c27> in <module>() 1 from statsmodels.graphics.mosaicplot import mosaic ----> 2 mosaic(df,[Integer(1),Integer(2)]) /ext/sage/sage-8.1/local/lib/python2.7/site-packages/statsmodels/graphics/mosaicplot.pyc in mosaic(data, index, ax, horizontal, gap, properties, labelizer, title, statistic, axes_label, label_rotation) 618 fig, ax = utils.create_mpl_ax(ax) 619 # normalize the data to a dict with tuple of strings as keys --> 620 data = _normalize_data(data, index) 621 # split the graph into different areas 622 rects = _hierarchical_split(data, horizontal=horizontal, gap=gap) /ext/sage/sage-8.1/local/lib/python2.7/site-packages/statsmodels/graphics/mosaicplot.pyc in _normalize_data(data, index) 315 contingency = OrderedDict() 316 for key, value in iteritems(data): --> 317 new_key = tuple(key[i] for i in index) 318 contingency[new_key] = value 319 data = contingency /ext/sage/sage-8.1/local/lib/python2.7/site-packages/statsmodels/graphics/mosaicplot.pyc in <genexpr>((i,)) 315 contingency = OrderedDict() 316 for key, value in iteritems(data): --> 317 new_key = tuple(key[i] for i in index) 318 contingency[new_key] = value 319 data = contingency IndexError: tuple index out of range
Image in a Jupyter notebook
from rpy2.robjects import r, pandas2ri
pandas2ri.ri2py(....)
# Your Python equivalent goes here

Problem 2: Sunspots revisited

Grading criteria: correctness of code and results.

Let sunspots be the sunactivity dataframe (defined below for you).

from statsmodels import datasets sunspots = datasets.sunspots.load_pandas().data.set_index("YEAR")
from statsmodels import datasets sunspots = datasets.sunspots.load_pandas().data.set_index("YEAR")
--------------------------------------------------------------------------- KeyError Traceback (most recent call last) <ipython-input-12-1fde0097a6c5> in <module>() ----> 1 s[True] /ext/sage/sage-8.1/local/lib/python2.7/site-packages/pandas/core/series.pyc in __getitem__(self, key) 621 key = com._apply_if_callable(key, self) 622 try: --> 623 result = self.index.get_value(self, key) 624 625 if not is_scalar(result): /ext/sage/sage-8.1/local/lib/python2.7/site-packages/pandas/core/indexes/numeric.pyc in get_value(self, series, key) 343 344 k = _values_from_object(key) --> 345 loc = self.get_loc(k) 346 new_values = _values_from_object(series)[loc] 347 /ext/sage/sage-8.1/local/lib/python2.7/site-packages/pandas/core/indexes/numeric.pyc in get_loc(self, key, method, tolerance) 402 pass 403 return super(Float64Index, self).get_loc(key, method=method, --> 404 tolerance=tolerance) 405 406 @cache_readonly /ext/sage/sage-8.1/local/lib/python2.7/site-packages/pandas/core/indexes/base.pyc in get_loc(self, key, method, tolerance) 2522 return self._engine.get_loc(key) 2523 except KeyError: -> 2524 return self._engine.get_loc(self._maybe_cast_indexer(key)) 2525 2526 indexer = self.get_indexer([key], method=method, tolerance=tolerance) pandas/_libs/index.pyx in pandas._libs.index.IndexEngine.get_loc() pandas/_libs/index.pyx in pandas._libs.index.IndexEngine.get_loc() pandas/_libs/hashtable_class_helper.pxi in pandas._libs.hashtable.Float64HashTable.get_item() pandas/_libs/hashtable_class_helper.pxi in pandas._libs.hashtable.Float64HashTable.get_item() KeyError: 1.0

2a. For how many years was the activity 100\geq 100?

2b. Make a histogram plot of all activity from 1900 to the end of the dataset.

sunspots[1900:]
SUNACTIVITY
YEAR
1900.0 9.5
1901.0 2.7
1902.0 5.0
1903.0 24.4
1904.0 42.0
1905.0 63.5
1906.0 53.8
1907.0 62.0
1908.0 48.5
1909.0 43.9
1910.0 18.6
1911.0 5.7
1912.0 3.6
1913.0 1.4
1914.0 9.6
1915.0 47.4
1916.0 57.1
1917.0 103.9
1918.0 80.6
1919.0 63.6
1920.0 37.6
1921.0 26.1
1922.0 14.2
1923.0 5.8
1924.0 16.7
1925.0 44.3
1926.0 63.9
1927.0 69.0
1928.0 77.8
1929.0 64.9
... ...
1979.0 155.4
1980.0 154.6
1981.0 140.4
1982.0 115.9
1983.0 66.6
1984.0 45.9
1985.0 17.9
1986.0 13.4
1987.0 29.4
1988.0 100.2
1989.0 157.6
1990.0 142.6
1991.0 145.7
1992.0 94.3
1993.0 54.6
1994.0 29.9
1995.0 17.5
1996.0 8.6
1997.0 21.5
1998.0 64.3
1999.0 93.3
2000.0 119.6
2001.0 111.0
2002.0 104.0
2003.0 63.7
2004.0 40.4
2005.0 29.8
2006.0 15.2
2007.0 7.5
2008.0 2.9

109 rows × 1 columns

2c. Which year(s) had the highest activity?

Problem 3: Pivot tables

Grading criteria: correctness of code and explanations.

3a. Load the "mpg" R dataset from the Python ggplot library into the variable mpg.

from ggplot import mpg mpg.describe()
displ year cyl cty hwy
count 234.000000 234.000000 234.000000 234.000000 234.000000
mean 3.471795 2003.500000 5.888889 16.858974 23.440171
std 1.291959 4.509646 1.611534 4.255946 5.954643
min 1.600000 1999.000000 4.000000 9.000000 12.000000
25% 2.400000 1999.000000 4.000000 14.000000 18.000000
50% 3.300000 2003.500000 6.000000 17.000000 24.000000
75% 4.600000 2008.000000 8.000000 19.000000 27.000000
max 7.000000 2008.000000 8.000000 35.000000 44.000000

3b. Using the pandas pivot_table command, cretae a pandas DataFrame that tells you the average "cty" and "hwy" (city and highway miles per gallon) for each manufacturer.

3c. Has the average city mileage improved from 1999 to 2008? Has the average highway mileage improved from 1999 to 2008?

3d. Create a scatterplot of pairs (displ, hwy) for all cars in 1999, and another for all cars in 2008.

3e. What effect does increasing displacement have on highway gas mileage?

Problem 4: Irises

Grading criteria: correctness of code and explanations.

The iris dataset is a famous example used in statistics education.

4a. Load the iris dataset into a pandas DataFrame and use the describe command to see some basic statistics.

import statsmodels.api as sm iris = sm.datasets.get_rdataset('iris').data iris
/ext/sage/sage-8.1/local/lib/python2.7/site-packages/statsmodels/compat/pandas.py:56: FutureWarning: The pandas.core.datetools module is deprecated and will be removed in a future version. Please use the pandas.tseries module instead. from pandas.core import datetools
Sepal.Length Sepal.Width Petal.Length Petal.Width Species
0 5.1 3.5 1.4 0.2 setosa
1 4.9 3.0 1.4 0.2 setosa
2 4.7 3.2 1.3 0.2 setosa
3 4.6 3.1 1.5 0.2 setosa
4 5.0 3.6 1.4 0.2 setosa
5 5.4 3.9 1.7 0.4 setosa
6 4.6 3.4 1.4 0.3 setosa
7 5.0 3.4 1.5 0.2 setosa
8 4.4 2.9 1.4 0.2 setosa
9 4.9 3.1 1.5 0.1 setosa
10 5.4 3.7 1.5 0.2 setosa
11 4.8 3.4 1.6 0.2 setosa
12 4.8 3.0 1.4 0.1 setosa
13 4.3 3.0 1.1 0.1 setosa
14 5.8 4.0 1.2 0.2 setosa
15 5.7 4.4 1.5 0.4 setosa
16 5.4 3.9 1.3 0.4 setosa
17 5.1 3.5 1.4 0.3 setosa
18 5.7 3.8 1.7 0.3 setosa
19 5.1 3.8 1.5 0.3 setosa
20 5.4 3.4 1.7 0.2 setosa
21 5.1 3.7 1.5 0.4 setosa
22 4.6 3.6 1.0 0.2 setosa
23 5.1 3.3 1.7 0.5 setosa
24 4.8 3.4 1.9 0.2 setosa
25 5.0 3.0 1.6 0.2 setosa
26 5.0 3.4 1.6 0.4 setosa
27 5.2 3.5 1.5 0.2 setosa
28 5.2 3.4 1.4 0.2 setosa
29 4.7 3.2 1.6 0.2 setosa
... ... ... ... ... ...
120 6.9 3.2 5.7 2.3 virginica
121 5.6 2.8 4.9 2.0 virginica
122 7.7 2.8 6.7 2.0 virginica
123 6.3 2.7 4.9 1.8 virginica
124 6.7 3.3 5.7 2.1 virginica
125 7.2 3.2 6.0 1.8 virginica
126 6.2 2.8 4.8 1.8 virginica
127 6.1 3.0 4.9 1.8 virginica
128 6.4 2.8 5.6 2.1 virginica
129 7.2 3.0 5.8 1.6 virginica
130 7.4 2.8 6.1 1.9 virginica
131 7.9 3.8 6.4 2.0 virginica
132 6.4 2.8 5.6 2.2 virginica
133 6.3 2.8 5.1 1.5 virginica
134 6.1 2.6 5.6 1.4 virginica
135 7.7 3.0 6.1 2.3 virginica
136 6.3 3.4 5.6 2.4 virginica
137 6.4 3.1 5.5 1.8 virginica
138 6.0 3.0 4.8 1.8 virginica
139 6.9 3.1 5.4 2.1 virginica
140 6.7 3.1 5.6 2.4 virginica
141 6.9 3.1 5.1 2.3 virginica
142 5.8 2.7 5.1 1.9 virginica
143 6.8 3.2 5.9 2.3 virginica
144 6.7 3.3 5.7 2.5 virginica
145 6.7 3.0 5.2 2.3 virginica
146 6.3 2.5 5.0 1.9 virginica
147 6.5 3.0 5.2 2.0 virginica
148 6.2 3.4 5.4 2.3 virginica
149 5.9 3.0 5.1 1.8 virginica

150 rows × 5 columns

4b. Plot all of the sepal (length, width) pairs in a scatterplot, and the petal (length, width) pairs in another scatterplot.

import matplotlib.pyplot as plt #x_data = iris.iloc[:,0] #x_data y_data = iris['Sepal.Width'] y_data #plt.scatter(x_data, y_data)
0 3.5 1 3.0 2 3.2 3 3.1 4 3.6 5 3.9 6 3.4 7 3.4 8 2.9 9 3.1 10 3.7 11 3.4 12 3.0 13 3.0 14 4.0 15 4.4 16 3.9 17 3.5 18 3.8 19 3.8 20 3.4 21 3.7 22 3.6 23 3.3 24 3.4 25 3.0 26 3.4 27 3.5 28 3.4 29 3.2 ... 120 3.2 121 2.8 122 2.8 123 2.7 124 3.3 125 3.2 126 2.8 127 3.0 128 2.8 129 3.0 130 2.8 131 3.8 132 2.8 133 2.8 134 2.6 135 3.0 136 3.4 137 3.1 138 3.0 139 3.1 140 3.1 141 3.1 142 2.7 143 3.2 144 3.3 145 3.0 146 2.5 147 3.0 148 3.4 149 3.0 Name: Sepal.Width, Length: 150, dtype: float64

4c. Compute the average petal width for each of the "species"-categories.

Problem 5: Machine learning with irises

Grading criteria: correctness and relevance of code.

5a. The Wikipedia article on the iris dataset asserts:

The use of this data set in cluster analysis however is not common, since the data set only contains two clusters with rather obvious separation.

Demonstrate this by performing a clustering computation and showing that it fails to separate the three species.

5b. Use the scikit-learn SVM classifier to classify species. Use a random sample of 80% of the initial data for training and the other 20% for testing, and report the accuracy rate of your predictions.

from rpy2.robjects import r, pandas2ri import pandas as pd import matplotlib.pyplot as plt pandas2ri.activate() iris = pd.DataFrame(r['iris'])
iris.axes()
--------------------------------------------------------------------------- AttributeError Traceback (most recent call last) <ipython-input-43-46df5a53509d> in <module>() ----> 1 iris.labels /ext/sage/sage-8.1/local/lib/python2.7/site-packages/pandas/core/generic.pyc in __getattr__(self, name) 3608 if (name in self._internal_names_set or name in self._metadata or 3609 name in self._accessors): -> 3610 return object.__getattribute__(self, name) 3611 else: 3612 if name in self._info_axis: AttributeError: 'DataFrame' object has no attribute 'labels'
from sklearn import svm from sklearn.utils import shuffle import statsmodels.formula.api as sm iris = shuffle(iris) clf = svm.SVC(gamma=0.001, C=100.) train = iris.sample(frac=0.8) test = iris.loc[~iris.index.isin(train.index)] clf.fit(train[['Sepal.Length','Sepal.Width','Petal.Length','Petal.Width']],train['Species']) predictions = clf.predict(test[['Sepal.Length','Sepal.Width','Petal.Length','Petal.Width']]) predictions model = sm.ols(formula='Species ~ Sepal, Petal', data=train) fitted = model.fit() fitted.summary()
--------------------------------------------------------------------------- NameError Traceback (most recent call last) <ipython-input-41-8b82712dca2c> in <module>() 9 predictions = clf.predict(test[['Sepal.Length','Sepal.Width','Petal.Length','Petal.Width']]) 10 predictions ---> 11 model = sm.ols(formula='Species ~ Sepal, Petal', data=train) 12 fitted = model.fit() 13 fitted.summary() /ext/sage/sage-8.1/local/lib/python2.7/site-packages/statsmodels/base/model.pyc in from_formula(cls, formula, data, subset, drop_cols, *args, **kwargs) 153 154 tmp = handle_formula_data(data, None, formula, depth=eval_env, --> 155 missing=missing) 156 ((endog, exog), missing_idx, design_info) = tmp 157 /ext/sage/sage-8.1/local/lib/python2.7/site-packages/statsmodels/formula/formulatools.pyc in handle_formula_data(Y, X, formula, depth, missing) 63 if data_util._is_using_pandas(Y, None): 64 result = dmatrices(formula, Y, depth, return_type='dataframe', ---> 65 NA_action=na_action) 66 else: 67 result = dmatrices(formula, Y, depth, return_type='dataframe', /ext/sage/sage-8.1/local/lib/python2.7/site-packages/patsy/highlevel.pyc in dmatrices(formula_like, data, eval_env, NA_action, return_type) 308 eval_env = EvalEnvironment.capture(eval_env, reference=1) 309 (lhs, rhs) = _do_highlevel_design(formula_like, data, eval_env, --> 310 NA_action, return_type) 311 if lhs.shape[1] == 0: 312 raise PatsyError("model is missing required outcome variables") /ext/sage/sage-8.1/local/lib/python2.7/site-packages/patsy/highlevel.pyc in _do_highlevel_design(formula_like, data, eval_env, NA_action, return_type) 163 return iter([data]) 164 design_infos = _try_incr_builders(formula_like, data_iter_maker, eval_env, --> 165 NA_action) 166 if design_infos is not None: 167 return build_design_matrices(design_infos, data, /ext/sage/sage-8.1/local/lib/python2.7/site-packages/patsy/highlevel.pyc in _try_incr_builders(formula_like, data_iter_maker, eval_env, NA_action) 68 data_iter_maker, 69 eval_env, ---> 70 NA_action) 71 else: 72 return None /ext/sage/sage-8.1/local/lib/python2.7/site-packages/patsy/build.pyc in design_matrix_builders(termlists, data_iter_maker, eval_env, NA_action) 694 factor_states, 695 data_iter_maker, --> 696 NA_action) 697 # Now we need the factor infos, which encapsulate the knowledge of 698 # how to turn any given factor into a chunk of data: /ext/sage/sage-8.1/local/lib/python2.7/site-packages/patsy/build.pyc in _examine_factor_types(factors, factor_states, data_iter_maker, NA_action) 441 for data in data_iter_maker(): 442 for factor in list(examine_needed): --> 443 value = factor.eval(factor_states[factor], data) 444 if factor in cat_sniffers or guess_categorical(value): 445 if factor not in cat_sniffers: /ext/sage/sage-8.1/local/lib/python2.7/site-packages/patsy/eval.pyc in eval(self, memorize_state, data) 564 return self._eval(memorize_state["eval_code"], 565 memorize_state, --> 566 data) 567 568 __getstate__ = no_pickling /ext/sage/sage-8.1/local/lib/python2.7/site-packages/patsy/eval.pyc in _eval(self, code, memorize_state, data) 549 memorize_state["eval_env"].eval, 550 code, --> 551 inner_namespace=inner_namespace) 552 553 def memorize_chunk(self, state, which_pass, data): /ext/sage/sage-8.1/local/lib/python2.7/site-packages/patsy/compat.pyc in call_and_wrap_exc(msg, origin, f, *args, **kwargs) 115 def call_and_wrap_exc(msg, origin, f, *args, **kwargs): 116 try: --> 117 return f(*args, **kwargs) 118 except Exception as e: 119 if sys.version_info[0] >= 3: /ext/sage/sage-8.1/local/lib/python2.7/site-packages/patsy/eval.pyc in eval(self, expr, source_name, inner_namespace) 164 code = compile(expr, source_name, "eval", self.flags, False) 165 return eval(code, {}, VarLookupDict([inner_namespace] --> 166 + self._namespaces)) 167 168 @classmethod <string> in <module>() NameError: name 'Sepal' is not defined
sm.ols?