This repository contains the course materials from Math 157: Intro to Mathematical Software.
Creative Commons BY-SA 4.0 license.
License: OTHER
Math 157: Intro to Mathematical Software
UC San Diego, winter 2018
Homework 7: due March 2, 2018
Please enter all answers within this notebook unless otherwise specified. As usual, don't forget to cite sources and collaborators.
Through this problem set, use the SageMath 8.1 kernel except as specified. You may find the following declarations useful:
This homework consists of 5 problems, each of equal value.
Problem 1: Emulation of R in Python
Grading criteria: correctness of code.
Demonstrate Python analogues of the following R code blocks from the previous homework. Hints:
The Python
statsmodels
module includes the submoduledatasets
which simulates the corresponding R package.The R
pairs
function can be simulated using the pandas functionscatter_matrix
.The seaborn function
FacetGrid
allows you to set up a grid in which each entry corresponds to a particular value of a conditioning variable. Use this, and the matplotlib scatter plot functionality, to simulate the R functioncoplot
.The statsmodels module
mosaicplot
can simulate the R functionmosaicplot
.
---------------------------------------------------------------------------
IndexError Traceback (most recent call last)
<ipython-input-28-ac441a8a3c27> in <module>()
1 from statsmodels.graphics.mosaicplot import mosaic
----> 2 mosaic(df,[Integer(1),Integer(2)])
/ext/sage/sage-8.1/local/lib/python2.7/site-packages/statsmodels/graphics/mosaicplot.pyc in mosaic(data, index, ax, horizontal, gap, properties, labelizer, title, statistic, axes_label, label_rotation)
618 fig, ax = utils.create_mpl_ax(ax)
619 # normalize the data to a dict with tuple of strings as keys
--> 620 data = _normalize_data(data, index)
621 # split the graph into different areas
622 rects = _hierarchical_split(data, horizontal=horizontal, gap=gap)
/ext/sage/sage-8.1/local/lib/python2.7/site-packages/statsmodels/graphics/mosaicplot.pyc in _normalize_data(data, index)
315 contingency = OrderedDict()
316 for key, value in iteritems(data):
--> 317 new_key = tuple(key[i] for i in index)
318 contingency[new_key] = value
319 data = contingency
/ext/sage/sage-8.1/local/lib/python2.7/site-packages/statsmodels/graphics/mosaicplot.pyc in <genexpr>((i,))
315 contingency = OrderedDict()
316 for key, value in iteritems(data):
--> 317 new_key = tuple(key[i] for i in index)
318 contingency[new_key] = value
319 data = contingency
IndexError: tuple index out of range
Problem 2: Sunspots revisited
Grading criteria: correctness of code and results.
Let sunspots
be the sunactivity dataframe (defined below for you).
---------------------------------------------------------------------------
KeyError Traceback (most recent call last)
<ipython-input-12-1fde0097a6c5> in <module>()
----> 1 s[True]
/ext/sage/sage-8.1/local/lib/python2.7/site-packages/pandas/core/series.pyc in __getitem__(self, key)
621 key = com._apply_if_callable(key, self)
622 try:
--> 623 result = self.index.get_value(self, key)
624
625 if not is_scalar(result):
/ext/sage/sage-8.1/local/lib/python2.7/site-packages/pandas/core/indexes/numeric.pyc in get_value(self, series, key)
343
344 k = _values_from_object(key)
--> 345 loc = self.get_loc(k)
346 new_values = _values_from_object(series)[loc]
347
/ext/sage/sage-8.1/local/lib/python2.7/site-packages/pandas/core/indexes/numeric.pyc in get_loc(self, key, method, tolerance)
402 pass
403 return super(Float64Index, self).get_loc(key, method=method,
--> 404 tolerance=tolerance)
405
406 @cache_readonly
/ext/sage/sage-8.1/local/lib/python2.7/site-packages/pandas/core/indexes/base.pyc in get_loc(self, key, method, tolerance)
2522 return self._engine.get_loc(key)
2523 except KeyError:
-> 2524 return self._engine.get_loc(self._maybe_cast_indexer(key))
2525
2526 indexer = self.get_indexer([key], method=method, tolerance=tolerance)
pandas/_libs/index.pyx in pandas._libs.index.IndexEngine.get_loc()
pandas/_libs/index.pyx in pandas._libs.index.IndexEngine.get_loc()
pandas/_libs/hashtable_class_helper.pxi in pandas._libs.hashtable.Float64HashTable.get_item()
pandas/_libs/hashtable_class_helper.pxi in pandas._libs.hashtable.Float64HashTable.get_item()
KeyError: 1.0
2a. For how many years was the activity ?
2b. Make a histogram plot of all activity from 1900 to the end of the dataset.
SUNACTIVITY | |
---|---|
YEAR | |
1900.0 | 9.5 |
1901.0 | 2.7 |
1902.0 | 5.0 |
1903.0 | 24.4 |
1904.0 | 42.0 |
1905.0 | 63.5 |
1906.0 | 53.8 |
1907.0 | 62.0 |
1908.0 | 48.5 |
1909.0 | 43.9 |
1910.0 | 18.6 |
1911.0 | 5.7 |
1912.0 | 3.6 |
1913.0 | 1.4 |
1914.0 | 9.6 |
1915.0 | 47.4 |
1916.0 | 57.1 |
1917.0 | 103.9 |
1918.0 | 80.6 |
1919.0 | 63.6 |
1920.0 | 37.6 |
1921.0 | 26.1 |
1922.0 | 14.2 |
1923.0 | 5.8 |
1924.0 | 16.7 |
1925.0 | 44.3 |
1926.0 | 63.9 |
1927.0 | 69.0 |
1928.0 | 77.8 |
1929.0 | 64.9 |
... | ... |
1979.0 | 155.4 |
1980.0 | 154.6 |
1981.0 | 140.4 |
1982.0 | 115.9 |
1983.0 | 66.6 |
1984.0 | 45.9 |
1985.0 | 17.9 |
1986.0 | 13.4 |
1987.0 | 29.4 |
1988.0 | 100.2 |
1989.0 | 157.6 |
1990.0 | 142.6 |
1991.0 | 145.7 |
1992.0 | 94.3 |
1993.0 | 54.6 |
1994.0 | 29.9 |
1995.0 | 17.5 |
1996.0 | 8.6 |
1997.0 | 21.5 |
1998.0 | 64.3 |
1999.0 | 93.3 |
2000.0 | 119.6 |
2001.0 | 111.0 |
2002.0 | 104.0 |
2003.0 | 63.7 |
2004.0 | 40.4 |
2005.0 | 29.8 |
2006.0 | 15.2 |
2007.0 | 7.5 |
2008.0 | 2.9 |
109 rows × 1 columns
2c. Which year(s) had the highest activity?
Problem 3: Pivot tables
Grading criteria: correctness of code and explanations.
3a. Load the "mpg" R dataset from the Python ggplot library into the variable mpg
.
displ | year | cyl | cty | hwy | |
---|---|---|---|---|---|
count | 234.000000 | 234.000000 | 234.000000 | 234.000000 | 234.000000 |
mean | 3.471795 | 2003.500000 | 5.888889 | 16.858974 | 23.440171 |
std | 1.291959 | 4.509646 | 1.611534 | 4.255946 | 5.954643 |
min | 1.600000 | 1999.000000 | 4.000000 | 9.000000 | 12.000000 |
25% | 2.400000 | 1999.000000 | 4.000000 | 14.000000 | 18.000000 |
50% | 3.300000 | 2003.500000 | 6.000000 | 17.000000 | 24.000000 |
75% | 4.600000 | 2008.000000 | 8.000000 | 19.000000 | 27.000000 |
max | 7.000000 | 2008.000000 | 8.000000 | 35.000000 | 44.000000 |
3b. Using the pandas pivot_table
command, cretae a pandas DataFrame that tells you the average "cty" and "hwy" (city and highway miles per gallon) for each manufacturer.
3c. Has the average city mileage improved from 1999 to 2008? Has the average highway mileage improved from 1999 to 2008?
3d. Create a scatterplot of pairs (displ, hwy) for all cars in 1999, and another for all cars in 2008.
3e. What effect does increasing displacement have on highway gas mileage?
Problem 4: Irises
Grading criteria: correctness of code and explanations.
The iris dataset is a famous example used in statistics education.
4a. Load the iris dataset into a pandas DataFrame and use the describe
command to see some basic statistics.
Sepal.Length | Sepal.Width | Petal.Length | Petal.Width | Species | |
---|---|---|---|---|---|
0 | 5.1 | 3.5 | 1.4 | 0.2 | setosa |
1 | 4.9 | 3.0 | 1.4 | 0.2 | setosa |
2 | 4.7 | 3.2 | 1.3 | 0.2 | setosa |
3 | 4.6 | 3.1 | 1.5 | 0.2 | setosa |
4 | 5.0 | 3.6 | 1.4 | 0.2 | setosa |
5 | 5.4 | 3.9 | 1.7 | 0.4 | setosa |
6 | 4.6 | 3.4 | 1.4 | 0.3 | setosa |
7 | 5.0 | 3.4 | 1.5 | 0.2 | setosa |
8 | 4.4 | 2.9 | 1.4 | 0.2 | setosa |
9 | 4.9 | 3.1 | 1.5 | 0.1 | setosa |
10 | 5.4 | 3.7 | 1.5 | 0.2 | setosa |
11 | 4.8 | 3.4 | 1.6 | 0.2 | setosa |
12 | 4.8 | 3.0 | 1.4 | 0.1 | setosa |
13 | 4.3 | 3.0 | 1.1 | 0.1 | setosa |
14 | 5.8 | 4.0 | 1.2 | 0.2 | setosa |
15 | 5.7 | 4.4 | 1.5 | 0.4 | setosa |
16 | 5.4 | 3.9 | 1.3 | 0.4 | setosa |
17 | 5.1 | 3.5 | 1.4 | 0.3 | setosa |
18 | 5.7 | 3.8 | 1.7 | 0.3 | setosa |
19 | 5.1 | 3.8 | 1.5 | 0.3 | setosa |
20 | 5.4 | 3.4 | 1.7 | 0.2 | setosa |
21 | 5.1 | 3.7 | 1.5 | 0.4 | setosa |
22 | 4.6 | 3.6 | 1.0 | 0.2 | setosa |
23 | 5.1 | 3.3 | 1.7 | 0.5 | setosa |
24 | 4.8 | 3.4 | 1.9 | 0.2 | setosa |
25 | 5.0 | 3.0 | 1.6 | 0.2 | setosa |
26 | 5.0 | 3.4 | 1.6 | 0.4 | setosa |
27 | 5.2 | 3.5 | 1.5 | 0.2 | setosa |
28 | 5.2 | 3.4 | 1.4 | 0.2 | setosa |
29 | 4.7 | 3.2 | 1.6 | 0.2 | setosa |
... | ... | ... | ... | ... | ... |
120 | 6.9 | 3.2 | 5.7 | 2.3 | virginica |
121 | 5.6 | 2.8 | 4.9 | 2.0 | virginica |
122 | 7.7 | 2.8 | 6.7 | 2.0 | virginica |
123 | 6.3 | 2.7 | 4.9 | 1.8 | virginica |
124 | 6.7 | 3.3 | 5.7 | 2.1 | virginica |
125 | 7.2 | 3.2 | 6.0 | 1.8 | virginica |
126 | 6.2 | 2.8 | 4.8 | 1.8 | virginica |
127 | 6.1 | 3.0 | 4.9 | 1.8 | virginica |
128 | 6.4 | 2.8 | 5.6 | 2.1 | virginica |
129 | 7.2 | 3.0 | 5.8 | 1.6 | virginica |
130 | 7.4 | 2.8 | 6.1 | 1.9 | virginica |
131 | 7.9 | 3.8 | 6.4 | 2.0 | virginica |
132 | 6.4 | 2.8 | 5.6 | 2.2 | virginica |
133 | 6.3 | 2.8 | 5.1 | 1.5 | virginica |
134 | 6.1 | 2.6 | 5.6 | 1.4 | virginica |
135 | 7.7 | 3.0 | 6.1 | 2.3 | virginica |
136 | 6.3 | 3.4 | 5.6 | 2.4 | virginica |
137 | 6.4 | 3.1 | 5.5 | 1.8 | virginica |
138 | 6.0 | 3.0 | 4.8 | 1.8 | virginica |
139 | 6.9 | 3.1 | 5.4 | 2.1 | virginica |
140 | 6.7 | 3.1 | 5.6 | 2.4 | virginica |
141 | 6.9 | 3.1 | 5.1 | 2.3 | virginica |
142 | 5.8 | 2.7 | 5.1 | 1.9 | virginica |
143 | 6.8 | 3.2 | 5.9 | 2.3 | virginica |
144 | 6.7 | 3.3 | 5.7 | 2.5 | virginica |
145 | 6.7 | 3.0 | 5.2 | 2.3 | virginica |
146 | 6.3 | 2.5 | 5.0 | 1.9 | virginica |
147 | 6.5 | 3.0 | 5.2 | 2.0 | virginica |
148 | 6.2 | 3.4 | 5.4 | 2.3 | virginica |
149 | 5.9 | 3.0 | 5.1 | 1.8 | virginica |
150 rows × 5 columns
4b. Plot all of the sepal (length, width) pairs in a scatterplot, and the petal (length, width) pairs in another scatterplot.
4c. Compute the average petal width for each of the "species"-categories.
Problem 5: Machine learning with irises
Grading criteria: correctness and relevance of code.
5a. The Wikipedia article on the iris dataset asserts:
The use of this data set in cluster analysis however is not common, since the data set only contains two clusters with rather obvious separation.
Demonstrate this by performing a clustering computation and showing that it fails to separate the three species.
5b. Use the scikit-learn SVM classifier to classify species. Use a random sample of 80% of the initial data for training and the other 20% for testing, and report the accuracy rate of your predictions.
---------------------------------------------------------------------------
AttributeError Traceback (most recent call last)
<ipython-input-43-46df5a53509d> in <module>()
----> 1 iris.labels
/ext/sage/sage-8.1/local/lib/python2.7/site-packages/pandas/core/generic.pyc in __getattr__(self, name)
3608 if (name in self._internal_names_set or name in self._metadata or
3609 name in self._accessors):
-> 3610 return object.__getattribute__(self, name)
3611 else:
3612 if name in self._info_axis:
AttributeError: 'DataFrame' object has no attribute 'labels'
---------------------------------------------------------------------------
NameError Traceback (most recent call last)
<ipython-input-41-8b82712dca2c> in <module>()
9 predictions = clf.predict(test[['Sepal.Length','Sepal.Width','Petal.Length','Petal.Width']])
10 predictions
---> 11 model = sm.ols(formula='Species ~ Sepal, Petal', data=train)
12 fitted = model.fit()
13 fitted.summary()
/ext/sage/sage-8.1/local/lib/python2.7/site-packages/statsmodels/base/model.pyc in from_formula(cls, formula, data, subset, drop_cols, *args, **kwargs)
153
154 tmp = handle_formula_data(data, None, formula, depth=eval_env,
--> 155 missing=missing)
156 ((endog, exog), missing_idx, design_info) = tmp
157
/ext/sage/sage-8.1/local/lib/python2.7/site-packages/statsmodels/formula/formulatools.pyc in handle_formula_data(Y, X, formula, depth, missing)
63 if data_util._is_using_pandas(Y, None):
64 result = dmatrices(formula, Y, depth, return_type='dataframe',
---> 65 NA_action=na_action)
66 else:
67 result = dmatrices(formula, Y, depth, return_type='dataframe',
/ext/sage/sage-8.1/local/lib/python2.7/site-packages/patsy/highlevel.pyc in dmatrices(formula_like, data, eval_env, NA_action, return_type)
308 eval_env = EvalEnvironment.capture(eval_env, reference=1)
309 (lhs, rhs) = _do_highlevel_design(formula_like, data, eval_env,
--> 310 NA_action, return_type)
311 if lhs.shape[1] == 0:
312 raise PatsyError("model is missing required outcome variables")
/ext/sage/sage-8.1/local/lib/python2.7/site-packages/patsy/highlevel.pyc in _do_highlevel_design(formula_like, data, eval_env, NA_action, return_type)
163 return iter([data])
164 design_infos = _try_incr_builders(formula_like, data_iter_maker, eval_env,
--> 165 NA_action)
166 if design_infos is not None:
167 return build_design_matrices(design_infos, data,
/ext/sage/sage-8.1/local/lib/python2.7/site-packages/patsy/highlevel.pyc in _try_incr_builders(formula_like, data_iter_maker, eval_env, NA_action)
68 data_iter_maker,
69 eval_env,
---> 70 NA_action)
71 else:
72 return None
/ext/sage/sage-8.1/local/lib/python2.7/site-packages/patsy/build.pyc in design_matrix_builders(termlists, data_iter_maker, eval_env, NA_action)
694 factor_states,
695 data_iter_maker,
--> 696 NA_action)
697 # Now we need the factor infos, which encapsulate the knowledge of
698 # how to turn any given factor into a chunk of data:
/ext/sage/sage-8.1/local/lib/python2.7/site-packages/patsy/build.pyc in _examine_factor_types(factors, factor_states, data_iter_maker, NA_action)
441 for data in data_iter_maker():
442 for factor in list(examine_needed):
--> 443 value = factor.eval(factor_states[factor], data)
444 if factor in cat_sniffers or guess_categorical(value):
445 if factor not in cat_sniffers:
/ext/sage/sage-8.1/local/lib/python2.7/site-packages/patsy/eval.pyc in eval(self, memorize_state, data)
564 return self._eval(memorize_state["eval_code"],
565 memorize_state,
--> 566 data)
567
568 __getstate__ = no_pickling
/ext/sage/sage-8.1/local/lib/python2.7/site-packages/patsy/eval.pyc in _eval(self, code, memorize_state, data)
549 memorize_state["eval_env"].eval,
550 code,
--> 551 inner_namespace=inner_namespace)
552
553 def memorize_chunk(self, state, which_pass, data):
/ext/sage/sage-8.1/local/lib/python2.7/site-packages/patsy/compat.pyc in call_and_wrap_exc(msg, origin, f, *args, **kwargs)
115 def call_and_wrap_exc(msg, origin, f, *args, **kwargs):
116 try:
--> 117 return f(*args, **kwargs)
118 except Exception as e:
119 if sys.version_info[0] >= 3:
/ext/sage/sage-8.1/local/lib/python2.7/site-packages/patsy/eval.pyc in eval(self, expr, source_name, inner_namespace)
164 code = compile(expr, source_name, "eval", self.flags, False)
165 return eval(code, {}, VarLookupDict([inner_namespace]
--> 166 + self._namespaces))
167
168 @classmethod
<string> in <module>()
NameError: name 'Sepal' is not defined