CoCalc -- 2018-03-02.ipynb

| Download

This repository contains the course materials from Math 157: Intro to Mathematical Software.

Creative Commons BY-SA 4.0 license.

Project: Support and Testing

Path: math157 / assignments / 2018-03-02 / 2018-03-02.ipynb

Views: ³⁰³⁷
License: OTHER

Kernel: SageMath 8.1

Math 157: Intro to Mathematical Software

UC San Diego, winter 2018

Homework 7: due March 2, 2018

Please enter all answers within this notebook unless otherwise specified. As usual, don't forget to cite sources and collaborators.

Through this problem set, use the SageMath 8.1 kernel except as specified. You may find the following declarations useful:

preparser(False)                    # Turn off Sage preparser
import pandas as pd                 # Load pandas
import numpy as np                  # Load numpy
import matplotlib                   # Load matplotlib
import matplotlib.pyplot as plt     # Load pyplot
matplotlib.style.use('ggplot')      # Use R-style plotting in matplotlib
import seaborn as sns               # Load seaborn

This homework consists of 5 problems, each of equal value.

Problem 1: Emulation of R in Python

Grading criteria: correctness of code.

Demonstrate Python analogues of the following R code blocks from the previous homework. Hints:

The Python statsmodels module includes the submodule datasets which simulates the corresponding R package.
The R pairs function can be simulated using the pandas function scatter_matrix.
The seaborn function FacetGrid allows you to set up a grid in which each entry corresponds to a particular value of a conditioning variable. Use this, and the matplotlib scatter plot functionality, to simulate the R function coplot.
The statsmodels module mosaicplot can simulate the R function mosaicplot.

In [13]:

from statsmodels import datasets
%load_ext rpy2.ipython

In [0]:

%%R
pairs(mtcars)

In [1]:

# Your Python equivalent goes here.  Warning: this dataset is not available via statsmodels, and the FacetGrid function may crash the kernel.

In [0]:

%%R
require(stats); require(graphics)
coplot(circumference ~ age | Tree, data = Orange, show.given = FALSE)

In [12]:

# Your Python equivalent goes here.

In [0]:

%%R
x <- apply(HairEyeColor, c(1, 2), sum)
x
mosaicplot(x)

In [20]:

df = datasets.get_rdataset("HairEyeColor")

In [28]:

from statsmodels.graphics.mosaicplot import mosaic
mosaic(df,[1,2])

---------------------------------------------------------------------------
IndexError                                Traceback (most recent call last)
<ipython-input-28-ac441a8a3c27> in <module>()
      1 from statsmodels.graphics.mosaicplot import mosaic
----> 2 mosaic(df,[Integer(1),Integer(2)])

/ext/sage/sage-8.1/local/lib/python2.7/site-packages/statsmodels/graphics/mosaicplot.pyc in mosaic(data, index, ax, horizontal, gap, properties, labelizer, title, statistic, axes_label, label_rotation)
    618     fig, ax = utils.create_mpl_ax(ax)
    619     # normalize the data to a dict with tuple of strings as keys
--> 620     data = _normalize_data(data, index)
    621     # split the graph into different areas
    622     rects = _hierarchical_split(data, horizontal=horizontal, gap=gap)
/ext/sage/sage-8.1/local/lib/python2.7/site-packages/statsmodels/graphics/mosaicplot.pyc in _normalize_data(data, index)
    315     contingency = OrderedDict()
    316     for key, value in iteritems(data):
--> 317         new_key = tuple(key[i] for i in index)
    318         contingency[new_key] = value
    319     data = contingency
/ext/sage/sage-8.1/local/lib/python2.7/site-packages/statsmodels/graphics/mosaicplot.pyc in <genexpr>((i,))
    315     contingency = OrderedDict()
    316     for key, value in iteritems(data):
--> 317         new_key = tuple(key[i] for i in index)
    318         contingency[new_key] = value
    319     data = contingency
IndexError: tuple index out of range

In [19]:

from rpy2.robjects import r, pandas2ri

In [0]:

pandas2ri.ri2py(....)

In [0]:

# Your Python equivalent goes here

Problem 2: Sunspots revisited

Grading criteria: correctness of code and results.

Let sunspots be the sunactivity dataframe (defined below for you).

In [0]:

from statsmodels import datasets
sunspots = datasets.sunspots.load_pandas().data.set_index("YEAR")

In [30]:

from statsmodels import datasets
sunspots = datasets.sunspots.load_pandas().data.set_index("YEAR")

In [11]:

In [12]:

---------------------------------------------------------------------------
KeyError                                  Traceback (most recent call last)
<ipython-input-12-1fde0097a6c5> in <module>()
----> 1 s[True]

/ext/sage/sage-8.1/local/lib/python2.7/site-packages/pandas/core/series.pyc in __getitem__(self, key)
    621         key = com._apply_if_callable(key, self)
    622         try:
--> 623             result = self.index.get_value(self, key)
    624 
    625             if not is_scalar(result):
/ext/sage/sage-8.1/local/lib/python2.7/site-packages/pandas/core/indexes/numeric.pyc in get_value(self, series, key)
    343 
    344         k = _values_from_object(key)
--> 345         loc = self.get_loc(k)
    346         new_values = _values_from_object(series)[loc]
    347 
/ext/sage/sage-8.1/local/lib/python2.7/site-packages/pandas/core/indexes/numeric.pyc in get_loc(self, key, method, tolerance)
    402             pass
    403         return super(Float64Index, self).get_loc(key, method=method,
--> 404                                                  tolerance=tolerance)
    405 
    406     @cache_readonly
/ext/sage/sage-8.1/local/lib/python2.7/site-packages/pandas/core/indexes/base.pyc in get_loc(self, key, method, tolerance)
   2522                 return self._engine.get_loc(key)
   2523             except KeyError:
-> 2524                 return self._engine.get_loc(self._maybe_cast_indexer(key))
   2525 
   2526         indexer = self.get_indexer([key], method=method, tolerance=tolerance)
pandas/_libs/index.pyx in pandas._libs.index.IndexEngine.get_loc()
pandas/_libs/index.pyx in pandas._libs.index.IndexEngine.get_loc()
pandas/_libs/hashtable_class_helper.pxi in pandas._libs.hashtable.Float64HashTable.get_item()
pandas/_libs/hashtable_class_helper.pxi in pandas._libs.hashtable.Float64HashTable.get_item()
KeyError: 1.0

2a. For how many years was the activity $\geq 100$ ?

In [0]:

2b. Make a histogram plot of all activity from 1900 to the end of the dataset.

In [33]:

sunspots[1900:]

	SUNACTIVITY
YEAR
1900.0	9.5
1901.0	2.7
1902.0	5.0
1903.0	24.4
1904.0	42.0
1905.0	63.5
1906.0	53.8
1907.0	62.0
1908.0	48.5
1909.0	43.9
1910.0	18.6
1911.0	5.7
1912.0	3.6
1913.0	1.4
1914.0	9.6
1915.0	47.4
1916.0	57.1
1917.0	103.9
1918.0	80.6
1919.0	63.6
1920.0	37.6
1921.0	26.1
1922.0	14.2
1923.0	5.8
1924.0	16.7
1925.0	44.3
1926.0	63.9
1927.0	69.0
1928.0	77.8
1929.0	64.9
...	...
1979.0	155.4
1980.0	154.6
1981.0	140.4
1982.0	115.9
1983.0	66.6
1984.0	45.9
1985.0	17.9
1986.0	13.4
1987.0	29.4
1988.0	100.2
1989.0	157.6
1990.0	142.6
1991.0	145.7
1992.0	94.3
1993.0	54.6
1994.0	29.9
1995.0	17.5
1996.0	8.6
1997.0	21.5
1998.0	64.3
1999.0	93.3
2000.0	119.6
2001.0	111.0
2002.0	104.0
2003.0	63.7
2004.0	40.4
2005.0	29.8
2006.0	15.2
2007.0	7.5
2008.0	2.9

109 rows × 1 columns

In [0]:

2c. Which year(s) had the highest activity?

In [0]:

Problem 3: Pivot tables

Grading criteria: correctness of code and explanations.

3a. Load the "mpg" R dataset from the Python ggplot library into the variable mpg.

In [3]:

from ggplot import mpg
mpg.describe()

	displ	year	cyl	cty	hwy
count	234.000000	234.000000	234.000000	234.000000	234.000000
mean	3.471795	2003.500000	5.888889	16.858974	23.440171
std	1.291959	4.509646	1.611534	4.255946	5.954643
min	1.600000	1999.000000	4.000000	9.000000	12.000000
25%	2.400000	1999.000000	4.000000	14.000000	18.000000
50%	3.300000	2003.500000	6.000000	17.000000	24.000000
75%	4.600000	2008.000000	8.000000	19.000000	27.000000
max	7.000000	2008.000000	8.000000	35.000000	44.000000

3b. Using the pandas pivot_table command, cretae a pandas DataFrame that tells you the average "cty" and "hwy" (city and highway miles per gallon) for each manufacturer.

In [0]:

3c. Has the average city mileage improved from 1999 to 2008? Has the average highway mileage improved from 1999 to 2008?

In [0]:

3d. Create a scatterplot of pairs (displ, hwy) for all cars in 1999, and another for all cars in 2008.

In [0]:

3e. What effect does increasing displacement have on highway gas mileage?

In [0]:

Problem 4: Irises

Grading criteria: correctness of code and explanations.

The iris dataset is a famous example used in statistics education.

4a. Load the iris dataset into a pandas DataFrame and use the describe command to see some basic statistics.

In [34]:

import statsmodels.api as sm
iris = sm.datasets.get_rdataset('iris').data
iris

/ext/sage/sage-8.1/local/lib/python2.7/site-packages/statsmodels/compat/pandas.py:56: FutureWarning: The pandas.core.datetools module is deprecated and will be removed in a future version. Please use the pandas.tseries module instead.
  from pandas.core import datetools

	Sepal.Length	Sepal.Width	Petal.Length	Petal.Width	Species
0	5.1	3.5	1.4	0.2	setosa
1	4.9	3.0	1.4	0.2	setosa
2	4.7	3.2	1.3	0.2	setosa
3	4.6	3.1	1.5	0.2	setosa
4	5.0	3.6	1.4	0.2	setosa
5	5.4	3.9	1.7	0.4	setosa
6	4.6	3.4	1.4	0.3	setosa
7	5.0	3.4	1.5	0.2	setosa
8	4.4	2.9	1.4	0.2	setosa
9	4.9	3.1	1.5	0.1	setosa
10	5.4	3.7	1.5	0.2	setosa
11	4.8	3.4	1.6	0.2	setosa
12	4.8	3.0	1.4	0.1	setosa
13	4.3	3.0	1.1	0.1	setosa
14	5.8	4.0	1.2	0.2	setosa
15	5.7	4.4	1.5	0.4	setosa
16	5.4	3.9	1.3	0.4	setosa
17	5.1	3.5	1.4	0.3	setosa
18	5.7	3.8	1.7	0.3	setosa
19	5.1	3.8	1.5	0.3	setosa
20	5.4	3.4	1.7	0.2	setosa
21	5.1	3.7	1.5	0.4	setosa
22	4.6	3.6	1.0	0.2	setosa
23	5.1	3.3	1.7	0.5	setosa
24	4.8	3.4	1.9	0.2	setosa
25	5.0	3.0	1.6	0.2	setosa
26	5.0	3.4	1.6	0.4	setosa
27	5.2	3.5	1.5	0.2	setosa
28	5.2	3.4	1.4	0.2	setosa
29	4.7	3.2	1.6	0.2	setosa
...	...	...	...	...	...
120	6.9	3.2	5.7	2.3	virginica
121	5.6	2.8	4.9	2.0	virginica
122	7.7	2.8	6.7	2.0	virginica
123	6.3	2.7	4.9	1.8	virginica
124	6.7	3.3	5.7	2.1	virginica
125	7.2	3.2	6.0	1.8	virginica
126	6.2	2.8	4.8	1.8	virginica
127	6.1	3.0	4.9	1.8	virginica
128	6.4	2.8	5.6	2.1	virginica
129	7.2	3.0	5.8	1.6	virginica
130	7.4	2.8	6.1	1.9	virginica
131	7.9	3.8	6.4	2.0	virginica
132	6.4	2.8	5.6	2.2	virginica
133	6.3	2.8	5.1	1.5	virginica
134	6.1	2.6	5.6	1.4	virginica
135	7.7	3.0	6.1	2.3	virginica
136	6.3	3.4	5.6	2.4	virginica
137	6.4	3.1	5.5	1.8	virginica
138	6.0	3.0	4.8	1.8	virginica
139	6.9	3.1	5.4	2.1	virginica
140	6.7	3.1	5.6	2.4	virginica
141	6.9	3.1	5.1	2.3	virginica
142	5.8	2.7	5.1	1.9	virginica
143	6.8	3.2	5.9	2.3	virginica
144	6.7	3.3	5.7	2.5	virginica
145	6.7	3.0	5.2	2.3	virginica
146	6.3	2.5	5.0	1.9	virginica
147	6.5	3.0	5.2	2.0	virginica
148	6.2	3.4	5.4	2.3	virginica
149	5.9	3.0	5.1	1.8	virginica

150 rows × 5 columns

4b. Plot all of the sepal (length, width) pairs in a scatterplot, and the petal (length, width) pairs in another scatterplot.

In [36]:

import matplotlib.pyplot as plt
#x_data = iris.iloc[:,0]
#x_data
y_data = iris['Sepal.Width']
y_data
#plt.scatter(x_data, y_data)

    3.5
    3.0
    3.2
    3.1
    3.6
    3.9
    3.4
    3.4
    2.9
    3.1
   3.7
   3.4
   3.0
   3.0
   4.0
   4.4
   3.9
   3.5
   3.8
   3.8
   3.4
   3.7
   3.6
   3.3
   3.4
   3.0
   3.4
   3.5
   3.4
   3.2
      ... 
  3.2
  2.8
  2.8
  2.7
  3.3
  3.2
  2.8
  3.0
  2.8
  3.0
  2.8
  3.8
  2.8
  2.8
  2.6
  3.0
  3.4
  3.1
  3.0
  3.1
  3.1
  3.1
  2.7
  3.2
  3.3
  3.0
  2.5
  3.0
  3.4
  3.0
Name: Sepal.Width, Length: 150, dtype: float64

4c. Compute the average petal width for each of the "species"-categories.

In [0]:

Problem 5: Machine learning with irises

Grading criteria: correctness and relevance of code.

5a. The Wikipedia article on the iris dataset asserts:

The use of this data set in cluster analysis however is not common, since the data set only contains two clusters with rather obvious separation.

Demonstrate this by performing a clustering computation and showing that it fails to separate the three species.

In [0]:

5b. Use the scikit-learn SVM classifier to classify species. Use a random sample of 80% of the initial data for training and the other 20% for testing, and report the accuracy rate of your predictions.

In [37]:

from rpy2.robjects import r, pandas2ri
import pandas as pd
import matplotlib.pyplot as plt
pandas2ri.activate()
iris = pd.DataFrame(r['iris'])

In [43]:

iris.axes()

---------------------------------------------------------------------------
AttributeError                            Traceback (most recent call last)
<ipython-input-43-46df5a53509d> in <module>()
----> 1 iris.labels

/ext/sage/sage-8.1/local/lib/python2.7/site-packages/pandas/core/generic.pyc in __getattr__(self, name)
   3608         if (name in self._internal_names_set or name in self._metadata or
   3609                 name in self._accessors):
-> 3610             return object.__getattribute__(self, name)
   3611         else:
   3612             if name in self._info_axis:
AttributeError: 'DataFrame' object has no attribute 'labels'

In [41]:

from sklearn import svm
from sklearn.utils import shuffle
import statsmodels.formula.api as sm
iris = shuffle(iris)
clf = svm.SVC(gamma=0.001, C=100.)
train = iris.sample(frac=0.8)
test = iris.loc[~iris.index.isin(train.index)]
clf.fit(train[['Sepal.Length','Sepal.Width','Petal.Length','Petal.Width']],train['Species'])
predictions = clf.predict(test[['Sepal.Length','Sepal.Width','Petal.Length','Petal.Width']])
predictions
model = sm.ols(formula='Species ~ Sepal, Petal', data=train)
fitted = model.fit()
fitted.summary()

---------------------------------------------------------------------------
NameError                                 Traceback (most recent call last)
<ipython-input-41-8b82712dca2c> in <module>()
      9 predictions = clf.predict(test[['Sepal.Length','Sepal.Width','Petal.Length','Petal.Width']])
     10 predictions
---> 11 model = sm.ols(formula='Species ~ Sepal, Petal', data=train)
     12 fitted = model.fit()
     13 fitted.summary()
/ext/sage/sage-8.1/local/lib/python2.7/site-packages/statsmodels/base/model.pyc in from_formula(cls, formula, data, subset, drop_cols, *args, **kwargs)
    153 
    154         tmp = handle_formula_data(data, None, formula, depth=eval_env,
--> 155                                   missing=missing)
    156         ((endog, exog), missing_idx, design_info) = tmp
    157 
/ext/sage/sage-8.1/local/lib/python2.7/site-packages/statsmodels/formula/formulatools.pyc in handle_formula_data(Y, X, formula, depth, missing)
     63         if data_util._is_using_pandas(Y, None):
     64             result = dmatrices(formula, Y, depth, return_type='dataframe',
---> 65                                NA_action=na_action)
     66         else:
     67             result = dmatrices(formula, Y, depth, return_type='dataframe',
/ext/sage/sage-8.1/local/lib/python2.7/site-packages/patsy/highlevel.pyc in dmatrices(formula_like, data, eval_env, NA_action, return_type)
    308     eval_env = EvalEnvironment.capture(eval_env, reference=1)
    309     (lhs, rhs) = _do_highlevel_design(formula_like, data, eval_env,
--> 310                                       NA_action, return_type)
    311     if lhs.shape[1] == 0:
    312         raise PatsyError("model is missing required outcome variables")
/ext/sage/sage-8.1/local/lib/python2.7/site-packages/patsy/highlevel.pyc in _do_highlevel_design(formula_like, data, eval_env, NA_action, return_type)
    163         return iter([data])
    164     design_infos = _try_incr_builders(formula_like, data_iter_maker, eval_env,
--> 165                                       NA_action)
    166     if design_infos is not None:
    167         return build_design_matrices(design_infos, data,
/ext/sage/sage-8.1/local/lib/python2.7/site-packages/patsy/highlevel.pyc in _try_incr_builders(formula_like, data_iter_maker, eval_env, NA_action)
     68                                       data_iter_maker,
     69                                       eval_env,
---> 70                                       NA_action)
     71     else:
     72         return None
/ext/sage/sage-8.1/local/lib/python2.7/site-packages/patsy/build.pyc in design_matrix_builders(termlists, data_iter_maker, eval_env, NA_action)
    694                                                    factor_states,
    695                                                    data_iter_maker,
--> 696                                                    NA_action)
    697     # Now we need the factor infos, which encapsulate the knowledge of
    698     # how to turn any given factor into a chunk of data:
/ext/sage/sage-8.1/local/lib/python2.7/site-packages/patsy/build.pyc in _examine_factor_types(factors, factor_states, data_iter_maker, NA_action)
    441     for data in data_iter_maker():
    442         for factor in list(examine_needed):
--> 443             value = factor.eval(factor_states[factor], data)
    444             if factor in cat_sniffers or guess_categorical(value):
    445                 if factor not in cat_sniffers:
/ext/sage/sage-8.1/local/lib/python2.7/site-packages/patsy/eval.pyc in eval(self, memorize_state, data)
    564         return self._eval(memorize_state["eval_code"],
    565                           memorize_state,
--> 566                           data)
    567 
    568     __getstate__ = no_pickling
/ext/sage/sage-8.1/local/lib/python2.7/site-packages/patsy/eval.pyc in _eval(self, code, memorize_state, data)
    549                                  memorize_state["eval_env"].eval,
    550                                  code,
--> 551                                  inner_namespace=inner_namespace)
    552 
    553     def memorize_chunk(self, state, which_pass, data):
/ext/sage/sage-8.1/local/lib/python2.7/site-packages/patsy/compat.pyc in call_and_wrap_exc(msg, origin, f, *args, **kwargs)
    115 def call_and_wrap_exc(msg, origin, f, *args, **kwargs):
    116     try:
--> 117         return f(*args, **kwargs)
    118     except Exception as e:
    119         if sys.version_info[0] >= 3:
/ext/sage/sage-8.1/local/lib/python2.7/site-packages/patsy/eval.pyc in eval(self, expr, source_name, inner_namespace)
    164         code = compile(expr, source_name, "eval", self.flags, False)
    165         return eval(code, {}, VarLookupDict([inner_namespace]
--> 166                                             + self._namespaces))
    167 
    168     @classmethod
<string> in <module>()
NameError: name 'Sepal' is not defined

In [0]:

sm.ols?

In [0]: