Project: math480-2016

Path: solutions / DUE-2016-05-13 / DUE-2016-05-13.sagews

Views: ⁷⁷⁹

Math 480 - Homework 6

Due 6pm on May 13, 2016

There are 5 problems. All problems have equal weight.

# Always gets run when you start this worksheet -- makes things nice for pandas.
%auto
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import matplotlib
matplotlib.style.use('ggplot')
%default_mode python
%typeset_mode True

Problem 1 -- Your CSV file

(1.a) Search for a CSV dataset online (google: "[a keyword] filetype:csv") and try to load it into pandas. Make sure, it contains at least one column with numbers!
(1.b) Load the file as a Pandas dataframe, and compute the sum, mean, max, min, etc. of columns with numbers (use the describe method on a dataframe).
(1.c) Use a command from the Pandas visualization tools to draw at least one plot that illustrates your data.

10 Points Total

Part A (2 points total)

All or nothing, there should be data loaded into pandas via csv.

Part B (4 points total)

Award:

4 points for using describe
3 points if the manually computed the values that describe should give.
2 points if their csv file does not load but they used describe correctly

Part C (4 points total)

Award:

4 points for plotting their data.
2 points if they plotted correctly but their csv file did not load.

(1.a) -- make some notes about how you did it here.

# (1.b) -- load and describe

# (1.c) -- visualize

### Solution:

# Googled for "skateboard type:csv", then got github https://raw.githubusercontent.com/taulant/nodebox-sandbox/master/Skateboard_Parks.csv and downloaded raw.

d = pd.read_csv("Skateboard_Parks.csv")
d.describe()

d.plot.line(x='FID', y='PARK_ID')

	FID	PARK_ID	FACILITYID
count	20.00000	20.000000	20.000000
mean	9.50000	869.750000	28009.900000
std	5.91608	456.136623	6.348643
min	0.00000	9.000000	28000.000000
25%	4.75000	646.750000	28004.750000
50%	9.50000	867.500000	28009.500000
75%	14.25000	1157.500000	28015.250000
max	19.00000	1717.000000	28020.000000

Problem 2 -- Creating/Importing Different types of files

This problem is very similar to problem 1, but with more file types (and they are smaller). Pandas can import many types of files, including CSV files, excel spreadsheets, and much more.

(2.a) Find or create small example files (each should have at least 3 rows) in any way you want:
- prob2.csv -- a CSV file
- prob2.json -- a JSON file (hint: you can make json files using the json Python module)
- prob2.xls -- an excel spreadsheet (hint: use google docs to make one)
- prob2.h5 -- an HDF file (hint: create such a file using pandas; e.g., see HDFStore docs)
(2.b) Read each of the files above in as Pandas data frames, compute summary statistics about them (with describe), and draw one plot (of your choosing) to illustrate something about the data.

(2.a)

Explain how you got (or created) your data:
When you're done there should be files data.csv, data.json, data.xls, and data.h5 in the same directory as this worksheet.

10 Points Total

Part A (1 point for CSV, JSON, and xlsx. 2 points for h5; 5 points total)

Award (for each file type):

Full points if they have the file and explained how they got it. (code to make the file counts as it's self explanitory)
Half points if they're missing one of the requirements for the full point.

Part B (1 point per file type import and 1 point for a plot. 5 points total)

Award: (per file type).

1 point if they used read_XXXX(file_name) correctly and used describe.
0.5 points if did not use describe()
0 points if they did not read from file (eg. read from a variable)

Award 1 point for the plot of some data.

Solution

# Creating CSV File
raw_data = {'first_name': ['Jason', 'Molly', 'Tina', 'Jake', 'Amy'],
        'last_name': ['Miller', 'Jacobson', ".", 'Milner', 'Cooze'],
        'age': [42, 52, 36, 24, 73],
        'preTestScore': [4, 24, 31, ".", "."],
        'postTestScore': ["25,000", "94,000", 57, 62, 70]}
df = pd.DataFrame(raw_data, columns = ['first_name', 'last_name', 'age', 'preTestScore', 'postTestScore'])
df.to_csv('prob2.csv')

Json was downloaded from here via this example
xlsx was created in google drive and exported as xlsx

# Creating hdf file
# From example here: http://pandas.pydata.org/pandas-docs/version/0.17.1/io.html#io-hdf5
store = pd.HDFStore('prob2.h5')
np.random.seed(1234)
index = pd.date_range('1/1/2000', periods=8)
df = pd.DataFrame(np.random.randn(8, 3), index=index, columns=['A', 'B', 'C'])
store["df"] = df

# (2.b)
csv_data  = pd.read_csv("prob2.csv")
json_data = pd.read_json("prob2.json")
xls_data  = pd.read_excel("prob2.xlsx")
h5_data  = pd.read_hdf("prob2.h5")

csv_data.describe()
json_data.describe()
xls_data.describe()
h5_data.describe()

	Unnamed: 0	age
count	5.000000	5.000000
mean	2.000000	45.400000
std	1.581139	18.460769
min	0.000000	24.000000
25%	1.000000	36.000000
50%	2.000000	42.000000
75%	3.000000	52.000000
max	4.000000	73.000000

	assignee	comments	id	number
count	0.0	5.000000	5.000000e+00	5.000000
mean	NaN	2.600000	1.537063e+08	13117.400000
std	NaN	4.722288	6.382248e+04	2.073644
min	NaN	0.000000	1.536583e+08	13115.000000
25%	NaN	0.000000	1.536615e+08	13116.000000
50%	NaN	1.000000	1.536700e+08	13117.000000
75%	NaN	1.000000	1.537364e+08	13119.000000
max	NaN	11.000000	1.538051e+08	13120.000000

	Time	12:30 - 2:30	12:30 - 2:30.1	11:45 - 12:45	12:30 - 2:30.2	12:30 - 2:30.3
count	2	4	4	3	4	4
unique	2	4	4	3	4	4
top	Location	April 4th	April 5th	MGH	April 16th	Ryan
freq	1	1	1	1	1	1

	A	B	C
count	8.000000	8.000000	8.000000
mean	0.049540	-0.545917	0.454165
std	1.143521	1.066463	0.622934
min	-2.242685	-2.021255	-0.334077
25%	-0.398481	-1.279958	-0.038889
50%	0.236777	-0.678556	0.421265
75%	0.883022	0.246429	0.913359
max	1.321158	1.150036	1.432707

### Problem 3 -- Sunactivity

Let sunspots be the sunactivity dataframe from the lecture.

(3.a) For how many years was the activity $\geq 100$ ? (Hint: how to get from a list/array of objects to the number of elements in that list/array?)
(3.b) Plot a histogram of all activity values beginning with the year 1900.
(3.c) Which year(s) had the highest activity?

10 Points Total

Part A (4 points)

Award:

4 points if they 'just got the answer' using Pandas functionality (should be short)
2 points if they used something like a for loop
0 points if they got the wrong answer

Part B (4 pointsl)

Award:

4 points for the correct histogram
2 points if they made a histogram but with the wrong subset of data
0 points for no histogram

Part C (2 pointsl)

Award: (per file type).

2 points if they gave the correct answer with relatively short code
1 point if they gave the correct answer but for looped through the data
0 points if they gave the wrong answer

# (3.a)
from statsmodels import datasets
sunspots = datasets.sunspots.load_pandas().data.set_index("YEAR")

print "number of years when activity was > 100:"
len(sunspots[sunspots.SUNACTIVITY>100])

number of years when activity was > 100:

\displaystyle 43

# (3.b)

sunspots[1900:].plot.hist()

# (3.c)

m = sunspots.max().SUNACTIVITY
sunspots[sunspots.SUNACTIVITY == m]

	SUNACTIVITY
YEAR
1957.0	190.2

Problem 4 -- Iris flowers

All statstic students learn about the extremely famous iris dataset! It lists the various sizes of petals and tries to classify them.

(4.a) Load the iris data set and use describe to see basic statistics about it. Hint: from statsmodels import datasets; iris = datasets.get_rdataset("iris").data
(4.b) Plot all of the sepal (length, width) pairs in a scatterplot, and then plot the petal (length, width) pairs in another scatterplot.
(4.c) Compute the average petal width for each of the "species"-categories.

10 Points Total

Part A (1 points)

Award:

1 point for doing iris.describe()

Part B (2 points per plot, 4 points total)

Award: (per plot)

2 points for the correct plot with a simple solution (about 1 or 2 lines)
1 point for the correct plot with a more involved solution
0 points the wrong plot

Part C (5 points)

Award:

5 points for finding the correct average petal width for each species.
4 points for the correct widths without labeling the species.
2 points for the wrong answer but a reasonable attempt.

# (4.a)
from statsmodels import datasets
iris = datasets.get_rdataset("iris").data
iris.describe()

	Sepal.Length	Sepal.Width	Petal.Length	Petal.Width
count	150.000000	150.000000	150.000000	150.000000
mean	5.843333	3.057333	3.758000	1.199333
std	0.828066	0.435866	1.765298	0.762238
min	4.300000	2.000000	1.000000	0.100000
25%	5.100000	2.800000	1.600000	0.300000
50%	5.800000	3.000000	4.350000	1.300000
75%	6.400000	3.300000	5.100000	1.800000
max	7.900000	4.400000	6.900000	2.500000

# (4.b)
iris.plot.scatter(x='Sepal.Length', y = 'Sepal.Width')
iris.plot.scatter(x='Petal.Length', y = 'Petal.Width')

# (4.c)
iris.groupby('Species').describe()
for s in iris.Species.unique():
    print s, iris["Petal.Width"][iris.Species == s].mean()

		Petal.Length	Petal.Width	Sepal.Length	Sepal.Width
Species
setosa	count	50.000000	50.000000	50.000000	50.000000
	mean	1.462000	0.246000	5.006000	3.428000
	std	0.173664	0.105386	0.352490	0.379064
	min	1.000000	0.100000	4.300000	2.300000
	25%	1.400000	0.200000	4.800000	3.200000
	50%	1.500000	0.200000	5.000000	3.400000
	75%	1.575000	0.300000	5.200000	3.675000
	max	1.900000	0.600000	5.800000	4.400000
versicolor	count	50.000000	50.000000	50.000000	50.000000
	mean	4.260000	1.326000	5.936000	2.770000
	std	0.469911	0.197753	0.516171	0.313798
	min	3.000000	1.000000	4.900000	2.000000
	25%	4.000000	1.200000	5.600000	2.525000
	50%	4.350000	1.300000	5.900000	2.800000
	75%	4.600000	1.500000	6.300000	3.000000
	max	5.100000	1.800000	7.000000	3.400000
virginica	count	50.000000	50.000000	50.000000	50.000000
	mean	5.552000	2.026000	6.588000	2.974000
	std	0.551895	0.274650	0.635880	0.322497
	min	4.500000	1.400000	4.900000	2.200000
	25%	5.100000	1.800000	6.225000	2.800000
	50%	5.550000	2.000000	6.500000	3.000000
	75%	5.875000	2.300000	6.900000	3.175000
	max	6.900000	2.500000	7.900000	3.800000

setosa 0.246
versicolor 1.326
virginica 2.026

Problem 5 -- Pivot Tables

Large datasets have a problem: they are large. One of the most commonly used techniques for summarizing larger tables into a more compact table are [Pivot Tables](https://en.wikipedia.org/wiki/Pivot_table).

Pandas has a very powerful pd.pivot_table function. See also http://pbpython.com/pandas-pivot-table-explained.html

Load the miles per gallon data set, which has both numerical and categorical columns:

from statsmodels import datasets
mpg = datasets.get_rdataset("mpg", "ggplot2").data

You will then compute pivot tables, where you aggregate columns of your choice by sum, mean, min or max by category.

(5.a) Create a pandas data frame using the pd.pivot_table command that tells you the average "cty" and "hwy" (city and highway miles per gallon) for each manufacturer?
(5.b) Has the average city mileage improved from 1999 to 2008? Has the average highway mileage improved from 1999 to 2008?
(5.c) Create a scatterplot of pairs (displ, hwy) for all cars in 1999, and another scatter plot for all cars in 2008. Roughly speaking, if you increase the card displacement, does the highway gas mileage go up or down?

10 Points Total

Deduct 2 points from each part where they use an involved solution (eg. for-looping) For parts A and B, deduct 1 point if they did not use a pivot table.

Part A (4 points)

Award:

4 points for a correct table showing just the relevant, requested information.
3 points for a correct table showing extra information

Part B (4 points total)

Award:

4 points for a correct table showing just the relevant, requested information.
3 points for a correct table showing extra information

Part C (1 points per plot, 2 points total)

No pivot table is required for this part.

Award: (per plot)

1 point for the correct plot with a simple solution (about 1 or 2 lines)
0.5 points for the correct plot with a more involved solution
0 points the wrong plot

from statsmodels import datasets
mpg = datasets.get_rdataset("mpg", "ggplot2").data
mpg.head(5) # only the first few rows

	manufacturer	model	displ	year	cyl	trans	drv	cty	hwy	fl	class
0	audi	a4	1.8	1999	4	auto(l5)	f	18	29	p	compact
1	audi	a4	1.8	1999	4	manual(m5)	f	21	29	p	compact
2	audi	a4	2.0	2008	4	manual(m6)	f	20	31	p	compact
3	audi	a4	2.0	2008	4	auto(av)	f	21	30	p	compact
4	audi	a4	2.8	1999	6	auto(l5)	f	16	26	p	compact

# (5.a)

pd.pivot_table(mpg, values=['cty', 'hwy'], columns=['manufacturer'], aggfunc=np.mean)

manufacturer	audi	chevrolet	dodge	ford	honda	hyundai	jeep	land rover	lincoln	mercury	nissan	pontiac	subaru	toyota	volkswagen
cty	17.611111	15.000000	13.135135	14.00	24.444444	18.642857	13.500	11.5	11.333333	13.25	18.076923	17.0	19.285714	18.529412	20.925926
hwy	26.444444	21.894737	17.945946	19.36	32.555556	26.857143	17.625	16.5	17.000000	18.00	24.615385	26.4	25.571429	24.911765	29.222222


# (5.b)
pd.pivot_table(mpg, values=['hwy', 'cty'], columns=['year'], aggfunc=np.mean)

year	1999	2008
cty	17.017094	16.700855
hwy	23.427350	23.452991

# (5.c)
mpg[mpg.year==1999].plot.scatter(x='displ', y='hwy')
mpg[mpg.year==2008].plot.scatter(x='displ', y='hwy')