Contact
CoCalc Logo Icon
StoreFeaturesDocsShareSupport News AboutSign UpSign In
| Download
Project: math480-2016
Views: 779

Math 480 - Homework 6

Due 6pm on May 13, 2016

There are 5 problems. All problems have equal weight.

# Always gets run when you start this worksheet -- makes things nice for pandas. %auto import pandas as pd import numpy as np import matplotlib.pyplot as plt import matplotlib matplotlib.style.use('ggplot') %default_mode python %typeset_mode True

Problem 1 -- Your CSV file

  • (1.a) Search for a CSV dataset online (google: "[a keyword] filetype:csv") and try to load it into pandas. Make sure, it contains at least one column with numbers!

  • (1.b) Load the file as a Pandas dataframe, and compute the sum, mean, max, min, etc. of columns with numbers (use the describe method on a dataframe).

  • (1.c) Use a command from the Pandas visualization tools to draw at least one plot that illustrates your data.

10 Points Total

Part A (2 points total)

All or nothing, there should be data loaded into pandas via csv.

Part B (4 points total)

Award:

  • 4 points for using describe
  • 3 points if the manually computed the values that describe should give.
  • 2 points if their csv file does not load but they used describe correctly

Part C (4 points total)

Award:

  • 4 points for plotting their data.
  • 2 points if they plotted correctly but their csv file did not load.

(1.a) -- make some notes about how you did it here.

# (1.b) -- load and describe
# (1.c) -- visualize
### Solution: # Googled for "skateboard type:csv", then got github https://raw.githubusercontent.com/taulant/nodebox-sandbox/master/Skateboard_Parks.csv and downloaded raw. d = pd.read_csv("Skateboard_Parks.csv") d.describe() d.plot.line(x='FID', y='PARK_ID')
FID PARK_ID FACILITYID
count 20.00000 20.000000 20.000000
mean 9.50000 869.750000 28009.900000
std 5.91608 456.136623 6.348643
min 0.00000 9.000000 28000.000000
25% 4.75000 646.750000 28004.750000
50% 9.50000 867.500000 28009.500000
75% 14.25000 1157.500000 28015.250000
max 19.00000 1717.000000 28020.000000

Problem 2 -- Creating/Importing Different types of files

This problem is very similar to problem 1, but with more file types (and they are smaller). Pandas can import many types of files, including CSV files, excel spreadsheets, and much more.

  • (2.a) Find or create small example files (each should have at least 3 rows) in any way you want:

    • prob2.csv -- a CSV file

    • prob2.json -- a JSON file (hint: you can make json files using the json Python module)

    • prob2.xls -- an excel spreadsheet (hint: use google docs to make one)

    • prob2.h5 -- an HDF file (hint: create such a file using pandas; e.g., see HDFStore docs)

  • (2.b) Read each of the files above in as Pandas data frames, compute summary statistics about them (with describe), and draw one plot (of your choosing) to illustrate something about the data.

(2.a)

  • Explain how you got (or created) your data:

  • When you're done there should be files data.csv, data.json, data.xls, and data.h5 in the same directory as this worksheet.

10 Points Total

Part A (1 point for CSV, JSON, and xlsx. 2 points for h5; 5 points total)

Award (for each file type):

  • Full points if they have the file and explained how they got it. (code to make the file counts as it's self explanitory)
  • Half points if they're missing one of the requirements for the full point.

Part B (1 point per file type import and 1 point for a plot. 5 points total)

Award: (per file type).

  • 1 point if they used read_XXXX(file_name) correctly and used describe.
  • 0.5 points if did not use describe()
  • 0 points if they did not read from file (eg. read from a variable)

Award 1 point for the plot of some data.

Solution

# Creating CSV File raw_data = {'first_name': ['Jason', 'Molly', 'Tina', 'Jake', 'Amy'], 'last_name': ['Miller', 'Jacobson', ".", 'Milner', 'Cooze'], 'age': [42, 52, 36, 24, 73], 'preTestScore': [4, 24, 31, ".", "."], 'postTestScore': ["25,000", "94,000", 57, 62, 70]} df = pd.DataFrame(raw_data, columns = ['first_name', 'last_name', 'age', 'preTestScore', 'postTestScore']) df.to_csv('prob2.csv')
  • Json was downloaded from here via this example

  • xlsx was created in google drive and exported as xlsx

# Creating hdf file # From example here: http://pandas.pydata.org/pandas-docs/version/0.17.1/io.html#io-hdf5 store = pd.HDFStore('prob2.h5') np.random.seed(1234) index = pd.date_range('1/1/2000', periods=8) df = pd.DataFrame(np.random.randn(8, 3), index=index, columns=['A', 'B', 'C']) store["df"] = df
# (2.b) csv_data = pd.read_csv("prob2.csv") json_data = pd.read_json("prob2.json") xls_data = pd.read_excel("prob2.xlsx") h5_data = pd.read_hdf("prob2.h5") csv_data.describe() json_data.describe() xls_data.describe() h5_data.describe()
Unnamed: 0 age
count 5.000000 5.000000
mean 2.000000 45.400000
std 1.581139 18.460769
min 0.000000 24.000000
25% 1.000000 36.000000
50% 2.000000 42.000000
75% 3.000000 52.000000
max 4.000000 73.000000
assignee comments id number
count 0.0 5.000000 5.000000e+00 5.000000
mean NaN 2.600000 1.537063e+08 13117.400000
std NaN 4.722288 6.382248e+04 2.073644
min NaN 0.000000 1.536583e+08 13115.000000
25% NaN 0.000000 1.536615e+08 13116.000000
50% NaN 1.000000 1.536700e+08 13117.000000
75% NaN 1.000000 1.537364e+08 13119.000000
max NaN 11.000000 1.538051e+08 13120.000000
Time 12:30 - 2:30 12:30 - 2:30.1 11:45 - 12:45 12:30 - 2:30.2 12:30 - 2:30.3
count 2 4 4 3 4 4
unique 2 4 4 3 4 4
top Location April 4th April 5th MGH April 16th Ryan
freq 1 1 1 1 1 1
A B C
count 8.000000 8.000000 8.000000
mean 0.049540 -0.545917 0.454165
std 1.143521 1.066463 0.622934
min -2.242685 -2.021255 -0.334077
25% -0.398481 -1.279958 -0.038889
50% 0.236777 -0.678556 0.421265
75% 0.883022 0.246429 0.913359
max 1.321158 1.150036 1.432707
### Problem 3 -- Sunactivity

Let sunspots be the sunactivity dataframe from the lecture.

  • (3.a) For how many years was the activity 100\geq 100? (Hint: how to get from a list/array of objects to the number of elements in that list/array?)

  • (3.b) Plot a histogram of all activity values beginning with the year 1900.

  • (3.c) Which year(s) had the highest activity?

10 Points Total

Part A (4 points)

Award:

  • 4 points if they 'just got the answer' using Pandas functionality (should be short)
  • 2 points if they used something like a for loop
  • 0 points if they got the wrong answer

Part B (4 pointsl)

Award:

  • 4 points for the correct histogram
  • 2 points if they made a histogram but with the wrong subset of data
  • 0 points for no histogram

Part C (2 pointsl)

Award: (per file type).

  • 2 points if they gave the correct answer with relatively short code
  • 1 point if they gave the correct answer but for looped through the data
  • 0 points if they gave the wrong answer

# (3.a) from statsmodels import datasets sunspots = datasets.sunspots.load_pandas().data.set_index("YEAR") print "number of years when activity was > 100:" len(sunspots[sunspots.SUNACTIVITY>100])
number of years when activity was > 100:
43\displaystyle 43
# (3.b) sunspots[1900:].plot.hist()
# (3.c) m = sunspots.max().SUNACTIVITY sunspots[sunspots.SUNACTIVITY == m]
SUNACTIVITY
YEAR
1957.0 190.2

Problem 4 -- Iris flowers

All statstic students learn about the extremely famous iris dataset! It lists the various sizes of petals and tries to classify them.

  • (4.a) Load the iris data set and use describe to see basic statistics about it. Hint: from statsmodels import datasets; iris = datasets.get_rdataset("iris").data

  • (4.b) Plot all of the sepal (length, width) pairs in a scatterplot, and then plot the petal (length, width) pairs in another scatterplot.

  • (4.c) Compute the average petal width for each of the "species"-categories.

10 Points Total

Part A (1 points)

Award:

  • 1 point for doing iris.describe()

Part B (2 points per plot, 4 points total)

Award: (per plot)

  • 2 points for the correct plot with a simple solution (about 1 or 2 lines)
  • 1 point for the correct plot with a more involved solution
  • 0 points the wrong plot

Part C (5 points)

Award:

  • 5 points for finding the correct average petal width for each species.
  • 4 points for the correct widths without labeling the species.
  • 2 points for the wrong answer but a reasonable attempt.

# (4.a) from statsmodels import datasets iris = datasets.get_rdataset("iris").data iris.describe()
Sepal.Length Sepal.Width Petal.Length Petal.Width
count 150.000000 150.000000 150.000000 150.000000
mean 5.843333 3.057333 3.758000 1.199333
std 0.828066 0.435866 1.765298 0.762238
min 4.300000 2.000000 1.000000 0.100000
25% 5.100000 2.800000 1.600000 0.300000
50% 5.800000 3.000000 4.350000 1.300000
75% 6.400000 3.300000 5.100000 1.800000
max 7.900000 4.400000 6.900000 2.500000
# (4.b) iris.plot.scatter(x='Sepal.Length', y = 'Sepal.Width') iris.plot.scatter(x='Petal.Length', y = 'Petal.Width')
# (4.c) iris.groupby('Species').describe() for s in iris.Species.unique(): print s, iris["Petal.Width"][iris.Species == s].mean()
Petal.Length Petal.Width Sepal.Length Sepal.Width
Species
setosa count 50.000000 50.000000 50.000000 50.000000
mean 1.462000 0.246000 5.006000 3.428000
std 0.173664 0.105386 0.352490 0.379064
min 1.000000 0.100000 4.300000 2.300000
25% 1.400000 0.200000 4.800000 3.200000
50% 1.500000 0.200000 5.000000 3.400000
75% 1.575000 0.300000 5.200000 3.675000
max 1.900000 0.600000 5.800000 4.400000
versicolor count 50.000000 50.000000 50.000000 50.000000
mean 4.260000 1.326000 5.936000 2.770000
std 0.469911 0.197753 0.516171 0.313798
min 3.000000 1.000000 4.900000 2.000000
25% 4.000000 1.200000 5.600000 2.525000
50% 4.350000 1.300000 5.900000 2.800000
75% 4.600000 1.500000 6.300000 3.000000
max 5.100000 1.800000 7.000000 3.400000
virginica count 50.000000 50.000000 50.000000 50.000000
mean 5.552000 2.026000 6.588000 2.974000
std 0.551895 0.274650 0.635880 0.322497
min 4.500000 1.400000 4.900000 2.200000
25% 5.100000 1.800000 6.225000 2.800000
50% 5.550000 2.000000 6.500000 3.000000
75% 5.875000 2.300000 6.900000 3.175000
max 6.900000 2.500000 7.900000 3.800000
setosa 0.246 versicolor 1.326 virginica 2.026

Problem 5 -- Pivot Tables

Large datasets have a problem: they are large. One of the most commonly used techniques for summarizing larger tables into a more compact table are [Pivot Tables](https://en.wikipedia.org/wiki/Pivot_table).

Pandas has a very powerful pd.pivot_table function. See also http://pbpython.com/pandas-pivot-table-explained.html

Load the miles per gallon data set, which has both numerical and categorical columns:

from statsmodels import datasets mpg = datasets.get_rdataset("mpg", "ggplot2").data



You will then compute pivot tables, where you aggregate columns of your choice by sum, mean, min or max by category.

  • (5.a) Create a pandas data frame using the pd.pivot_table command that tells you the average "cty" and "hwy" (city and highway miles per gallon) for each manufacturer?

  • (5.b) Has the average city mileage improved from 1999 to 2008? Has the average highway mileage improved from 1999 to 2008?

  • (5.c) Create a scatterplot of pairs (displ, hwy) for all cars in 1999, and another scatter plot for all cars in 2008. Roughly speaking, if you increase the card displacement, does the highway gas mileage go up or down?

10 Points Total

Deduct 2 points from each part where they use an involved solution (eg. for-looping) For parts A and B, deduct 1 point if they did not use a pivot table.

Part A (4 points)

Award:

  • 4 points for a correct table showing just the relevant, requested information.
  • 3 points for a correct table showing extra information

Part B (4 points total)

Award:

  • 4 points for a correct table showing just the relevant, requested information.
  • 3 points for a correct table showing extra information

Part C (1 points per plot, 2 points total)

No pivot table is required for this part.

Award: (per plot)

  • 1 point for the correct plot with a simple solution (about 1 or 2 lines)
  • 0.5 points for the correct plot with a more involved solution
  • 0 points the wrong plot

from statsmodels import datasets mpg = datasets.get_rdataset("mpg", "ggplot2").data mpg.head(5) # only the first few rows
manufacturer model displ year cyl trans drv cty hwy fl class
0 audi a4 1.8 1999 4 auto(l5) f 18 29 p compact
1 audi a4 1.8 1999 4 manual(m5) f 21 29 p compact
2 audi a4 2.0 2008 4 manual(m6) f 20 31 p compact
3 audi a4 2.0 2008 4 auto(av) f 21 30 p compact
4 audi a4 2.8 1999 6 auto(l5) f 16 26 p compact
# (5.a) pd.pivot_table(mpg, values=['cty', 'hwy'], columns=['manufacturer'], aggfunc=np.mean)
manufacturer audi chevrolet dodge ford honda hyundai jeep land rover lincoln mercury nissan pontiac subaru toyota volkswagen
cty 17.611111 15.000000 13.135135 14.00 24.444444 18.642857 13.500 11.5 11.333333 13.25 18.076923 17.0 19.285714 18.529412 20.925926
hwy 26.444444 21.894737 17.945946 19.36 32.555556 26.857143 17.625 16.5 17.000000 18.00 24.615385 26.4 25.571429 24.911765 29.222222
# (5.b) pd.pivot_table(mpg, values=['hwy', 'cty'], columns=['year'], aggfunc=np.mean)
year 1999 2008
cty 17.017094 16.700855
hwy 23.427350 23.452991
# (5.c) mpg[mpg.year==1999].plot.scatter(x='displ', y='hwy') mpg[mpg.year==2008].plot.scatter(x='displ', y='hwy')