DSE200x Week 4 Notes

Path: Week 4 / Introduction to Pandas.ipynb

Views: ¹⁸⁹⁰

Kernel: Python 3 (Ubuntu Linux)

Pandas

pandas is a Python library for data analysis. It offers a number of data exploration, cleaning and transformation operations that are critical in working with data in Python.

pandas build upon numpy and scipy providing easy-to-use data structures and data manipulation functions with integrated indexing.

The main data structures pandas provides are Series and DataFrames. After a brief introduction to these two data structures and data ingestion, the key features of pandas this notebook covers are:

Generating descriptive statistics on data
Data cleaning using built in pandas functions
Frequent data operations for subsetting, filtering, insertion, deletion and aggregation of data
Merging multiple datasets using dataframes
Working with timestamps and time-series data

Additional Recommended Resources:

pandas Documentation: http://pandas.pydata.org/pandas-docs/stable/
Python for Data Analysis by Wes McKinney
Python Data Science Handbook by Jake VanderPlas

Let's get started with our first pandas notebook!

Import Libraries

In [18]:

import pandas as pd

Introduction to pandas Data Structures

*pandas* has two main data structures it uses, namely, *Series* and *DataFrames*.

pandas Series

pandas Series one-dimensional labeled array.

In [19]:

ser = pd.Series([100, 'foo', 300, 'bar', 500], ['tom', 'bob', 'nancy', 'dan', 'eric'])

In [20]:

ser

tom      100
bob      foo
nancy    300
dan      bar
eric     500
dtype: object

In [21]:

ser.index

Index(['tom', 'bob', 'nancy', 'dan', 'eric'], dtype='object')

In [22]:

ser.loc[['nancy','bob']]

nancy    300
bob      foo
dtype: object

In [23]:

ser[[4, 3, 1]]

eric    500
dan     bar
bob     foo
dtype: object

In [24]:

ser.iloc[2]

300

In [25]:

'bob' in ser

True

In [26]:

ser

tom      100
bob      foo
nancy    300
dan      bar
eric     500
dtype: object

In [27]:

ser * 2

tom         200
bob      foofoo
nancy       600
dan      barbar
eric       1000
dtype: object

In [28]:

ser[['nancy', 'eric']] ** 2

nancy     90000
eric     250000
dtype: object

pandas DataFrame

pandas DataFrame is a 2-dimensional labeled data structure.

Create DataFrame from dictionary of Python Series

In [29]:

d = {'one' : pd.Series([100., 200., 300.], index=['apple', 'ball', 'clock']),
     'two' : pd.Series([111., 222., 333., 4444.], index=['apple', 'ball', 'cerill', 'dancy'])}

In [30]:

df = pd.DataFrame(d)
print(df)

          one     two
apple   100.0   111.0
ball    200.0   222.0
cerill    NaN   333.0
clock   300.0     NaN
dancy     NaN  4444.0

In [31]:

df.index

Index(['apple', 'ball', 'cerill', 'clock', 'dancy'], dtype='object')

In [32]:

df.columns

Index(['one', 'two'], dtype='object')

In [33]:

pd.DataFrame(d, index=['dancy', 'ball', 'apple'])

	one	two
dancy	NaN	4444.0
ball	200.0	222.0
apple	100.0	111.0

In [34]:

pd.DataFrame(d, index=['dancy', 'ball', 'apple'], columns=['two', 'five'])

	two	five
dancy	4444.0	NaN
ball	222.0	NaN
apple	111.0	NaN

Create DataFrame from list of Python dictionaries

In [35]:

data = [{'alex': 1, 'joe': 2}, {'ema': 5, 'dora': 10, 'alice': 20}]

In [36]:

pd.DataFrame(data)

	alex	alice	dora	ema	joe
0	1.0	NaN	NaN	NaN	2.0
1	NaN	20.0	10.0	5.0	NaN

In [37]:

pd.DataFrame(data, index=['orange', 'red'])

	alex	alice	dora	ema	joe
orange	1.0	NaN	NaN	NaN	2.0
red	NaN	20.0	10.0	5.0	NaN

In [38]:

pd.DataFrame(data, columns=['joe', 'dora','alice'])

	joe	dora	alice
0	2.0	NaN	NaN
1	NaN	10.0	20.0

Basic DataFrame operations

In [39]:

df

	one	two
apple	100.0	111.0
ball	200.0	222.0
cerill	NaN	333.0
clock	300.0	NaN
dancy	NaN	4444.0

In [40]:

df['one']

apple     100.0
ball      200.0
cerill      NaN
clock     300.0
dancy       NaN
Name: one, dtype: float64

In [41]:

df['three'] = df['one'] * df['two']
df

	one	two	three
apple	100.0	111.0	11100.0
ball	200.0	222.0	44400.0
cerill	NaN	333.0	NaN
clock	300.0	NaN	NaN
dancy	NaN	4444.0	NaN

In [42]:

df['flag'] = df['one'] > 250
df

	one	two	three	flag
apple	100.0	111.0	11100.0	False
ball	200.0	222.0	44400.0	False
cerill	NaN	333.0	NaN	False
clock	300.0	NaN	NaN	True
dancy	NaN	4444.0	NaN	False

In [43]:

three = df.pop('three')

In [44]:

three

apple     11100.0
ball      44400.0
cerill        NaN
clock         NaN
dancy         NaN
Name: three, dtype: float64

In [45]:

df

	one	two	flag
apple	100.0	111.0	False
ball	200.0	222.0	False
cerill	NaN	333.0	False
clock	300.0	NaN	True
dancy	NaN	4444.0	False

In [46]:

del df['two']

In [47]:

df

	one	flag
apple	100.0	False
ball	200.0	False
cerill	NaN	False
clock	300.0	True
dancy	NaN	False

In [48]:

df.insert(2, 'copy_of_one', df['one'])
df

	one	flag	copy_of_one
apple	100.0	False	100.0
ball	200.0	False	200.0
cerill	NaN	False	NaN
clock	300.0	True	300.0
dancy	NaN	False	NaN

In [49]:

df['one_upper_half'] = df['one'][:2]
df

	one	flag	copy_of_one	one_upper_half
apple	100.0	False	100.0	100.0
ball	200.0	False	200.0	200.0
cerill	NaN	False	NaN	NaN
clock	300.0	True	300.0	NaN
dancy	NaN	False	NaN	NaN

Case Study: Movie Data Analysis

This notebook uses a dataset from the MovieLens website. We will describe the dataset further as we explore with it using *pandas*.

Download the Dataset

Please note that you will need to download the dataset. Although the video for this notebook says that the data is in your folder, the folder turned out to be too large to fit on the edX platform due to size constraints.

Here are the links to the data source and location:

Data Source: MovieLens web site (filename: ml-20m.zip)
Location: https://grouplens.org/datasets/movielens/

Once the download completes, please make sure the data files are in a directory called movielens in your Week-3-pandas folder.

Let us look at the files in this dataset using the UNIX command ls.

In [50]:

# Note: Adjust the name of the folder to match your local directory

!ls ./movielens

README.txt  genome-tags.csv  links.csv	movies.csv  ratings.csv  tags.csv

In [51]:

!cat ./movielens/movies.csv | wc -l

9743

In [52]:

!head -5 ./movielens/ratings.csv

userId,movieId,rating,timestamp
1,1,4.0,964982703
1,3,4.0,964981247
1,6,4.0,964982224
1,47,5.0,964983815

Use Pandas to Read the Dataset

In this notebook, we will be using three CSV files: * **ratings.csv :** *userId*,*movieId*,*rating*, *timestamp* * **tags.csv :** *userId*,*movieId*, *tag*, *timestamp* * **movies.csv :** *movieId*, *title*, *genres*

Using the read_csv function in pandas, we will ingest these three files.

In [53]:

movies = pd.read_csv('./movielens/movies.csv', sep=',')
print(type(movies))
movies.head(15)

<class 'pandas.core.frame.DataFrame'>

	movieId	title	genres
0	1	Toy Story (1995)	Adventure\|Animation\|Children\|Comedy\|Fantasy
1	2	Jumanji (1995)	Adventure\|Children\|Fantasy
2	3	Grumpier Old Men (1995)	Comedy\|Romance
3	4	Waiting to Exhale (1995)	Comedy\|Drama\|Romance
4	5	Father of the Bride Part II (1995)	Comedy
5	6	Heat (1995)	Action\|Crime\|Thriller
6	7	Sabrina (1995)	Comedy\|Romance
7	8	Tom and Huck (1995)	Adventure\|Children
8	9	Sudden Death (1995)	Action
9	10	GoldenEye (1995)	Action\|Adventure\|Thriller
10	11	American President, The (1995)	Comedy\|Drama\|Romance
11	12	Dracula: Dead and Loving It (1995)	Comedy\|Horror
12	13	Balto (1995)	Adventure\|Animation\|Children
13	14	Nixon (1995)	Drama
14	15	Cutthroat Island (1995)	Action\|Adventure\|Romance

In [54]:

# Timestamps represent seconds since midnight Coordinated Universal Time (UTC) of January 1, 1970

tags = pd.read_csv('./movielens/tags.csv', sep=',')
tags.head()

	userId	movieId	tag	timestamp
0	2	60756	funny	1445714994
1	2	60756	Highly quotable	1445714996
2	2	60756	will ferrell	1445714992
3	2	89774	Boxing story	1445715207
4	2	89774	MMA	1445715200

In [55]:

ratings = pd.read_csv('./movielens/ratings.csv', sep=',', parse_dates=['timestamp'])
ratings.head()

	userId	movieId	rating	timestamp
0	1	1	4.0	964982703
1	1	3	4.0	964981247
2	1	6	4.0	964982224
3	1	47	5.0	964983815
4	1	50	5.0	964982931

In [56]:

# For current analysis, we will remove timestamp (we will come back to it!)

del ratings['timestamp']
del tags['timestamp']

Data Structures

Series

In [129]:

#Extract 0th row: notice that it is infact a Series

row_0 = tags.iloc[0]
type(row_0)

pandas.core.series.Series

In [58]:

print(row_0)

userId         2
movieId    60756
tag        funny
Name: 0, dtype: object

In [59]:

row_0.index

Index(['userId', 'movieId', 'tag'], dtype='object')

In [60]:

row_0['userId']

2

In [61]:

'rating' in row_0

False

In [62]:

row_0.name

0

In [63]:

row_0 = row_0.rename('first_row')
row_0.name

'first_row'

DataFrames

In [64]:

tags.head()

	userId	movieId	tag
0	2	60756	funny
1	2	60756	Highly quotable
2	2	60756	will ferrell
3	2	89774	Boxing story
4	2	89774	MMA

In [65]:

tags.index

RangeIndex(start=0, stop=3683, step=1)

In [130]:

tags.columns

Index(['userId', 'movieId', 'tag', 'timestamp', 'parsed_time'], dtype='object')

In [67]:

# Extract row 0, 11, 2000 from DataFrame

tags.iloc[ [0,11,2000] ]

	userId	movieId	tag
0	2	60756	funny
11	18	431	gangster
2000	474	5450	women

Descriptive Statistics

Let's look how the ratings are distributed!

In [68]:

ratings['rating'].describe()

count    100836.000000
mean          3.501557
std           1.042529
min           0.500000
25%           3.000000
50%           3.500000
75%           4.000000
max           5.000000
Name: rating, dtype: float64

In [69]:

ratings.describe()

	userId	movieId	rating
count	100836.000000	100836.000000	100836.000000
mean	326.127564	19435.295718	3.501557
std	182.618491	35530.987199	1.042529
min	1.000000	1.000000	0.500000
25%	177.000000	1199.000000	3.000000
50%	325.000000	2991.000000	3.500000
75%	477.000000	8122.000000	4.000000
max	610.000000	193609.000000	5.000000

In [70]:

ratings['rating'].mean()

3.501556983616962

In [71]:

ratings.mean()

userId       326.127564
movieId    19435.295718
rating         3.501557
dtype: float64

In [72]:

ratings['rating'].min()

0.5

In [73]:

ratings['rating'].max()

5.0

In [74]:

ratings['rating'].std()

1.0425292390605359

In [75]:

ratings['rating'].mode()

0    4.0
dtype: float64

In [76]:

ratings.corr()

	userId	movieId	rating
userId	1.000000	0.006773	-0.049348
movieId	0.006773	1.000000	-0.004061
rating	-0.049348	-0.004061	1.000000

In [77]:

filter_1 = ratings['rating'] > 5
print(filter_1)
filter_1.any()

0         False
1         False
2         False
3         False
4         False
5         False
6         False
7         False
8         False
9         False
10        False
11        False
12        False
13        False
14        False
15        False
16        False
17        False
18        False
19        False
20        False
21        False
22        False
23        False
24        False
25        False
26        False
27        False
28        False
29        False
          ...  
100806    False
100807    False
100808    False
100809    False
100810    False
100811    False
100812    False
100813    False
100814    False
100815    False
100816    False
100817    False
100818    False
100819    False
100820    False
100821    False
100822    False
100823    False
100824    False
100825    False
100826    False
100827    False
100828    False
100829    False
100830    False
100831    False
100832    False
100833    False
100834    False
100835    False
Name: rating, Length: 100836, dtype: bool

False

In [78]:

filter_2 = ratings['rating'] > 0
filter_2.all()

True

Data Cleaning: Handling Missing Data

In [79]:

movies.shape

(9742, 3)

In [80]:

#is any row NULL ?

movies.isnull().any()

movieId    False
title      False
genres     False
dtype: bool

Thats nice ! No NULL values !

In [81]:

ratings.shape

(100836, 3)

In [82]:

#is any row NULL ?

ratings.isnull().any()

userId     False
movieId    False
rating     False
dtype: bool

Thats nice ! No NULL values !

In [83]:

tags.shape

(3683, 3)

In [84]:

#is any row NULL ?

tags.isnull().any()

userId     False
movieId    False
tag        False
dtype: bool

We have some tags which are NULL.

In [85]:

tags = tags.dropna()

In [86]:

#Check again: is any row NULL ?

tags.isnull().any()

userId     False
movieId    False
tag        False
dtype: bool

In [87]:

tags.shape

(3683, 3)

Thats nice ! No NULL values ! Notice the number of lines have reduced.

Data Visualization

In [88]:

%matplotlib inline

ratings.hist(column='rating', figsize=(15,10))

array([[<matplotlib.axes._subplots.AxesSubplot object at 0x7f12e4acc438>]],
      dtype=object)

In [89]:

ratings.boxplot(column='rating', figsize=(15,20))

<matplotlib.axes._subplots.AxesSubplot at 0x7f12e4a63860>

Slicing Out Columns

In [90]:

tags['tag'].head()

            funny
  Highly quotable
     will ferrell
     Boxing story
              MMA
Name: tag, dtype: object

In [91]:

movies[['title','genres']].head()

	title	genres
0	Toy Story (1995)	Adventure\|Animation\|Children\|Comedy\|Fantasy
1	Jumanji (1995)	Adventure\|Children\|Fantasy
2	Grumpier Old Men (1995)	Comedy\|Romance
3	Waiting to Exhale (1995)	Comedy\|Drama\|Romance
4	Father of the Bride Part II (1995)	Comedy

In [92]:

ratings[-10:]

	userId	movieId	rating
100826	610	162350	3.5
100827	610	163937	3.5
100828	610	163981	3.5
100829	610	164179	5.0
100830	610	166528	4.0
100831	610	166534	4.0
100832	610	168248	5.0
100833	610	168250	5.0
100834	610	168252	5.0
100835	610	170875	3.0

In [93]:

tag_counts = tags['tag'].value_counts()
tag_counts[-10:]

tedious                    1
royalty                    1
narnia                     1
Notable Nudity             1
stephen king               1
amazing                    1
short films                1
Lonesome Polecat           1
Disney animated feature    1
Andy Garcia                1
Name: tag, dtype: int64

In [94]:

tag_counts[:10].plot(kind='bar', figsize=(15,10))

<matplotlib.axes._subplots.AxesSubplot at 0x7f12e49b2ef0>

Filters for Selecting Rows

In [95]:

is_highly_rated = ratings['rating'] >= 4.0

ratings[is_highly_rated][30:50]

	userId	movieId	rating
36	1	608	5.0
38	1	661	5.0
40	1	733	4.0
43	1	804	4.0
44	1	919	5.0
45	1	923	5.0
46	1	940	5.0
47	1	943	4.0
48	1	954	5.0
50	1	1023	5.0
51	1	1024	5.0
52	1	1025	5.0
53	1	1029	5.0
55	1	1031	5.0
56	1	1032	5.0
57	1	1042	4.0
58	1	1049	5.0
59	1	1060	4.0
60	1	1073	5.0
61	1	1080	5.0

In [96]:

is_animation = movies['genres'].str.contains('Animation')

movies[is_animation][5:15]

	movieId	title	genres
322	364	Lion King, The (1994)	Adventure\|Animation\|Children\|Drama\|Musical\|IMAX
483	551	Nightmare Before Christmas, The (1993)	Animation\|Children\|Fantasy\|Musical
488	558	Pagemaster, The (1994)	Action\|Adventure\|Animation\|Children\|Fantasy
506	588	Aladdin (1992)	Adventure\|Animation\|Children\|Comedy\|Musical
511	594	Snow White and the Seven Dwarfs (1937)	Animation\|Children\|Drama\|Fantasy\|Musical
512	595	Beauty and the Beast (1991)	Animation\|Children\|Fantasy\|Musical\|Romance\|IMAX
513	596	Pinocchio (1940)	Animation\|Children\|Fantasy\|Musical
522	610	Heavy Metal (1981)	Action\|Adventure\|Animation\|Horror\|Sci-Fi
527	616	Aristocats, The (1970)	Animation\|Children
534	631	All Dogs Go to Heaven 2 (1996)	Adventure\|Animation\|Children\|Fantasy\|Musical\|R...

In [97]:

movies[is_animation].head(15)

	movieId	title	genres
0	1	Toy Story (1995)	Adventure\|Animation\|Children\|Comedy\|Fantasy
12	13	Balto (1995)	Adventure\|Animation\|Children
44	48	Pocahontas (1995)	Animation\|Children\|Drama\|Musical\|Romance
205	239	Goofy Movie, A (1995)	Animation\|Children\|Comedy\|Romance
272	313	Swan Princess, The (1994)	Animation\|Children
322	364	Lion King, The (1994)	Adventure\|Animation\|Children\|Drama\|Musical\|IMAX
483	551	Nightmare Before Christmas, The (1993)	Animation\|Children\|Fantasy\|Musical
488	558	Pagemaster, The (1994)	Action\|Adventure\|Animation\|Children\|Fantasy
506	588	Aladdin (1992)	Adventure\|Animation\|Children\|Comedy\|Musical
511	594	Snow White and the Seven Dwarfs (1937)	Animation\|Children\|Drama\|Fantasy\|Musical
512	595	Beauty and the Beast (1991)	Animation\|Children\|Fantasy\|Musical\|Romance\|IMAX
513	596	Pinocchio (1940)	Animation\|Children\|Fantasy\|Musical
522	610	Heavy Metal (1981)	Action\|Adventure\|Animation\|Horror\|Sci-Fi
527	616	Aristocats, The (1970)	Animation\|Children
534	631	All Dogs Go to Heaven 2 (1996)	Adventure\|Animation\|Children\|Fantasy\|Musical\|R...

Group By and Aggregate

In [98]:

ratings_count = ratings[['movieId','rating']].groupby('rating').count()
ratings_count

	movieId
rating
0.5	1370
1.0	2811
1.5	1791
2.0	7551
2.5	5550
3.0	20047
3.5	13136
4.0	26818
4.5	8551
5.0	13211

In [99]:

average_rating = ratings[['movieId','rating']].groupby('movieId').mean()
average_rating.head()

	rating
movieId
1	3.920930
2	3.431818
3	3.259615
4	2.357143
5	3.071429

In [100]:

movie_count = ratings[['movieId','rating']].groupby('movieId').count()
movie_count.head()

	rating
movieId
1	215
2	110
3	52
4	7
5	49

In [101]:

movie_count = ratings[['movieId','rating']].groupby('movieId').count()
movie_count.tail()

	rating
movieId
193581	1
193583	1
193585	1
193587	1
193609	1

Merge Dataframes

In [102]:

tags.head()

	userId	movieId	tag
0	2	60756	funny
1	2	60756	Highly quotable
2	2	60756	will ferrell
3	2	89774	Boxing story
4	2	89774	MMA

In [103]:

movies.head()

	movieId	title	genres
0	1	Toy Story (1995)	Adventure\|Animation\|Children\|Comedy\|Fantasy
1	2	Jumanji (1995)	Adventure\|Children\|Fantasy
2	3	Grumpier Old Men (1995)	Comedy\|Romance
3	4	Waiting to Exhale (1995)	Comedy\|Drama\|Romance
4	5	Father of the Bride Part II (1995)	Comedy

In [104]:

t = movies.merge(tags, on='movieId', how='inner')
t.head()

	movieId	title	genres	userId	tag
0	1	Toy Story (1995)	Adventure\|Animation\|Children\|Comedy\|Fantasy	336	pixar
1	1	Toy Story (1995)	Adventure\|Animation\|Children\|Comedy\|Fantasy	474	pixar
2	1	Toy Story (1995)	Adventure\|Animation\|Children\|Comedy\|Fantasy	567	fun
3	2	Jumanji (1995)	Adventure\|Children\|Fantasy	62	fantasy
4	2	Jumanji (1995)	Adventure\|Children\|Fantasy	62	magic board game

More examples: http://pandas.pydata.org/pandas-docs/stable/merging.html

Combine aggreagation, merging, and filters to get useful analytics

In [105]:

avg_ratings = ratings.groupby('movieId', as_index=False).mean()
del avg_ratings['userId']
avg_ratings.head()

	movieId	rating
0	1	3.920930
1	2	3.431818
2	3	3.259615
3	4	2.357143
4	5	3.071429

In [106]:

box_office = movies.merge(avg_ratings, on='movieId', how='inner')
box_office.tail()

	movieId	title	genres	rating
9719	193581	Black Butler: Book of the Atlantic (2017)	Action\|Animation\|Comedy\|Fantasy	4.0
9720	193583	No Game No Life: Zero (2017)	Animation\|Comedy\|Fantasy	3.5
9721	193585	Flint (2017)	Drama	3.5
9722	193587	Bungo Stray Dogs: Dead Apple (2018)	Action\|Animation	3.5
9723	193609	Andrew Dice Clay: Dice Rules (1991)	Comedy	4.0

In [107]:

is_highly_rated = box_office['rating'] >= 4.0

box_office[is_highly_rated][-5:]

	movieId	title	genres	rating
9713	191005	Gintama (2017)	Action\|Adventure\|Comedy\|Sci-Fi	4.5
9716	193571	Silver Spoon (2014)	Comedy\|Drama	4.0
9717	193573	Love Live! The School Idol Movie (2015)	Animation	4.0
9719	193581	Black Butler: Book of the Atlantic (2017)	Action\|Animation\|Comedy\|Fantasy	4.0
9723	193609	Andrew Dice Clay: Dice Rules (1991)	Comedy	4.0

In [108]:

is_comedy = box_office['genres'].str.contains('Comedy')

box_office[is_comedy][:5]

	movieId	title	genres	rating
0	1	Toy Story (1995)	Adventure\|Animation\|Children\|Comedy\|Fantasy	3.920930
2	3	Grumpier Old Men (1995)	Comedy\|Romance	3.259615
3	4	Waiting to Exhale (1995)	Comedy\|Drama\|Romance	2.357143
4	5	Father of the Bride Part II (1995)	Comedy	3.071429
6	7	Sabrina (1995)	Comedy\|Romance	3.185185

In [109]:

box_office[is_comedy & is_highly_rated][-5:]

	movieId	title	genres	rating
9708	190209	Jeff Ross Roasts the Border (2017)	Comedy	4.0
9713	191005	Gintama (2017)	Action\|Adventure\|Comedy\|Sci-Fi	4.5
9716	193571	Silver Spoon (2014)	Comedy\|Drama	4.0
9719	193581	Black Butler: Book of the Atlantic (2017)	Action\|Animation\|Comedy\|Fantasy	4.0
9723	193609	Andrew Dice Clay: Dice Rules (1991)	Comedy	4.0

Vectorized String Operations

In [110]:

movies.head()

	movieId	title	genres
0	1	Toy Story (1995)	Adventure\|Animation\|Children\|Comedy\|Fantasy
1	2	Jumanji (1995)	Adventure\|Children\|Fantasy
2	3	Grumpier Old Men (1995)	Comedy\|Romance
3	4	Waiting to Exhale (1995)	Comedy\|Drama\|Romance
4	5	Father of the Bride Part II (1995)	Comedy

Split 'genres' into multiple columns

In [111]:

movie_genres = movies['genres'].str.split('|', expand=True)

In [112]:

movie_genres[:10]

	0	1	2	3	4	5	6	7	8	9
0	Adventure	Animation	Children	Comedy	Fantasy	None	None	None	None	None
1	Adventure	Children	Fantasy	None	None	None	None	None	None	None
2	Comedy	Romance	None	None	None	None	None	None	None	None
3	Comedy	Drama	Romance	None	None	None	None	None	None	None
4	Comedy	None	None	None	None	None	None	None	None	None
5	Action	Crime	Thriller	None	None	None	None	None	None	None
6	Comedy	Romance	None	None	None	None	None	None	None	None
7	Adventure	Children	None	None	None	None	None	None	None	None
8	Action	None	None	None	None	None	None	None	None	None
9	Action	Adventure	Thriller	None	None	None	None	None	None	None

Add a new column for comedy genre flag

In [113]:

movie_genres['isComedy'] = movies['genres'].str.contains('Comedy')

In [114]:

movie_genres[:10]

	0	1	2	3	4	5	6	7	8	9	isComedy
0	Adventure	Animation	Children	Comedy	Fantasy	None	None	None	None	None	True
1	Adventure	Children	Fantasy	None	None	None	None	None	None	None	False
2	Comedy	Romance	None	None	None	None	None	None	None	None	True
3	Comedy	Drama	Romance	None	None	None	None	None	None	None	True
4	Comedy	None	None	None	None	None	None	None	None	None	True
5	Action	Crime	Thriller	None	None	None	None	None	None	None	False
6	Comedy	Romance	None	None	None	None	None	None	None	None	True
7	Adventure	Children	None	None	None	None	None	None	None	None	False
8	Action	None	None	None	None	None	None	None	None	None	False
9	Action	Adventure	Thriller	None	None	None	None	None	None	None	False

Extract year from title e.g. (1995)

In [115]:

movies['year'] = movies['title'].str.extract('.*\((.*)\).*', expand=True)

In [116]:

movies.tail()

	movieId	title	genres	year
9737	193581	Black Butler: Book of the Atlantic (2017)	Action\|Animation\|Comedy\|Fantasy	2017
9738	193583	No Game No Life: Zero (2017)	Animation\|Comedy\|Fantasy	2017
9739	193585	Flint (2017)	Drama	2017
9740	193587	Bungo Stray Dogs: Dead Apple (2018)	Action\|Animation	2018
9741	193609	Andrew Dice Clay: Dice Rules (1991)	Comedy	1991

More here: http://pandas.pydata.org/pandas-docs/stable/text.html#text-string-methods

Parsing Timestamps

Timestamps are common in sensor data or other time series datasets. Let us revisit the tags.csv dataset and read the timestamps!

In [117]:

tags = pd.read_csv('./movielens/tags.csv', sep=',')

In [118]:

tags.dtypes

userId        int64
movieId       int64
tag          object
timestamp     int64
dtype: object

Unix time / POSIX time / epoch time records time in seconds
since midnight Coordinated Universal Time (UTC) of January 1, 1970

In [119]:

tags.head(5)

	userId	movieId	tag	timestamp
0	2	60756	funny	1445714994
1	2	60756	Highly quotable	1445714996
2	2	60756	will ferrell	1445714992
3	2	89774	Boxing story	1445715207
4	2	89774	MMA	1445715200

In [120]:

tags['parsed_time'] = pd.to_datetime(tags['timestamp'], unit='s')

Data Type datetime64[ns] maps to either <M8[ns] or >M8[ns] depending on the hardware

In [121]:


tags['parsed_time'].dtype

dtype('<M8[ns]')

In [122]:

tags.head(2)

	userId	movieId	tag	timestamp	parsed_time
0	2	60756	funny	1445714994	2015-10-24 19:29:54
1	2	60756	Highly quotable	1445714996	2015-10-24 19:29:56

Selecting rows based on timestamps

In [123]:

greater_than_t = tags['parsed_time'] > '2015-02-01'

selected_rows = tags[greater_than_t]

tags.shape, selected_rows.shape

((3683, 5), (1710, 5))

Sorting the table using the timestamps

In [124]:

tags.sort_values(by='parsed_time', ascending=True)[:10]

	userId	movieId	tag	timestamp	parsed_time
1756	474	3181	Shakespeare	1137179352	2006-01-13 19:09:12
2212	474	6912	Rita Hayworth can dance!	1137179371	2006-01-13 19:09:31
1636	474	2494	Hungary	1137179426	2006-01-13 19:10:26
1635	474	2494	Holocaust	1137179426	2006-01-13 19:10:26
1497	474	1836	No DVD at Netflix	1137179444	2006-01-13 19:10:44
1961	474	4969	In Netflix queue	1137179563	2006-01-13 19:12:43
2409	474	26242	In Netflix queue	1137179570	2006-01-13 19:12:50
2413	474	27741	In Netflix queue	1137179587	2006-01-13 19:13:07
2231	474	7025	In Netflix queue	1137179593	2006-01-13 19:13:13
2485	474	41997	In Netflix queue	1137179603	2006-01-13 19:13:23

Average Movie Ratings over Time

## Are Movie ratings related to the year of launch?

In [125]:

average_rating = ratings[['movieId','rating']].groupby('movieId', as_index=False).mean()
average_rating.tail()

	movieId	rating
9719	193581	4.0
9720	193583	3.5
9721	193585	3.5
9722	193587	3.5
9723	193609	4.0

In [126]:

joined = movies.merge(average_rating, on='movieId', how='inner')
joined.head()
joined.corr()

	movieId	rating
movieId	1.000000	0.027841
rating	0.027841	1.000000

In [127]:

yearly_average = joined[['year','rating']].groupby('year', as_index=False).mean()
yearly_average[:10]

	year	rating
0	1902	3.500000
1	1903	2.500000
2	1908	4.000000
3	1915	2.000000
4	1916	3.562500
5	1917	4.500000
6	1919	2.000000
7	1920	3.678571
8	1921	4.100000
9	1922	3.531250

In [128]:

yearly_average[-20:].plot(x='year', y='rating', figsize=(15,10), grid=True)

<matplotlib.axes._subplots.AxesSubplot at 0x7f12e47a2128>

Do some years look better for the boxoffice movies than others?

Does any data point seem like an outlier in some sense?

In [0]: