Contact
CoCalc Logo Icon
StoreFeaturesDocsShareSupport News AboutSign UpSign In
| Download

DSE200x Week 4 Notes

Views: 1890
Kernel: Python 3 (Ubuntu Linux)


Pandas


pandas is a Python library for data analysis. It offers a number of data exploration, cleaning and transformation operations that are critical in working with data in Python.

pandas build upon numpy and scipy providing easy-to-use data structures and data manipulation functions with integrated indexing.

The main data structures pandas provides are Series and DataFrames. After a brief introduction to these two data structures and data ingestion, the key features of pandas this notebook covers are:

  • Generating descriptive statistics on data

  • Data cleaning using built in pandas functions

  • Frequent data operations for subsetting, filtering, insertion, deletion and aggregation of data

  • Merging multiple datasets using dataframes

  • Working with timestamps and time-series data

Additional Recommended Resources:

Let's get started with our first pandas notebook!


Import Libraries

import pandas as pd

Introduction to pandas Data Structures


*pandas* has two main data structures it uses, namely, *Series* and *DataFrames*.

pandas Series

pandas Series one-dimensional labeled array.

ser = pd.Series([100, 'foo', 300, 'bar', 500], ['tom', 'bob', 'nancy', 'dan', 'eric'])
ser
tom 100 bob foo nancy 300 dan bar eric 500 dtype: object
ser.index
Index(['tom', 'bob', 'nancy', 'dan', 'eric'], dtype='object')
ser.loc[['nancy','bob']]
nancy 300 bob foo dtype: object
ser[[4, 3, 1]]
eric 500 dan bar bob foo dtype: object
ser.iloc[2]
300
'bob' in ser
True
ser
tom 100 bob foo nancy 300 dan bar eric 500 dtype: object
ser * 2
tom 200 bob foofoo nancy 600 dan barbar eric 1000 dtype: object
ser[['nancy', 'eric']] ** 2
nancy 90000 eric 250000 dtype: object

pandas DataFrame

pandas DataFrame is a 2-dimensional labeled data structure.

Create DataFrame from dictionary of Python Series

d = {'one' : pd.Series([100., 200., 300.], index=['apple', 'ball', 'clock']), 'two' : pd.Series([111., 222., 333., 4444.], index=['apple', 'ball', 'cerill', 'dancy'])}
df = pd.DataFrame(d) print(df)
one two apple 100.0 111.0 ball 200.0 222.0 cerill NaN 333.0 clock 300.0 NaN dancy NaN 4444.0
df.index
Index(['apple', 'ball', 'cerill', 'clock', 'dancy'], dtype='object')
df.columns
Index(['one', 'two'], dtype='object')
pd.DataFrame(d, index=['dancy', 'ball', 'apple'])
one two
dancy NaN 4444.0
ball 200.0 222.0
apple 100.0 111.0
pd.DataFrame(d, index=['dancy', 'ball', 'apple'], columns=['two', 'five'])
two five
dancy 4444.0 NaN
ball 222.0 NaN
apple 111.0 NaN

Create DataFrame from list of Python dictionaries

data = [{'alex': 1, 'joe': 2}, {'ema': 5, 'dora': 10, 'alice': 20}]
pd.DataFrame(data)
alex alice dora ema joe
0 1.0 NaN NaN NaN 2.0
1 NaN 20.0 10.0 5.0 NaN
pd.DataFrame(data, index=['orange', 'red'])
alex alice dora ema joe
orange 1.0 NaN NaN NaN 2.0
red NaN 20.0 10.0 5.0 NaN
pd.DataFrame(data, columns=['joe', 'dora','alice'])
joe dora alice
0 2.0 NaN NaN
1 NaN 10.0 20.0

Basic DataFrame operations

df
one two
apple 100.0 111.0
ball 200.0 222.0
cerill NaN 333.0
clock 300.0 NaN
dancy NaN 4444.0
df['one']
apple 100.0 ball 200.0 cerill NaN clock 300.0 dancy NaN Name: one, dtype: float64
df['three'] = df['one'] * df['two'] df
one two three
apple 100.0 111.0 11100.0
ball 200.0 222.0 44400.0
cerill NaN 333.0 NaN
clock 300.0 NaN NaN
dancy NaN 4444.0 NaN
df['flag'] = df['one'] > 250 df
one two three flag
apple 100.0 111.0 11100.0 False
ball 200.0 222.0 44400.0 False
cerill NaN 333.0 NaN False
clock 300.0 NaN NaN True
dancy NaN 4444.0 NaN False
three = df.pop('three')
three
apple 11100.0 ball 44400.0 cerill NaN clock NaN dancy NaN Name: three, dtype: float64
df
one two flag
apple 100.0 111.0 False
ball 200.0 222.0 False
cerill NaN 333.0 False
clock 300.0 NaN True
dancy NaN 4444.0 False
del df['two']
df
one flag
apple 100.0 False
ball 200.0 False
cerill NaN False
clock 300.0 True
dancy NaN False
df.insert(2, 'copy_of_one', df['one']) df
one flag copy_of_one
apple 100.0 False 100.0
ball 200.0 False 200.0
cerill NaN False NaN
clock 300.0 True 300.0
dancy NaN False NaN
df['one_upper_half'] = df['one'][:2] df
one flag copy_of_one one_upper_half
apple 100.0 False 100.0 100.0
ball 200.0 False 200.0 200.0
cerill NaN False NaN NaN
clock 300.0 True 300.0 NaN
dancy NaN False NaN NaN

Case Study: Movie Data Analysis


This notebook uses a dataset from the MovieLens website. We will describe the dataset further as we explore with it using *pandas*.

Download the Dataset

Please note that you will need to download the dataset. Although the video for this notebook says that the data is in your folder, the folder turned out to be too large to fit on the edX platform due to size constraints.

Here are the links to the data source and location:

Once the download completes, please make sure the data files are in a directory called movielens in your Week-3-pandas folder.

Let us look at the files in this dataset using the UNIX command ls.

# Note: Adjust the name of the folder to match your local directory !ls ./movielens
README.txt genome-tags.csv links.csv movies.csv ratings.csv tags.csv
!cat ./movielens/movies.csv | wc -l
9743
!head -5 ./movielens/ratings.csv
userId,movieId,rating,timestamp 1,1,4.0,964982703 1,3,4.0,964981247 1,6,4.0,964982224 1,47,5.0,964983815

Use Pandas to Read the Dataset


In this notebook, we will be using three CSV files: * **ratings.csv :** *userId*,*movieId*,*rating*, *timestamp* * **tags.csv :** *userId*,*movieId*, *tag*, *timestamp* * **movies.csv :** *movieId*, *title*, *genres*

Using the read_csv function in pandas, we will ingest these three files.

movies = pd.read_csv('./movielens/movies.csv', sep=',') print(type(movies)) movies.head(15)
<class 'pandas.core.frame.DataFrame'>
movieId title genres
0 1 Toy Story (1995) Adventure|Animation|Children|Comedy|Fantasy
1 2 Jumanji (1995) Adventure|Children|Fantasy
2 3 Grumpier Old Men (1995) Comedy|Romance
3 4 Waiting to Exhale (1995) Comedy|Drama|Romance
4 5 Father of the Bride Part II (1995) Comedy
5 6 Heat (1995) Action|Crime|Thriller
6 7 Sabrina (1995) Comedy|Romance
7 8 Tom and Huck (1995) Adventure|Children
8 9 Sudden Death (1995) Action
9 10 GoldenEye (1995) Action|Adventure|Thriller
10 11 American President, The (1995) Comedy|Drama|Romance
11 12 Dracula: Dead and Loving It (1995) Comedy|Horror
12 13 Balto (1995) Adventure|Animation|Children
13 14 Nixon (1995) Drama
14 15 Cutthroat Island (1995) Action|Adventure|Romance
# Timestamps represent seconds since midnight Coordinated Universal Time (UTC) of January 1, 1970 tags = pd.read_csv('./movielens/tags.csv', sep=',') tags.head()
userId movieId tag timestamp
0 2 60756 funny 1445714994
1 2 60756 Highly quotable 1445714996
2 2 60756 will ferrell 1445714992
3 2 89774 Boxing story 1445715207
4 2 89774 MMA 1445715200
ratings = pd.read_csv('./movielens/ratings.csv', sep=',', parse_dates=['timestamp']) ratings.head()
userId movieId rating timestamp
0 1 1 4.0 964982703
1 1 3 4.0 964981247
2 1 6 4.0 964982224
3 1 47 5.0 964983815
4 1 50 5.0 964982931
# For current analysis, we will remove timestamp (we will come back to it!) del ratings['timestamp'] del tags['timestamp']

Data Structures

Series

#Extract 0th row: notice that it is infact a Series row_0 = tags.iloc[0] type(row_0)
pandas.core.series.Series
print(row_0)
userId 2 movieId 60756 tag funny Name: 0, dtype: object
row_0.index
Index(['userId', 'movieId', 'tag'], dtype='object')
row_0['userId']
2
'rating' in row_0
False
row_0.name
0
row_0 = row_0.rename('first_row') row_0.name
'first_row'

DataFrames

tags.head()
userId movieId tag
0 2 60756 funny
1 2 60756 Highly quotable
2 2 60756 will ferrell
3 2 89774 Boxing story
4 2 89774 MMA
tags.index
RangeIndex(start=0, stop=3683, step=1)
tags.columns
Index(['userId', 'movieId', 'tag', 'timestamp', 'parsed_time'], dtype='object')
# Extract row 0, 11, 2000 from DataFrame tags.iloc[ [0,11,2000] ]
userId movieId tag
0 2 60756 funny
11 18 431 gangster
2000 474 5450 women

Descriptive Statistics

Let's look how the ratings are distributed!

ratings['rating'].describe()
count 100836.000000 mean 3.501557 std 1.042529 min 0.500000 25% 3.000000 50% 3.500000 75% 4.000000 max 5.000000 Name: rating, dtype: float64
ratings.describe()
userId movieId rating
count 100836.000000 100836.000000 100836.000000
mean 326.127564 19435.295718 3.501557
std 182.618491 35530.987199 1.042529
min 1.000000 1.000000 0.500000
25% 177.000000 1199.000000 3.000000
50% 325.000000 2991.000000 3.500000
75% 477.000000 8122.000000 4.000000
max 610.000000 193609.000000 5.000000
ratings['rating'].mean()
3.501556983616962
ratings.mean()
userId 326.127564 movieId 19435.295718 rating 3.501557 dtype: float64
ratings['rating'].min()
0.5
ratings['rating'].max()
5.0
ratings['rating'].std()
1.0425292390605359
ratings['rating'].mode()
0 4.0 dtype: float64
ratings.corr()
userId movieId rating
userId 1.000000 0.006773 -0.049348
movieId 0.006773 1.000000 -0.004061
rating -0.049348 -0.004061 1.000000
filter_1 = ratings['rating'] > 5 print(filter_1) filter_1.any()
0 False 1 False 2 False 3 False 4 False 5 False 6 False 7 False 8 False 9 False 10 False 11 False 12 False 13 False 14 False 15 False 16 False 17 False 18 False 19 False 20 False 21 False 22 False 23 False 24 False 25 False 26 False 27 False 28 False 29 False ... 100806 False 100807 False 100808 False 100809 False 100810 False 100811 False 100812 False 100813 False 100814 False 100815 False 100816 False 100817 False 100818 False 100819 False 100820 False 100821 False 100822 False 100823 False 100824 False 100825 False 100826 False 100827 False 100828 False 100829 False 100830 False 100831 False 100832 False 100833 False 100834 False 100835 False Name: rating, Length: 100836, dtype: bool
False
filter_2 = ratings['rating'] > 0 filter_2.all()
True

Data Cleaning: Handling Missing Data

movies.shape
(9742, 3)
#is any row NULL ? movies.isnull().any()
movieId False title False genres False dtype: bool

Thats nice ! No NULL values !

ratings.shape
(100836, 3)
#is any row NULL ? ratings.isnull().any()
userId False movieId False rating False dtype: bool

Thats nice ! No NULL values !

tags.shape
(3683, 3)
#is any row NULL ? tags.isnull().any()
userId False movieId False tag False dtype: bool

We have some tags which are NULL.

tags = tags.dropna()
#Check again: is any row NULL ? tags.isnull().any()
userId False movieId False tag False dtype: bool
tags.shape
(3683, 3)

Thats nice ! No NULL values ! Notice the number of lines have reduced.

Data Visualization

%matplotlib inline ratings.hist(column='rating', figsize=(15,10))
array([[<matplotlib.axes._subplots.AxesSubplot object at 0x7f12e4acc438>]], dtype=object)
Image in a Jupyter notebook
ratings.boxplot(column='rating', figsize=(15,20))
<matplotlib.axes._subplots.AxesSubplot at 0x7f12e4a63860>
Image in a Jupyter notebook

Slicing Out Columns

tags['tag'].head()
0 funny 1 Highly quotable 2 will ferrell 3 Boxing story 4 MMA Name: tag, dtype: object
movies[['title','genres']].head()
title genres
0 Toy Story (1995) Adventure|Animation|Children|Comedy|Fantasy
1 Jumanji (1995) Adventure|Children|Fantasy
2 Grumpier Old Men (1995) Comedy|Romance
3 Waiting to Exhale (1995) Comedy|Drama|Romance
4 Father of the Bride Part II (1995) Comedy
ratings[-10:]
userId movieId rating
100826 610 162350 3.5
100827 610 163937 3.5
100828 610 163981 3.5
100829 610 164179 5.0
100830 610 166528 4.0
100831 610 166534 4.0
100832 610 168248 5.0
100833 610 168250 5.0
100834 610 168252 5.0
100835 610 170875 3.0
tag_counts = tags['tag'].value_counts() tag_counts[-10:]
tedious 1 royalty 1 narnia 1 Notable Nudity 1 stephen king 1 amazing 1 short films 1 Lonesome Polecat 1 Disney animated feature 1 Andy Garcia 1 Name: tag, dtype: int64
tag_counts[:10].plot(kind='bar', figsize=(15,10))
<matplotlib.axes._subplots.AxesSubplot at 0x7f12e49b2ef0>
Image in a Jupyter notebook

Filters for Selecting Rows

is_highly_rated = ratings['rating'] >= 4.0 ratings[is_highly_rated][30:50]
userId movieId rating
36 1 608 5.0
38 1 661 5.0
40 1 733 4.0
43 1 804 4.0
44 1 919 5.0
45 1 923 5.0
46 1 940 5.0
47 1 943 4.0
48 1 954 5.0
50 1 1023 5.0
51 1 1024 5.0
52 1 1025 5.0
53 1 1029 5.0
55 1 1031 5.0
56 1 1032 5.0
57 1 1042 4.0
58 1 1049 5.0
59 1 1060 4.0
60 1 1073 5.0
61 1 1080 5.0
is_animation = movies['genres'].str.contains('Animation') movies[is_animation][5:15]
movieId title genres
322 364 Lion King, The (1994) Adventure|Animation|Children|Drama|Musical|IMAX
483 551 Nightmare Before Christmas, The (1993) Animation|Children|Fantasy|Musical
488 558 Pagemaster, The (1994) Action|Adventure|Animation|Children|Fantasy
506 588 Aladdin (1992) Adventure|Animation|Children|Comedy|Musical
511 594 Snow White and the Seven Dwarfs (1937) Animation|Children|Drama|Fantasy|Musical
512 595 Beauty and the Beast (1991) Animation|Children|Fantasy|Musical|Romance|IMAX
513 596 Pinocchio (1940) Animation|Children|Fantasy|Musical
522 610 Heavy Metal (1981) Action|Adventure|Animation|Horror|Sci-Fi
527 616 Aristocats, The (1970) Animation|Children
534 631 All Dogs Go to Heaven 2 (1996) Adventure|Animation|Children|Fantasy|Musical|R...
movies[is_animation].head(15)
movieId title genres
0 1 Toy Story (1995) Adventure|Animation|Children|Comedy|Fantasy
12 13 Balto (1995) Adventure|Animation|Children
44 48 Pocahontas (1995) Animation|Children|Drama|Musical|Romance
205 239 Goofy Movie, A (1995) Animation|Children|Comedy|Romance
272 313 Swan Princess, The (1994) Animation|Children
322 364 Lion King, The (1994) Adventure|Animation|Children|Drama|Musical|IMAX
483 551 Nightmare Before Christmas, The (1993) Animation|Children|Fantasy|Musical
488 558 Pagemaster, The (1994) Action|Adventure|Animation|Children|Fantasy
506 588 Aladdin (1992) Adventure|Animation|Children|Comedy|Musical
511 594 Snow White and the Seven Dwarfs (1937) Animation|Children|Drama|Fantasy|Musical
512 595 Beauty and the Beast (1991) Animation|Children|Fantasy|Musical|Romance|IMAX
513 596 Pinocchio (1940) Animation|Children|Fantasy|Musical
522 610 Heavy Metal (1981) Action|Adventure|Animation|Horror|Sci-Fi
527 616 Aristocats, The (1970) Animation|Children
534 631 All Dogs Go to Heaven 2 (1996) Adventure|Animation|Children|Fantasy|Musical|R...

Group By and Aggregate

ratings_count = ratings[['movieId','rating']].groupby('rating').count() ratings_count
movieId
rating
0.5 1370
1.0 2811
1.5 1791
2.0 7551
2.5 5550
3.0 20047
3.5 13136
4.0 26818
4.5 8551
5.0 13211
average_rating = ratings[['movieId','rating']].groupby('movieId').mean() average_rating.head()
rating
movieId
1 3.920930
2 3.431818
3 3.259615
4 2.357143
5 3.071429
movie_count = ratings[['movieId','rating']].groupby('movieId').count() movie_count.head()
rating
movieId
1 215
2 110
3 52
4 7
5 49
movie_count = ratings[['movieId','rating']].groupby('movieId').count() movie_count.tail()
rating
movieId
193581 1
193583 1
193585 1
193587 1
193609 1

Merge Dataframes

tags.head()
userId movieId tag
0 2 60756 funny
1 2 60756 Highly quotable
2 2 60756 will ferrell
3 2 89774 Boxing story
4 2 89774 MMA
movies.head()
movieId title genres
0 1 Toy Story (1995) Adventure|Animation|Children|Comedy|Fantasy
1 2 Jumanji (1995) Adventure|Children|Fantasy
2 3 Grumpier Old Men (1995) Comedy|Romance
3 4 Waiting to Exhale (1995) Comedy|Drama|Romance
4 5 Father of the Bride Part II (1995) Comedy
t = movies.merge(tags, on='movieId', how='inner') t.head()
movieId title genres userId tag
0 1 Toy Story (1995) Adventure|Animation|Children|Comedy|Fantasy 336 pixar
1 1 Toy Story (1995) Adventure|Animation|Children|Comedy|Fantasy 474 pixar
2 1 Toy Story (1995) Adventure|Animation|Children|Comedy|Fantasy 567 fun
3 2 Jumanji (1995) Adventure|Children|Fantasy 62 fantasy
4 2 Jumanji (1995) Adventure|Children|Fantasy 62 magic board game


Combine aggreagation, merging, and filters to get useful analytics

avg_ratings = ratings.groupby('movieId', as_index=False).mean() del avg_ratings['userId'] avg_ratings.head()
movieId rating
0 1 3.920930
1 2 3.431818
2 3 3.259615
3 4 2.357143
4 5 3.071429
box_office = movies.merge(avg_ratings, on='movieId', how='inner') box_office.tail()
movieId title genres rating
9719 193581 Black Butler: Book of the Atlantic (2017) Action|Animation|Comedy|Fantasy 4.0
9720 193583 No Game No Life: Zero (2017) Animation|Comedy|Fantasy 3.5
9721 193585 Flint (2017) Drama 3.5
9722 193587 Bungo Stray Dogs: Dead Apple (2018) Action|Animation 3.5
9723 193609 Andrew Dice Clay: Dice Rules (1991) Comedy 4.0
is_highly_rated = box_office['rating'] >= 4.0 box_office[is_highly_rated][-5:]
movieId title genres rating
9713 191005 Gintama (2017) Action|Adventure|Comedy|Sci-Fi 4.5
9716 193571 Silver Spoon (2014) Comedy|Drama 4.0
9717 193573 Love Live! The School Idol Movie (2015) Animation 4.0
9719 193581 Black Butler: Book of the Atlantic (2017) Action|Animation|Comedy|Fantasy 4.0
9723 193609 Andrew Dice Clay: Dice Rules (1991) Comedy 4.0
is_comedy = box_office['genres'].str.contains('Comedy') box_office[is_comedy][:5]
movieId title genres rating
0 1 Toy Story (1995) Adventure|Animation|Children|Comedy|Fantasy 3.920930
2 3 Grumpier Old Men (1995) Comedy|Romance 3.259615
3 4 Waiting to Exhale (1995) Comedy|Drama|Romance 2.357143
4 5 Father of the Bride Part II (1995) Comedy 3.071429
6 7 Sabrina (1995) Comedy|Romance 3.185185
box_office[is_comedy & is_highly_rated][-5:]
movieId title genres rating
9708 190209 Jeff Ross Roasts the Border (2017) Comedy 4.0
9713 191005 Gintama (2017) Action|Adventure|Comedy|Sci-Fi 4.5
9716 193571 Silver Spoon (2014) Comedy|Drama 4.0
9719 193581 Black Butler: Book of the Atlantic (2017) Action|Animation|Comedy|Fantasy 4.0
9723 193609 Andrew Dice Clay: Dice Rules (1991) Comedy 4.0

Vectorized String Operations

movies.head()
movieId title genres
0 1 Toy Story (1995) Adventure|Animation|Children|Comedy|Fantasy
1 2 Jumanji (1995) Adventure|Children|Fantasy
2 3 Grumpier Old Men (1995) Comedy|Romance
3 4 Waiting to Exhale (1995) Comedy|Drama|Romance
4 5 Father of the Bride Part II (1995) Comedy


Split 'genres' into multiple columns


movie_genres = movies['genres'].str.split('|', expand=True)
movie_genres[:10]
0 1 2 3 4 5 6 7 8 9
0 Adventure Animation Children Comedy Fantasy None None None None None
1 Adventure Children Fantasy None None None None None None None
2 Comedy Romance None None None None None None None None
3 Comedy Drama Romance None None None None None None None
4 Comedy None None None None None None None None None
5 Action Crime Thriller None None None None None None None
6 Comedy Romance None None None None None None None None
7 Adventure Children None None None None None None None None
8 Action None None None None None None None None None
9 Action Adventure Thriller None None None None None None None


Add a new column for comedy genre flag


movie_genres['isComedy'] = movies['genres'].str.contains('Comedy')
movie_genres[:10]
0 1 2 3 4 5 6 7 8 9 isComedy
0 Adventure Animation Children Comedy Fantasy None None None None None True
1 Adventure Children Fantasy None None None None None None None False
2 Comedy Romance None None None None None None None None True
3 Comedy Drama Romance None None None None None None None True
4 Comedy None None None None None None None None None True
5 Action Crime Thriller None None None None None None None False
6 Comedy Romance None None None None None None None None True
7 Adventure Children None None None None None None None None False
8 Action None None None None None None None None None False
9 Action Adventure Thriller None None None None None None None False


Extract year from title e.g. (1995)


movies['year'] = movies['title'].str.extract('.*\((.*)\).*', expand=True)
movies.tail()
movieId title genres year
9737 193581 Black Butler: Book of the Atlantic (2017) Action|Animation|Comedy|Fantasy 2017
9738 193583 No Game No Life: Zero (2017) Animation|Comedy|Fantasy 2017
9739 193585 Flint (2017) Drama 2017
9740 193587 Bungo Stray Dogs: Dead Apple (2018) Action|Animation 2018
9741 193609 Andrew Dice Clay: Dice Rules (1991) Comedy 1991

Parsing Timestamps

Timestamps are common in sensor data or other time series datasets. Let us revisit the tags.csv dataset and read the timestamps!

tags = pd.read_csv('./movielens/tags.csv', sep=',')
tags.dtypes
userId int64 movieId int64 tag object timestamp int64 dtype: object

Unix time / POSIX time / epoch time records time in seconds
since midnight Coordinated Universal Time (UTC) of January 1, 1970

tags.head(5)
userId movieId tag timestamp
0 2 60756 funny 1445714994
1 2 60756 Highly quotable 1445714996
2 2 60756 will ferrell 1445714992
3 2 89774 Boxing story 1445715207
4 2 89774 MMA 1445715200
tags['parsed_time'] = pd.to_datetime(tags['timestamp'], unit='s')

Data Type datetime64[ns] maps to either <M8[ns] or >M8[ns] depending on the hardware

tags['parsed_time'].dtype
dtype('<M8[ns]')
tags.head(2)
userId movieId tag timestamp parsed_time
0 2 60756 funny 1445714994 2015-10-24 19:29:54
1 2 60756 Highly quotable 1445714996 2015-10-24 19:29:56

Selecting rows based on timestamps

greater_than_t = tags['parsed_time'] > '2015-02-01' selected_rows = tags[greater_than_t] tags.shape, selected_rows.shape
((3683, 5), (1710, 5))

Sorting the table using the timestamps

tags.sort_values(by='parsed_time', ascending=True)[:10]
userId movieId tag timestamp parsed_time
1756 474 3181 Shakespeare 1137179352 2006-01-13 19:09:12
2212 474 6912 Rita Hayworth can dance! 1137179371 2006-01-13 19:09:31
1636 474 2494 Hungary 1137179426 2006-01-13 19:10:26
1635 474 2494 Holocaust 1137179426 2006-01-13 19:10:26
1497 474 1836 No DVD at Netflix 1137179444 2006-01-13 19:10:44
1961 474 4969 In Netflix queue 1137179563 2006-01-13 19:12:43
2409 474 26242 In Netflix queue 1137179570 2006-01-13 19:12:50
2413 474 27741 In Netflix queue 1137179587 2006-01-13 19:13:07
2231 474 7025 In Netflix queue 1137179593 2006-01-13 19:13:13
2485 474 41997 In Netflix queue 1137179603 2006-01-13 19:13:23

Average Movie Ratings over Time

## Are Movie ratings related to the year of launch?
average_rating = ratings[['movieId','rating']].groupby('movieId', as_index=False).mean() average_rating.tail()
movieId rating
9719 193581 4.0
9720 193583 3.5
9721 193585 3.5
9722 193587 3.5
9723 193609 4.0
joined = movies.merge(average_rating, on='movieId', how='inner') joined.head() joined.corr()
movieId rating
movieId 1.000000 0.027841
rating 0.027841 1.000000
yearly_average = joined[['year','rating']].groupby('year', as_index=False).mean() yearly_average[:10]
year rating
0 1902 3.500000
1 1903 2.500000
2 1908 4.000000
3 1915 2.000000
4 1916 3.562500
5 1917 4.500000
6 1919 2.000000
7 1920 3.678571
8 1921 4.100000
9 1922 3.531250
yearly_average[-20:].plot(x='year', y='rating', figsize=(15,10), grid=True)
<matplotlib.axes._subplots.AxesSubplot at 0x7f12e47a2128>
Image in a Jupyter notebook

Do some years look better for the boxoffice movies than others?

Does any data point seem like an outlier in some sense?