Pandas
pandas is a Python library for data analysis. It offers a number of data exploration, cleaning and transformation operations that are critical in working with data in Python.
pandas build upon numpy and scipy providing easy-to-use data structures and data manipulation functions with integrated indexing.
The main data structures pandas provides are Series and DataFrames. After a brief introduction to these two data structures and data ingestion, the key features of pandas this notebook covers are:
Generating descriptive statistics on data
Data cleaning using built in pandas functions
Frequent data operations for subsetting, filtering, insertion, deletion and aggregation of data
Merging multiple datasets using dataframes
Working with timestamps and time-series data
Additional Recommended Resources:
pandas Documentation: http://pandas.pydata.org/pandas-docs/stable/
Python for Data Analysis by Wes McKinney
Python Data Science Handbook by Jake VanderPlas
Let's get started with our first pandas notebook!
Import Libraries
Introduction to pandas Data Structures
*pandas* has two main data structures it uses, namely, *Series* and *DataFrames*.
pandas Series
pandas Series one-dimensional labeled array.
pandas DataFrame
pandas DataFrame is a 2-dimensional labeled data structure.
Create DataFrame from dictionary of Python Series
one | two | |
---|---|---|
dancy | NaN | 4444.0 |
ball | 200.0 | 222.0 |
apple | 100.0 | 111.0 |
two | five | |
---|---|---|
dancy | 4444.0 | NaN |
ball | 222.0 | NaN |
apple | 111.0 | NaN |
Create DataFrame from list of Python dictionaries
alex | alice | dora | ema | joe | |
---|---|---|---|---|---|
0 | 1.0 | NaN | NaN | NaN | 2.0 |
1 | NaN | 20.0 | 10.0 | 5.0 | NaN |
alex | alice | dora | ema | joe | |
---|---|---|---|---|---|
orange | 1.0 | NaN | NaN | NaN | 2.0 |
red | NaN | 20.0 | 10.0 | 5.0 | NaN |
joe | dora | alice | |
---|---|---|---|
0 | 2.0 | NaN | NaN |
1 | NaN | 10.0 | 20.0 |
Basic DataFrame operations
one | two | |
---|---|---|
apple | 100.0 | 111.0 |
ball | 200.0 | 222.0 |
cerill | NaN | 333.0 |
clock | 300.0 | NaN |
dancy | NaN | 4444.0 |
one | two | three | |
---|---|---|---|
apple | 100.0 | 111.0 | 11100.0 |
ball | 200.0 | 222.0 | 44400.0 |
cerill | NaN | 333.0 | NaN |
clock | 300.0 | NaN | NaN |
dancy | NaN | 4444.0 | NaN |
one | two | three | flag | |
---|---|---|---|---|
apple | 100.0 | 111.0 | 11100.0 | False |
ball | 200.0 | 222.0 | 44400.0 | False |
cerill | NaN | 333.0 | NaN | False |
clock | 300.0 | NaN | NaN | True |
dancy | NaN | 4444.0 | NaN | False |
one | two | flag | |
---|---|---|---|
apple | 100.0 | 111.0 | False |
ball | 200.0 | 222.0 | False |
cerill | NaN | 333.0 | False |
clock | 300.0 | NaN | True |
dancy | NaN | 4444.0 | False |
one | flag | |
---|---|---|
apple | 100.0 | False |
ball | 200.0 | False |
cerill | NaN | False |
clock | 300.0 | True |
dancy | NaN | False |
one | flag | copy_of_one | |
---|---|---|---|
apple | 100.0 | False | 100.0 |
ball | 200.0 | False | 200.0 |
cerill | NaN | False | NaN |
clock | 300.0 | True | 300.0 |
dancy | NaN | False | NaN |
one | flag | copy_of_one | one_upper_half | |
---|---|---|---|---|
apple | 100.0 | False | 100.0 | 100.0 |
ball | 200.0 | False | 200.0 | 200.0 |
cerill | NaN | False | NaN | NaN |
clock | 300.0 | True | 300.0 | NaN |
dancy | NaN | False | NaN | NaN |
Case Study: Movie Data Analysis
This notebook uses a dataset from the MovieLens website. We will describe the dataset further as we explore with it using *pandas*.
Download the Dataset
Please note that you will need to download the dataset. Although the video for this notebook says that the data is in your folder, the folder turned out to be too large to fit on the edX platform due to size constraints.
Here are the links to the data source and location:
Data Source: MovieLens web site (filename: ml-20m.zip)
Once the download completes, please make sure the data files are in a directory called movielens in your Week-3-pandas folder.
Let us look at the files in this dataset using the UNIX command ls.
README.txt genome-tags.csv links.csv movies.csv ratings.csv tags.csv
9743
userId,movieId,rating,timestamp
1,1,4.0,964982703
1,3,4.0,964981247
1,6,4.0,964982224
1,47,5.0,964983815
Use Pandas to Read the Dataset
In this notebook, we will be using three CSV files: * **ratings.csv :** *userId*,*movieId*,*rating*, *timestamp* * **tags.csv :** *userId*,*movieId*, *tag*, *timestamp* * **movies.csv :** *movieId*, *title*, *genres*
Using the read_csv function in pandas, we will ingest these three files.
movieId | title | genres | |
---|---|---|---|
0 | 1 | Toy Story (1995) | Adventure|Animation|Children|Comedy|Fantasy |
1 | 2 | Jumanji (1995) | Adventure|Children|Fantasy |
2 | 3 | Grumpier Old Men (1995) | Comedy|Romance |
3 | 4 | Waiting to Exhale (1995) | Comedy|Drama|Romance |
4 | 5 | Father of the Bride Part II (1995) | Comedy |
5 | 6 | Heat (1995) | Action|Crime|Thriller |
6 | 7 | Sabrina (1995) | Comedy|Romance |
7 | 8 | Tom and Huck (1995) | Adventure|Children |
8 | 9 | Sudden Death (1995) | Action |
9 | 10 | GoldenEye (1995) | Action|Adventure|Thriller |
10 | 11 | American President, The (1995) | Comedy|Drama|Romance |
11 | 12 | Dracula: Dead and Loving It (1995) | Comedy|Horror |
12 | 13 | Balto (1995) | Adventure|Animation|Children |
13 | 14 | Nixon (1995) | Drama |
14 | 15 | Cutthroat Island (1995) | Action|Adventure|Romance |
userId | movieId | tag | timestamp | |
---|---|---|---|---|
0 | 2 | 60756 | funny | 1445714994 |
1 | 2 | 60756 | Highly quotable | 1445714996 |
2 | 2 | 60756 | will ferrell | 1445714992 |
3 | 2 | 89774 | Boxing story | 1445715207 |
4 | 2 | 89774 | MMA | 1445715200 |
userId | movieId | rating | timestamp | |
---|---|---|---|---|
0 | 1 | 1 | 4.0 | 964982703 |
1 | 1 | 3 | 4.0 | 964981247 |
2 | 1 | 6 | 4.0 | 964982224 |
3 | 1 | 47 | 5.0 | 964983815 |
4 | 1 | 50 | 5.0 | 964982931 |
Data Structures
Series
DataFrames
userId | movieId | tag | |
---|---|---|---|
0 | 2 | 60756 | funny |
1 | 2 | 60756 | Highly quotable |
2 | 2 | 60756 | will ferrell |
3 | 2 | 89774 | Boxing story |
4 | 2 | 89774 | MMA |
userId | movieId | tag | |
---|---|---|---|
0 | 2 | 60756 | funny |
11 | 18 | 431 | gangster |
2000 | 474 | 5450 | women |
Descriptive Statistics
Let's look how the ratings are distributed!
userId | movieId | rating | |
---|---|---|---|
count | 100836.000000 | 100836.000000 | 100836.000000 |
mean | 326.127564 | 19435.295718 | 3.501557 |
std | 182.618491 | 35530.987199 | 1.042529 |
min | 1.000000 | 1.000000 | 0.500000 |
25% | 177.000000 | 1199.000000 | 3.000000 |
50% | 325.000000 | 2991.000000 | 3.500000 |
75% | 477.000000 | 8122.000000 | 4.000000 |
max | 610.000000 | 193609.000000 | 5.000000 |
userId | movieId | rating | |
---|---|---|---|
userId | 1.000000 | 0.006773 | -0.049348 |
movieId | 0.006773 | 1.000000 | -0.004061 |
rating | -0.049348 | -0.004061 | 1.000000 |
Data Cleaning: Handling Missing Data
Thats nice ! No NULL values !
Thats nice ! No NULL values !
We have some tags which are NULL.
Thats nice ! No NULL values ! Notice the number of lines have reduced.
Data Visualization
Slicing Out Columns
title | genres | |
---|---|---|
0 | Toy Story (1995) | Adventure|Animation|Children|Comedy|Fantasy |
1 | Jumanji (1995) | Adventure|Children|Fantasy |
2 | Grumpier Old Men (1995) | Comedy|Romance |
3 | Waiting to Exhale (1995) | Comedy|Drama|Romance |
4 | Father of the Bride Part II (1995) | Comedy |
userId | movieId | rating | |
---|---|---|---|
100826 | 610 | 162350 | 3.5 |
100827 | 610 | 163937 | 3.5 |
100828 | 610 | 163981 | 3.5 |
100829 | 610 | 164179 | 5.0 |
100830 | 610 | 166528 | 4.0 |
100831 | 610 | 166534 | 4.0 |
100832 | 610 | 168248 | 5.0 |
100833 | 610 | 168250 | 5.0 |
100834 | 610 | 168252 | 5.0 |
100835 | 610 | 170875 | 3.0 |
Filters for Selecting Rows
userId | movieId | rating | |
---|---|---|---|
36 | 1 | 608 | 5.0 |
38 | 1 | 661 | 5.0 |
40 | 1 | 733 | 4.0 |
43 | 1 | 804 | 4.0 |
44 | 1 | 919 | 5.0 |
45 | 1 | 923 | 5.0 |
46 | 1 | 940 | 5.0 |
47 | 1 | 943 | 4.0 |
48 | 1 | 954 | 5.0 |
50 | 1 | 1023 | 5.0 |
51 | 1 | 1024 | 5.0 |
52 | 1 | 1025 | 5.0 |
53 | 1 | 1029 | 5.0 |
55 | 1 | 1031 | 5.0 |
56 | 1 | 1032 | 5.0 |
57 | 1 | 1042 | 4.0 |
58 | 1 | 1049 | 5.0 |
59 | 1 | 1060 | 4.0 |
60 | 1 | 1073 | 5.0 |
61 | 1 | 1080 | 5.0 |
movieId | title | genres | |
---|---|---|---|
322 | 364 | Lion King, The (1994) | Adventure|Animation|Children|Drama|Musical|IMAX |
483 | 551 | Nightmare Before Christmas, The (1993) | Animation|Children|Fantasy|Musical |
488 | 558 | Pagemaster, The (1994) | Action|Adventure|Animation|Children|Fantasy |
506 | 588 | Aladdin (1992) | Adventure|Animation|Children|Comedy|Musical |
511 | 594 | Snow White and the Seven Dwarfs (1937) | Animation|Children|Drama|Fantasy|Musical |
512 | 595 | Beauty and the Beast (1991) | Animation|Children|Fantasy|Musical|Romance|IMAX |
513 | 596 | Pinocchio (1940) | Animation|Children|Fantasy|Musical |
522 | 610 | Heavy Metal (1981) | Action|Adventure|Animation|Horror|Sci-Fi |
527 | 616 | Aristocats, The (1970) | Animation|Children |
534 | 631 | All Dogs Go to Heaven 2 (1996) | Adventure|Animation|Children|Fantasy|Musical|R... |
movieId | title | genres | |
---|---|---|---|
0 | 1 | Toy Story (1995) | Adventure|Animation|Children|Comedy|Fantasy |
12 | 13 | Balto (1995) | Adventure|Animation|Children |
44 | 48 | Pocahontas (1995) | Animation|Children|Drama|Musical|Romance |
205 | 239 | Goofy Movie, A (1995) | Animation|Children|Comedy|Romance |
272 | 313 | Swan Princess, The (1994) | Animation|Children |
322 | 364 | Lion King, The (1994) | Adventure|Animation|Children|Drama|Musical|IMAX |
483 | 551 | Nightmare Before Christmas, The (1993) | Animation|Children|Fantasy|Musical |
488 | 558 | Pagemaster, The (1994) | Action|Adventure|Animation|Children|Fantasy |
506 | 588 | Aladdin (1992) | Adventure|Animation|Children|Comedy|Musical |
511 | 594 | Snow White and the Seven Dwarfs (1937) | Animation|Children|Drama|Fantasy|Musical |
512 | 595 | Beauty and the Beast (1991) | Animation|Children|Fantasy|Musical|Romance|IMAX |
513 | 596 | Pinocchio (1940) | Animation|Children|Fantasy|Musical |
522 | 610 | Heavy Metal (1981) | Action|Adventure|Animation|Horror|Sci-Fi |
527 | 616 | Aristocats, The (1970) | Animation|Children |
534 | 631 | All Dogs Go to Heaven 2 (1996) | Adventure|Animation|Children|Fantasy|Musical|R... |
Group By and Aggregate
movieId | |
---|---|
rating | |
0.5 | 1370 |
1.0 | 2811 |
1.5 | 1791 |
2.0 | 7551 |
2.5 | 5550 |
3.0 | 20047 |
3.5 | 13136 |
4.0 | 26818 |
4.5 | 8551 |
5.0 | 13211 |
rating | |
---|---|
movieId | |
1 | 3.920930 |
2 | 3.431818 |
3 | 3.259615 |
4 | 2.357143 |
5 | 3.071429 |
rating | |
---|---|
movieId | |
1 | 215 |
2 | 110 |
3 | 52 |
4 | 7 |
5 | 49 |
rating | |
---|---|
movieId | |
193581 | 1 |
193583 | 1 |
193585 | 1 |
193587 | 1 |
193609 | 1 |
Merge Dataframes
userId | movieId | tag | |
---|---|---|---|
0 | 2 | 60756 | funny |
1 | 2 | 60756 | Highly quotable |
2 | 2 | 60756 | will ferrell |
3 | 2 | 89774 | Boxing story |
4 | 2 | 89774 | MMA |
movieId | title | genres | |
---|---|---|---|
0 | 1 | Toy Story (1995) | Adventure|Animation|Children|Comedy|Fantasy |
1 | 2 | Jumanji (1995) | Adventure|Children|Fantasy |
2 | 3 | Grumpier Old Men (1995) | Comedy|Romance |
3 | 4 | Waiting to Exhale (1995) | Comedy|Drama|Romance |
4 | 5 | Father of the Bride Part II (1995) | Comedy |
movieId | title | genres | userId | tag | |
---|---|---|---|---|---|
0 | 1 | Toy Story (1995) | Adventure|Animation|Children|Comedy|Fantasy | 336 | pixar |
1 | 1 | Toy Story (1995) | Adventure|Animation|Children|Comedy|Fantasy | 474 | pixar |
2 | 1 | Toy Story (1995) | Adventure|Animation|Children|Comedy|Fantasy | 567 | fun |
3 | 2 | Jumanji (1995) | Adventure|Children|Fantasy | 62 | fantasy |
4 | 2 | Jumanji (1995) | Adventure|Children|Fantasy | 62 | magic board game |
More examples: http://pandas.pydata.org/pandas-docs/stable/merging.html
Combine aggreagation, merging, and filters to get useful analytics
movieId | rating | |
---|---|---|
0 | 1 | 3.920930 |
1 | 2 | 3.431818 |
2 | 3 | 3.259615 |
3 | 4 | 2.357143 |
4 | 5 | 3.071429 |
movieId | title | genres | rating | |
---|---|---|---|---|
9719 | 193581 | Black Butler: Book of the Atlantic (2017) | Action|Animation|Comedy|Fantasy | 4.0 |
9720 | 193583 | No Game No Life: Zero (2017) | Animation|Comedy|Fantasy | 3.5 |
9721 | 193585 | Flint (2017) | Drama | 3.5 |
9722 | 193587 | Bungo Stray Dogs: Dead Apple (2018) | Action|Animation | 3.5 |
9723 | 193609 | Andrew Dice Clay: Dice Rules (1991) | Comedy | 4.0 |
movieId | title | genres | rating | |
---|---|---|---|---|
9713 | 191005 | Gintama (2017) | Action|Adventure|Comedy|Sci-Fi | 4.5 |
9716 | 193571 | Silver Spoon (2014) | Comedy|Drama | 4.0 |
9717 | 193573 | Love Live! The School Idol Movie (2015) | Animation | 4.0 |
9719 | 193581 | Black Butler: Book of the Atlantic (2017) | Action|Animation|Comedy|Fantasy | 4.0 |
9723 | 193609 | Andrew Dice Clay: Dice Rules (1991) | Comedy | 4.0 |
movieId | title | genres | rating | |
---|---|---|---|---|
0 | 1 | Toy Story (1995) | Adventure|Animation|Children|Comedy|Fantasy | 3.920930 |
2 | 3 | Grumpier Old Men (1995) | Comedy|Romance | 3.259615 |
3 | 4 | Waiting to Exhale (1995) | Comedy|Drama|Romance | 2.357143 |
4 | 5 | Father of the Bride Part II (1995) | Comedy | 3.071429 |
6 | 7 | Sabrina (1995) | Comedy|Romance | 3.185185 |
movieId | title | genres | rating | |
---|---|---|---|---|
9708 | 190209 | Jeff Ross Roasts the Border (2017) | Comedy | 4.0 |
9713 | 191005 | Gintama (2017) | Action|Adventure|Comedy|Sci-Fi | 4.5 |
9716 | 193571 | Silver Spoon (2014) | Comedy|Drama | 4.0 |
9719 | 193581 | Black Butler: Book of the Atlantic (2017) | Action|Animation|Comedy|Fantasy | 4.0 |
9723 | 193609 | Andrew Dice Clay: Dice Rules (1991) | Comedy | 4.0 |
Vectorized String Operations
movieId | title | genres | |
---|---|---|---|
0 | 1 | Toy Story (1995) | Adventure|Animation|Children|Comedy|Fantasy |
1 | 2 | Jumanji (1995) | Adventure|Children|Fantasy |
2 | 3 | Grumpier Old Men (1995) | Comedy|Romance |
3 | 4 | Waiting to Exhale (1995) | Comedy|Drama|Romance |
4 | 5 | Father of the Bride Part II (1995) | Comedy |
Split 'genres' into multiple columns
0 | 1 | 2 | 3 | 4 | 5 | 6 | 7 | 8 | 9 | |
---|---|---|---|---|---|---|---|---|---|---|
0 | Adventure | Animation | Children | Comedy | Fantasy | None | None | None | None | None |
1 | Adventure | Children | Fantasy | None | None | None | None | None | None | None |
2 | Comedy | Romance | None | None | None | None | None | None | None | None |
3 | Comedy | Drama | Romance | None | None | None | None | None | None | None |
4 | Comedy | None | None | None | None | None | None | None | None | None |
5 | Action | Crime | Thriller | None | None | None | None | None | None | None |
6 | Comedy | Romance | None | None | None | None | None | None | None | None |
7 | Adventure | Children | None | None | None | None | None | None | None | None |
8 | Action | None | None | None | None | None | None | None | None | None |
9 | Action | Adventure | Thriller | None | None | None | None | None | None | None |
Add a new column for comedy genre flag
0 | 1 | 2 | 3 | 4 | 5 | 6 | 7 | 8 | 9 | isComedy | |
---|---|---|---|---|---|---|---|---|---|---|---|
0 | Adventure | Animation | Children | Comedy | Fantasy | None | None | None | None | None | True |
1 | Adventure | Children | Fantasy | None | None | None | None | None | None | None | False |
2 | Comedy | Romance | None | None | None | None | None | None | None | None | True |
3 | Comedy | Drama | Romance | None | None | None | None | None | None | None | True |
4 | Comedy | None | None | None | None | None | None | None | None | None | True |
5 | Action | Crime | Thriller | None | None | None | None | None | None | None | False |
6 | Comedy | Romance | None | None | None | None | None | None | None | None | True |
7 | Adventure | Children | None | None | None | None | None | None | None | None | False |
8 | Action | None | None | None | None | None | None | None | None | None | False |
9 | Action | Adventure | Thriller | None | None | None | None | None | None | None | False |
Extract year from title e.g. (1995)
movieId | title | genres | year | |
---|---|---|---|---|
9737 | 193581 | Black Butler: Book of the Atlantic (2017) | Action|Animation|Comedy|Fantasy | 2017 |
9738 | 193583 | No Game No Life: Zero (2017) | Animation|Comedy|Fantasy | 2017 |
9739 | 193585 | Flint (2017) | Drama | 2017 |
9740 | 193587 | Bungo Stray Dogs: Dead Apple (2018) | Action|Animation | 2018 |
9741 | 193609 | Andrew Dice Clay: Dice Rules (1991) | Comedy | 1991 |
Parsing Timestamps
Timestamps are common in sensor data or other time series datasets. Let us revisit the tags.csv dataset and read the timestamps!
Unix time / POSIX time / epoch time records time in seconds
since midnight Coordinated Universal Time (UTC) of January 1, 1970
userId | movieId | tag | timestamp | |
---|---|---|---|---|
0 | 2 | 60756 | funny | 1445714994 |
1 | 2 | 60756 | Highly quotable | 1445714996 |
2 | 2 | 60756 | will ferrell | 1445714992 |
3 | 2 | 89774 | Boxing story | 1445715207 |
4 | 2 | 89774 | MMA | 1445715200 |
Data Type datetime64[ns] maps to either <M8[ns] or >M8[ns] depending on the hardware
userId | movieId | tag | timestamp | parsed_time | |
---|---|---|---|---|---|
0 | 2 | 60756 | funny | 1445714994 | 2015-10-24 19:29:54 |
1 | 2 | 60756 | Highly quotable | 1445714996 | 2015-10-24 19:29:56 |
Selecting rows based on timestamps
Sorting the table using the timestamps
userId | movieId | tag | timestamp | parsed_time | |
---|---|---|---|---|---|
1756 | 474 | 3181 | Shakespeare | 1137179352 | 2006-01-13 19:09:12 |
2212 | 474 | 6912 | Rita Hayworth can dance! | 1137179371 | 2006-01-13 19:09:31 |
1636 | 474 | 2494 | Hungary | 1137179426 | 2006-01-13 19:10:26 |
1635 | 474 | 2494 | Holocaust | 1137179426 | 2006-01-13 19:10:26 |
1497 | 474 | 1836 | No DVD at Netflix | 1137179444 | 2006-01-13 19:10:44 |
1961 | 474 | 4969 | In Netflix queue | 1137179563 | 2006-01-13 19:12:43 |
2409 | 474 | 26242 | In Netflix queue | 1137179570 | 2006-01-13 19:12:50 |
2413 | 474 | 27741 | In Netflix queue | 1137179587 | 2006-01-13 19:13:07 |
2231 | 474 | 7025 | In Netflix queue | 1137179593 | 2006-01-13 19:13:13 |
2485 | 474 | 41997 | In Netflix queue | 1137179603 | 2006-01-13 19:13:23 |
Average Movie Ratings over Time
## Are Movie ratings related to the year of launch?movieId | rating | |
---|---|---|
9719 | 193581 | 4.0 |
9720 | 193583 | 3.5 |
9721 | 193585 | 3.5 |
9722 | 193587 | 3.5 |
9723 | 193609 | 4.0 |
movieId | rating | |
---|---|---|
movieId | 1.000000 | 0.027841 |
rating | 0.027841 | 1.000000 |
year | rating | |
---|---|---|
0 | 1902 | 3.500000 |
1 | 1903 | 2.500000 |
2 | 1908 | 4.000000 |
3 | 1915 | 2.000000 |
4 | 1916 | 3.562500 |
5 | 1917 | 4.500000 |
6 | 1919 | 2.000000 |
7 | 1920 | 3.678571 |
8 | 1921 | 4.100000 |
9 | 1922 | 3.531250 |
Do some years look better for the boxoffice movies than others?
Does any data point seem like an outlier in some sense?