Project: Audrey Romjue - FromDataToDiscovery-Fall2018

Views: ⁴¹⁶

Kernel: R (R-Project)

Lab 9: Project 2 Introduction

Welcome to the ninth lab session of our course!

Each week, you will be asked to complete a lab exercise just like this one. The labs are important parts of the course; you should be spending recitation time working on these questions.

You will work in groups of two. However, each member of the group must turn in their own lab writeup.

Make sure to ask your partner or your TA if you are not sure that you are understanding the exercises.

Learning Objectives

In today's lab, you will:

Form your team for Project 2
Explore the dataset that we will use in Project 2
Review Classification and Nearest Neighbor Search

Names & NetIDs

Your Name: Audrey Romjue [email protected]

In this lab, I collaborated with: Maren Altman (moa257) and Yanfei Qin (yq617)

Task 1: Forming your group for Project 2

Please form your groups now. Your group should have 2 or 3 members. (You can choose to work with you Project 1 team, or form a new team.)

Once you form your group, please edit the cell below so that it displays the names and netIDs of your team members correctly. Each group member should complete this step in their individual Lab04 Jupyter notebook.

In [1]:

# The code below creates a data frame called "myteam", which has two columns: Name and NetID.
# Please edit the content of this data frame so that it contains the actual names and netIDs of your team members

myteam <- data.frame( Name = c( "Audrey Romjue", "Maren Altman", "Yanfei Qin" ),
                      NetID = c("agr348", "moa257", "yq617") )
myteam

Name	NetID
Audrey Romjue	agr348
Maren Altman	moa257
Yanfei Qin	yq617

In [2]:

# Task 1
# 5 points
# Replace the ... below with "myteaminfo.csv" (including the quotation marks)
# Then, run this code cell after you edit the one above
# This creates a csv file in your Lab04 folder containing your group information
write.csv( "myteaminfo.csv" )

"","x"
"1","myteaminfo.csv"

The Dataset for Project 2

You should discuss this part of the project with your Project 2 group members, but each member should submit their own lab work.

Detailed instructions for Project 2 will be posted on NYU Classes (under Resources) later. In the meantime, the upcoming homework assignments and labs will contain parts that gets you started with the project.

In Project 2, you will create a Movie Classifier which will classify movies into one of two genres--action or romance--based on the frequency of words in its script.

Start by loading the packages and the two csv files in this folder.

In [3]:

# load dplyr and ggplot2 as usual
library( dplyr )
library( ggplot2 )

Attaching package: ‘dplyr’

The following objects are masked from ‘package:stats’:

    filter, lag

The following objects are masked from ‘package:base’:

    intersect, setdiff, setequal, union

In [4]:

# load the main datasets (run this cell)

movies <- read.csv('movies.csv')
stem <- read.csv('stem.csv')

Task 2: Explore the Dataset

Get an initial sense of the contents of each data frame. For each data frame, answer the following questions:

How many variables? How many rows? What are the observations?
What information does each data frame contain? How is it related to the other two data frames? What are the differences between them?
The movies dataframe contains a lot of columns. Which column tells us the "class"/"category" that we classify the movies into? Which columns contain attributes of the movies?

Take note of your answers in the cell below. It is currently a code cell; you will need to change it to a "Markdown" cell once you type your answer/notes below.

In [7]:

# Task 2
# any codes that you need to run to answer Task 2 should go here

dim(movies)
head(movies, 3)

dim(stem)
head(stem, 5)

242
5006

Title	Genre	Year	Rating	X..Votes	X..Words	i	the	to	a	⋯	psychopath
the terminator	action	1984	8.1	183538	1849	0.04002163	0.04380746	0.02541915	0.02487831	⋯	0.000000000
batman	action	1989	7.6	112731	2836	0.05148096	0.03385049	0.02397743	0.02820875	⋯	0.000000000
tomorrow never dies	action	1997	6.4	47198	4215	0.02870700	0.05432977	0.03036773	0.02182681	⋯	0.000237248

40053
2

Stem	Word
sowel	sowell
everybodyit	everybodyit
uabortu	uabortu
wood	woods
spider	spiders

Task 2, continued

5 points

movies

How many variables? How many rows? What are the observations?
- There are 5006 variables and 242 rows. The observations are movies, with each movie title listed in the first column.
What information does each data frame contain? How is it related to the other two data frames? What are the differences between them?
- This data frame contains information about the general information about each movie, like the year it was made and the genre. It also includes the amount of words in each movie and the words that appear in the movies. It is related to the other data frame in that it includes observations about the similar words. It is different because it does not specify any sort of movie.
The movies dataframe contains a lot of columns. Which column tells us the "class"/"category" that we classify the movies into? Which columns contain attributes of the movies?
- In the movies dataframe, the column "Genre" tells us the class or category that we classify the movies into. The columns with words, from "i" to "fling", are the attributes of the movies.

stem

How many variables? How many rows? What are the observations?
- There are 2 variables and 40053 rows. The observations are the stem word in question.
What information does each data frame contain? How is it related to the other two data frames? What are the differences between them?
- This dataframe contains information about the stems of full words. It is related to the other dataframe because it includes information about words, but is different because it contains more specific information about the words. It is also different becauseit includes a significantly less amount of information – the movies dataframe includes a lot of information about movies, while the stem dataframe is just words.

Task 3: Understanding the Columns of the `movies` Data frame

The columns other than "Title", "Genre", "Year", "Rating", "# Votes" and "# Words" in the movies table are all words that appear in some of the movies in our dataset. These words have been stemmed, or abbreviated heuristically, in an attempt to make different inflected forms of the same base word into the same string. For example, the column "manag" is the sum of proportions of the words "manage", "manager", "managed", and "managerial" (and perhaps others) in each movie. This is a common technique used in machine learning and natural language processing.

Stemming makes it a little tricky to search for the words you want to use, so we have provided the stem data frame that will let you see examples of unstemmed versions of each stemmed word. Run the code below to pick into this data frame.

In [8]:

# Task 3
# 0 points

head(stem)

Stem	Word
sowel	sowell
everybodyit	everybodyit
uabortu	uabortu
wood	woods
spider	spiders
hang	hanging

Task 4: Splitting Into Training and Test Data, part 1

In this "warm up" version, we will split the movies dataframe into two dataframes: training0 and test0. The training data, training0, will consist of the first 2/3 (roughly) of the rows of the movies dataframe while the test data, test0, will consiste of the remaining 1/3 of the rows.

We have included partial codes below; please complete the codes and run it. Make sure that you understand what the codes do.

In [9]:

# Task 4
# 4 points

# total number of rows in movies
num_rows <- dim(movies)[1]

# specify the proportion of the data that is used for training
training_proportion <- (2/3) 

num_training <- round( training_proportion * num_rows )   # round the number to the nearest integer

# row indices for the training and the test data:
training_row_indices <- 1:num_training
test_row_indices <- (num_training+1):num_rows

# split the movies dataframe
training0 <- movies[ training_row_indices,  ]
test0 <- movies[ test_row_indices , ]

In [10]:

# Task 4 Check (run this cell)
dim(training0)
dim(test0)

# training0 should have 161 rows and test0 should have 81 rows; both should have 5006 columns

161
5006

81
5006

Task 5: Splitting Into Training and Test Data, part 2

Next, we will split the movies dataframe into two dataframes: training and test. The training data, training, will consist 2/3 (roughly) of the rows of the movies dataframe while the test data, test, will consist of the other 1/3 of the rows.

The main difference here is that we will now select the rows for the training data randomly instead of simply choosing the first 2/3 of the rows of the movies dataframe.

We have included partial codes below; please complete the codes and run it. Make sure that you understand what the codes do.

In [11]:

# Task 5 (there is only one thing you need to complete in this code)
# 1 point

# total number of rows in movies
num_rows <- dim(movies)[1]
# specify the proportion of the data that is used for training
training_proportion <- 2/3   # later you can change 2/3 into any number between 0 and 1
num_training <- round( training_proportion * num_rows )   # round the number to the nearest integer

# shuffle all the 242 rows
shuffled_row_indices <- sample( 1:num_rows, num_rows, replace = FALSE)

# row indices for the training data is the first num_training elements of the shuffled row indices
# row indices for the test data is the remaining elements of the shuffled row indices
training_row_indices <- shuffled_row_indices[ 1:num_training ]
test_row_indices <- shuffled_row_indices[ (num_training+1):num_rows ]


# split the movies dataframe
training <- movies[ training_row_indices,  ]
test <- movies[ test_row_indices, ]
head(training)
head(test)

	Title	Genre	Year	Rating	X..Votes	X..Words	i	the	to	a	⋯	pub	richter
38	seven	action	1979	6.1	259	4313	0.02596800	0.03756086	0.03199629	0.02295386	⋯	0.000000000	0.000000000
151	an american werewolf in london	romance	1981	7.5	24443	4241	0.03984909	0.03631219	0.03136053	0.02805942	⋯	0.000235793	0.000000000
226	the cider house rules	romance	1999	7.5	38836	4963	0.04029821	0.02841024	0.03223856	0.01974612	⋯	0.000000000	0.000000000
143	stranglehold	action	2007	8.6	907	1455	0.02474227	0.04879725	0.02749141	0.02268041	⋯	0.000000000	0.002061856
189	shampoo	romance	1975	6.2	4406	7258	0.04960044	0.01846239	0.03058694	0.02411132	⋯	0.000000000	0.000000000
197	say anything...	romance	1989	7.5	25220	6317	0.05540605	0.02089600	0.03134399	0.02580339	⋯	0.000000000	0.000000000

	Title	Genre	Year	Rating	X..Votes	X..Words	i	the	to	a	⋯	foster	garrison
115	snow falling on cedars	romance	1999	6.7	8483	4088	0.02372798	0.05797456	0.02568493	0.02739726	⋯	0.000000000	0.00000000
23	arctic blue	action	1993	4.8	464	4115	0.02405832	0.03791009	0.02818955	0.02575942	⋯	0.000000000	0.00000000
130	entrapment	romance	1999	6.1	41120	4031	0.02530389	0.05160010	0.02704044	0.03249814	⋯	0.000000000	0.00000000
200	star wars: the empire strikes back	action	1982	8.0	42	5006	0.03096284	0.03415901	0.02636836	0.02097483	⋯	0.000000000	0.00019976
19	never been kissed	romance	1999	5.7	27409	3794	0.04138113	0.03110174	0.03426463	0.01976805	⋯	0.000263574	0.00000000
102	the family man	romance	2000	6.6	34509	5214	0.03835827	0.03049482	0.02531646	0.02953587	⋯	0.000000000	0.00000000

Task 6: Visualizing Movie Attributes

Intuitively, we can be persuaded to believe that certain words are used more frequently in one genre than in another. For example, the word "love" might be appear more frequently in a romance movie than in an action movie, while the word "gunshot" might appear more frequently in an action movie than in a romance movie.

Let's check if this is true, by creating a scatterplot, plotting the frequency of the word "love" in the x-axis and the frequency of the word "gunshot" in the y-axis, with the color of the dots corresponding to the genre of the movie.

In [12]:

# Task 6 , part 1: just run this cell
# 0 points

# First, let's check that the word "love" indeed appears exactly in as a column of the movie data frame.  
#    To do this, we use the match() function:
#       - the match function takes two inputs: WORD and LIST.
#       - match( WORD, LIST ) returns the index of the element WORD in the list LIST
#    The function names( DATAFRAME )  returns a list of the column names of the dataframe 

match( "love", names(movies) )
# this should return 113, which means that love is the name of the 113th column of the movies data frame

match( "gunshot", names(movies) )
# this should return 4150, which means that gunshot is the name of the 113th column of the movies data frame

match("gun", names(movies) )
# this hsould return <NA>, which means that gun is NOT a column name of the movies data frame

113

4150

<NA>

In [13]:

# Task 6, part 2: visualize!
# 5 points

ggplot( training, aes(x=love, y=gunshot, color=factor(Genre))) + geom_point(position = "jitter")

Task 7: Visualizing Movie Attributes, part 2

We will repeat what we did in Task 6, with other pairs of columns:

Try plotting money and feel
Pick two other attributes, plot them. (To see what words are available, enter names(movies); this command lists the column names of the movies data frame.)

In [14]:

# Task 7 (money vs. feel)
# 5 points

ggplot( training, aes(x=money, y=feel, color=factor(Genre)) ) + geom_point(position="jitter")

In [17]:

names(movies)

WARNING: Some output was deleted.

In [18]:

# Task 7, continued (two other attributes)
# 5 points

ggplot( training, aes(x=egg, y=shrimp, color=factor(Genre)) ) + geom_point(position="jitter")

Task 8: Measuring Distance Between Movies

Our ultimate goal is to build a "Nearest Neighbor Classifier" to predict whether a movie belongs to the action or romance genre.

Working towards this goal, we will start by measuring distances between movies, in terms of just two "features": money and feel.

To make things a little simpler, we will limit our work to a much smaller data frame first, containing just five movies: "Batman Returns", "The Avengers", "Titanic", "Star Wars", and "Shakespeare in Love". We will build a data frame called movies_small which only contains these five movies, with four columns: Title, Genre, money, and feel.
We will then compute the distance between each of these movies to "Batman Returns". First, we need to make our notion of similarity more precise. We will say that the distance between two movies is the straight-line distance between them when we plot their features in a scatter diagram. This distance is called the Euclidean ("yoo-KLID-ee-un") distance, whose formula is $\sqrt{(x_1−x_2)^2 +(y_1−y_2)^2}$ .
For example, in the movie Titanic (in the training set), 0.0009768 of all the words in the movie are "money" and 0.0017094 are "feel". Its distance from Batman Returns on this 2-word feature set is $\sqrt{(0.000502−0.0009768)^2+(0.004016−0.0017094)^2} \approx 0.00235496$ . (If we included more or different features, the distance could be different.)
A third movie, The Avengers (in the training set), is 0 "money" and 0.001115 "feel".
Create an empty data frame called df with a column called distance. Then, write a while loop that fills in this column with the distance between each movie to the movie "Batman Returns".

In [19]:

# Task 8.1: Creating the smaller data frame
# 5 points

# first find the row numbers in which the movies are in the big movies data frame (all titles are in lower case)
batman_row <- match( "batman returns", movies$Title)
avengers_row <- match( "the avengers", movies$Title)
titanic_row <- match( "titanic", movies$Title)
starwars_row <- match( "star wars", movies$Title)
shakespeare_row <- match( "shakespeare in love", movies$Title)

# find the column numbers in which the columns Genre, money, and feel are
Title_col <- match( "Title", names(movies))
Genre_col <- match( "Genre", names(movies))
money_col <- match( "money", names(movies))
feel_col <- match( "feel", names(movies))

movies_small <- movies[ c(batman_row, avengers_row, titanic_row, starwars_row, shakespeare_row), c( Genre_col, money_col, feel_col)]

head(movies_small)

	Genre	money	feel
242	action	0.000502008	0.004016064
6	action	0.000000000	0.001115449
29	romance	0.000976801	0.001709402
217	action	0.000508647	0.002797558
11	romance	0.000692042	0.000461361

In [20]:

# Task 8.2: Creating the smaller data frame
# 5 points

df <- data.frame( distance=double(5) )

count <- 1
while( count <= 5 ){
    
    df$distance[count] <- ( sqrt((movies_small$money[1] - movies_small$money[count])**2 + (movies_small$feel[1] - movies_small$feel[count])**2))
    
    count <- count + 1
}

In [21]:

# Task 8.2 Check

# view the data frame
df

distance
0.000000000
0.002943736
0.002355020
0.001218524
0.003559779