Contact
CoCalc Logo Icon
StoreFeaturesDocsShareSupport News AboutSign UpSign In
| Download
Views: 416
Kernel: R (R-Project)

Lab 9: Project 2 Introduction

Welcome to the ninth lab session of our course!

Each week, you will be asked to complete a lab exercise just like this one. The labs are important parts of the course; you should be spending recitation time working on these questions.

You will work in groups of two. However, each member of the group must turn in their own lab writeup.

Make sure to ask your partner or your TA if you are not sure that you are understanding the exercises.

Learning Objectives

In today's lab, you will:

  1. Form your team for Project 2

  2. Explore the dataset that we will use in Project 2

  3. Review Classification and Nearest Neighbor Search

Names & NetIDs

Your Name: Audrey Romjue [email protected]

In this lab, I collaborated with: Maren Altman (moa257) and Yanfei Qin (yq617)

Task 1: Forming your group for Project 2

Please form your groups now. Your group should have 2 or 3 members. (You can choose to work with you Project 1 team, or form a new team.)

Once you form your group, please edit the cell below so that it displays the names and netIDs of your team members correctly. Each group member should complete this step in their individual Lab04 Jupyter notebook.

# The code below creates a data frame called "myteam", which has two columns: Name and NetID. # Please edit the content of this data frame so that it contains the actual names and netIDs of your team members myteam <- data.frame( Name = c( "Audrey Romjue", "Maren Altman", "Yanfei Qin" ), NetID = c("agr348", "moa257", "yq617") ) myteam
NameNetID
Audrey Romjueagr348
Maren Altman moa257
Yanfei Qin yq617
# Task 1 # 5 points # Replace the ... below with "myteaminfo.csv" (including the quotation marks) # Then, run this code cell after you edit the one above # This creates a csv file in your Lab04 folder containing your group information write.csv( "myteaminfo.csv" )
"","x" "1","myteaminfo.csv"

The Dataset for Project 2

You should discuss this part of the project with your Project 2 group members, but each member should submit their own lab work.

Detailed instructions for Project 2 will be posted on NYU Classes (under Resources) later. In the meantime, the upcoming homework assignments and labs will contain parts that gets you started with the project.

In Project 2, you will create a Movie Classifier which will classify movies into one of two genres--action or romance--based on the frequency of words in its script.

Start by loading the packages and the two csv files in this folder.

# load dplyr and ggplot2 as usual library( dplyr ) library( ggplot2 )
Attaching package: ‘dplyr’ The following objects are masked from ‘package:stats’: filter, lag The following objects are masked from ‘package:base’: intersect, setdiff, setequal, union
# load the main datasets (run this cell) movies <- read.csv('movies.csv') stem <- read.csv('stem.csv')

Task 2: Explore the Dataset

Get an initial sense of the contents of each data frame. For each data frame, answer the following questions:

  1. How many variables? How many rows? What are the observations?

  2. What information does each data frame contain? How is it related to the other two data frames? What are the differences between them?

  3. The movies dataframe contains a lot of columns. Which column tells us the "class"/"category" that we classify the movies into? Which columns contain attributes of the movies?

Take note of your answers in the cell below. It is currently a code cell; you will need to change it to a "Markdown" cell once you type your answer/notes below.

# Task 2 # any codes that you need to run to answer Task 2 should go here dim(movies) head(movies, 3) dim(stem) head(stem, 5)
  1. 242
  2. 5006
TitleGenreYearRatingX..VotesX..Wordsithetoafosterpubvegetariangarrisongrammoochimneybikinirichterpsychopathfling
the terminator action 1984 8.1 183538 1849 0.04002163 0.04380746 0.02541915 0.02487831 0 0 0 0 0 0 0 0 0.000000000 0
batman action 1989 7.6 112731 2836 0.05148096 0.03385049 0.02397743 0.02820875 0 0 0 0 0 0 0 0 0.000000000 0
tomorrow never diesaction 1997 6.4 47198 4215 0.02870700 0.05432977 0.03036773 0.02182681 0 0 0 0 0 0 0 0 0.000237248 0
  1. 40053
  2. 2
StemWord
sowel sowell
everybodyiteverybodyit
uabortu uabortu
wood woods
spider spiders

Task 2, continued

5 points

movies

  1. How many variables? How many rows? What are the observations?

    • There are 5006 variables and 242 rows. The observations are movies, with each movie title listed in the first column.

  2. What information does each data frame contain? How is it related to the other two data frames? What are the differences between them?

    • This data frame contains information about the general information about each movie, like the year it was made and the genre. It also includes the amount of words in each movie and the words that appear in the movies. It is related to the other data frame in that it includes observations about the similar words. It is different because it does not specify any sort of movie.

  3. The movies dataframe contains a lot of columns. Which column tells us the "class"/"category" that we classify the movies into? Which columns contain attributes of the movies?

    • In the movies dataframe, the column "Genre" tells us the class or category that we classify the movies into. The columns with words, from "i" to "fling", are the attributes of the movies.

stem

  1. How many variables? How many rows? What are the observations?

    • There are 2 variables and 40053 rows. The observations are the stem word in question.

  2. What information does each data frame contain? How is it related to the other two data frames? What are the differences between them?

    • This dataframe contains information about the stems of full words. It is related to the other dataframe because it includes information about words, but is different because it contains more specific information about the words. It is also different becauseit includes a significantly less amount of information – the movies dataframe includes a lot of information about movies, while the stem dataframe is just words.

Task 3: Understanding the Columns of the movies Data frame

The columns other than "Title", "Genre", "Year", "Rating", "# Votes" and "# Words" in the movies table are all words that appear in some of the movies in our dataset. These words have been stemmed, or abbreviated heuristically, in an attempt to make different inflected forms of the same base word into the same string. For example, the column "manag" is the sum of proportions of the words "manage", "manager", "managed", and "managerial" (and perhaps others) in each movie. This is a common technique used in machine learning and natural language processing.

Stemming makes it a little tricky to search for the words you want to use, so we have provided the stem data frame that will let you see examples of unstemmed versions of each stemmed word. Run the code below to pick into this data frame.

# Task 3 # 0 points head(stem)
StemWord
sowel sowell
everybodyiteverybodyit
uabortu uabortu
wood woods
spider spiders
hang hanging

Task 4: Splitting Into Training and Test Data, part 1

In this "warm up" version, we will split the movies dataframe into two dataframes: training0 and test0. The training data, training0, will consist of the first 2/3 (roughly) of the rows of the movies dataframe while the test data, test0, will consiste of the remaining 1/3 of the rows.

We have included partial codes below; please complete the codes and run it. Make sure that you understand what the codes do.

# Task 4 # 4 points # total number of rows in movies num_rows <- dim(movies)[1] # specify the proportion of the data that is used for training training_proportion <- (2/3) num_training <- round( training_proportion * num_rows ) # round the number to the nearest integer # row indices for the training and the test data: training_row_indices <- 1:num_training test_row_indices <- (num_training+1):num_rows # split the movies dataframe training0 <- movies[ training_row_indices, ] test0 <- movies[ test_row_indices , ]
# Task 4 Check (run this cell) dim(training0) dim(test0) # training0 should have 161 rows and test0 should have 81 rows; both should have 5006 columns
  1. 161
  2. 5006
  1. 81
  2. 5006

Task 5: Splitting Into Training and Test Data, part 2

Next, we will split the movies dataframe into two dataframes: training and test. The training data, training, will consist 2/3 (roughly) of the rows of the movies dataframe while the test data, test, will consist of the other 1/3 of the rows.

The main difference here is that we will now select the rows for the training data randomly instead of simply choosing the first 2/3 of the rows of the movies dataframe.

We have included partial codes below; please complete the codes and run it. Make sure that you understand what the codes do.

# Task 5 (there is only one thing you need to complete in this code) # 1 point # total number of rows in movies num_rows <- dim(movies)[1] # specify the proportion of the data that is used for training training_proportion <- 2/3 # later you can change 2/3 into any number between 0 and 1 num_training <- round( training_proportion * num_rows ) # round the number to the nearest integer # shuffle all the 242 rows shuffled_row_indices <- sample( 1:num_rows, num_rows, replace = FALSE) # row indices for the training data is the first num_training elements of the shuffled row indices # row indices for the test data is the remaining elements of the shuffled row indices training_row_indices <- shuffled_row_indices[ 1:num_training ] test_row_indices <- shuffled_row_indices[ (num_training+1):num_rows ] # split the movies dataframe training <- movies[ training_row_indices, ] test <- movies[ test_row_indices, ] head(training) head(test)
TitleGenreYearRatingX..VotesX..Wordsithetoafosterpubvegetariangarrisongrammoochimneybikinirichterpsychopathfling
38seven action 1979 6.1 259 4313 0.02596800 0.03756086 0.03199629 0.02295386 0 0.000000000 0 0 0 0 0 0.000000000 0 0
151an american werewolf in londonromance 1981 7.5 24443 4241 0.03984909 0.03631219 0.03136053 0.02805942 0 0.000235793 0 0 0 0 0 0.000000000 0 0
226the cider house rules romance 1999 7.5 38836 4963 0.04029821 0.02841024 0.03223856 0.01974612 0 0.000000000 0 0 0 0 0 0.000000000 0 0
143stranglehold action 2007 8.6 907 1455 0.02474227 0.04879725 0.02749141 0.02268041 0 0.000000000 0 0 0 0 0 0.002061856 0 0
189shampoo romance 1975 6.2 4406 7258 0.04960044 0.01846239 0.03058694 0.02411132 0 0.000000000 0 0 0 0 0 0.000000000 0 0
197say anything... romance 1989 7.5 25220 6317 0.05540605 0.02089600 0.03134399 0.02580339 0 0.000000000 0 0 0 0 0 0.000000000 0 0
TitleGenreYearRatingX..VotesX..Wordsithetoafosterpubvegetariangarrisongrammoochimneybikinirichterpsychopathfling
115snow falling on cedars romance 1999 6.7 8483 4088 0.02372798 0.05797456 0.02568493 0.02739726 0.000000000 0 0 0.00000000 0 0 0 0 0 0
23arctic blue action 1993 4.8 464 4115 0.02405832 0.03791009 0.02818955 0.02575942 0.000000000 0 0 0.00000000 0 0 0 0 0 0
130entrapment romance 1999 6.1 41120 4031 0.02530389 0.05160010 0.02704044 0.03249814 0.000000000 0 0 0.00000000 0 0 0 0 0 0
200star wars: the empire strikes backaction 1982 8.0 42 5006 0.03096284 0.03415901 0.02636836 0.02097483 0.000000000 0 0 0.00019976 0 0 0 0 0 0
19never been kissed romance 1999 5.7 27409 3794 0.04138113 0.03110174 0.03426463 0.01976805 0.000263574 0 0 0.00000000 0 0 0 0 0 0
102the family man romance 2000 6.6 34509 5214 0.03835827 0.03049482 0.02531646 0.02953587 0.000000000 0 0 0.00000000 0 0 0 0 0 0

Task 6: Visualizing Movie Attributes

Intuitively, we can be persuaded to believe that certain words are used more frequently in one genre than in another. For example, the word "love" might be appear more frequently in a romance movie than in an action movie, while the word "gunshot" might appear more frequently in an action movie than in a romance movie.

Let's check if this is true, by creating a scatterplot, plotting the frequency of the word "love" in the x-axis and the frequency of the word "gunshot" in the y-axis, with the color of the dots corresponding to the genre of the movie.

# Task 6 , part 1: just run this cell # 0 points # First, let's check that the word "love" indeed appears exactly in as a column of the movie data frame. # To do this, we use the match() function: # - the match function takes two inputs: WORD and LIST. # - match( WORD, LIST ) returns the index of the element WORD in the list LIST # The function names( DATAFRAME ) returns a list of the column names of the dataframe match( "love", names(movies) ) # this should return 113, which means that love is the name of the 113th column of the movies data frame match( "gunshot", names(movies) ) # this should return 4150, which means that gunshot is the name of the 113th column of the movies data frame match("gun", names(movies) ) # this hsould return <NA>, which means that gun is NOT a column name of the movies data frame
113
4150
<NA>
# Task 6, part 2: visualize! # 5 points ggplot( training, aes(x=love, y=gunshot, color=factor(Genre))) + geom_point(position = "jitter")
Image in a Jupyter notebook

Task 7: Visualizing Movie Attributes, part 2

We will repeat what we did in Task 6, with other pairs of columns:

  1. Try plotting money and feel

  2. Pick two other attributes, plot them. (To see what words are available, enter names(movies); this command lists the column names of the movies data frame.)

# Task 7 (money vs. feel) # 5 points ggplot( training, aes(x=money, y=feel, color=factor(Genre)) ) + geom_point(position="jitter")
Image in a Jupyter notebook
names(movies)
WARNING: Some output was deleted.
# Task 7, continued (two other attributes) # 5 points ggplot( training, aes(x=egg, y=shrimp, color=factor(Genre)) ) + geom_point(position="jitter")
Image in a Jupyter notebook

Task 8: Measuring Distance Between Movies

Our ultimate goal is to build a "Nearest Neighbor Classifier" to predict whether a movie belongs to the action or romance genre.

Working towards this goal, we will start by measuring distances between movies, in terms of just two "features": money and feel.

  1. To make things a little simpler, we will limit our work to a much smaller data frame first, containing just five movies: "Batman Returns", "The Avengers", "Titanic", "Star Wars", and "Shakespeare in Love". We will build a data frame called movies_small which only contains these five movies, with four columns: Title, Genre, money, and feel.

  2. We will then compute the distance between each of these movies to "Batman Returns". First, we need to make our notion of similarity more precise. We will say that the distance between two movies is the straight-line distance between them when we plot their features in a scatter diagram. This distance is called the Euclidean ("yoo-KLID-ee-un") distance, whose formula is (x1x2)2+(y1y2)2\sqrt{(x_1−x_2)^2 +(y_1−y_2)^2}.

    For example, in the movie Titanic (in the training set), 0.0009768 of all the words in the movie are "money" and 0.0017094 are "feel". Its distance from Batman Returns on this 2-word feature set is (0.0005020.0009768)2+(0.0040160.0017094)20.00235496\sqrt{(0.000502−0.0009768)^2+(0.004016−0.0017094)^2} \approx 0.00235496. (If we included more or different features, the distance could be different.)

    A third movie, The Avengers (in the training set), is 0 "money" and 0.001115 "feel".

    Create an empty data frame called df with a column called distance. Then, write a while loop that fills in this column with the distance between each movie to the movie "Batman Returns".

# Task 8.1: Creating the smaller data frame # 5 points # first find the row numbers in which the movies are in the big movies data frame (all titles are in lower case) batman_row <- match( "batman returns", movies$Title) avengers_row <- match( "the avengers", movies$Title) titanic_row <- match( "titanic", movies$Title) starwars_row <- match( "star wars", movies$Title) shakespeare_row <- match( "shakespeare in love", movies$Title) # find the column numbers in which the columns Genre, money, and feel are Title_col <- match( "Title", names(movies)) Genre_col <- match( "Genre", names(movies)) money_col <- match( "money", names(movies)) feel_col <- match( "feel", names(movies)) movies_small <- movies[ c(batman_row, avengers_row, titanic_row, starwars_row, shakespeare_row), c( Genre_col, money_col, feel_col)] head(movies_small)
Genremoneyfeel
242action 0.0005020080.004016064
6action 0.0000000000.001115449
29romance 0.0009768010.001709402
217action 0.0005086470.002797558
11romance 0.0006920420.000461361
# Task 8.2: Creating the smaller data frame # 5 points df <- data.frame( distance=double(5) ) count <- 1 while( count <= 5 ){ df$distance[count] <- ( sqrt((movies_small$money[1] - movies_small$money[count])**2 + (movies_small$feel[1] - movies_small$feel[count])**2)) count <- count + 1 }
# Task 8.2 Check # view the data frame df
distance
0.000000000
0.002943736
0.002355020
0.001218524
0.003559779