Contact
CoCalc Logo Icon
StoreFeaturesDocsShareSupport News AboutSign UpSign In
| Download
Views: 46
Kernel: R (R-Project)

Lecture 21: Classification

Today:

  1. Basic Classifier

  2. Nearest Neighbor Classifier

  3. Assessing a Classifier

Setup and Data Upload

library("dplyr") library("ggplot2") # ignore the next two lines; these are just to make plots a bit smaller library('repr') options(repr.plot.width=3, repr.plot.height=3)
# dataset cancerdata <- read.csv("breast-cancer.csv")
dim(cancerdata) head(cancerdata, 3)
  1. 683
  2. 11
IDClump.ThicknessUniformity.of.Cell.SizeUniformity.of.Cell.ShapeMarginal.AdhesionSingle.Epithelial.Cell.SizeBare.NucleiBland.ChromatinNormal.NucleoliMitosesClass
10000255 1 1 1 2 1 3 1 1 0
10029455 4 4 5 7 10 3 2 1 0
10154253 1 1 1 2 2 3 1 1 0
# Pick 400 rows to "train" the classifier; use the remaining 283 to test/assess the classifier trainingdata <- cancerdata[ 1:400, 1:11 ] testdata <- cancerdata[ 401:683, 1:11 ]
dim(trainingdata) dim(testdata)
  1. 400
  2. 11
  1. 283
  2. 11

Exploration and A Simple Classifier (continued from Lecture 20)

# Pick two variables and visualize; see if there is a relationship ggplot( trainingdata, aes( x = Clump.Thickness, y = Uniformity.of.Cell.Size, color = factor( Class ) ) ) + geom_point( position = "jitter" )
Image in a Jupyter notebook

Observation:

If Clump.Thickness < 7 AND Uniformity.of.Cell.Size < 3.75, then the label should be 0 (not cancer)

Otherwise, label should be 1 (cancer)

#Simple Classifier #If Clump.Thickness < ... AND Uniformity.of.Cell.Size < ... , then the label should be 0 (not cancer) #Otherwise, label should be 1 (cancer) simple_classifier <- function( thickness, uniformity ){ if( thickness < 7 && uniformity < 3.75 ){ label <- 0 } else{ label <- 1 } label }
# test simple_classifier( 3, 6)
1
# predicting the label of the first row of test data simple_classifier( testdata$Clump.Thickness[ 2 ], testdata$Uniformity.of.Cell.Size[ 2 ] )
1
# compare with actual label: testdata$Class[ 100 ]
1
dim(testdata)[1]
283
# use loop to predict the label of each row in testdata num_rows_test <- dim(testdata)[1] # number of rows of testdata prediction <- data.frame( pred_label = double( num_rows_test ) ) count <- 1 while( count <= num_rows_test ){ prediction$pred_label[count] <- simple_classifier( testdata$Clump.Thickness[count] , testdata$Uniformity.of.Cell.Size[count] ) count <- count + 1 }
accuracy <- sum( testdata$Class == prediction$pred_label ) / length(testdata$Class) accuracy
0.978798586572438