Discussion

Clump together at a set of tables with a TA. Discuss your thoughts about the pre-class reading material.

Game time!

Now that we have this capability and we've seen some of the dangers, we're going to spend this week on a game. In this game, we have two goals: 1) We want to build the best predictor that we can, but 2)at all times we want to have an accurate idea of how well the predictor works.

For this game, we've managed to get our hands on some data about two diseases (D1 and D2). Each of these datasets has features in columns and examples in rows. Each feature represents a clinical measurement, while each row represents a person. We want to be able to predict whether or not a person has a disease (the last column).

We'll supply you with four datasets for each disease throughout the week. For the first day, we've given you two of them. We also provide example code to read the data. From there, the path that you take is up to you. We do not know the best predictor or even what the maximum achievable accuracy for these data! This is a chance to experiment and find out what best captures disease status.

The machine learning algorithm, SVM, that we've already introduced has many things that you can change. You've already played around with changes to the C parameter. You could change other options as well. You may want to try to play around with different "kernel" parameters, "C" parameters, even the underlying algorithm!

If you feel like trying entirely different algorithms, a few potential ones are demonstrated in scikit-learn's documentation: http://scikit-learn.org/stable/auto_examples/classification/plot_classifier_comparison.html

In the interests of recording your research steps, whatever you change should be recorded and noted in the iPython notebook. We provide an example first move below. In every case, please label the move number, the goal (what you hope to implement), the rationale (why you've chosen to implement that, or make that move as a result of the prior move), and an expected accuracy which you fill out after you build and run your code.

In [ ]:
# D1 - Move #1

# Goal:
# Build an SVM classifier with C=0.00001 for D1.
# Use S1 as a training set and S2 as a testing set

# Rationale:
# We need somewhere to start. We might as well start here.

# Expected Accuracy
# Assessed using a held out test set: 0.5

# numpy provides the tools to easily load our data and split the
# features from the labels
import numpy as np

# We'll use an SVM from scikit learn
from sklearn import svm

# use numpy to load our training set
d1_train = np.loadtxt(open("D1_S1.csv", "rb"), delimiter=",")
# features are all rows for columns before 200
d1_train_features = d1_train[:,:200]
# labels are in all rows at the 200th column
d1_train_labels = d1_train[:,200]

# use numpy to load our testing set
d1_test = np.loadtxt(open("D1_S2.csv", "rb"), delimiter=",")
# features are all rows for columns before 200
d1_test_features = d1_test[:,:200]
# labels are in all rows at the 200th column
d1_test_labels = d1_test[:,200]

# Now we're going to construct a classifier. First we need to set up our parameters
classifier = svm.SVC(C=1, kernel='linear')

# Once our parameters are set, we can fit the classifier to our data
classifier.fit(d1_train_features, d1_train_labels)

# Once we have our classifier, we can apply it back to the examples and get our score
# Since this is binary classification. We get an accuracy.
train_score = classifier.score(d1_train_features, d1_train_labels)
print("Training Accuracy: " + str(train_score))

# We can also apply it back to our testing dataset
test_score = classifier.score(d1_test_features, d1_test_labels)
print("Testing Accuracy: " + str(test_score))
In [ ]:
# D1 - Move #2

# Goal:
# Build an SVM classifier with C= for D1.
# Use S1 as a training set and S2 as a testing set

# Rationale:
# We started somewhere and by raising C we will probably get better training and testing scores.

# Expected Accuracy
# Assessed using a held out test set: >0.5

# numpy provides the tools to easily load our data and split the
# features from the labels
import numpy as np

# We'll use an SVM from scikit learn
from sklearn import svm

# use numpy to load our training set
d1_train = np.loadtxt(open("D1_S1.csv", "rb"), delimiter=",")
# features are all rows for columns before 200
d1_train_features = d1_train[:,:200]
# labels are in all rows at the 200th column
d1_train_labels = d1_train[:,200]

# use numpy to load our testing set
d1_test = np.loadtxt(open("D1_S2.csv", "rb"), delimiter=",")
# features are all rows for columns before 200
d1_test_features = d1_test[:,:200]
# labels are in all rows at the 200th column
d1_test_labels = d1_test[:,200]

# Now we're going to construct a classifier. First we need to set up our parameters
classifier = svm.SVC(C=, kernel='linear')

# Once our parameters are set, we can fit the classifier to our data
classifier.fit(d1_train_features, d1_train_labels)

# Once we have our classifier, we can apply it back to the examples and get our score
# Since this is binary classification. We get an accuracy.
train_score = classifier.score(d1_train_features, d1_train_labels)
print("Training Accuracy: " + str(train_score))

# We can also apply it back to our testing dataset
test_score = classifier.score(d1_test_features, d1_test_labels)
print("Testing Accuracy: " + str(test_score))