MATH 157 FINAL: PROJECT MACHINE LEARNING AND ALGORITHMIC BIAS
Introduction to Machine Learning
As a branch of Artificial Intelligence, Machine Learning is a data analysis method that automates analytical model building. The idea is that with minimal human involvement, systems learn from data, find patterns, and create decisions.
The significance of Machine Learning is its action of iteration. Meaning, as models created by Machine Learning are prone to new data, the systems involved with Machine Learing become independently adapt. The systems learn to produce decisions and results that are both reliable and repeatable.
Examples and uses of Machine Learning can be found in the most aspects we take for granted.
Netflix recommendations based on shows you've watched.
Twitter posts related to criticism or posts involving you.
Used by health care industries to have real time patient data and analyze next steps for patient care.
Used by the government to analyze dat related to saving and being effiecient with money spending, along with finding fraud and reducing identity theft.
Used by Oil and Gas companies to find new energy sources and predict failure in refinery sensors. Also used to streamline oil distribution so that it is more efficient and cost-effective.
scikit-learn
To understand how exactly the inner workings of Machine Learning works, our course introduced the scikit-learn module. This is a Python libray with Machine Learning tools and functions that is
Simple and efficient for data mining and data analysis
Accessible to everybody, and reusable in various contexts
Built on NumPy, SciPy, and matplotlib
Open source and commercially usable
The module includes six aspects Machine Learning is involved with:
Classification
Regression
Clustering
Dimensionality Reduction
Model Selection
Preprocessing
For this presentation we will focus on Classification.
Classification with K-Nearest Neighbors (KNN)
In Machine Learning, Classification is "the process of predicting class or category from observed values or given data points." In other words, it is the action of having different categorical output such as black or white or inside or outside.
In mathematical terms, "classification is the task of approximating a mapping function (f) from input variables (X) to output variables (Y)." In other words, you put an x in some sort of function and you get an output of y.
The K-Nearset Neighbor (KNN) part is a category under Supervised Learning, which is used to predict catogorical variables.
Let us look at an example of classification that involves the iris dataset of three types of irises we have seen in class and on past homework assignments.
Next, we build the KNN classifier where with our iris dataset the classifier determines what data of the iris dataset have close features and assigns the predominnt class.
Based on its sepal width and length, we use the next code below to predict the kind of iris we have.
Let's now build another KNN classifier and predict another iris, this time with only two features, specifcally sepal width and sepal length.
If we ran the actual we should be able to see two plots of the sepal space and the prediction, which involve with either with a single nearest neighbor or three nearest neighbours.
Thus, this iris example gives us a good understanding on classification problem: a task of determining whether an object is a Setosa, a Versicolor, or a Virginicas iris, where the label is from three distinct categories. After using KNN classifier, we are able to easily predict the type of a iris based on the reference database, and which should fall into one of the above neighbours.
Algorithmic Bias
So what exactly is Algorithmic Bias?
In the context of Machine Learning, Algorithimc Bias, according to Jaspreet, is "the phenomena of observing results that are systematically prejudiced due to faulty assumptions." In other words, it is the idea that systems obtain data that is not repesentative of a certain entire population. The data is gathered from the population in a way that is in-favor of the one sampling, or better yet bias towards the one gathering the data, the human!
To be general, if data was gathered under a bias perspective of some sort, then once inputted into a system, the system itself will learn to be bias as well! And this is both dangeorus and unethical! A machine will only learn what we input in it and if we are bias, the machine itself will follow the bias we taught it to follow!
Let's take a look at how Bias is involved with the Classification with KNN.
Classification with KNN and Algorithmic Bias
Calulate the distance
Before the algorithm performs any step to build KNN, it firstly measure the distance (Euclidian, Manhattan, Minkowski or Weighted) from the new data to all others data that is already classified. So one of the cons of kNN is you must know a meaningful distance function.
KNN uses distance metrics in order to find similarities or dissimilarities, so that's why it requires scaling of data because KNN uses the Euclidean distance between two data points to find nearest neighbors5.
Its formula is the below:
The distances between the first row and every row in the dataset, including itself are outputted by the code above. Basically, this code calculates the distance between each train and test data point.
Find k nearest neighbors
After calculating the distance between each train and test data point, we will select the top nearest according to the value of k by the scikit KNN classifier. There is no structured method to find the best value for “K”. We needto find out with various values by trial and error and assuming that training data is unknown.
From here, we see that the computation cost is quite high because we need to compute the distance of each query instance to all training samples.
It prints 3 most similar records in this dataset. As expected, the first record is the most similar to itself and is at the top of the list.
How about 4?
So the first 3 most similar neighbors are the same, only adds the 4th most.
Therefore, another challenged step in kNN is we need to determine the value of parameter K (number of nearest neighbors). A good value for K is determined experimentally, and if the k value we piack is not the good one, it may cause some bias towards the prediction later.
We will use the iris dataset again here...
You can see in the above code we are using Minkowski distance metric with value of p as 2 i.e. KNN classifier is going to use Euclidean Distance Metric formula.
Let's create another kNN model with 3 neighbors...
From the observations between 2 models, we can see by choosing different values of k will definitely impact the prediction on train data and test data, so that the accuracy between them will also be different. Based on the iris dataset, it looks like a smaller k values of nearest neighbor we choose, a higher accuracy score on train data we end up with.
Algorithmic Bias in Real Life
So how specifically can we see bias in our system?
A perfect example of how Machine Learning and Algorithmic Bias go hand-in-hand is gathering data that involved discrimination in order to find out the possible average housing income in a certain area, which may reflect racial bias in housing (Kiran 2/28 Lecture). You should take a look here.
There may be another interesting one:
According to the “No Free Lunch theorem” (Macready, 1997), all classifiers have the same error rate when averaged over all possible data generating distributions. Therefore, a certain classifier must have a certain bias towards certain distributions and functions to be better at modelling those distributions.
In this article, the author uses an example to illustrate a race bias--a controversial recidivism prediction tool called COMPAS to predict the criminals' risk of committing crimes in the future. A study done by ProPublica showed that the algorithm was twice as likely to label black defenders as high risk who eventually did not reoffend as compared to white defenders.
However, later according to the measures COMPAS used, black and white defenders had the same misclassification rate.
It turns out that both of the results may be correct since they use different measures for fairness. And there is no algorithm that can perform equally well on both the fairness measures if the base recidivism rates differ for blacks and whites. Both these fairness measures represent inherent tradeoffs3.
So how exactly do we deal with the bias in our data and in Machine Learning?
We cannot eliminate bias in the data we receive. Even the most randomly selected sample of a certain population may have some bias involved. What we can do is minimize the bias as much as we could.
Furthermore...
Here are few tips we might be found helpful to manage/combat for bias in Machine Learning, assuming the data inputted into the system is already bias itself:
Choose the right learning model for the problem
There’s no single model to follow that will avoid bias, but there are parameters that can inform your team as it’s building.
Choose a representative training data set
Making sure the training data is diverse and includes different groups is essential, but segmentation in the model can be problematic unless the real data is similarly segmented.
Monitor performance using real data
simulating real-world applications as much as possible when building algorithms since the public might not take best intentions for ethical violations.
Problems to Try:
Problem 1: Simple model of kNN
Load csv files "test-data.csv" and "train-data.csv" into separate dataframes.
Now, we need to predict the missing target variable in the test data with target variable - Survived. Starting with creating the object of the K-Nearest Neighbor model and fit the model with data we have. Then print out your number of neighbors and predict the target on the train dataset and test dataset.
Run below code to check you accuracy on train dataset and test dataset.
SOURCES AND COLLABORATORS
Collaborated with Shannon He A LOT! We both worked on collaborated on this project a lot
Machine Learning Intro: https://www.sas.com/en_us/insights/analytics/machine-learning.html
scikit-learn: https://scipy-lectures.org/packages/scikit-learn/index.html
Understanding of bias: https://towardsdatascience.com/understanding-and-reducing-bias-in-machine-learning-6565e23900ac
KNN: https://www.analyticsvidhya.com/blog/2017/09/common-machine-learning-algorithms/?#
KNN cons: https://www.datacamp.com/community/tutorials/k-nearest-neighbor-classification-scikit-learn
Euclidean: https://machinelearningmastery.com/tutorial-to-implement-k-nearest-neighbors-in-python-from-scratch/
KNN analysis: https://towardsdatascience.com/a-simple-introduction-to-k-nearest-neighbors-algorithm-b3519ed98e
Manage bias: https://techcrunch.com/2018/11/06/3-ways-to-avoid-bias-in-machine-learning/