**Data imputation tutorial:**

Working with real world data means that the data is not likely to be "cleaned". It may be missing observations. Rather than just tossing out those examples, it's important to know how to deal with missing data. This tutorial will go through some simple imputation techniques.

First we need to read in the data.

In [ ]:

source("https://bioconductor.org/biocLite.R") biocLite("impute") library(ggplot2) library(impute) source("helper_functions.R")

In [1]:

# read in data dataset <- read.csv('data.csv') # look at first 6 lines head(dataset)

id | diagnosis | radius_mean | texture_mean | perimeter_mean | area_mean | smoothness_mean | compactness_mean | concavity_mean | concave.points_mean | ⋯ | texture_worst | perimeter_worst | area_worst | smoothness_worst | compactness_worst | concavity_worst | concave.points_worst | symmetry_worst | fractal_dimension_worst | X |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|

842302 | M | 17.99 | 10.38 | 122.80 | 1001.0 | 0.11840 | 0.27760 | 0.3001 | 0.14710 | ⋯ | 17.33 | 184.60 | 2019.0 | 0.1622 | 0.6656 | 0.7119 | 0.2654 | 0.4601 | 0.11890 | NA |

842517 | M | 20.57 | 17.77 | 132.90 | 1326.0 | 0.08474 | 0.07864 | 0.0869 | 0.07017 | ⋯ | 23.41 | 158.80 | 1956.0 | 0.1238 | 0.1866 | 0.2416 | 0.1860 | 0.2750 | 0.08902 | NA |

84300903 | M | 19.69 | 21.25 | 130.00 | 1203.0 | 0.10960 | 0.15990 | 0.1974 | 0.12790 | ⋯ | 25.53 | 152.50 | 1709.0 | 0.1444 | 0.4245 | 0.4504 | 0.2430 | 0.3613 | 0.08758 | NA |

84348301 | M | 11.42 | 20.38 | 77.58 | 386.1 | 0.14250 | 0.28390 | 0.2414 | 0.10520 | ⋯ | 26.50 | 98.87 | 567.7 | 0.2098 | 0.8663 | 0.6869 | 0.2575 | 0.6638 | 0.17300 | NA |

84358402 | M | 20.29 | 14.34 | 135.10 | 1297.0 | 0.10030 | 0.13280 | 0.1980 | 0.10430 | ⋯ | 16.67 | 152.20 | 1575.0 | 0.1374 | 0.2050 | 0.4000 | 0.1625 | 0.2364 | 0.07678 | NA |

843786 | M | 12.45 | 15.70 | 82.57 | 477.1 | 0.12780 | 0.17000 | 0.1578 | 0.08089 | ⋯ | 23.75 | 103.40 | 741.6 | 0.1791 | 0.5249 | 0.5355 | 0.1741 | 0.3985 | 0.12440 | NA |

Next we are going to introduce missing values to the feature "area_mean".

In [ ]:

# convert 10% of area_mean feature into NAs dataset_na <- generate_NAs(dataset, 'area_mean', 0.1) head(dataset_na[,"area_mean"], 10)

**Mean imputation:**

The first imputation strategy we are going to implement is **mean imputation**. This involves estimating the missing values using the mean of the observed values. Starter code for this function has been provided below:

In [ ]:

# impute missing values using mean imputation mean_imputation <- function(dataset, feature) { # using the dataset and the feature name, impute missing values using mean imputation # mean() calculates the mean of a vector # is.na() provides the indices for all NAs in the vector }

In [ ]:

# impute missing values dataset_mi <- mean_imputation(dataset_na, 'area_mean')

Next we want to visualize how well this method works compared to the real values. To do this, we want to generate a scatterplot of the imputed values vs the real values and calculate the correlation between the two. A function has been written to do this for you. You can use it as below or **challenge yourself** and generate your own plot!

In [ ]:

plot_scatterplot(real = dataset$area_mean, imputed = dataset_mi$area_mean)

**Random imputation:**

Next we are going to implement **random imputation**. This method randomly samples from the observed data to fill in the missing values.

In [ ]:

# impute missing values using random imputation random_imputation <- function(dataset, feature) { # using the data and feature name, impute missing values using random imputation # sample() takes a vector and randomly samples from it depending on the size specified, the replace # argument says whether to sample with or without replacement # consider three samples: # 1) first figure out how many NAs you need to impute # 2) randomly sample the number of NAs you need from the observed data # 3) replace the NAs with the randomly sampled values and return the imputed dataset }

In [ ]:

# impute missing values dataset_ri <- random_imputation(dataset_na, 'area_mean')

In [ ]:

plot_scatterplot(real = dataset$area_mean, imputed = dataset_ri$area_mean)

**K-Nearest Neighbours:**

A third effective method of data imputation is to apply machine learning models to predict the missing values. This can be done by treating the feature with missing data as the response and harnessing the known values of the other features in the model. One effective method is known as **k-nearest neighbours (KNN)**. This algorithm looks at the k-nearest data points to the missing data value and assigns the missing value as the either the majority class if the feature is categorical or an average of the values if the feature is continuous.

Check out the following documentation: https://www.rdocumentation.org/packages/caret/versions/6.0-79/topics/preProcess

In [ ]:

# impuate missing values using knn features knn_imputation <- function(dataset, k) { # impute missing values using knn features # see function impute.knn from the impute package }

Try running with multiple different values of k and see how that changes the correlation of the imputated values with the real values.

In [ ]:

dataset_knn <- knn_imputation(dataset_na, k = 10)

In [ ]:

plot_scatterplot(real = dataset$area_mean, imputed = dataset_knn$area_mean)