Week 3 Practical

This Notebook contains practical assignments for Week 3.

It contains guidance on how

to read and write data in R
how to load data into a data frame and clean the data ready for analysis
basic usage of for and while loops and conditional statestements like if

It also contains practical tasks that you will have to implement yourself. We will use data that is stored in your SageMathCloud folder for the upload of the data into the workspace. Any other data can be used as long as it is specifically mentioned in your notebook and access to it is possible via SageMathCloud.

You are free to base your work on the examples given here but you are also welcome to use different methods if you prefer, adding and/or creating new ways of displaying the data. You will need to add descriptions of what you have done in the assigned tasks and a comment on the results obtained using Markdown cells.

You will need to create a new notebook in the Week 3 folder of your SageMathCloud account that you will call your username_week3.ipynb.

The notebooks will be marked and formative feedback given to you as shown in the class. The last version of your notebook saved by the deadlines indicated on the website will be the one that will receive feedback.

All the notebooks are meant to be used interactively. All the code needs to be written into code cells -- that can be executed by an R kernel. The outputs are not present in this notebook, but the code cells are executable.

Accessing the data

For this practical we will mainly be using the data sets that are provided in the SageMathCloud folder. They are in the form of .txt files or .csv files. This is because the Week 3 practical is intended to provide guidance on how to use basic R commands for reading data stored in files and writing output of your analysis into files. You are free to use other appropriate data of yours that you have uploaded into your R workspace. If you do so you MUST describe the data set using a markdown cell and comment.

We will access data that is stored in comma separated values, tab delimited values and as an example here .xls (excel files). This latter type will not be used for the final assignment, but it is useful to know that data can be uploaded in R workspace directly from an .xls file.

Comma or tab delimited files

A popular data file format is a text file format where columns are separated by a tab, space or comma. We will use comma and tab delimited files, which have the following extensions:

.csv comma delimeted
.txt tab or space delimited

*** Reading the data***

In this notebook we will read in R both types of files. R reads table of data from files with a command called read.table()

Exercise 1: Explore the command read.table() in R help. In a markdown cell give example of the use of this command for .csv and .txt files.

A typical use of the command read.csv() is as

data <- read.csv("nameFile.csv", header = TRUE)

this will read the file called "nameFile.csv" from your working directory. The headers will also be read and the names of columns, if present, will be stored. If the file is in a different location you need to specify where the file is using the full path. For example: data <- read.csv("data_wk3/nameFile.csv", header = TRUE) will read the file "smokers.csv" in your data folder called data_wk3.

After reading the data you need to create an appropriate data.frame for your analysis. The data.frame will have to contain all the information you need about the data. This will be important for when we starting using the Bioconductor framework.

Exercise 2: Load the data contained in the file smokers.csv and explore. Check what is its structure, using the command str() and explain in a markdown cell what you have found and if there are problems with missing data.

Cleaning the data

Once we have loaded the data we want to make sure that all values have the type that we expect. We also have Excel's calculations for the t-tests -- both of which we don't need. It is traditional in data analysis for data cleansing to be a rather complex step.

Exercise 3:In the smokers data the strings equal and unequal in column 2 create problems when R converts the data into a data.frame. What problems they create? Suggest ways of coping with this problem, without altering the dimensions of the the data.frame, to ensure that column 2 is considered numeric. (TIP: Our solution was to avoid the problem when reading the file. Explore read.csv() for futher information on this).

*** Write the data***

We can write our output from manipulations and calulations in R in our working directory or where is a path specified in the command. In R we use the command write.table(), or we can use the command write.csv() for comma separated files. Explore these options in R help before using them.

As an example we can use: write.csv(data,file="nameFile.csv")

Exercise 4: Create a data.frame from smokers data with NA values and numerical data. Exclude the rows that are calculations from an Excel file (Could they be the last two rows?). Write the data in the directory data_wk3 in a .csv file called "smokers_clean.csv".

Missing Data

Often in a data set we have missing values, expressed within R as NA entries. NA stands for 'Not Available' and means that data is missing. In this context, if we were looking at an Excel file the corresponding cells in the Excel spreadsheet would be empty. These entries can cause problems with R functions and commands. We can tell R to exclude missing values with the parameter na.rm, in functions such as mean().

You can use the na.rm as follows:

y<-c(27,40,72,NA,89)
my<-mean(y,na.rm=TRUE)

You can also check which are the NA values in a data set using the comand is.na() and set them to a fixed value. For example:

y[is.na(y)] <- 0
y

Exercise 5: In the cleaned smoker data ( from Exercise 4) analyse the problem of missing values by calculating descriptive statistics of the data with and without missing values. Visualise the data in the way that is more appropriate to highlight the problem. Explain your choices in a markdown cell. Evaluate if there is a significant difference beetween smokers and non-smokers with a t-test. Plot results.

Exponential and Log manipulations.

In R we can use the the command exp(x) for computing $e^x$ and log10(x),log2(x),logb(x, base=b) to compute $log_{10}(x)$ , $log_2(x)$ or $log_b(x)$ .

Exercice 6: Using the clean smokers data take the $log_2$ of the data and compare the density distribution of the data with or without the log transformation. Plot the densities on the the same plot using the command lines(). Explain in a markdown cell what you can derive.

An important plot that we use in bioinformatics is a plot called MA plot. It helps to identify the differences amongst conditions, the outliers will be easly detected using this way of visualising the data. The MA plot is a plot of the following quantities

M= log_2(X)-log_2(Y)

A = \frac{1}{2}(log_2(x)+log_2(y))

Where M represents the ratio (also known as fold change) and A the over all signals. The MA plot is build by plotting A vs M. This plot as well as the volcano plot are available in the work package limma under Bioconductor. We will explore them with built-in functions in Week 10.

Exercise 7: Build a MA plot of the smoker data using the above formulae. Enrich the plot with labels and colors and In a markdown cell explain why M represents the fold change between samples and A the over signal.

For Loops and conditional statements

Looping is an important part of a programming script. It allows you to repeat a set of instructions many times. There are many different types of looping structures. In this module we will use for and while loops and if as conditional statements.

For loops For loops are very simple repeating structure, they just repeat the set of instructions as many time as indicated by the max value of the counter i. For example

for(i in 1:10){
	print("Hello world!")
	print(i*i)
}

the set of instructions are always grouped in curly brackets {-}.

We use for loops to do any sort of repeated operation, not just printing or plotting. However, it is very useful when we need to create plots with many subplots. For example:

In [ ]:

mydata<-rnorm(100,mean=0,sd=1)
mydata2<-rnorm(100,mean=2,sd=1)

par(mfrow=c(2, 2))
for(i in 1:2){
    hist(mydata,col='red', prob=TRUE)
    lines(density(mydata), pch=3, lty=i, col="blue")
    hist(mydata2,col='red',prob=TRUE)
    lines(density(mydata2), pch=3, lty=i, col="blue")
    #title(main=paste("lty = ",i), col.main="blue")
}