CoCalc Shared FilesCDS-102 / Lab Week 07 - Tidying your dataset / CDS-102 Lab Week 07 Report.mdOpen in CoCalc with one click!
Authors: James Glasbrenner, Helena Gray
Views : 6

CDS-102: Lab 7 Report

"Tidying Data"

Helena Gray
March 8, 2017

Link to workbook: Lab 7 Workbook


Lab Task 1

Load the "tidyverse" package, import the dataset using read_csv, and display the imported data using the print function. Observe that each of categories of data is compatible with the entries in the columns.

# Run this code block to load the Tidyverse package
.libPaths(new = "~/Rlibs")
install.packages("GGally", lib = "~/Rlibs")

# Import the dataset
original_data <- read_csv("Brauer2008_Dataset1.csv")

Split the values in the NAME column so that each value (separated by ||) is in a separate column. Use the separate() function to do this. Name the columns “name”, “BP”, “MF”, “systematic_name”, and “number”. The separator expression you should use is "\|\|".


Lab Task 2

Remove the whitespace from the five new columns you created using the function mutate_each(original_data, funs(trimws), name:systematic_name).

original_data_1<-mutate_each(original_data_1, funs(trimws), name:systematic_name)

Lab Task 3

The GID, YORF, and GWEIGHT columns aren’t particularly important for any kind of analysis we could do. Remove them from the dataset but leave all the others. Hint: Minus signs - can be used to specify columns to remove.

original_data_1 <- subset(original_data_1, select = -c(GID, YORF,GWEIGHT))

Lab Task 4

The columns G0.05 through U0.3 are actually variables themselves, so they shouldn’t be column names. Use the gather() function to move these variable names into a column. The new column holding the variable names should be called “sample” and the new column holding the values underneath the G0.05 through U0.3 should be called “expression”.


Lab Task 5

The sample column we just created actually contains two variables, nutrient and rate. Use the separate() function again to split it into those two columns. Use sep = 1 and convert = TRUE as additional arguments


Lab Task 6

Take your newly tidying dataset and apply a filter() to only look at the rows with a name of “LEU1”. Then, create a line plot using ggplot2. The plot should have rate on the horizontal, expression on the vertical, and the separate nutrients plotted as different colors.

leucine<-filter(original_data_1, name == "LEU1")
enzyme.plot<-ggplot(data=leucine,aes(rate, expression, colour=nutrient)) + geom_line(lwd=1.5)
ggsave("enzymePlot.png", plot = enzyme.plot, device="png", scale=1, width=5, height=4)

Summary of Results

As a final result from the tidying data set a plot is created that shows rate of enzymatic reaction and expression of each gene.

Enzyme Plot

Key Questions

What has happened to the dataset after going through the tidying procedure? What are its dimensions like (it’s number of rows vs. columns) before and after tidying? Do you find it easier to read the data in this format? Why or why not?

The data originally had 5537 rows and 40 columns. After tidying it has 199332 rows and 8 columns. However, despite having many more rows the data is much easier to read the data. The way the data is organizing is easier to find specific information by category.

How did tidying the data facilitate the use of ggplot2 in Lab Task 6? Would making the same plot be just as easy with the original version of the dataset? Why or why not?

In the original dataset the nutrient and rate were combined together and were displayed as the names of columns with the expression of each nutrient and rate displayed as values in the columns. Also the 'name' column in the original dataset had the name, biological role, and molecular function all lumped together which was not very useful for organizing the data. These traits of the original data set would have made using ggplot impossible to plot as there would've been no clear, single values to map against each other.