Link to workbook: Lab 7 Workbook
Load the "tidyverse" package, import the dataset using read_csv, and display the imported data using the print function. Observe that each of categories of data is compatible with the entries in the columns.
# Run this code block to load the Tidyverse package .libPaths(new = "~/Rlibs") install.packages("GGally", lib = "~/Rlibs") library(tidyverse) library(dplyr) library("GGally") # Import the dataset original_data <- read_csv("Brauer2008_Dataset1.csv")
Split the values in the NAME column so that each value (separated by ||) is in a separate column. Use the separate() function to do this. Name the columns “name”, “BP”, “MF”, “systematic_name”, and “number”. The separator expression you should use is "\|\|".
Remove the whitespace from the five new columns you created using the function mutate_each(original_data, funs(trimws), name:systematic_name).
original_data_1<-mutate_each(original_data_1, funs(trimws), name:systematic_name)
The GID, YORF, and GWEIGHT columns aren’t particularly important for any kind of analysis we could do. Remove them from the dataset but leave all the others. Hint: Minus signs - can be used to specify columns to remove.
original_data_1 <- subset(original_data_1, select = -c(GID, YORF,GWEIGHT))
The columns G0.05 through U0.3 are actually variables themselves, so they shouldn’t be column names. Use the gather() function to move these variable names into a column. The new column holding the variable names should be called “sample” and the new column holding the values underneath the G0.05 through U0.3 should be called “expression”.
The sample column we just created actually contains two variables, nutrient and rate. Use the separate() function again to split it into those two columns. Use sep = 1 and convert = TRUE as additional arguments
Take your newly tidying dataset and apply a filter() to only look at the rows with a name of “LEU1”. Then, create a line plot using ggplot2. The plot should have rate on the horizontal, expression on the vertical, and the separate nutrients plotted as different colors.
leucine<-filter(original_data_1, name == "LEU1") enzyme.plot<-ggplot(data=leucine,aes(rate, expression, colour=nutrient)) + geom_line(lwd=1.5) enzyme.plot ggsave("enzymePlot.png", plot = enzyme.plot, device="png", scale=1, width=5, height=4) enzyme.plot
As a final result from the tidying data set a plot is created that shows rate of enzymatic reaction and expression of each gene.
The data originally had 5537 rows and 40 columns. After tidying it has 199332 rows and 8 columns. However, despite having many more rows the data is much easier to read the data. The way the data is organizing is easier to find specific information by category.
In the original dataset the nutrient and rate were combined together and were displayed as the names of columns with the expression of each nutrient and rate displayed as values in the columns. Also the 'name' column in the original dataset had the name, biological role, and molecular function all lumped together which was not very useful for organizing the data. These traits of the original data set would have made using ggplot impossible to plot as there would've been no clear, single values to map against each other.