SharedCDS-102 / Lab Week 07 - Tidying your dataset / CDS-102 Lab Week 07 Workbook.ipynbOpen in CoCalc
Authors: bassil alomari , James Glasbrenner
Views : 2
Description: Jupyter notebook CDS-102/Lab Week 07 - Tidying your dataset/CDS-102 Lab Week 07 Workbook.ipynb

CDS-102: Lab 7 Workbook

Name:

March 9, 2017

In [1]:
# Run this code block to load the Tidyverse package
.libPaths(new = "~/Rlibs")
library(tidyverse)

In [2]:
# Import the dataset

Parsed with column specification: cols( .default = col_double(), GID = col_character(), YORF = col_character(), NAME = col_character(), GWEIGHT = col_integer() ) See spec(...) for full column specifications.
In [3]:
original_data

GIDYORFNAMEGWEIGHTG0.05G0.1G0.15G0.2G0.25G0.3L0.15L0.2L0.25L0.3U0.05U0.1U0.15U0.2U0.25U0.3
GENE1331X A_06_P5820 SFB2 || ER to Golgi transport || molecular function unknown || YNL049C || 1082129 1 -0.24 -0.13 -0.21 -0.15 -0.05 -0.05 0.13 0.20 0.17 0.11 -0.06 -0.26 -0.05 -0.28 -0.19 0.09
GENE4924X A_06_P5866 || biological process unknown || molecular function unknown || YNL095C || 1086222 1 0.28 0.13 -0.40 -0.48 -0.11 0.17 0.02 0.04 0.03 0.01 -1.02 -0.91 -0.59 -0.61 -0.17 0.18
GENE4690X A_06_P1834 QRI7 || proteolysis and peptidolysis || metalloendopeptidase activity || YDL104C || 1085955 1 -0.02 -0.27 -0.27 -0.02 0.24 0.25 -0.07 -0.05 -0.13 -0.04 -0.91 -0.94 -0.42 -0.36 -0.49 -0.47
GENE1177X A_06_P4928 CFT2 || mRNA polyadenylylation* || RNA binding || YLR115W || 1081958 1 -0.33 -0.41 -0.24 -0.03 -0.03 0.00 -0.05 0.02 0.00 0.08 -0.53 -0.51 -0.26 0.05 -0.14 -0.01
GENE511X A_06_P5620 SSO2 || vesicle fusion* || t-SNARE activity || YMR183C || 1081214 1 0.05 0.02 0.40 0.34 -0.13 -0.14 0.00 -0.11 0.04 0.01 -0.45 -0.09 -0.13 0.02 -0.09 -0.03
GENE2133X A_06_P5307 PSP2 || biological process unknown || molecular function unknown || YML017W || 1083036 1 -0.69 -0.03 0.23 0.20 0.00 -0.27 0.25 -0.21 0.12 -0.11 NA -0.65 0.09 0.06 -0.07 -0.10
GENE1002X A_06_P6258 RIB2 || riboflavin biosynthesis || pseudouridylate synthase activity* || YOL066C || 1081766 1 -0.55 -0.30 -0.12 -0.03 -0.16 -0.11 0.27 0.24 0.05 0.19 0.07 -0.31 -0.08 0.12 0.05 0.06
GENE5478X A_06_P7082 VMA13 || vacuolar acidification || hydrogen-transporting ATPase activity, rotational mechanism || YPR036W || 1086860 1 -0.75 -0.12 -0.07 0.02 -0.32 -0.41 0.15 0.15 0.00 0.03 -0.40 -0.02 0.26 0.31 0.14 0.11
GENE2065X A_06_P2554 EDC3 || deadenylylation-independent decapping || molecular function unknown || YEL015W || 1082963 1 -0.24 -0.22 0.14 0.06 0.00 -0.13 0.17 0.07 0.10 0.11 0.01 -0.16 0.07 0.20 0.02 0.10
GENE2440X A_06_P6431 VPS5 || protein retention in Golgi* || protein transporter activity || YOR069W || 1083389 1 -0.16 -0.38 0.05 0.14 -0.04 -0.01 0.11 0.00 0.02 0.09 -0.26 -0.13 -0.10 0.07 -0.04 -0.12
GENE4180X A_06_P6220 || biological process unknown || molecular function unknown || YOL029C || 1085380 1 -0.22 -0.18 0.27 0.18 0.03 -0.04 -0.04 -0.13 -0.08 0.10 -0.02 0.04 0.16 0.02 -0.03 -0.22
GENE5247X A_06_P1410 AMN1 || negative regulation of exit from mitosis* || protein binding || YBR158W || 1086594 1 0.18 0.61 1.55 1.34 0.23 -0.03 -0.35 -0.27 -0.07 -0.11 -1.15 0.41 0.28 0.00 0.17 -0.01
GENE2121X A_06_P2983 SCW11 || cytokinesis, completion of separation || glucan 1,3-beta-glucosidase activity || YGL028C || 1083024 1 -0.67 -0.47 1.16 1.05 -0.18 -0.68 -0.11 0.01 -0.27 -0.51 -1.48 -0.43 -0.27 -0.32 -0.24 -0.15
GENE1985X A_06_P3720 DSE2 || cell wall organization and biogenesis* || glucan 1,3-beta-glucosidase activity || YHR143W || 1082870 1 -0.59 -0.17 1.17 0.85 -0.12 -0.61 -0.39 -0.42 -0.48 -0.65 -1.24 0.41 0.18 0.09 0.13 -0.04
GENE4728X A_06_P2774 COX15 || cytochrome c oxidase complex assembly* || oxidoreductase activity, acting on NADH or NADPH, heme protein as acceptor || YER141W || 1085995 1 -0.28 -0.81 -0.39 0.24 0.01 0.01 -0.18 -0.02 0.15 -0.18 -1.91 -0.31 0.09 -0.24 -0.03 0.19
GENE3153X A_06_P4597 SPE1 || pantothenate biosynthesis* || ornithine decarboxylase activity || YKL184W || 1084207 1 -0.19 0.24 0.03 0.17 0.00 -0.01 0.04 -0.07 0.06 -0.20 -1.53 -0.43 -0.46 -0.73 -0.48 -0.25
GENE3704X A_06_P5667 MTF1 || transcription from mitochondrial promoter || S-adenosylmethionine-dependent methyltransferase activity* || YMR228W || 1084832 1 -0.42 -0.43 -0.36 -0.12 0.05 0.24 -0.07 -0.14 -0.03 -0.04 -0.62 -0.53 -0.30 -0.17 -0.44 -0.35
GENE2141X A_06_P3260 KSS1 || invasive growth (sensu Saccharomyces)* || MAP kinase activity || YGR040W || 1083046 1 -0.76 -0.32 -0.05 -0.27 -0.31 -0.01 0.15 0.06 0.20 -0.11 -0.80 -0.18 -0.05 -0.26 -0.58 -0.18
GENE2978X A_06_P3607 || biological process unknown || molecular function unknown || YHR036W || 1084002 1 -0.91 -0.43 -0.05 -0.09 -0.27 -0.45 -0.08 -0.12 -0.13 -0.05 -1.04 -0.59 -0.47 -0.29 -0.33 -0.20
GENE1203X A_06_P5929 || biological process unknown || molecular function unknown || YNL158W || 1081987 1 -0.47 -0.43 -0.15 0.08 -0.26 -0.25 0.02 -0.17 -0.30 -0.41 -0.67 -0.01 -0.20 -0.36 -0.30 -0.04
GENE3214X A_06_P6219 YAP7 || positive regulation of transcription from RNA polymerase II promoter || RNA polymerase II transcription factor activity || YOL028C || 10842811 -0.51 -0.04 0.06 0.26 -0.19 -0.22 -0.16 -0.13 -0.13 -0.03 -1.09 -0.26 0.02 -0.09 -0.43 -0.21
GENE443X A_06_P1322 || proteolysis and peptidolysis || metalloendopeptidase activity || YBR074W || 1081132 1 -1.01 -0.55 -0.72 -0.54 -0.55 -0.19 -0.08 -0.25 0.12 -0.09 NA -0.23 -0.32 -0.49 0.01 0.24
GENE1570X A_06_P6449 YVC1 || cation homeostasis || calcium channel activity* || YOR087W || 1082401 1 -0.40 -0.14 -0.06 0.00 -0.22 -0.07 -0.10 -0.15 0.03 0.09 NA -0.39 -0.30 -0.28 -0.01 0.14
GENE4434X A_06_P2356 CDC40 || nuclear mRNA splicing, via spliceosome* || RNA splicing factor activity, transesterification mechanism* || YDR364C || 1085655 1 -0.19 -0.08 -0.16 -0.10 -0.12 -0.10 -0.38 -0.19 -0.01 0.01 -1.07 -0.01 0.05 -0.10 -0.04 0.12
GENE2486X A_06_P6921 || biological process unknown || molecular function unknown || YPL162C || 1083440 1 -0.10 -0.02 -0.37 -0.09 -0.14 0.10 -0.22 -0.04 -0.10 0.05 -1.19 0.15 -0.09 -0.17 0.05 0.12
GENE2099X A_06_P1729 RMD1 || biological process unknown || molecular function unknown || YDL001W || 1083001 1 -0.22 -0.03 -0.26 -0.19 -0.10 0.15 -0.21 -0.16 -0.10 -0.03 -0.76 -0.03 -0.13 -0.20 -0.13 -0.03
GENE5137X A_06_P2688 PCL6 || regulation of glycogen biosynthesis* || cyclin-dependent protein kinase regulator activity || YER059W || 1086466 1 -0.25 -0.35 0.04 0.15 0.02 0.09 0.12 0.02 0.04 0.13 -1.10 -0.19 -0.12 0.08 0.00 0.06
GENE2691X A_06_P1007 AI4 || RNA splicing* || endonuclease activity || Q0065 || 1083679 1 -0.36 -0.39 0.42 -0.09 0.00 0.12 0.49 0.22 -0.09 -0.80 -1.59 -1.24 -0.65 -0.62 -0.61 -0.81
GENE2673X A_06_P1933 GGC1 || mitochondrial genome maintenance* || guanine nucleotide transporter activity || YDL198C || 1083659 1 0.76 0.33 -0.21 -0.16 0.00 0.09 -0.15 -0.08 -0.72 -1.25 -2.31 -1.70 -1.37 -1.24 -0.73 -0.83
GENE3094X A_06_P1548 SUL1 || sulfate transport || sulfate transporter activity || YBR294W || 1084134 1 -0.32 -0.54 -0.37 -0.64 -0.09 -0.10 -0.56 -0.77 -0.95 -1.32 -5.55 -4.59 -3.34 -1.98 -3.09 -1.79
GENE5335X A_06_P5817 || biological process unknown || molecular function unknown || YNL046W || 1086698 1 0.09 0.23 0.13 0.27 0.11 -0.08 0.02 0.25 0.10 -0.03 -0.13 0.56 0.39 0.38 0.40 0.36
GENE3931X A_06_P4860 RPS0B || protein biosynthesis* || structural constituent of ribosome || YLR048W || 1085091 1 -0.12 -0.42 -0.12 0.09 0.01 0.08 0.24 0.32 0.16 0.12 0.32 0.41 0.03 -0.12 0.16 0.30
GENE2273X A_06_P3220 COS12 || biological process unknown || molecular function unknown || YGL263W || 1083201 1 -1.30 0.09 0.29 -0.03 -0.14 -0.20 0.77 1.07 0.95 1.17 0.51 0.50 1.03 1.64 1.27 1.33
GENE1180X A_06_P6178 || biological process unknown || molecular function unknown || YNR065C || 1081961 1 -0.16 -0.02 0.15 0.17 -0.21 -0.12 0.13 -0.06 -0.21 -0.11 0.22 0.51 0.59 0.14 0.38 0.34
GENE4771X A_06_P2485 IZH1 || lipid metabolism* || metal ion binding || YDR492W || 1086044 1 -0.10 -0.13 0.28 0.46 0.23 -0.25 -0.47 -0.62 -0.60 -0.51 0.39 0.35 0.31 -0.07 0.56 0.44
GENE321X A_06_P7110 || || || YPR064W || 1080987 1 0.13 0.17 0.53 1.21 0.38 -0.02 -0.28 -0.11 -0.31 -0.37 1.21 1.54 0.93 0.50 0.52 0.27
GENE236X A_06_P6294 IZH4 || lipid metabolism* || metal ion binding || YOL101C || 1080893 1 0.64 0.44 1.31 1.35 0.97 0.26 -0.13 -0.73 -0.46 -0.42 3.36 2.87 2.25 2.00 2.32 1.15
GENE2516X A_06_P2042 PST1 || cell wall organization and biogenesis || molecular function unknown || YDR055W || 1083475 1 0.62 0.25 1.12 1.05 0.13 -0.12 0.28 0.15 -0.05 -0.29 1.95 2.00 1.40 0.50 0.81 0.64
GENE1687X A_06_P4130 PRM10 || conjugation with cellular fusion || molecular function unknown || YJL108C || 1082535 1 0.63 0.59 0.73 0.65 0.23 0.29 -0.24 -0.40 -0.42 -0.08 0.54 1.26 0.77 0.61 0.82 0.54
GENE5522X A_06_P4129 || biological process unknown || molecular function unknown || YJL107C || 1086909 1 0.65 0.62 0.71 0.71 0.04 0.10 -0.29 -0.56 -0.73 -0.21 NA 1.06 0.58 0.39 0.53 0.48
GENE2461X A_06_P1902 SFA1 || formaldehyde catabolism || alcohol dehydrogenase activity* || YDL168W || 1083412 1 0.81 0.30 0.16 0.36 0.38 0.37 -0.07 -0.01 -0.09 0.01 0.12 0.53 0.41 0.02 0.40 0.32
GENE5154X A_06_P3834 CAP2 || filamentous growth* || actin filament binding || YIL034C || 1086485 1 0.10 -0.33 0.05 0.25 0.09 -0.05 0.24 0.12 -0.01 -0.05 0.09 0.76 0.35 0.04 0.20 0.14
GENE2896X A_06_P5553 || biological process unknown || molecular function unknown || YMR122W-A || 1083907 1 0.01 -0.22 0.67 0.72 0.36 0.04 -0.19 0.10 0.02 0.05 -0.06 0.87 0.88 0.28 0.49 0.49
GENE4037X A_06_P4180 CIS3 || cell wall organization and biogenesis || structural constituent of cell wall || YJL158C || 1085216 1 -0.01 -0.21 0.20 0.60 0.48 0.18 -0.02 -0.06 0.07 0.03 0.49 1.47 1.13 0.04 0.69 0.32
GENE674X A_06_P7057 || || || YPR012W || 1081403 1 -0.23 0.05 0.25 0.28 0.22 0.27 -0.14 0.02 0.13 0.04 0.32 0.14 -0.01 0.17 0.02 0.27
GENE3957X A_06_P6469 RGS2 || G-protein signaling, coupled to cAMP nucleotide second messenger || GTPase activator activity || YOR107W || 1085121 1 0.00 0.01 -0.13 0.02 -0.07 -0.32 -0.09 -0.32 -0.30 -0.29 0.45 0.24 0.18 -0.24 -0.26 -0.08
GENE2250X A_06_P7164 || biological process unknown || molecular function unknown || YPR117W || 1083173 1 -0.22 -0.51 0.28 0.32 0.00 -0.38 -0.49 -0.57 -0.53 -0.48 0.79 1.04 0.32 -0.14 0.29 -0.16
GENE785X A_06_P7198 || || || YPR150W || 1081527 1 -0.10 -0.24 -0.06 -0.04 -0.37 -0.67 -0.21 -0.16 -0.17 -0.52 -0.13 0.67 0.39 -0.11 0.33 0.09
GENE4483X A_06_P1283 CSG2 || calcium ion homeostasis* || enzyme regulator activity || YBR036C || 1085710 1 0.06 -0.22 -0.09 0.15 -0.12 -0.27 0.15 0.16 0.03 -0.15 0.11 0.26 -0.04 -0.24 -0.05 -0.01
GENE491X A_06_P3540 SPO11 || meiotic DNA double-strand break formation || endodeoxyribonuclease activity, producing 3'-phosphomonoesters || YHL022C || 10811881 -0.83 0.21 0.04 -0.19 -0.41 -0.15 0.04 0.27 -0.17 -0.23 NA 0.07 0.33 0.19 0.01 -0.34
GENE4050X A_06_P2650 CHO1 || phosphatidylserine biosynthesis || CDP-diacylglycerol-serine O-phosphatidyltransferase activity || YER026C || 1085231 1 -0.74 -0.63 -0.07 0.11 -0.18 -0.31 0.26 0.11 0.04 -0.16 -0.28 0.81 -0.10 -0.03 -0.13 -0.20
GENE17X A_06_P6055 WSC2 || cell wall organization and biogenesis* || transmembrane receptor activity || YNL283C || 1080641 1 -0.62 -0.26 -0.19 -0.03 -0.09 -0.27 -0.33 -0.21 -0.16 -0.20 0.67 0.78 0.08 -0.34 -0.09 -0.21
GENE4426X A_06_P6690 MYO2 || endocytosis* || microfilament motor activity || YOR326W || 1085645 1 -0.67 -0.38 -0.12 -0.05 -0.15 -0.24 -0.12 -0.24 -0.15 -0.24 0.11 0.69 0.36 -0.12 0.08 -0.04
GENE1274X A_06_P6825 || biological process unknown || molecular function unknown || YPL066W || 1082066 1 -0.14 0.06 0.78 0.81 0.30 0.14 0.28 0.13 0.30 0.06 -0.75 1.91 0.47 0.25 0.37 0.15
GENE410X A_06_P4625 DOA1 || ubiquitin-dependent protein catabolism* || molecular function unknown || YKL213C || 1081094 1 0.12 -0.07 0.14 0.29 0.17 0.09 0.19 0.17 0.03 0.02 0.16 0.28 0.36 0.22 0.29 0.14
GENE2833X A_06_P6094 KRE1 || cell wall organization and biogenesis || structural constituent of cell wall || YNL322C || 1083836 1 0.41 -0.28 0.30 0.50 -0.05 -0.08 0.38 0.23 0.21 0.15 0.32 0.62 0.54 0.01 0.56 0.28
GENE271X A_06_P3243 MTL1 || cell wall organization and biogenesis || molecular function unknown || YGR023W || 1080930 1 0.50 -0.12 0.25 0.24 0.13 0.02 0.25 -0.02 -0.06 -0.10 NA 0.50 0.29 -0.14 0.47 0.27
GENE1691X A_06_P4196 KRE9 || cell wall organization and biogenesis* || molecular function unknown || YJL174W || 1082539 1 0.15 0.09 0.21 0.46 0.19 -0.02 0.37 0.21 0.16 -0.01 -0.68 0.63 0.41 0.09 0.48 0.43
GENE1755X A_06_P4680 UTH1 || mitochondrion organization and biogenesis* || molecular function unknown || YKR042W || 1082610 1 0.63 0.38 0.05 0.12 0.13 -0.01 -0.07 0.02 0.24 0.18 -0.89 0.19 0.03 0.04 0.13 0.19
GENE4255X A_06_P6304 || biological process unknown || molecular function unknown || YOL111C || 1085465 1 0.18 0.05 0.11 0.09 -0.02 0.03 -0.07 -0.08 -0.02 -0.06 0.03 0.14 0.00 -0.21 0.07 0.04
In [17]:
cleaning_data_step1 <- separate(original_data, NAME, c("name", "BP", "MF", "systematic_name",  "number"), sep = "\\|\\|")

In [18]:
cleaning_data_step2 <- mutate_each(cleaning_data_step1, funs(trimws), name:systematic_name)

In [19]:
cleaning_data_step3 <- select(cleaning_data_step2, -number, -GID, -YORF, -GWEIGHT)

In [20]:
cleaning_data_step4 <- gather(cleaning_data_step3, sample, expression, G0.05:U0.3)

In [23]:
cleaning_data_step5 <- separate(cleaning_data_step4, sample, c("nutrient", "rate"), sep =1, convert = TRUE)

In [24]:
glimpse(cleaning_data_step5)

Observations: 199,332 Variables: 7 $name <chr> "SFB2", "", "QRI7", "CFT2", "SSO2", "PSP2", "RIB2",...$ BP <chr> "ER to Golgi transport", "biological process unknow... $MF <chr> "molecular function unknown", "molecular function u...$ systematic_name <chr> "YNL049C", "YNL095C", "YDL104C", "YLR115W", "YMR183... $nutrient <chr> "G", "G", "G", "G", "G", "G", "G", "G", "G", "G", "...$ rate <dbl> 0.05, 0.05, 0.05, 0.05, 0.05, 0.05, 0.05, 0.05, 0.0... \$ expression <dbl> -0.24, 0.28, -0.02, -0.33, 0.05, -0.69, -0.55, -0.7...
In [32]:
data.filtered <- filter(cleaning_data_step5, name == "LEU1")
print(data.filtered)

# A tibble: 36 × 7 name BP MF <chr> <chr> <chr> 1 LEU1 leucine biosynthesis 3-isopropylmalate dehydratase activity 2 LEU1 leucine biosynthesis 3-isopropylmalate dehydratase activity 3 LEU1 leucine biosynthesis 3-isopropylmalate dehydratase activity 4 LEU1 leucine biosynthesis 3-isopropylmalate dehydratase activity 5 LEU1 leucine biosynthesis 3-isopropylmalate dehydratase activity 6 LEU1 leucine biosynthesis 3-isopropylmalate dehydratase activity 7 LEU1 leucine biosynthesis 3-isopropylmalate dehydratase activity 8 LEU1 leucine biosynthesis 3-isopropylmalate dehydratase activity 9 LEU1 leucine biosynthesis 3-isopropylmalate dehydratase activity 10 LEU1 leucine biosynthesis 3-isopropylmalate dehydratase activity # ... with 26 more rows, and 4 more variables: systematic_name <chr>, # nutrient <chr>, rate <dbl>, expression <dbl>
In [33]:
ggplot(data = data.filtered) + geom_line(mapping = aes(x = rate, y = expression, color = nutrient))

In [ ]: