Contact
CoCalc Logo Icon
StoreFeaturesDocsShareSupport News Sign UpSign In
| Download

Jupyter notebook 03_UNIX-II_assignment/Unix Module II - In Class.ipynb

Views: 183
Kernel: Python 2 (SageMath)

Introduction to Unix II - In Class

For this in class assignment, we are going to download a compressed data file from the web, decompress it, and then do some searching and processing of the file. I have prepared a simple file that contains analysis of how much each chromosome is covered by 6 different histone modification ChIP-seq data sets in 8 different tissues, all from the Roadmaps epigenomics consortium (http://roadmapepigenomics.org). If this doesn't make any sense to you, don't worry, as the contents of this file aren't as important as learning how to manipulate it! The file is located at http://tesla.pcbi.upenn.edu/~alexaml/roadmap_chromosomal_cov.tar.gz. Similarly to the assignment from the first Unix module, just fill in your Unix commands in the cells containing the '$' symbol (don't forget to save when you're done!), and for the written answers, also write in your answers to the appropriate cells.

First, move, as usual, to the directory called move_here:

$ cd move_here

Now, use either wget or curl to download the file (you should keep the same file name as what's on the server):

$ wget http://tesla.pcbi.upenn.edu/~alexaml/roadmap_chromosomal_cov.tar.gz

Next, extract the contents of the file (refer back to the prelab if you need to, and remember that .tar.gz files can be extracted with a single command):

$ tar -xzvf roadmap_chromosomal_cov.tar.gz

Now look at the top of the file to get a sense of what it contains (any of several commands are acceptable for this step):

$ head -8 8tissue_6mark_marks_chrom_coverage.txt

Notice that there are 7 columns. They are fairly self explanatory for the most part, but I will describe what they contain. The 'class' column describes the general categorization of the tissue of interest (e.g. brain, liver, heart, etc). The 'id' column just gives the Roadmap-defined ID for the sample, which we're not concerned with here. The 'type' column gives a more detailed description of the tissue. The 'mark' column describes which kind of histone modification is being analyzed. The 'chr' column describes which chromosome is being looked at. The 'chr_cov' column contains the percentage of the chromosome that is covered by ChIP-seq peaks for this mark in this tissue, while the 'chr_covered_bp' gives the raw number of bases in the chromosome that are covered.

Now let's do some exploration of this data. First, check to make sure that it is tab separated and not space separated by using the cat flags that we learned in the prelab:

$ cat -A 8tissue_6mark_marks_chrom_coverage.txt

Next, let's cut out the 'class' column and list all the unique classes that exist in the data:

$ cut -f1 8tissue_6mark_marks_chrom_coverage.txt|uniq

Now, let's do the same thing on the 'type' column so that we can see how many of the tissue classes actually have more detailed descriptions in this column:

$ cut -f3 8tissue_6mark_marks_chrom_coverage.txt|uniq

So we see here that the 'class' and 'type' columns are pretty much the same, except that the classes are all upper case, and the esophagus and intestine tissues have a 'GI_' prefix for their classes. This has to do with how the data was generated, and doesn't matter for our purposes. Next, let's use a similar approach to find all the different histone marks that are present in this data (the mark column, which is the fourth):

$ cut -f4 8tissue_6mark_marks_chrom_coverage.txt|uniq

Now let's have some more practice with grep. H3K27ac is an important histone modification related to enhancer activity, so let's use grep and the 'wc' command to count how many H3K27ac entries there are:

$ grep "H3K27ac" 8tissue_6mark_marks_chrom_coverage.txt|wc -l

There are a lot of these, so let's use grep to count how many brain samples have H3K27ac entries. You can use a pipe ( | ) to run two grep commands one after another to accomplish this. For example, if we wanted H3K27ac in the liver, we could run:

$ grep "H3K27ac" 8tissue_6mark_marks_chrom_coverage.txt | grep -i "liver" | wc -l

$ grep "H3K27ac" 8tissue_6mark_marks_chrom_coverage.txt | grep -i "brain" | wc -l

Next, let's get some more practice with regular expressions. In the prelab, we only used one type of regular expression, the * wildcard character. We could spend months learning about all the intricacies of regular expressions, but for now let's look at just one other type of regular expression construct. Let's say we want to pull out all the entries for H3K27me3 and H3K36me3, but not any of the other four modifications. We can't just grep for "*me*" because that will also pull out H3K4me1, H3K4me3, and H3K9me3, so we have to use something more specific. In regular expressions, we can specify that we only want specific numbers or characters in a certain part of the query using brackets. Try running the following command to see for yourself:

$ grep "H4K[2,3][6,7]me*" 8tissue_6mark_marks_chrom_coverage.txt | cut -f4 | sort -u

See how this gives us only the marks we're interested in, and notice how we combined the bracket notation with the wildcard notation to get both me1 and me3. Now try running a simpler command for yourself to pull out H3K4me1, H3K4me3, and H3K9me3 and not H3K27me3 and H3K36me3. For this, you only need to use one bracket:

$ grep "H3K[4,9]me*" 8tissue_6mark_marks_chrom_coverage.txt | cut -f4 | sort -u

Then, send the output of the grep command (without the cut and sort commands) to a file called mark_subsets.txt:

$ grep "H3K[4,9]me*" 8tissue_6mark_marks_chrom_coverage.txt > mark_subsets.txt

Now let's practice compression by compressing this file into a tar.gz file, mark_subsets.tar.gz:

$ tar -czvf mark_subsets.tar.gz mark_subsets.txt

Let's also compress this into a zip file, mark_subsets.zip:

$ zip mark_subsets.zip mark_subsets.txt

Homework Exercise (10 Points)

For this assignment, we are going to get some more practice with grep and compression. For this, we will be using a data set from the city of Philadelphia about historic streets. This data set is located at http://data.phl.opendata.arcgis.com/datasets/9409bce14c4e4768a11a8432e80bfa68_0.csv

First, move back to the 'inclass/' directory, make a folder called homework_data/, and move into it:

$ cd ../
$ mkdir homework_data/
$cd homework_data/

Then, use wget with the "-O" flag, or output redirection from curl to download this file into the current directory (homework_data) and name it historic_streets.csv: (1 Point)

$ wget -O historic_streets.csv http://data.phl.opendata.arcgis.com/datasets/9409bce14c4e4768a11a8432e80bfa68_0.csv

Look at the top of this file to see what it looks like:

$ cat -A historic_streets.csv| head

Notice how this is a comma separated file, containing information about historic streets in Philadelphia. Each row contains information about a given street such as the street name, where it comes from, and where it goes.

Give a command to cut out the 5th column of this file (remember that you need to specify that it is comma-delimited), and count how many different, unique streets there are in this column (remember you can pipe results to the wc command to get the number of lines): (1 Point)

$ cut -f5 -d"," historic_streets.csv|uniq|wc -l

(1 Point) How many unique streets are there?

Number of unique streets:286

Now give a command to count how many times each street appears in the 5th column: (1 Point)

$ cut -f5 -d"," historic_streets.csv|uniq -c

Which is the most common street in the 5th column? (1 Point)

Most common street: GERMANTOWN AVE

Now let's use grep to find all the entries that contain 'ERIE AVE', either west or east: (1 Point)

$ grep "ERIE AVE *" historic_streets.csv

Finally, let's look at the very last column, which describes what kind of road each entry is. Give a command to cut out this column (which is the 16th column), find all the unique classes, and count how many entries there are for each class: (1 Point)

$ cut -f16 -d"," historic_streets.csv|uniq -c

Use grep to pull out all the entries that have Red Brick in this column (including the "Red Brick (Molded)" entries), and write this to a file called red_brick_houses.txt. Note that you can search the whole entry for Red Brick, not just specifically the last column (i.e. you don't need to use a cut command first). (1 Point)

$ grep "Red Brick" historic_streets.csv > red_brick_houses.txt

Compress this file into a .zip file called red_brick_houses.zip: (1 Point)

$ zip red_brick_houses.zip red_brick_houses.txt

Also compress it into a .tar.gz file called red_brick_houses.tar.gz: (1 Point)

$ tar -czvf red_brick_houses.tar.gz red_brick_houses.txt

That's it! We hope that these Unix modules have been useful for you to get up to speed with working on the command line!