Contact
CoCalc Logo Icon
StoreFeaturesDocsShareSupport News Sign UpSign In
| Download

Jupyter notebook 02_UNIX-1_assignment/Unix Module I - In Class.ipynb

Views: 39
Kernel: Python 2 (SageMath)

Introduction to Unix I - In Class Exercises

For these exercises, we will ask you to write the Unix commands necessary to accomplish certain tasks. Just write in the appropriate commands in the cells that currently just contain the '$' character.

First, move to the directory called "exercises_move_here" (in the inclass directory):

$ cd excercises_move_here

Then, check what files and directories are present in this directory:

$ ls

Notice that the data is split into three separate directories. Write the command to check the man pages for 'ls' to see how we can recursively list the subdirectories:

$ man ls

Now write the command using this flag to show us what files are in each directory:

$ ls -R

Notice how each directory contains some lists of fruits and vegetables. Let's reorganize these files so that the fruit files and the vegetable files each have their own directories. Write the command(s) to make two directories, one called fruit_data and one called vegetable_data. Note that this can be done using a single command or two separate ones.

$ mkdir fruit_data
$ mkdir vegetable_data

Now, write a command to copy all the fruit data into the fruit_data directory using wildcards. Note that wildcards can be used as part of the path to a file; for example, we can list all the data files from each directory using the command:

$ ls */*.txt

Armed with this knowledge, we can write a command to copy all the fruit data into the fruit_data directory:

$ cp */fruit*.txt fruit_data/

Now write a similar command to copy all the vegetable data into the vegetable_data directory using wildcards:

$ cp */vegetable*.txt vegetable_data

Now, move into the fruit_data directory:

$ cd fruit_data

Let's check how many lines are in each of the files. Note that you can do this with three separate commands (just insert new lines starting with '$' if you do this), three arguments to the command, or using wildcards:

$ wc fruit_list1.txt fruit_list2.txt fruit_list3.txt -l

These are pretty long files, so we probably don't want to just "cat" them. Write a command to look at the files using a tool meant to scroll through large files. Remember, less is more..

$ less fruit_list1.txt fruit_list2.txt fruit_list3.txt

Also, try using whichever command you used above with a wildcard argument. You can then scroll through the files by typing ":" followed by "n" to move to the next one, or "p" to move to the previous one. This is very useful for data exploration purposes!

We can see that these files contain lists of many fruits, and are not sorted. However, the lists are spread across three different data files, so first let's write a command to put them into a single file called full_fruit_list.txt:

$ cat fruit_list1.txt fruit_list2.txt fruit_list3.txt > full_fruit_list.txt

Now, write a command to sort this file and pipe that output into one of the tools meant to scroll through large files:

$ sort full_fruit_list.txt | less

So we can see that the entries are not unique. We'd like a sorted, unique list, but first write a command to check how many times each entry was found, after sorting:

$ uniq -c full_fruit_list.txt

Finally, write a command to sort the list, get the unique entries, and write that output to a file called sorted_unique_fruit_list.txt:

$ sort full_fruit_list.txt | uniq > sorted_unique_fruit_list.txt

However, if you scroll through this list, you may notice that there are some non-fruit items! Write a command to open the file in a text editor, then go through and remove the offending vegetables (remember that tomato is actually a fruit). Finally, save this file with a new name, true_fruit_list.txt. Note that in nano, you can save to a new file using ^O ([Control]+o):

$ nano sorted_unique_fruit_list.txt

Now, move into the vegetable_data directory:

$ cd vegetable_data

Check the number of lines in each file:

$ wc *.txt -l

Combine the three data files into a single file called full_vegetable_list.txt:

$ cat *.txt > full_vegetable_list.txt

Now sort this list, get the unique entries, and write it to a file called sorted_unique_vegetable_list.txt:

$ sort full_vegetable_list.txt | uniq > sorted_unique_vegetable_list.txt

Open this file and remove any errant fruits that may have shown up:

$ nano sorted_unique_vegetable_list.txt

Finally, move back into the 'inclass/' directory from whence you came:

$ cd .. $ cd ..

Homework exercise (10 Points)

Now, move to the directory called "assignment_move_here": (1 Point)

$ cd assignment_move_here

Now list the files in this directory:

$ ls

See how there are several Pokémon-related files here, and a directory called output/. Do a listing of this directory to see what is in that directory (if anything):

$ ls output/

So we see that nothing is in the output directory, and so our job will be to fill this in ourselves! First, let's get a sense of what the data looks like. The main file is orig_151_pokemon.txt, so write a command to take a look at that file (1 Point):

$ nano orig_151_pokemon.txt

Notice how, unlike the other files we've been looking at so far, this file contains more than one column. We will learn how to manipulate files like this in the next prelab, but for now we have split this file by column for you into four files: pokemon_names.txt contains the first column, the name of each Pokémon, pokemon_main_types.txt contains the second column, the main type of each Pokémon, pokemon_secondary_types.txt contains the third column, the secondary type of each Pokémon, and pokemon_both_types.txt contains the 2nd and 3rd columns, corresponding to the combined main and secondary types.

Now, write a command to sort the pokemon_main_types.txt file and count the number of Pokemon with each unique type (remember you can use the man pages if you forget a certain flag). Finally, send the output of this command to a file called output/type1_counts.txt: (1 Point)

$ sort pokemon_main_types.txt | uniq -c > output/type1_counts.txt

Write a command to print this file: (1 Point)

$ cat output/type1_counts.txt

What's the most common main type for a Pokemon to have? Fill in your answer here: (1 Point)

Most common main type: Water

Now write a similar command for the pokemon_secondary_types.txt file (the secondary type of each Pokemon), and send it to output/type2_counts.txt: (1 Point)

$ sort pokemon_secondary_types.txt | uniq -c > output/type2_counts.txt

How many Pokemon have no secondary type (denoted by 'None' in the type2 column)? Just fill in your answer in the following cell: (1 Point)

Number of Pokemon with no secondary type: 84

What's the most common secondary type, other than None, for a Pokemon? (1 Point)

Most common secondary type: tie between Flying and Poison

Now, let's look at combinations of types. Write a command to sort the pokemon_both_types.txt file, count how many times each unique combination of types appears, and send this to output/combo_type_counts.txt: (1 Point)

$ sort pokemon_both_types.txt | uniq -c > output/combo_type_counts.txt

What's the most common combination of types, where there actually is a secondary type (so ignore entries with 'None' in the second column)? (1 Point)

Most common combination of types: Primary: Grass, Secondary: Poison