Kernel: Python 2 (SageMath)

Lesson 4: In-class exercises

Instructions: For each problem, write code in the provided code block. Don't forget to run your code to make sure it works.

1. Simple list and dictionary practice

Using the data below, write code to accomplish the following tasks.

Name	Favorite Food
Wilfred	Steak
Manfred	Duck
Wadsworth	Spaghetti
Jeeves	Ice cream
Mitsworth	Tuna

(A) Make a list of all the names, then loop through the list and print each name out.

In [1]:

names=["Wilfred","Manfred","Wadsworth","Jeeves","Mitsworth"]
for item in names:
    print item

Wilfred
Manfred
Wadsworth
Jeeves
Mitsworth

(B) Below, some of the names and foods have already been added to a dictionary. Fill in the missing entries using the dict[key] = value syntax. Then loop through the dictionary and print each name and food combination in the format:

<NAME>'s favorite food is <FOOD>

In [29]:

favFoods = {"Wilfred":"Steak", "Manfred":"Duck", "Wadsworth":"Spaghetti"}

# add your code below:
favFoods["Jeeves"]="Ice Cream"
favFoods["Mitsworth"]="Tuna"
for item in favFoods:
    print item+"'s"+" "+"favorite food is"+favFoods[item]

Manfred's favorite food isDuck
Wadsworth's favorite food isSpaghetti
Wilfred's favorite food isSteak
Mitsworth's favorite food isTuna
Jeeves's favorite food isIce Cream

(C) In the dictionary from part (B), change Wilfred's favorite food to pizza.

In [30]:

favFoods["Wilfred"]="pizza"

2. Duplicate removal

Read in the file genes.txt and print only the unique gene IDs (remove the duplicates). Do not assume repeat IDs appear consecutively in the file.

Hint: see the practice exercises from Lesson 4 for an example of how to remove duplicates using a list.

In [48]:

Filename="genes.txt"
inFile=open(Filename,'r')
new_list=[]
for item in inFile:
    if item not in new_list:
        new_list.append(item)
print new_list

['uc007zzs.1\n', 'uc009akk.2\n', 'uc009eyb.1\n', 'uc008vlv.1\n', 'uc008wzq.1\n', 'uc007hnl.1\n', 'uc008tvu.1\n', 'uc008vlv.3\n', 'uc007xgk.1\n', 'uc009qsh.1\n', 'uc008all.1\n', 'uc008eda.1\n', 'uc007shu.4\n', 'uc009mor.1\n', 'uc008fux.1\n', 'uc007ztg.2\n', 'uc007nkt.1\n', 'uc008qul.3\n', 'uc008ktr.2\n', 'uc008iwn.1\n', 'uc009fxp.2\n', 'uc008vsh.1\n', 'uc008gkj.2\n', 'uc007piu.2\n', 'uc008vsk.1\n', 'uc008vsv.1\n', 'uc008kjh.1\n', 'uc009dri.1\n', 'uc008vlv.2\n', 'uc009rxy.1\n', 'uc008fyq.1\n', 'uc009act.1\n', 'uc008lub.1\n', 'uc007ker.1\n', 'uc008qiz.2\n', 'uc008bak.1\n', 'uc008kcg.1\n', 'uc009cjg.1\n', 'uc007vlq.1\n', 'uc007xog.1\n', 'uc009avv.1\n', 'uc008kcg.2\n', 'uc007kmj.1\n', 'uc008oaj.1\n', 'uc007cib.1\n', 'uc007ket.1\n', 'uc009rpf.1\n', 'uc008owo.1\n', 'uc008jaq.1']

3. Split practice

Read in the file init_sites.txt and compute the average CDS length (i.e. average the values in the 7th column). Your answer should be 236.36.

In [49]:

Filename="init_sites.txt"
inFile=open(Filename,'r')
inFile.readline()
line_count=0
total=0
for line in inFile:
    line = line.rstrip('\r\n') #strips embedded line ends to prevent spaces between lines
    data = line.split() #splits file by spaces
    total+=int(data[6]) #converts from string to integer
    line_count=line_count+1 #accumulates lines with for loop iteration
    
    
print total/line_count

236

4. The "many counters" problem

Write a script that reads a file of sequences and tallies how many sequences there are of each length. Use sequences3.txt as input to test your code. After reading through all the sequences, print the sequence length that was the most common.

Hint: you can use a dictionary to keep track of all the tallies, e.g.:

In [ ]:

# HINT CODE

seq = "ATGCTGATCGATATA"
length = len(seq)

if length not in tallyDictionary:
    tallyDictionary[length] = 1     # initialize if first occurrence
else:
    tallyDictionary[length] += 1    # otherwise just increment the count

In [5]:

filename="sequences3.txt"
infile=open(filename, 'r')
mylist={}
for line in infile:
    line=line.rstrip('\r\n')
    length=len(line)
    if length not in mylist:
        mylist[length]=1
    else:
        mylist[length]+=1
    if mylist[length]==7:
        print length

51

Homework exercise (10 Points)

Codon table

For this question, use codon_table.txt, which contains a list of all possible codons and their corresponding amino acids. We will be using this info to translate a nucleotide sequence into amino acids. Each part of this question builds off the previous parts.

(A) Thinkin' question (short answer, not code): If we want to create a codon dictionary and use it to translate nucleotide sequences, would it be better to use the codons or amino acids as keys? (2 Points)

Since multiple tiplet codons code for the same amino acid sequences, I would use codons for as the key.

(B) Read in codon_table.txt (note that it has a header line) and use it to create a codon dictionary. Then use raw_input() prompt the user to enter a single codon (e.g. ATG) and print the amino acid corresponding to that codon to the screen. If the nucleotide combonation is not a valid codon, print a warning message. (4 Points)

In [3]:

codon="codon_table.txt"
infile=open(codon, 'r')
translation={}
infile.readline()
for line in infile:
    line=line.rstrip('\r\n')
    (Codon, AA) = line.split()
    translation[(Codon)] = AA 
name=raw_input("")
if name in translation: #looks for input in translation dictionary
    print translation[name] #prints out corresponding amino acid with the correct codon sequence
else:
    print "wrong codon"

ATG
M

(C) Now we will adapt the code in (b) to translate a longer sequence. Instead of prompting the user for a single codon, allow them to enter a longer sequence. First, check that the sequence they entered has a length that is a multiple of 3 (Hint: use the mod operator, %), and print an error message if it is not. If it is valid, then go on to translate every three nucleotides to an amino acid. Print the final amino acid sequence to the screen. We have included some code to help you out. You can either program this function from scratch, or add to the given code. (4 Points)

In [ ]:

#Prompt the user for a sequence


#Check that their sequence is a multiple of 3



#Loop through the sequence, in groups of 3, translating each one as you go
protSeq = "" #Add each amino acid to this string as you loop through the codons. 
for i in range(0,len(request),3): #request is the sequence given by the user
    codon = request[i:i+3] #gets the current codon under consideration
    
    #Use your dictionary to find what AA this codon corresponds to, as in part b. Print an error if it is invalid.
    
    
print "Your protein sequence is: " + protSeq

In [ ]:

codon="codon_table.txt"
infile=open(codon, 'r')
translation={}
infile.readline()
for line in infile:
    line=line.rstrip('\r\n')
    (Codon, AA) = line.split()
    translation[(Codon)] = AA 
name=raw_input("")
protSeq = ""
if len(name)%3==0: #looks for input in translation dictionary and divisible by three
    for i in range(0,len(name),3): #breaks input name into triplicates
        codon_seq =name[i:i+3]
        if codon_seq in translation:
            protSeq+=translation[codon_seq]
        else:
            print "not valid"
else:
    print "not triplicate code"
print protSeq

In [ ]:

In [ ]: