Dictionary

A dictionary can be created by listing key-value pairs inside a curly braces.

We can access the value associated with a particular key.

Dictionaries are mutable.

In [1]:

# Try to create a dictionary

virus = {'name':'HIV', 'type':'RNA', 'cell':'CD4'}
virus['name']


# Change values of a dictionary

virus['name'] = 'HCV'
print virus

{'cell': 'CD4', 'type': 'RNA', 'name': 'HCV'}

Dictionary – Exercise

Make a dictionary nucleotides and assign A, T, G and C as the keys and corresponding nucleotide names as values.
Print the value assigned to key 'G' in the dictionary nucleotides.
Print the first item stored in nucleotides. ( Hint: DictName.values() )
Print all keys and values stored in nucleotides.
Ask the user to input a sequence and print the nucleotide names using nucleotides.

In [2]:

# Make a dictionary
nucleotides = {'A':'Adenine', 'T':'Thymine', 'C':'Cytosine', 'G': 'Guanine'}
# Print value stored with key 'G'
print nucleotides["G"]

# Print the first value in the dictionary. Hint: DictName.values() 
print nucleotides.values()[0]
vals = nucleotides.values()
print(vals)
print(vals[0])

# Print all keys
print nucleotides.keys()

# Print all values
print nucleotides.values()

# Print all keys and their corresponding values

for key in nucleotides:
    print key, nucleotides[key]

# Use nucleotides dictionary created earlier

# Ask user input
# Check for each nucl
# print validity using dictionary

user_input = raw_input("Please enter a sequence: ").upper()
for base in user_input:
    if base in nucleotides:
        print "Full name for " + base + " is " + nucleotides[base]
    else:
        print base + " is not an existing nucleotide!"
        break

Guanine
Adenine
['Adenine', 'Cytosine', 'Thymine', 'Guanine']
Adenine
['A', 'C', 'T', 'G']
['Adenine', 'Cytosine', 'Thymine', 'Guanine']
A Adenine
C Cytosine
T Thymine
G Guanine

Please enter a sequence: 

Adding to a Dictionary

You can add entries to Python dictionaries. Ask user to enter a new key and value for your virus dictionary.
Create two dictionaries containing information about viruses. Can you create a database (using lists) containing the 2 dictionaries? Print names of all viruses stored in our database.

In [7]:

# Add entries to a dictionary

virus = {'name':'HIV', 'type':'RNA', 'cell':'CD4'}

virus['load'] = 10000

user_key = raw_input("Please enter a key: ")
user_val = raw_input("Please enter a value: ")
# Add user defined entry to the virus dictionary
virus[user_key] = user_val

print virus
# Create two dictionaries containing information about viruses
virus1 = {'name':'HIV', 'type':'RNA', 'cell':'CD4', 'load':10000}
virus2 = {'name':'HCV', 'type':'RNA', 'cell':'CD4', 'load':3000}

# Create a database containing 2 virus dictionaries. Print all viruses in the dictionary
db = [virus1, virus2]

for vir in db:
    print vir

Please enter a key: 

Please enter a value: 

{'cell': 'CD4', 'load': 10000, 'type': 'RNA', 'name': 'HIV', 'killrate': '10'}
{'cell': 'CD4', 'load': 10000, 'type': 'RNA', 'name': 'HIV'}
{'cell': 'CD4', 'load': 3000, 'type': 'RNA', 'name': 'HCV'}

Modifying a Dictionary

You can change the value of a key, e.g.: virus1['load'] = 2000

We can delete an entry using del function similar to lists, e.g.: del virus1['load']

In [8]:

virus1['load'] = 10000
print virus1
virus1['load'] = virus1['load'] + 5000
print virus1
virus1['load'] += 5000
print virus1

# Delete an entry
del virus1['cell']
print virus1

{'cell': 'CD4', 'load': 10000, 'type': 'RNA', 'name': 'HIV'}
{'cell': 'CD4', 'load': 15000, 'type': 'RNA', 'name': 'HIV'}
{'cell': 'CD4', 'load': 20000, 'type': 'RNA', 'name': 'HIV'}
{'load': 20000, 'type': 'RNA', 'name': 'HIV'}

Looping a Dictionary

You can loop through a dictionary using its keys.

Create a dictionary which has multiple Patient data.

Calculate the average virus load in HIV and HCV patients.

In [10]:

# Create dictionary that contains virus loads for three different patients

viral_load = {'Pat1':18000, 'Pat2':13000, 'Pat3':2200}
total_load = 0 # Assign a variable to store total virus load

# Loop through the dictionary to calculate sum of all virus loads
for name in viral_load:
    total_load += viral_load[name]
# or use one-liners to calculate sum of all virus loads
total_load = sum(viral_load.values())

# Create a dictionary of dictionaries to store patient specific data

#First initialize the dictionary patients
patients = {}
patients['Pat1']= {'name':'HIV', 'type':'RNA', 'cell':'CD4', 'load':18000}
patients['Pat2']= {'name':'HIV', 'type':'RNA', 'cell':'CD4', 'load':13000}
patients['Pat3']= {'name':'HIV', 'type':'RNA', 'cell':'CD4', 'load':19000}
patients['Pat4']= {'name':'HCV', 'type':'RNA', 'cell':'Hepa', 'load':2200}
patients['Pat5']= {'name':'HCV', 'type':'RNA', 'cell':'Hepa', 'load':8200}


HIV_load = []
HCV_load = []

for pat in patients:
    if patients[pat]["name"] is "HIV":
        HIV_load.append(patients[pat]["load"])
    elif patients[pat]["name"] is "HCV":
        HCV_load.append(patients[pat]["load"])

#print "Average HIV load = ", float(HIV_load)
#print "Average HCV load = ", float(HCV_load)
print(sum(HIV_load)/len(HIV_load))
print(sum(HCV_load)/len(HCV_load))

16666
5200

In [17]:

loads = {}
for pat in patients:
    if patients[pat]["name"] not in loads:
        loads[patients[pat]["name"]] = []
    loads[patients[pat]["name"]].append(patients[pat]["load"])

print loads
for virus in loads:
    print "%s  %d"%(virus, sum(loads[virus])/len(loads[virus]))

{'HCV': [8200, 2200], 'HIV': [18000, 19000, 13000]}
HCV  5200
HIV  16666

In [0]:

Challenge: DNA to Protein sequence

Translate the valid DNA sequences from seqs given in code below to protein sequences using the dictionary codon_table
Did you take the reading frames into account? Translate codon for each reading frame.

Start with the following bit of sexy code:

In [25]:

bases = ['T', 'C', 'A', 'G']
codons = [a+b+c for a in bases for b in bases for c in bases]
amino_acids = 'FFLLSSSSYY**CC*WLLLLPPPPHHQQRRRRIIIMTTTTNNKKSSRRVVVVAAAADDEEGGGG'
codon_table = dict(zip(codons, amino_acids))

# Confused by the code?? Ask us what it means!!

# Loop over the seqs
seqs = ['actgactgactgaattcgactg','caucgaucgcgauacacgaucagcuacg','augcagacgacguacgu','atcgatcgatcgatcacgt','atcgtagctactagctagc','acgatcgtagctacgta','cgaucagucgaucgauccagcga','cguacguagcacaugcagucaguauacguacggacgacgac','catgactgactgatcgatgctgactgactg','atcggatctgaactgactg','actgactgactgactg','caucgaucgcgauacacgaucagcuacg','augcagacgacguacgu','atcgatcgaattcgatcgatcacgt','atcgtagctactagctagc','acgatcgaattcgtagctacgta','cgaucagucgaucgauccagcga','cguacguagcacaugcagucaguauacguacggacgacgac','catgactgactgatcgatgaattcgctgactgactg','aucggauccgaaccgacag']

# Write your code here to translate sequences
# First check what is stored in codons, amino_acids and codon_table variables


# Make uppercase and chuck U's
for index,item in enumerate(seqs):
    seqs[index] = item.upper().replace('U','T')

  File "<ipython-input-25-38f8df452d0d>", line 14
    seqs[index] = item.upper().replace('U','T') for index,item in enumerate(seqs)
                                                  ^
SyntaxError: invalid syntax

In [15]:

proteins = []
for sequence in seqs:
    protein = [""]*3
    for frame in range(3):
        for position in range(frame, len(sequence), 3):
            triplet = sequence[position:position+3]
            if triplet not in codon_table:
                break
            amino_acid = codon_table[triplet]
            protein[fra me] += amino_acid
        proteins.append(protein)
        
print proteins

[['TD*LNST', 'LTD*IRL', '*LTEFD'], ['TD*LNST', 'LTD*IRL', '*LTEFD'], ['TD*LNST', 'LTD*IRL', '*LTEFD'], ['HRSRYTISY', 'IDRDTRSAT', 'SIAIHDQL'], ['HRSRYTISY', 'IDRDTRSAT', 'SIAIHDQL'], ['HRSRYTISY', 'IDRDTRSAT', 'SIAIHDQL'], ['MQTTY', 'CRRRT', 'ADDVR'], ['MQTTY', 'CRRRT', 'ADDVR'], ['MQTTY', 'CRRRT', 'ADDVR'], ['IDRSIT', 'SIDRSR', 'RSIDH'], ['IDRSIT', 'SIDRSR', 'RSIDH'], ['IDRSIT', 'SIDRSR', 'RSIDH'], ['IVATS*', 'S*LLAS', 'RSY*L'], ['IVATS*', 'S*LLAS', 'RSY*L'], ['IVATS*', 'S*LLAS', 'RSY*L'], ['TIVAT', 'RS*LR', 'DRSYV'], ['TIVAT', 'RS*LR', 'DRSYV'], ['TIVAT', 'RS*LR', 'DRSYV'], ['RSVDRSS', 'DQSIDPA', 'ISRSIQR'], ['RSVDRSS', 'DQSIDPA', 'ISRSIQR'], ['RSVDRSS', 'DQSIDPA', 'ISRSIQR'], ['RT*HMQSVYVRTT', 'VRSTCSQYTYGRR', 'YVAHAVSIRTDDD'], ['RT*HMQSVYVRTT', 'VRSTCSQYTYGRR', 'YVAHAVSIRTDDD'], ['RT*HMQSVYVRTT', 'VRSTCSQYTYGRR', 'YVAHAVSIRTDDD'], ['HD*LIDAD*L', 'MTD*SMLTD', '*LTDRC*LT'], ['HD*LIDAD*L', 'MTD*SMLTD', '*LTDRC*LT'], ['HD*LIDAD*L', 'MTD*SMLTD', '*LTDRC*LT'], ['IGSELT', 'SDLN*L', 'RI*TD'], ['IGSELT', 'SDLN*L', 'RI*TD'], ['IGSELT', 'SDLN*L', 'RI*TD'], ['TD*LT', 'LTD*L', '*LTD'], ['TD*LT', 'LTD*L', '*LTD'], ['TD*LT', 'LTD*L', '*LTD'], ['HRSRYTISY', 'IDRDTRSAT', 'SIAIHDQL'], ['HRSRYTISY', 'IDRDTRSAT', 'SIAIHDQL'], ['HRSRYTISY', 'IDRDTRSAT', 'SIAIHDQL'], ['MQTTY', 'CRRRT', 'ADDVR'], ['MQTTY', 'CRRRT', 'ADDVR'], ['MQTTY', 'CRRRT', 'ADDVR'], ['IDRIRSIT', 'SIEFDRSR', 'RSNSIDH'], ['IDRIRSIT', 'SIEFDRSR', 'RSNSIDH'], ['IDRIRSIT', 'SIEFDRSR', 'RSNSIDH'], ['IVATS*', 'S*LLAS', 'RSY*L'], ['IVATS*', 'S*LLAS', 'RSY*L'], ['IVATS*', 'S*LLAS', 'RSY*L'], ['TIEFVAT', 'RSNS*LR', 'DRIRSYV'], ['TIEFVAT', 'RSNS*LR', 'DRIRSYV'], ['TIEFVAT', 'RSNS*LR', 'DRIRSYV'], ['RSVDRSS', 'DQSIDPA', 'ISRSIQR'], ['RSVDRSS', 'DQSIDPA', 'ISRSIQR'], ['RSVDRSS', 'DQSIDPA', 'ISRSIQR'], ['RT*HMQSVYVRTT', 'VRSTCSQYTYGRR', 'YVAHAVSIRTDDD'], ['RT*HMQSVYVRTT', 'VRSTCSQYTYGRR', 'YVAHAVSIRTDDD'], ['RT*HMQSVYVRTT', 'VRSTCSQYTYGRR', 'YVAHAVSIRTDDD'], ['HD*LIDEFAD*L', 'MTD*SMNSLTD', '*LTDR*IR*LT'], ['HD*LIDEFAD*L', 'MTD*SMNSLTD', '*LTDR*IR*LT'], ['HD*LIDEFAD*L', 'MTD*SMNSLTD', '*LTDR*IR*LT'], ['IGSEPT', 'SDPNRQ', 'RIRTD'], ['IGSEPT', 'SDPNRQ', 'RIRTD'], ['IGSEPT', 'SDPNRQ', 'RIRTD']]

In [0]:

In [24]:

start_codon = "ATG"
for sequence in seqs:
    start_pos = [n for n in xrange(len(sequence)) if sequence.find(start_codon, n) == n]
    for i in start_pos:
        triplets = [sequence[n:n+3] for n in range(i,len(sequence),3)]
        protein = ''.join([codon_table.get(triplet,"") for triplet in triplets])
        stop_pos = protein.find("*")
        if stop_pos != -1:
            print sequence
            print triplets
            print protein
            print protein[:stop_pos]+"\n"

CATGACTGACTGATCGATGCTGACTGACTG
['ATG', 'ACT', 'GAC', 'TGA', 'TCG', 'ATG', 'CTG', 'ACT', 'GAC', 'TG']
MTD*SMLTD
MTD

CATGACTGACTGATCGATGAATTCGCTGACTGACTG
['ATG', 'ACT', 'GAC', 'TGA', 'TCG', 'ATG', 'AAT', 'TCG', 'CTG', 'ACT', 'GAC', 'TG']
MTD*SMNSLTD
MTD

In [0]:

In [0]:

In [0]: