CoCalc -- hw4_ex.ipynb

An exercise in cracking a substitution cipher

Views: ⁵⁶⁵

Kernel: Python 3 (Anaconda 5)

Cryptography

Your homework this week is to find the plaintext for the ciphertext below.

C = 'ENEOENFLEOKFXAQNOFDEMTPXOQONPOKPOPOJEOHEXIVWFVWQXKEOKIPHOSFMVMPOPUEMCWPPNAEMVFXEVESFFNFZFOWFDPOEMCWPNEXMWQYVPVWFDFMVAQOMVFXMCWPPNWFDFOVPOVPEVVFOKCEATXQKSFHOQZFXMQVIEOKTFCEAFVWFFKQVPXPUVWFHOKFXSXEKHEVFYEYFXSXEOVEEUVFXSXEKHEVQOSUXPACEATXQKSFQOAQNOFAPZFKTECGVPNPOKPODQVWFOPHSWMEZQOSMVPNQZFUPXPOFIFEXWFDEMKFVFXAQOFKVPTFCPAFEDXQVFXTIWFWEKTFFOPUUFXFKVWFYPMQVQPOPUEMMQMVEOVFKQVPXEVYHOCWECNEMMQCTXQVQMWWHAPXAESERQOFWFXFAEQOFKEVYHOCWUPXVWFOFLVFQSWVIFEXM'

The code that produced this ciphertext is in 171/Crypto/decrypt.ipynb. If you read that, you'll recall that this is ciphertext produced by a substitution cipher.

Substitution ciphers are interesting because they are not crackable by brute force (you can't try all the 403291461126605635584000000 possible mappings of the alphabet to itself). However they are easily crackable by using statistics.

We will use statistical properties of English plaintext to figure out what mapping was used to produce the ciphertext C. We could compute the plaintext statistics ourselves by using a large body of English writing, like a novel from Project Gutenberg. However for simplicity let's just use the statistics that have already been computed here:

https://www3.nd.edu/~busiforc/handouts/cryptography/Letter%20Frequencies.html

In [2]:

#By the way, this is how I computed factorial(26) which is the number of possible transposition ciphers on an alphabet with 26 symbols. Can you figure out how it works?

import functools

functools.reduce(lambda x,y:x*y,range(1,27))

403291461126605635584000000

Strategy

The strategy for breaking the cipher is to find the following:

The frequency of each letter in C.
The frequency of each bigram (sequence of two letters) in C.
The frequency of each trigram (sequence of three letters) in C.

From this data we will make educated guesses about what the mapping is.

We will start making a "guess" dictionary about what the mapping is. Initially we have no guesses.

In [5]:

# Be sure to actually enter this cell and the other cells I wrote already. 

guess = {l:'?' for l in "ABCDEFGHIJKLMNOPQRSTUVWXYZ"}
guess

{'A': '?',
 'B': '?',
 'C': '?',
 'D': '?',
 'E': '?',
 'F': '?',
 'G': '?',
 'H': '?',
 'I': '?',
 'J': '?',
 'K': '?',
 'L': '?',
 'M': '?',
 'N': '?',
 'O': '?',
 'P': '?',
 'Q': '?',
 'R': '?',
 'S': '?',
 'T': '?',
 'U': '?',
 'V': '?',
 'W': '?',
 'X': '?',
 'Y': '?',
 'Z': '?'}

After we have information about the most commonly occurring letter, bigrams, and trigrams in C, we will be able to fill in our guess dictionary and apply it to C. This may allow us to make further guesses, eventually cracking the cipher.

To get you started, I wrote the code for finding the most commonly occurring single letters, as well as the most commonly occurring 5-grams.

In [2]:

C = 'ENEOENFLEOKFXAQNOFDEMTPXOQONPOKPOPOJEOHEXIVWFVWQXKEOKIPHOSFMVMPOPUEMCWPPNAEMVFXEVESFFNFZFOWFDPOEMCWPNEXMWQYVPVWFDFMVAQOMVFXMCWPPNWFDFOVPOVPEVVFOKCEATXQKSFHOQZFXMQVIEOKTFCEAFVWFFKQVPXPUVWFHOKFXSXEKHEVFYEYFXSXEOVEEUVFXSXEKHEVQOSUXPACEATXQKSFQOAQNOFAPZFKTECGVPNPOKPODQVWFOPHSWMEZQOSMVPNQZFUPXPOFIFEXWFDEMKFVFXAQOFKVPTFCPAFEDXQVFXTIWFWEKTFFOPUUFXFKVWFYPMQVQPOPUEMMQMVEOVFKQVPXEVYHOCWECNEMMQCTXQVQMWWHAPXAESERQOFWFXFAEQOFKEVYHOCWUPXVWFOFLVFQSWVIFEXM'

def one_gram_freq(C):
    hist = {}
    for l in C:  
        try:
            hist[l] += 1
        except:
            hist[l] = 1
    return hist

  
    
def five_gram_freq(C):
    hist = {}
    for index,letter in enumerate(C):
        if(index <= len(C)-5):
            current_five_gram = C[index:index+5]
            try:
                hist[current_five_gram] += 1
            except:
                hist[current_five_gram] = 1
    return hist

In [0]:

In [3]:

hist = five_gram_freq(C)
sorted(hist.items(), key = lambda x: -x[1])[:7]

[('FXSXE', 3),
 ('AQNOF', 2),
 ('NPOKP', 2),
 ('POKPO', 2),
 ('POPUE', 2),
 ('OPUEM', 2),
 ('EMCWP', 2)]

In [8]:

hist = one_gram_freq(C)
freq = sorted(list(hist.items()),key= lambda x: -x[1])
freq

[('F', 57),
 ('E', 43),
 ('O', 39),
 ('V', 38),
 ('P', 36),
 ('X', 30),
 ('Q', 28),
 ('W', 24),
 ('M', 22),
 ('K', 20),
 ('A', 14),
 ('N', 12),
 ('S', 12),
 ('C', 12),
 ('H', 10),
 ('T', 9),
 ('U', 9),
 ('D', 7),
 ('I', 6),
 ('Y', 6),
 ('Z', 5),
 ('L', 2),
 ('J', 1),
 ('G', 1),
 ('R', 1)]

Observe that F is the most commonly occurring single letter in C. A good guess then (based on the information here) is that F is the encryption of E. We can now update our guess dictionary and see what the partially decrypted C looks like.

In [6]:

guess['F'] = 'E'
guess['E'] = 'A'
def apply_guess(C,guess):
    C_modified = ""
    for c in C:
        if guess[c] != '?':
            C_modified += guess[c].lower()
        else:
            C_modified += c
    return C_modified

apply_guess(C,guess)

'aNaOaNeLaOKeXAQNOeDaMTPXOQONPOKPOPOJaOHaXIVWeVWQXKaOKIPHOSeMVMPOPUaMCWPPNAaMVeXaVaSeeNeZeOWeDPOaMCWPNaXMWQYVPVWeDeMVAQOMVeXMCWPPNWeDeOVPOVPaVVeOKCaATXQKSeHOQZeXMQVIaOKTeCaAeVWeeKQVPXPUVWeHOKeXSXaKHaVeYaYeXSXaOVaaUVeXSXaKHaVQOSUXPACaATXQKSeQOAQNOeAPZeKTaCGVPNPOKPODQVWeOPHSWMaZQOSMVPNQZeUPXPOeIeaXWeDaMKeVeXAQOeKVPTeCPAeaDXQVeXTIWeWaKTeeOPUUeXeKVWeYPMQVQPOPUaMMQMVaOVeKQVPXaVYHOCWaCNaMMQCTXQVQMWWHAPXAaSaRQOeWeXeAaQOeKaVYHOCWUPXVWeOeLVeQSWVIeaXM'

For readability I decided to use a scheme in which "guessed" letters are lowercase and unguessed letters are uppercase. You may want to modify this so that unguessed letters are asterisks or some other symbol.

You do the rest.

Happy hunting. Please don't use an online cracker -- your code will be graded not your answer. Also please don't just download a cracker program written in Python. It will be obvious and you'll get a zero. There will be partial credit given.

In [0]:

In [0]:

In [0]: