CoCalc Public Filesweek 4 / assignment / Caesar Cipher.ipynb
Author: Boyuan Qi
Views : 37
Compute Environment: Ubuntu 20.04 (Default)

# Assignment 1: The Caesar Cipher

## Waypoints

By now you should have completed the fourth waypoint (this was in chapter 10, from week 1), and have the corresponding numerical value answer. This should be given here to confirm you have successfully completed it.

Marks: 2
Answer that will be automatically graded below, ID: cell-55ff9879af308205
In [ ]:
SdudentID=100911761
Waypoint4=72

Test your code from above here(2 points), ID: cell-86a036301ddef2e9
In [ ]:
#test Waypoint4


## Introduction

The ultimate purpose of this assignment is that you are provided with two documents, encrypt.txt and reference_reduced.txt. These are stored on a webserver, and will be download when required. reference_reduced.txt is a generic piece of text in some European (non-English) language. You are going to use this to work out the typical frequencies of different characters in that language. encrypt.txt is another piece of text in that language, but it has been encrypted by a Caesar cipher of unknown offset. You are going to analyse the frequency of the letters in that text, and work out what the unknown offset is.

### Caesar Cipher

The Caesar cipher encrypts letters by shifting them by a fixed offset in the alphabet. For example, an offset of 2 performs the encoding $A\mapsto C,\quad B\mapsto D,\quad C\mapsto E,\quad\ldots\quad X\mapsto Z,\quad Y\mapsto A,\quad Z\mapsto B$ If there is any punctuation (including spaces), these do not get encoded, and are just left as they are. There are different conventions for what you could choose to do with capital letters.

To check your understanding, manually calculate the following (there is no need to write code):

1. Apply the Caesar cipher of offset 4 (so $A\mapsto E$) to the string "Python is great". All letters should be in capitals. Set the variable encoded to the encoded string.
2. The string "P DPSS HWWSF MVY HU PUALYUZOPW" has been encoded with a Caesar cipher of offset 7. Decode it, and set the output to the variable decoded.
3. What is the most commonly used letter in the English language? Give your single character answer as a string named common
4. In a long piece of English text which you know is encrypted using a Caesar cipher, if the most common letter is Z, what is the most likely offset of the encoding? Give the integer answer as variable offset

So, let's say we think the answer to question 4 is 0. You'd write in the box below offset=0. Similarly, you'd write decoded="SOME TEXT".

Marks: 4
Answer that will be automatically graded below, ID: cell-fa4bbdcd7e48caaf
In [ ]:
# YOUR CODE HERE
encoded='TCXLSR MW KVIEX'
decoded='PYTHON IS GREAT'
encoded='P DPSS HWWSF MVY HU PUALYUZOPW'
decoded='I WILL APPLY FOR AN INTERNSHIP'
common='E'
offset=22
decoded='OKIA PATO'

Test your code from above here(1 point), ID: cell-9f20aa9e4bd4e071
In [ ]:
#test to check that the encoded variable is correct

Test your code from above here(1 point), ID: cell-82cb8a7f9a451a7e
In [ ]:
#test to check that the decoded variable is correct

Test your code from above here(1 point), ID: cell-8faf43ef0e7bc281
In [ ]:
#test to check that the common variable is correct

Test your code from above here(1 point), ID: cell-6225659c6e606834
In [ ]:
#test to check that the offset variable is correct


## Character Frequencies

Clearly, we want to be able to identify the most commonly occuring letters if we want to decode an encrypted message.

You are provided with a function that will read text from a given file on a website. Complete the function get_freqs so that it accepts one argument, a string of text. Return a list with 26 elements in it. The first is the number of "a"s in the text, the second is the number of "b"s and so on.

• Punctuation has not been encrypted
• Hint: don't forget about capital letters! For frequency counts, "a" and "A" are equivalent.
• Hint: what useful "string methods" does python provide that might speed up your count?
• Your code most return the answer rather than print it (you can print it as well, but the return is crucial), and it must use the same function name that you've been given.

Look at the tests to understand what the input and output should look like.

Marks: 4
Answer that will be automatically graded below, ID: cell-c09143c67a934fc8
In [ ]:
import requests
# def load_file(fName):
#     """Input: filename (string)
#     Output: file contents as a string"""
#     assert isinstance(fName, str),"Input must be a string"
#     f = open(fName,'r')
#     return f.read()

def load_file(fName):
url="https://www.ma.rhul.ac.uk/akay/teaching/python/"+fName
response=requests.get(url)
#check that it opened correctly
assert response.status_code == requests.codes.ok
return response.text

def get_freqs_from_file(fName):
"""Input: filename (string) to open)
Output: list of 26 elements
The first element of the output is the number of 'a's in the given text, the second element is the number of 'b's"""
assert isinstance(fName, str),"Input must be a string"
return get_freqs(load_file(fName))

def get_freqs(data):
"""Input: string
Output: list of 26 elements
The first element of the output is the number of 'a's in the given string, the second element is the number of 'b's"""
# YOUR CODE HERE
raise NotImplementedError()
s=input()
str1=''
alphabet=['a','b','c','d','e','f','g','h','i','j','k','l','m','n','o','p','q','r','s','t','u','v','w','x','y','z']
for i in range(len(s)):
if s[i] in alphabet:
temp=ord(s[i])
num=(temp-97+3)%26
str1=chr(num+97)
print(str1,end="")
else:
print(" ",end="")


Test your code from above here(1 point), ID: cell-39fa53789544edfd
In [ ]:
#first basic tests
#see how, if you don't understand what the question wants, you can probably use these examples to figure it out?
assert get_freqs("abcdde")==[1,1,1,2,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0]
assert get_freqs("ab? cdde.!")==[1,1,1,2,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0]

Test your code from above here(1 point), ID: cell-9c853ed8d5c06ed7
In [ ]:
#how do you handle mixed cases?
assert get_freqs("Hello World")==[0, 0, 0, 1, 1, 0, 0, 1, 0, 0, 0, 3, 0, 0, 2, 0, 0, 1, 0, 0, 0, 0, 1, 0, 0, 0]

Test your code from above here(1 point), ID: cell-437de983f2f88b80
In [ ]:
#a more extensive test

Test your code from above here(1 point), ID: cell-9c33d89abf53c180
In [ ]:
#test the the frequencies from the two given files are correct.


## Vector Inner Product

Recall that if you have two vectors $\underline v_1$ and $\underline v_2$, you can calculate their inner product, $\underline v_1\cdot\underline v_2=\|\underline v_1\|\|\underline v_2\|\cos\theta,$ where $\theta$ is the angle between the two vectors, measuring how similar the directions of the two vectors are.

Complete the function inner_product that accepts $\underline v_1$ and $\underline v_2$ as lists, and returns their inner product.

Don't forget pre-conditions!

Marks: 5
Answer that will be automatically graded below, ID: cell-c9b55875f78ab93e
In [3]:
def inner_product(list1,list2):
'''input: two lists of real numbers of the same length
output: the inner product of the two lists'''
# YOUR CODE HERE

assert all(isinstance(n,int) for n in list1)
assert all(isinstance(n,int) for n in list2)
calculation=sum([m[0]*m[1] for m in zip(list1,list2)]) #Dot product calculation
return(calculation)

print(inner_product([1,2],[1,5]))
print(inner_product([1,0,1],[-2,5,2]))


11 0
Test your code from above here(2 points), ID: cell-bfb30a03c1900850
In [ ]:
#basic tests
assert inner_product([1,2],[1,5])==11
assert inner_product([1,0,1],[-2,5,2])==0

Test your code from above here(1 point), ID: 0fa0b8
In [ ]:
# test preconditions (check how many there are)


Test your code from above here(1 point), ID: 168a0b
In [ ]:
# further test on preconditions


Test your code from above here(1 point), ID: cell-b935a4abba4f6caf
In [ ]:
#test the function applied to the file. If previous hidden test failed, this one will as well.


## Cycling

If we run the command:

inner_product(get_freqs_from_file("encrypt.txt"),get_freqs_from_file("reference.txt"))


we'll get a measure of how similar the character frequency distributions are between the encrypted text and the reference text (it would make more sense if we rescaled based on the lengths of the two individual vectors, but that's unnecessary for what's coming). Of course, we don't expect them to be very similar. But, if we tried every possible decryption of the encrypted text, then probably the correctly decrypted version would have the largest inner product. Of course, we don't have to try all possible decryptions of the encrypted text, we just use the frequency analysis that we've already done.

The cycle_list operation is defined as taking a list (first argument), removing one element from the front, and putting it back on the end. This can be repeated k times (optional second argument). Complete the cycle_list function.

Marks: 2
Answer that will be automatically graded below, ID: cell-6a8f42d8dce8e051
In [2]:
def cycle_list(myList,k=1):
"""
Input: a list and a cycle length
Output: the list cycled k times (defaults to 1)"""
# YOUR CODE HERE

assert isinstance (k,int)
assert k>0
newList=[] #a new list created to enter the shifted elements
for x in range(len(myList)):
shift=myList[(x+1)%len(myList)]
newList.append(shift)
output=newList
if k==1: #order 1
return output
else: #not order 1
return cycle_list(output,k-1)

print(cycle_list([1,2,3,4]))
print(cycle_list([1,2,3,4],k=3))
print(cycle_list([1,2,"a",4,"b"]))

[2, 3, 4, 1] [4, 1, 2, 3] [2, 'a', 4, 'b', 1]
Test your code from above here(1 point), ID: cell-a991ec688e93a51a
In [ ]:
#simple tests only this time
assert cycle_list([1,2,3,4])==[2,3,4,1]
assert cycle_list([1,2,3,4],3)==[4,1,2,3]
assert cycle_list([1,2,"a",4,"b"])==[2,"a",4,"b",1]

Test your code from above here(1 point), ID: cell-5726340677847deb
In [ ]:
cycling_list=[1,2,3,4,5]
#this catches a surprising issue that can arise.
assert cycle_list(cycling_list)==[2,3,4,5,1]
assert cycle_list(cycling_list)==[2,3,4,5,1]


## Bringing it Together

For each of the 26 different possible cycle lengths, we want to apply that cycle to the frequency information of the encrypted text, and calculate the inner product with the reference frequency information. Then we need to determine which was the cycle with the largest inner product. Complete the all_cycles function.

Marks: 4
Answer that will be automatically graded below, ID: cell-64f2bf0e4f2afa1e
In [5]:
def get_best_cycle_length():
'''
Inputs: none
Outputs: length of cycle that gives largest inner product.
'''
#test all the possible Caesar cipher offsets
ans=all_cycles(encrypt_freqs,ref_freqs)
return ans.index(max(ans))    #find the largest entry in the list, and return its position in the list. that's the desired offset

def all_cycles(list1,list2):
'''
Input: two lists of equal length containing frequency data
Output: list that is the same length as the input. Element n (0-indexed) is inner product between n-cycle of first list, and second list'''
# YOUR CODE HERE

assert len(list1)==len(list2)
longList=[]
for x in range(1,len(list1)+1): #We start at 1 because at minimum k=1
brandNew=cycle_list(list1,k=x)
calculation=inner_product(brandNew,list2) #The inner product calculation happens here between the brandNew list and list 2
longList.append(calculation)
bigShift=cycle_list(longList,k=(-1)%len(list1)) #As we can not start from k=0, we must shift the longList
assert len(bigShift)==len(list1)
return bigShift

print(all_cycles([1,0,0,0,0],[1,2,3,7,9]))
print(all_cycles([0,0,1,1,0],[1,2,3,7,9]))

[1, 9, 7, 3, 2] [10, 5, 3, 10, 16]
Test your code from above here(1 point), ID: cell-ba28bca0d8c3fb4c
In [ ]:
assert all_cycles([1,0,0,0,0],[1,2,3,7,9])==[1,9,7,3,2]
assert all_cycles([0,0,1,1,0],[1,2,3,7,9])==[10,5,3,10,16]

In [ ]:


Test your code from above here(1 point), ID: cell-1db7b6179c8c050f
In [ ]:
#test that you find the correct cycle length

Test your code from above here(1 point), ID: cell-42b8136be185d164
In [ ]:
#a sneaky test to check your input and output assertions

Test your code from above here(1 point), ID: cell-018ee4692e5e9177
In [ ]:
#make sure you're actually following the instructions and using the functions as you were asked to


Now that get_best_cycle_length has told us the correct cycle to apply to decrypt the text, we can decrypt it.

In [ ]:
def decrypt(fName,cycle):
letters = "abcdefghijklmnopqrstuvwxyz"
text=load_file(fName)
result=""
for character in text:
if character in letters:
result+=chr((ord(character)-ord("a")-cycle % 26)+ord("a"))
elif character.lower() in letters:
result+=chr((ord(character)-ord("A")-cycle % 26)+ord("A"))
else:
result+=character
return result

print(decrypt("encrypt.txt",get_best_cycle_length())[:1000])


Explain (in your own words) how the for loop works within decrypt: what do the chr and ord functions do? Each member of the pair programming pair should write their own explanation, independently, in a separate copy of the file (which will contain the same solutions to the programming questions).

Marks: 3
Manually graded answer(3 points), ID: cell-1df073bb8f9bc85b

YOUR ANSWER HERE

In [ ]:
chr takes encoding as a parameter，returns the character corresponding to the encoding. And ord takes one character as an argument，returns the corresponding Unicode value.

Manually graded task(3 points), ID: cell-2808c9f1e7304cb7

General feedback on your code structure, style, commenting, tests, explanations etc. You do not have to do anything here.

Marks: 3