Stylometry_exercise

Path: 171/Proj1 / string_processing_examples.ipynb

Views: ⁹²⁴

Kernel: Python 3 (Anaconda 5)

String Processing

Today we will do some string processing and work with repeat loops.

In [1]:

#Here is a string with whitespace
a_string  = " anaconda "

In [2]:

#This strips the whitespace from the left

a_string.lstrip()

'anaconda '

In [3]:

#This strips the whitespace from the right

a_string.rstrip()

' anaconda'

In [43]:

#Both sides at once

a_string = a_string.strip()
a_string

'anaconda'

In [16]:

#Turn a comma separated set of items in a string into a list

comma_separated = "eats, shoots, leaves"
comma_separated = comma_separated.split(",")
comma_separated

['eats', ' shoots', ' leaves']

In [45]:

#Turn a space separated set of items in a string into a list

space_separated = "larry moe curly"
space_separated = space_separated.split(" ")
space_separated

['larry', 'moe', 'curly']

In [46]:

#Put them back together again!  Join is the opposite of split.

" ".join(space_separated)

'larry moe curly'

In [59]:

# I should have mentioned before that array slicing works on strings...

word = "antidisestablishmentarianism"
word.find("establish"), word[7:7+len("establish")]

(7, 'establish')

In [44]:

### Exercise

#Here is the shadow password for myself on my system.
#The value between the 2nd and 3rd '$' symbols is my "salt"
#How can you extract my salt?
#For more on this see: https://en.wikipedia.org/wiki/Passwd#Shadow_file

shadow = "hunter:$6$2rrL/yy6$d.yj.i5fpz3ZKvTbQoUva4hyRmpdyzeWJ4UDhQE32ZkzMqM7Y0gI/SjfMXs1d/Dn08i9bWrm5QJxaPIDEr1Hv1:17214:0:99999:7:::"

#Solution:
firstDollar = shadow.find('$')
secondDollar = shadow.find('$',firstDollar+1)
thirdDollar = shadow.find('$',secondDollar+1)
shadow[secondDollar+1:thirdDollar]

'2rrL/yy6'

In [66]:

In [22]:

#Let's take a sentence and try to make a list of the words in that sentence.
#We want no punctuation.
#We also want all letters in lower case.
#Here's a starting sentence:
sentence = "It is a truth universally acknowledged, that a single man in possession of good fortune must be in want of a wife."
sentence

'It is a truth universally acknowledged, that a single man in possession of good fortune must be in want of a wife.'

In [27]:

#First we replace the commas with the emptystring

sentence = sentence.replace(',','')
sentence

'It is a truth universally acknowledged that a single man in possession of good fortune must be in want of a wife.'

In [29]:

#Then we replace the periods with the emptystring

sentence = sentence.replace('.','')
sentence

'It is a truth universally acknowledged that a single man in possession of good fortune must be in want of a wife'

In [67]:

#Now let's try it in a more automated way.
#Suppose we already know which punctuation marks are included. 
#Then the approach below would work to remove all the punctuation.  

sentence = "It is a truth universally acknowledged, that a single man in possession of good fortune must be in want of a wife."
punctuation = '.,'
for p in punctuation:
    sentence = sentence.replace(p,'')
sentence

'It is a truth universally acknowledged that a single man in possession of good fortune must be in want of a wife'

In [69]:

#Convert to all lowercase

sentence = sentence.lower()
sentence

'it is a truth universally acknowledged that a single man in possession of good fortune must be in want of a wife'

In [70]:

#Done!

sentence = sentence.split()
sentence

['it',
 'is',
 'a',
 'truth',
 'universally',
 'acknowledged',
 'that',
 'a',
 'single',
 'man',
 'in',
 'possession',
 'of',
 'good',
 'fortune',
 'must',
 'be',
 'in',
 'want',
 'of',
 'a',
 'wife']

In [71]:

#Here's a set of the unique words in the sentence.
#That means all the "repeats" have been removed.

set(sentence)

{'a',
 'acknowledged',
 'be',
 'fortune',
 'good',
 'in',
 'is',
 'it',
 'man',
 'must',
 'of',
 'possession',
 'single',
 'that',
 'truth',
 'universally',
 'want',
 'wife'}

In [79]:

#We can count how many times each word occurs
unique_words = list(set(sentence))
for uw in unique_words:
    print("'{}' occurs {} times".format(uw,sentence.count(uw)))

'fortune' occurs 1 times
'that' occurs 1 times
'a' occurs 3 times
'possession' occurs 1 times
'wife' occurs 1 times
'man' occurs 1 times
'good' occurs 1 times
'must' occurs 1 times
'truth' occurs 1 times
'single' occurs 1 times
'want' occurs 1 times
'it' occurs 1 times
'be' occurs 1 times
'is' occurs 1 times
'universally' occurs 1 times
'acknowledged' occurs 1 times
'of' occurs 2 times
'in' occurs 2 times

In [0]:

### Exercise...

#Print how many times each letter occurs in the sentence
sentence = "the quick brown fox jumped over the lazy dog"

#Modify this to get the sentence from the user!

In [97]:

## This is a little bit advanced...
## We'll produce a list of the unique words in a bit of text, sorted by frequency.

original_text = '''President Trump is facing a test to his presidency unlike any faced by a modern American leader.

It’s not just that the special counsel looms large. Or that the country is bitterly divided over Mr. Trump’s leadership. Or even that his party might well lose the House to an opposition hellbent on his downfall.

The dilemma — which he does not fully grasp — is that many of the senior officials in his own administration are working diligently from within to frustrate parts of his agenda and his worst inclinations.'''

text = original_text

#The command below where we get the punctuation symbols is called a "list comprehension".
#We will explain them later, but you could also learn about them now
#by Googling. 
#Every string has a built in operation "isalpha" that returns True only if the string
#contains only letters.

punctuation = [t for t in text if not t.isalpha() and t != ' ']
punctuation

['.', '\n', '\n', '’', '.', '.', '’', '.', '.', '\n', '\n', '—', '—', '.']

In [98]:

#unique punctuation symbols
punctuation = list(set(punctuation))
punctuation

['\n', '.', '’', '—']

In [99]:

#Now, should we replace all punctuation with the emptystring ''?
#Or, should we replace all punctuation with spaces?

#Both have downsides.  In the first case, "one line\nThe next line" becomes "one lineThe next line" which creates the weird word lineThe.

#If we replace all punctuation with spaces then "Trump's" becomes "Trump s" and it seems like "s" is a word.  

#Dealing with possessives and contractions is a dicey subject in and of itself.  

#I will replace \n with a space and everything else with the emptystring

text = text.replace('\n',' ')

In [100]:

#Now we will replace all remaining punctuation marks with the emptystring.
for p in punctuation:
    text = text.replace(p,'')
text

'President Trump is facing a test to his presidency unlike any faced by a modern American leader  Its not just that the special counsel looms large Or that the country is bitterly divided over Mr Trumps leadership Or even that his party might well lose the House to an opposition hellbent on his downfall  The dilemma  which he does not fully grasp  is that many of the senior officials in his own administration are working diligently from within to frustrate parts of his agenda and his worst inclinations'

In [101]:

#Now convert to lowercase..
text = text.lower()
text

'president trump is facing a test to his presidency unlike any faced by a modern american leader  its not just that the special counsel looms large or that the country is bitterly divided over mr trumps leadership or even that his party might well lose the house to an opposition hellbent on his downfall  the dilemma  which he does not fully grasp  is that many of the senior officials in his own administration are working diligently from within to frustrate parts of his agenda and his worst inclinations'

In [102]:

# We make a list of pairs.  The first part of each pair is word count, and the second part is the word counted.

# The weird thing I'm doing adding spaces (" "+word+" ") is to stop 'a' from being counted once for every freestanding word and also once for every time the letter occurs!
# By adding spaces we only count the 'a's that show up as standalone words. 


text = " "+text+" "
words = text.split()

unique_words = set(words)
count_and_word = []
for word in unique_words:
    count_and_word.append([text.count(" "+word+" "),word])
count_and_word

[[1, 'unlike'],
 [1, 'over'],
 [1, 'does'],
 [1, 'own'],
 [1, 'divided'],
 [1, 'he'],
 [1, 'officials'],
 [1, 'house'],
 [1, 'grasp'],
 [1, 'faced'],
 [2, 'or'],
 [1, 'many'],
 [1, 'parts'],
 [6, 'his'],
 [1, 'modern'],
 [3, 'is'],
 [1, 'by'],
 [4, 'that'],
 [1, 'leadership'],
 [1, 'even'],
 [1, 'are'],
 [1, 'from'],
 [1, 'dilemma'],
 [1, 'trump'],
 [1, 'party'],
 [1, 'fully'],
 [1, 'inclinations'],
 [1, 'worst'],
 [1, 'frustrate'],
 [1, 'opposition'],
 [1, 'counsel'],
 [1, 'which'],
 [1, 'hellbent'],
 [1, 'bitterly'],
 [1, 'and'],
 [1, 'its'],
 [1, 'diligently'],
 [1, 'lose'],
 [1, 'within'],
 [1, 'country'],
 [2, 'of'],
 [1, 'facing'],
 [1, 'well'],
 [1, 'test'],
 [2, 'a'],
 [1, 'in'],
 [1, 'agenda'],
 [1, 'senior'],
 [1, 'special'],
 [1, 'looms'],
 [1, 'american'],
 [2, 'not'],
 [1, 'working'],
 [1, 'on'],
 [3, 'to'],
 [1, 'mr'],
 [1, 'large'],
 [5, 'the'],
 [1, 'president'],
 [1, 'trumps'],
 [1, 'downfall'],
 [1, 'might'],
 [1, 'administration'],
 [1, 'an'],
 [1, 'leader'],
 [1, 'just'],
 [1, 'any'],
 [1, 'presidency']]

In [103]:

#We may want to print out the frequency pairs sorted by frequency, like this:
for word in sorted(count_and_word,key = lambda x: -x[0]):
    print(word)

[6, 'his']
[5, 'the']
[4, 'that']
[3, 'is']
[3, 'to']
[2, 'or']
[2, 'of']
[2, 'a']
[2, 'not']
[1, 'unlike']
[1, 'over']
[1, 'does']
[1, 'own']
[1, 'divided']
[1, 'he']
[1, 'officials']
[1, 'house']
[1, 'grasp']
[1, 'faced']
[1, 'many']
[1, 'parts']
[1, 'modern']
[1, 'by']
[1, 'leadership']
[1, 'even']
[1, 'are']
[1, 'from']
[1, 'dilemma']
[1, 'trump']
[1, 'party']
[1, 'fully']
[1, 'inclinations']
[1, 'worst']
[1, 'frustrate']
[1, 'opposition']
[1, 'counsel']
[1, 'which']
[1, 'hellbent']
[1, 'bitterly']
[1, 'and']
[1, 'its']
[1, 'diligently']
[1, 'lose']
[1, 'within']
[1, 'country']
[1, 'facing']
[1, 'well']
[1, 'test']
[1, 'in']
[1, 'agenda']
[1, 'senior']
[1, 'special']
[1, 'looms']
[1, 'american']
[1, 'working']
[1, 'on']
[1, 'mr']
[1, 'large']
[1, 'president']
[1, 'trumps']
[1, 'downfall']
[1, 'might']
[1, 'administration']
[1, 'an']
[1, 'leader']
[1, 'just']
[1, 'any']
[1, 'presidency']

In [92]:

#Or, maybe we want to print them out sorted by alphabetically by word...

for word in sorted(count_and_word,key=lambda x:x[1]):
    print(word)

[2, 'a']
[1, 'administration']
[1, 'agenda']
[1, 'american']
[1, 'an']
[1, 'and']
[1, 'any']
[1, 'are']
[1, 'bitterly']
[1, 'by']
[1, 'counsel']
[1, 'country']
[1, 'dilemma']
[1, 'diligently']
[1, 'divided']
[1, 'does']
[1, 'downfall']
[1, 'even']
[1, 'faced']
[1, 'facing']
[1, 'from']
[1, 'frustrate']
[1, 'fully']
[1, 'grasp']
[1, 'he']
[1, 'hellbent']
[6, 'his']
[1, 'house']
[1, 'in']
[1, 'inclinations']
[3, 'is']
[1, 'its']
[1, 'just']
[1, 'large']
[1, 'leader']
[1, 'leadership']
[1, 'looms']
[1, 'lose']
[1, 'many']
[1, 'might']
[1, 'modern']
[1, 'mr']
[2, 'not']
[2, 'of']
[1, 'officials']
[1, 'on']
[1, 'opposition']
[2, 'or']
[1, 'over']
[1, 'own']
[1, 'parts']
[1, 'party']
[1, 'presidency']
[1, 'president']
[1, 'senior']
[1, 'special']
[1, 'test']
[4, 'that']
[5, 'the']
[3, 'to']
[1, 'trump']
[1, 'trumps']
[1, 'unlike']
[1, 'well']
[1, 'which']
[1, 'within']
[1, 'working']
[1, 'worst']

In [105]:

# In each of these cases you might have noticed that the "key" is telling the sort function what matters to you for sorting.  
# Let's do it one more time, sorting on the basis of word length:

for word in sorted(count_and_word,key=lambda x: len(x[1])):
    print(word)

[2, 'a']
[1, 'he']
[2, 'or']
[3, 'is']
[1, 'by']
[2, 'of']
[1, 'in']
[1, 'on']
[3, 'to']
[1, 'mr']
[1, 'an']
[1, 'own']
[6, 'his']
[1, 'are']
[1, 'and']
[1, 'its']
[2, 'not']
[5, 'the']
[1, 'any']
[1, 'over']
[1, 'does']
[1, 'many']
[4, 'that']
[1, 'even']
[1, 'from']
[1, 'lose']
[1, 'well']
[1, 'test']
[1, 'just']
[1, 'house']
[1, 'grasp']
[1, 'faced']
[1, 'parts']
[1, 'trump']
[1, 'party']
[1, 'fully']
[1, 'worst']
[1, 'which']
[1, 'looms']
[1, 'large']
[1, 'might']
[1, 'unlike']
[1, 'modern']
[1, 'within']
[1, 'facing']
[1, 'agenda']
[1, 'senior']
[1, 'trumps']
[1, 'leader']
[1, 'divided']
[1, 'dilemma']
[1, 'counsel']
[1, 'country']
[1, 'special']
[1, 'working']
[1, 'hellbent']
[1, 'bitterly']
[1, 'american']
[1, 'downfall']
[1, 'officials']
[1, 'frustrate']
[1, 'president']
[1, 'leadership']
[1, 'opposition']
[1, 'diligently']
[1, 'presidency']
[1, 'inclinations']
[1, 'administration']

In [106]:

## Okay one more time (seriously) but this time big words come first:

for word in sorted(count_and_word,key=lambda x: -len(x[1])):
    print(word)

[1, 'administration']
[1, 'inclinations']
[1, 'leadership']
[1, 'opposition']
[1, 'diligently']
[1, 'presidency']
[1, 'officials']
[1, 'frustrate']
[1, 'president']
[1, 'hellbent']
[1, 'bitterly']
[1, 'american']
[1, 'downfall']
[1, 'divided']
[1, 'dilemma']
[1, 'counsel']
[1, 'country']
[1, 'special']
[1, 'working']
[1, 'unlike']
[1, 'modern']
[1, 'within']
[1, 'facing']
[1, 'agenda']
[1, 'senior']
[1, 'trumps']
[1, 'leader']
[1, 'house']
[1, 'grasp']
[1, 'faced']
[1, 'parts']
[1, 'trump']
[1, 'party']
[1, 'fully']
[1, 'worst']
[1, 'which']
[1, 'looms']
[1, 'large']
[1, 'might']
[1, 'over']
[1, 'does']
[1, 'many']
[4, 'that']
[1, 'even']
[1, 'from']
[1, 'lose']
[1, 'well']
[1, 'test']
[1, 'just']
[1, 'own']
[6, 'his']
[1, 'are']
[1, 'and']
[1, 'its']
[2, 'not']
[5, 'the']
[1, 'any']
[1, 'he']
[2, 'or']
[3, 'is']
[1, 'by']
[2, 'of']
[1, 'in']
[1, 'on']
[3, 'to']
[1, 'mr']
[1, 'an']
[2, 'a']

In [111]:

# Let's try to do it once more under the conditions that we sort first on the basis of word length, but then if two words have the same length, we additionally sort them alphabetically.

for word in sorted(count_and_word,key=lambda x: (len(x[1]),x[1])):
    print(word)

[2, 'a']
[1, 'an']
[1, 'by']
[1, 'he']
[1, 'in']
[3, 'is']
[1, 'mr']
[2, 'of']
[1, 'on']
[2, 'or']
[3, 'to']
[1, 'and']
[1, 'any']
[1, 'are']
[6, 'his']
[1, 'its']
[2, 'not']
[1, 'own']
[5, 'the']
[1, 'does']
[1, 'even']
[1, 'from']
[1, 'just']
[1, 'lose']
[1, 'many']
[1, 'over']
[1, 'test']
[4, 'that']
[1, 'well']
[1, 'faced']
[1, 'fully']
[1, 'grasp']
[1, 'house']
[1, 'large']
[1, 'looms']
[1, 'might']
[1, 'parts']
[1, 'party']
[1, 'trump']
[1, 'which']
[1, 'worst']
[1, 'agenda']
[1, 'facing']
[1, 'leader']
[1, 'modern']
[1, 'senior']
[1, 'trumps']
[1, 'unlike']
[1, 'within']
[1, 'counsel']
[1, 'country']
[1, 'dilemma']
[1, 'divided']
[1, 'special']
[1, 'working']
[1, 'american']
[1, 'bitterly']
[1, 'downfall']
[1, 'hellbent']
[1, 'frustrate']
[1, 'officials']
[1, 'president']
[1, 'diligently']
[1, 'leadership']
[1, 'opposition']
[1, 'presidency']
[1, 'inclinations']
[1, 'administration']

In [94]:

#  Maybe, instead of raw count, we want to know what *percentage* of the document is consituted by a given word...

# To do that we need the total number of words in the document...
total_number_words = len(words)

# And then we just divide each count by the total in order to get the percentage...

for pair in count_and_word:
    pair[0] = pair[0]/total_number_words

count_and_word

[[0.011363636363636364, 'unlike'],
 [0.011363636363636364, 'over'],
 [0.011363636363636364, 'does'],
 [0.011363636363636364, 'own'],
 [0.011363636363636364, 'divided'],
 [0.011363636363636364, 'he'],
 [0.011363636363636364, 'officials'],
 [0.011363636363636364, 'house'],
 [0.011363636363636364, 'grasp'],
 [0.011363636363636364, 'faced'],
 [0.022727272727272728, 'or'],
 [0.011363636363636364, 'many'],
 [0.011363636363636364, 'parts'],
 [0.06818181818181818, 'his'],
 [0.011363636363636364, 'modern'],
 [0.03409090909090909, 'is'],
 [0.011363636363636364, 'by'],
 [0.045454545454545456, 'that'],
 [0.011363636363636364, 'leadership'],
 [0.011363636363636364, 'even'],
 [0.011363636363636364, 'are'],
 [0.011363636363636364, 'from'],
 [0.011363636363636364, 'dilemma'],
 [0.011363636363636364, 'trump'],
 [0.011363636363636364, 'party'],
 [0.011363636363636364, 'fully'],
 [0.011363636363636364, 'inclinations'],
 [0.011363636363636364, 'worst'],
 [0.011363636363636364, 'frustrate'],
 [0.011363636363636364, 'opposition'],
 [0.011363636363636364, 'counsel'],
 [0.011363636363636364, 'which'],
 [0.011363636363636364, 'hellbent'],
 [0.011363636363636364, 'bitterly'],
 [0.011363636363636364, 'and'],
 [0.011363636363636364, 'its'],
 [0.011363636363636364, 'diligently'],
 [0.011363636363636364, 'lose'],
 [0.011363636363636364, 'within'],
 [0.011363636363636364, 'country'],
 [0.022727272727272728, 'of'],
 [0.011363636363636364, 'facing'],
 [0.011363636363636364, 'well'],
 [0.011363636363636364, 'test'],
 [0.022727272727272728, 'a'],
 [0.011363636363636364, 'in'],
 [0.011363636363636364, 'agenda'],
 [0.011363636363636364, 'senior'],
 [0.011363636363636364, 'special'],
 [0.011363636363636364, 'looms'],
 [0.011363636363636364, 'american'],
 [0.022727272727272728, 'not'],
 [0.011363636363636364, 'working'],
 [0.011363636363636364, 'on'],
 [0.03409090909090909, 'to'],
 [0.011363636363636364, 'mr'],
 [0.011363636363636364, 'large'],
 [0.056818181818181816, 'the'],
 [0.011363636363636364, 'president'],
 [0.011363636363636364, 'trumps'],
 [0.011363636363636364, 'downfall'],
 [0.011363636363636364, 'might'],
 [0.011363636363636364, 'administration'],
 [0.011363636363636364, 'an'],
 [0.011363636363636364, 'leader'],
 [0.011363636363636364, 'just'],
 [0.011363636363636364, 'any'],
 [0.011363636363636364, 'presidency']]

In [95]:

## As a consistency check we should make sure that all of these percentages add up to approximately 1.  It might be off just a smidge because of rounding errors.

sum(x[0] for x in count_and_word)

0.9999999999999994

In [0]:

Exercise

Define a variable to contain a sentence with commas and a period (and possibly other punctuation) and with standard capitalization. For instance:

sentence = "Happy families are all alike; every unhappy family is unhappy in its own way."

Then process the sentence to arrive at a list of the constituent words, all lowercase, with no whitespace and no punctuation.

Exercise

Below is a list of email addresses of students in the class. I want to use this to make a group for bulk email using Outlook Webmail. Unfortunately Outlook requires that email addresses be separated by a semicolon when they are entered in bulk. Can you use Python to make a semicolon-separated list of email addresses?

In [0]:

In [0]:

In [0]:

In [0]: