Shared3-GettingStartedwithProgramming.htmlOpen in CoCalc
Jupyter html version of 3-GettingStartedwithProgramming.ipynb
3-GettingStartedwithProgramming

Mathematics and Computing Getting Started with Programming </h1>

Dr. Jeff Knisley

In the first notebook, we introduced the Python language and used it to explore how mathematics can be implemented and advanced via algorithms. In this notebook, we begin to explore the relationship between algorithms and programs, as most people tend to think of "programming a computer" rather than "equipping a computer with algorithms."

The difference is both subtle and obvious. An algorithm is a sequence of steps designed to solve a problem, or equivalently, there are mathematical problems (e.g., estimating the square root of a positive number) whose solution is in the form of a sequence of steps. What is obvious is that a program is an implementation of an algorithm -- usually on a machine, most often a computer. What is subtle is that a program includes both the algorithm and the types of objects the algorithm is applied to.

This subtle yet obvious relationship is analogous to a cooking recipe. Although we think of the recipe as the sequence of steps that leads to the desired food dish, the types (and amounts) of ingredients at the top of the recipe is as necessary as the sequence of steps. Of course, substitutions are possible, and some ingredients ( a teaspoon of nutmeg ) are more specific than others (season to taste with some type of salt).

Similarly, we must convert algorithms into recipes implemented on a computer, which we tend to call programs. Let's first configure the notebook and then we'll explore in more detail the types of ingredients (data structures) included in Python.

Note: We are using Python 3 for this assignment, so no need to import from the __future__ (Python 3 is the future!)

Strings¶

As we saw in the first notebook, a string is a special type of list, one that is made solely of alphanumeric characters (and punctuation). (Please review first notebook if you do not remember what a string is). Strings are defined by enclosure in either single quotes ' ', double quotes " ... ", or triple quotes, ''' ... ''' or """ ... """.

Characters in a string are accessed using an index inside square brackets, beginning at 0. Thus, if

MyString = ' Hello World '

then MyString[0] returns the string ' ', which is the whitespace character that begins the string. To further explore this example, let's notice that len(MyString) is 22, and that we can index each character as follows:

                               1         2
0123456789012345678901
MyString = '     Hello World!     '



Thus, MyString[5] returns the string 'H', and MyString[5:10] returns the string "Hello".

In [1]:
MyString = '     Hello World!     '
MyString

Out[1]:
'     Hello World!     '
In [2]:
MyString[0]

Out[2]:
' '
In [3]:
MyString[5]

Out[3]:
'H'
In [4]:
MyString[5:10]

Out[4]:
'Hello'

There are also a number of methods associated with strings. If a string is bound to a name such as MyString, then

dir(MyString)

produces a list of the methods that can be applied to the string.

In [5]:
dir(MyString)

Out[5]:
['__add__',
'__class__',
'__contains__',
'__delattr__',
'__dir__',
'__doc__',
'__eq__',
'__format__',
'__ge__',
'__getattribute__',
'__getitem__',
'__getnewargs__',
'__gt__',
'__hash__',
'__init__',
'__iter__',
'__le__',
'__len__',
'__lt__',
'__mod__',
'__mul__',
'__ne__',
'__new__',
'__reduce__',
'__reduce_ex__',
'__repr__',
'__rmod__',
'__rmul__',
'__setattr__',
'__sizeof__',
'__str__',
'__subclasshook__',
'capitalize',
'casefold',
'center',
'count',
'encode',
'endswith',
'expandtabs',
'find',
'format',
'format_map',
'index',
'isalnum',
'isalpha',
'isdecimal',
'isdigit',
'isidentifier',
'islower',
'isnumeric',
'isprintable',
'isspace',
'istitle',
'isupper',
'join',
'ljust',
'lower',
'lstrip',
'maketrans',
'partition',
'replace',
'rfind',
'rindex',
'rjust',
'rpartition',
'rsplit',
'rstrip',
'split',
'splitlines',
'startswith',
'strip',
'swapcase',
'title',
'translate',
'upper',
'zfill']

For example, we can strip the spaces off of either end of MyString via MyString.strip().

In [6]:
MyString.strip()  # removes leading and trailing spaces

Out[6]:
'Hello World!'

The result does not change MyString, but instead creates a copy of the string with the leading and trailing spaces removed. Likewise, the methods attached to the MyString object produce a new string object, while the original MyString object remains unchanged.

In [7]:
MyString.upper()

Out[7]:
'     HELLO WORLD!     '
In [8]:
MyString.find('lo')

Out[8]:
8
In [9]:
MyString.split(' ')  #  ' '  is a space character

Out[9]:
['', '', '', '', '', 'Hello', 'World!', '', '', '', '', '']
In [10]:
MyString.replace("World","Universe")

Out[10]:
'     Hello Universe!     '

Casting and Comprehension¶

Obviously, a string is not an integer, and a float (decimal) is not a string. Likewise, a list is not a tuple and a tuple is not a string and so on. That is, each object has its own type, which can be revealed using the type command.

In [11]:
type("Hello World")

Out[11]:
str
In [12]:
type( 1.2345 )

Out[12]:
float
In [13]:
type( [1,2,3,4] )

Out[13]:
list

Casting an object is the process of creating a new object from the old but with a different type. For example, we may want to use a tuple as a list, or we might want to use a decimal as a string. To do so, we simply apply the command for that type.

In [14]:
str(1.2345)

Out[14]:
'1.2345'
In [15]:
list(  (1,2,3,4) )

Out[15]:
[1, 2, 3, 4]

It is important to note casting creates a new object. For example, below we bind the number 1.2345 to a and then cast it to a string, which we call b. This does not change the object bound to a!

In [16]:
a = 1.2345

In [17]:
b = str( a )

In [18]:
a

Out[18]:
1.2345
In [19]:
type(a)

Out[19]:
float
In [20]:
b

Out[20]:
'1.2345'
In [21]:
type(b)

Out[21]:
str

In contrast, comprehension is the concept of creating a new object using the content of a given object. That is, whereas casting an object is a new object with a different, comprehension is a new object with modified contents.

For example, list comprehension is the act of making a new list from any iterable object in the form

[ f(item) for item in IterableObject ]

where f(item) is some function or manipulation that is applied sequentially to each item in the object.

For example, the next few cells change a number into a list of its digits.

In [22]:
number = 1235

In [23]:
[ digit for digit in str(number) ]

Out[23]:
['1', '2', '3', '5']

That is, we cast a number into a string using str(number), and then we use comprehension to create a list from the individual letters in the string.

Moreover, if we so desired, we could convert each digit string into a numerical digit by casting via the command int.

In [24]:
[ int(digit) for digit in str(number) ]

Out[24]:
[1, 2, 3, 5]

To see how useful list comprehension is, let's recall that dir(MyString) produced a huge list of properties and methods attached to MyString, but in general, we are not interested in the ones that begin with an underscore _ because they tend to be used internally in Python rather than by programmers.

In [25]:
dir(MyString)

Out[25]:
['__add__',
'__class__',
'__contains__',
'__delattr__',
'__dir__',
'__doc__',
'__eq__',
'__format__',
'__ge__',
'__getattribute__',
'__getitem__',
'__getnewargs__',
'__gt__',
'__hash__',
'__init__',
'__iter__',
'__le__',
'__len__',
'__lt__',
'__mod__',
'__mul__',
'__ne__',
'__new__',
'__reduce__',
'__reduce_ex__',
'__repr__',
'__rmod__',
'__rmul__',
'__setattr__',
'__sizeof__',
'__str__',
'__subclasshook__',
'capitalize',
'casefold',
'center',
'count',
'encode',
'endswith',
'expandtabs',
'find',
'format',
'format_map',
'index',
'isalnum',
'isalpha',
'isdecimal',
'isdigit',
'isidentifier',
'islower',
'isnumeric',
'isprintable',
'isspace',
'istitle',
'isupper',
'join',
'ljust',
'lower',
'lstrip',
'maketrans',
'partition',
'replace',
'rfind',
'rindex',
'rjust',
'rpartition',
'rsplit',
'rstrip',
'split',
'splitlines',
'startswith',
'strip',
'swapcase',
'title',
'translate',
'upper',
'zfill']

We can use list comprehension, however, to obtain only the names we are interested in! To do so, we add a conditional statement which says that we want names that do not begin with a _, which in Python is

name[0] != '_'

where != is Python for "not the same as".

In [26]:
[ name for name in dir(MyString) if name[0] != '_'   ]

Out[26]:
['capitalize',
'casefold',
'center',
'count',
'encode',
'endswith',
'expandtabs',
'find',
'format',
'format_map',
'index',
'isalnum',
'isalpha',
'isdecimal',
'isdigit',
'isidentifier',
'islower',
'isnumeric',
'isprintable',
'isspace',
'istitle',
'isupper',
'join',
'ljust',
'lower',
'lstrip',
'maketrans',
'partition',
'replace',
'rfind',
'rindex',
'rjust',
'rpartition',
'rsplit',
'rstrip',
'split',
'splitlines',
'startswith',
'strip',
'swapcase',
'title',
'translate',
'upper',
'zfill']

In fact, we can include only those names that begin with a given first letter. For example,

name[0] == 's'

means we only want names associated with MyString that begin with the letter 's'.

IMPORTANT NOTE!!!!! The logical concept of equality as "the same as" is denoted by 2 equal signs, ==.

This is because one equal sign alone is used to assign values to a name.

In [27]:
[ name for name in dir(MyString) if name[0] == 's'   ]

Out[27]:
['split', 'splitlines', 'startswith', 'strip', 'swapcase']

Casting and comprehension are extremely useful, so much so that new features of Python and even entirely new langages (e.g., Julia) are based on the importance of types, casting, and comprehension.

Dictionaries¶

In addition to lists and tuples, a very useful type of object is the dictionary object. In particular, the command dict creates a Python dictionary, which is a set (order does matter) of key-value pairs of the form

{ key1:value1, key2:value2, … }

For example, dict( a = 5, b=7, c=3 ) creates the dictionary object

{ 'a': 5, 'b': 7, 'c':3 }

Below are additional examples of dictionary uses in Python.

In [28]:
MyDictionary = dict( a = 5, b=7, c=3 ) # bind to the name MyDictionary
MyDictionary

Out[28]:
{'a': 5, 'b': 7, 'c': 3}
In [29]:
MyDictionary['a']

Out[29]:
5
In [30]:
MyDictionary['c']

Out[30]:
3
In [31]:
MyDictionary['b'] = 11
MyDictionary

Out[31]:
{'a': 5, 'b': 11, 'c': 3}

We can also get the keys and values from a dictionary separately.

In [32]:
MyDictionary.keys() #returns the keys for the dictionary

Out[32]:
dict_keys(['c', 'a', 'b'])
In [33]:
MyDictionary.values() #returns the keys for the dictionary

Out[33]:
dict_values([3, 5, 11])

Alternately, a dictionary can be defined as a set of key-value pairs. However, notice also that like a set, the order in which the values are entered is not preserved. Instead, Python chooses the order, but all the key-value pairs are maintained !!

In [34]:
{ 'a':2, 'c':7, 11:14 }

Out[34]:
{'c': 7, 'a': 2, 11: 14}
In [35]:
_['a']

Out[35]:
2

Dictionaries are also quite useful, as we are about to discover.

Putting Python to work¶

We could continue in this fashion indefinitely, but ultimately, it distracts from the reason that computers languages exist -- namely, because people want computers to do things that they and others will find useful and exciting!

Instead, as the experiences of programmers and scientists the world over has shown repeatedly, one of the better ways to learn a new language is to

1. have a problem to solve and

2. hack a similar approach until it works for us (where "hack" means modifying until suitable for the task at hand).

In fact, our entire usage of Python in this course is motivated by

1. A Mathematical problem to solve that requires a computational approach.

2. An approach to the problem, which you the student may be asked to "hack" in some straightforward fashion.

Thus, we will motivate our initial foray into programming with Python via a problem to be solved. Our problem is to use computers to analyze strings. Sometimes, this is called text mining, which broadly is the problem of

__ _counting the number of occurences of a given pattern in a string._ __

For example, this is one of the most important computational problems in genomics, because a DNA molecule is nothing more than a (typically very very long) string which only includes the letters A, C, G, and T, which represent base pairs.

In [36]:
DNAsnippet = "AGTCGAATCGACAAAATCG"


Although DNA molecules tend to have millions or billions of base pairs, we should always remember one of the fundamental principles of programming:

Use a full range of examples to TEST the implementation of an algorithm as a program .

Thus, we start small with a string that will test our program.

Patterns are Regular Expressions¶

In the text mining mindset, a pattern is a special type of string known as a regular expression. Just as $\LaTeX$ has over many decades of development become the lingua franca of textual scientific communication, so also many decades of development have produced the regular expressions format for finding patterns in (often extremely large) strings. In Python, this means importing the re package.

In [37]:
import re


Regular expressions can themselves solve most simple pattern matching problems. We will see this using the findall method of re, which does exactly what the name implies.

In [38]:
re.findall('pattern','String with one pattern or many patterns')

Out[38]:
['pattern', 'pattern']

Thus, we can now "find all" occurences of a given pattern in a given string, such as our DNAsnippet.

In [39]:
re.findall("CG", DNAsnippet)

Out[39]:
['CG', 'CG', 'CG']
In [40]:
re.findall("TA", DNAsnippet)

Out[40]:
[]

So clearly, the pattern CG occurs twice in DNAsnippet, whereas the pattern TA does not occur at all in

DNAsnippet = "AGTCGAATCGACAAAATCG"

However, we can do much more. The reason patterns are called regular expressions is that they can find a variety of patterns, such as all the 3 letter substrings that begin with the letter A. To do so, we place a period (".") in a place where any letter can occur (that is, the period is a wild card character -- whereas to search for a period, we use "\.").

In [41]:
re.findall("A..", DNAsnippet)

Out[41]:
['AGT', 'AAT', 'ACA', 'AAA']

Other characters also have special meaning. For example, the + sign means "one or more" of whatever it follows. Thus, the pattern 'A+' means one or more repetitions of the letter A.

In [42]:
re.findall("A+", DNAsnippet)

Out[42]:
['A', 'AA', 'A', 'AAAA']

To find 2 or more repetitions of the letter A, we simply use "AA+", which means the letter A followed by one or more repetitions of the letter A.

In [43]:
re.findall("AA+",DNAsnippet)

Out[43]:
['AA', 'AAAA']

Indeed, there are many tutorials available online.

Exercise 1¶

The shortest viral genomes are those in the virus family </i>Circoviridae</i>. The string below is the complete genome of the Porcine Cirocovirus and is 1,767 basepairs (i.e., letters) in length.

In [44]:
PorcineCirocovirusGenome = "ACCAGCGCACTTCGGCAGCGGCAGCACCTCGGCAGCACCTCAGCAGCAACATGCCCAGCAAGAAGAGTGGA"+\
"AGAAGCGGACCCCAACCACATAAAAGGTGGGTGTTCACGCTGAATAATCCTTCCGAAGACGAGCGCAAGAAAATACGGGAGCTCCCAATCTCCCTATT"+\
"TGATTATTTTATTGTTGGCGAGGAAGGTAATGAGGAGGGCCGAACACCTCACCTACAGGGGTTCGCTAATTTTGTGAAGAAGCAAACTTTTAATAAAG"+\
"TGAAGTGGTATTTTGGTGCCCGCTGCCACATCGAGAAAGCGAAAGGAACAGATCAGCAGAATAAAGAATATTGCAGTAAAGAAGGCAACTTACTGATA"+\
"GAATGTGGAGCTCCTAGATCTCAAGGACAACGGAGTGACCTCTCTACTGCTGTGAGTACCTTGTTGGAGAGCGGGAGTCTGGTGACCGTTGCAGAGCA"+\
"GCACCCTGTAACGTTTGTCAGAAATTTCCGCGGGCTGGCTGAACTTTTGAAAGTGAGCGGGAAAATGCAGAAGCGTGATTGGAAGACGAATGTACACG"+\
"TCATTGTGGGGCCACCTGGGTGTGGCAAAAGCAAATGGGCTGCTAATTTTGCAGACCCGGAAACCACATACTGGAAACCACCTAGAAACAAGTGGTGG"+\
"GATGGTTACCATGGTGAAGAAGTGGTTGTTATTGATGACTTTTATGGCTGGCTGCCGTGGGATGATCTACTGAGACTCTGTGATCGATATCCTTTGAC"+\
"TGTTGAGACTAAAGGTGGAACTGTACCTTTTTTGGCCCGCAGTATTCTGATTACCAGCAATCAGACCCCGTTGGAATGGTACTCCTCAACTGCTGTCC"+\
"CAGCTGTAGAAGCTCTCTATCGGAGGATTACTTCCTTGGTATTTTGGAAGAATGCTACAGAACAATCCACGGAGGAAGGGGGCCAGTTCGTCACCCTT"+\
"TCCCCCCCATGCCCTGAATTTCCATATGAAATAAATTACTGAGTCTTTTTTATCACTTCGTAATGGTTTTTATTATTCACTTAGGGTTAAGTGGGGGG"+\
"TCTTTAAGATTAAATTCTCTGAATTGTACATACATGGTTATACGGATATTGTAGTCCTGGTCGTATATACTGTTTTCGAATGCAGTGCCGAGGCCTAC"+\
"ATGGTCTACATTTCCAGTAGTTTGTAGTCTCAGCCAGAGTTGATTTCTTTTGTTATTGGGTTGGAAGTAATCGATTGTCCTATCAAGGACAGGTTTCG"+\
"GGGTAAAGTACCGGGAGTGGTAGGAGAAGGGCTGGGTTATGGTATGGCGGGAGGAGTAGTTTACATAGGGGTCATAGGTTAGGGCATTGGCCTTTGTT"+\
"ACAAAGTTATCATCTAGAATAACAGCAGTGGAGCCCACTCCCCTGTCACCCTGGGTGATTGGGGAGCAGGGCCAGAATTCAACCTTAACCTTTCTTAT"+\
"TCTGTAGTATTCAAAGGGCACAGTGAGGGGGTTTGAGCCCCCTCCTGGGGGAAGAAAATCATTAATATTAAATCTCATCATGTCCACATTCCAGGAGG"+\
"GCGTTCTGACTGTGGTTTTCTTGACAGTATAACCGATGGTGCGGGAGAGGCGGGTGTTGAAAATGCCATTTTTCCTTCTCCAGCGGTAACGGTGGCGG"+\
"GGGTGGACGAGCCAGGGGCGGCGGCGGAGGATCTGGCTAAGATGGCTGCGGGGGCGGTGTCTTCGTCTGCGGAAACGCCTCCTTGGATACGTCATCGC"+\
"TGAAAACGAAAGAAGTGCGCTGTAAGTATT"


Enter the pattern which would allow you to find all occurences in the PorcineCirocovirusGenome in which there is a single letter between two occurences of the letter A.

In [45]:
re.findall("A.A",PorcineCirocovirusGenome)

Out[45]:
['ACA',
'AGA',
'AGA',
'AGA',
'ACA',
'AAA',
'ATA',
'AGA',
'AGA',
'AAA',
'ACA',
'ACA',
'AGA',
'AAA',
'ATA',
'ACA',
'AGA',
'AAA',
'ACA',
'AGA',
'ATA',
'AGA',
'ATA',
'AAA',
'ATA',
'AGA',
'ACA',
'AGA',
'AGA',
'AGA',
'AAA',
'AAA',
'AGA',
'AGA',
'ACA',
'AAA',
'AAA',
'AGA',
'AAA',
'ACA',
'AAA',
'AGA',
'ACA',
'AGA',
'AGA',
'ATA',
'AGA',
'AAA',
'AGA',
'AGA',
'AGA',
'ACA',
'ACA',
'ATA',
'AAA',
'AAA',
'AGA',
'AAA',
'ACA',
'ACA',
'ATA',
'ATA',
'ATA',
'ACA',
'ACA',
'AGA',
'ACA',
'AAA',
'AGA',
'ACA',
'ATA',
'ACA',
'AGA',
'ATA',
'ACA',
'AGA',
'AAA',
'ACA',
'AGA',
'AAA',
'ATA',
'AAA',
'ACA',
'ACA',
'ATA',
'AGA',
'AAA',
'AGA',
'AAA',
'ATA',
'AAA',
'AAA']

The shortest non-viral genome is more than 500,000 letters (basepairs) long (for a certain bacterium), so we won't go in that direction. Instead, we will explore some simple text mining applications.

A Simple Text Mining Application¶

In some sense, regular expressions make all basic problems rather easy. Thus, let's move on to a slightly more general problem -- one that is performed daily on millions of web pages by hundreds of different companies and organizations (e.g., Google).

The Problem: Given a document as a list of words, count the number of words in the document that satisfy a given pattern.

We will solve this problem using Python dictionaries, because key-value pairs are ideal for counting the number of times a pattern occurs in a string. For example, execute the cell belows:

In [46]:
TargetString = "There is a tree near the street."
PatternCounts = dict()

In [47]:
re.findall('ee',TargetString) #all occurences of the pattern 'ee'

Out[47]:
['ee', 'ee']
In [48]:
PatternCounts['ee'] = len(re.findall('ee',TargetString))

PatternCounts

Out[48]:
{'ee': 2}
In [49]:
re.findall('aa', TargetString)  #all occurences of the pattern 'aa'

Out[49]:
[]
In [50]:
PatternCounts['aa'] = len(re.findall('aa',TargetString))

PatternCounts

Out[50]:
{'aa': 0, 'ee': 2}

However, we would not want to use findall to determine how many times the word 'tree' occurs in the TargetString, because it would not only find the word 'tree', but it would also find the occurence of 'tree' inside of the word street.

In [51]:
re.findall('tree',"There is a tree near the street.")

Out[51]:
['tree', 'tree']

Instead, we split the string at the spaces to create a list of words, after which we use a regular expression match object.

If, Else, and Match¶

The re.match command is used in the form

re.match('pattern',TargetString)

If the pattern matches the beginning of the TargetString, then a match object is returned. Otherwise, if there is no match, then the value returned is None.

In [52]:
re.match('tree','trees')

Out[52]:
<_sre.SRE_Match object; span=(0, 4), match='tree'>
In [53]:
re.match('tree','street')


We ofen use re.match in conjunction with a Python if statement, which is of the form

if condition_to_be_tested :

First Command to execute if condition is True

Second Commmand to execute if condition is True

Again, whitespace (i.e., indentation) is used to show what is to be done if the condition is True. Try it out below.

REMINDER Python is case sensitive -- True must have a capital T

In [55]:
condition = True  # Change to True to see what happens differently

if( condition) :
print("The condition variable now has the value True")

The condition variable now has the value True


We can also include an else statement to be executed when the condition is False. This leads to if statements in the form

if condition_to_be_tested :

First Command to execute if condition is True

Second Commmand to execute if condition is True

else:

First Command to execute if condition is False

Second Commmand to execute if condition is False

In [56]:
condition = False  # Change to True to see what happens differently

if( condition) :
print("The condition variable now has the value True")
else:
print("The condition has the value False")

The condition has the value False


When combined with re.match, it is possible to determine when two "words" begin with the same pattern.

In [57]:
if( re.match('tree','tree.') ):
print("True")
else:
print("False")

True


A None value produces no output. However, in logical terms, None has a value of False

In [58]:
if( re.match('tree','street') ):
print("True")
else:
print("False")

False


Thus, we need only split a document into individual words and count the number of matches of a given pattern.

Pattern Counting in Strings¶

With all these data types, the program (i.e., 'recipe') for counting patterns becomes rather straightforward. If given a Document and a list of strings as Patterns, do the following:

1. Convert the Document into a WordList.
2. Create a dictionary called PatternCounts whose keys are the Patterns and whose initial values are 0.
3. For each word in WordList
1. Compare to each pattern in PatternCounts .
2. If a match, then increase the value of that pattern by 1

The resulting dictionary will then be the counts of the occurences of the patterns in the document.

For example, suppose we have the poem "Trees" by Joyce Kilmer (an American poet and soldier who died in battle in WW1), which we load as a triple quote in the next cell.

In [59]:
TreesByJoyceKilmer = """\
I THINK that I shall never see
A poem lovely as a tree.

A tree whose hungry mouth is prest
Against the sweet earth's flowing breast;

A tree that looks at God all day,
And lifts her leafy arms to pray;

A tree that may in summer wear
A nest of robins in her hair;

Upon whose bosom snow has lain;
Who intimately lives with rain.

Poems are made by fools like me,
But only God can make a tree."""


The Joyce Kilmer Memorial Forest in Western North Carolina is an old growth forest containing many, many trees, but how many time does the word "tree" occur in the poem it memorializes? To answer that, we want to count the number of times the pattern 'tree' occurs as a word in the poem, once we have converted the poem to all lower case.

In [60]:
treesbyjoycekilmer = TreesByJoyceKilmer.lower()
treesbyjoycekilmer

Out[60]:
"i think that i shall never see\na poem lovely as a tree.\n  \na tree whose hungry mouth is prest \nagainst the sweet earth's flowing breast; \n  \na tree that looks at god all day,\nand lifts her leafy arms to pray; \n  \na tree that may in summer wear \na nest of robins in her hair; \n  \nupon whose bosom snow has lain; \nwho intimately lives with rain.\n  \npoems are made by fools like me, \nbut only god can make a tree."

Now we must convert the poem into a list of words, which we do using the string method split.

In [61]:
WordList = treesbyjoycekilmer.split()
WordList[:10]  #The first 10 words in the list

Out[61]:
['i', 'think', 'that', 'i', 'shall', 'never', 'see', 'a', 'poem', 'lovely']

Matching is not as easy as we may think, however, as revealed by the last 10 words in WordList.

In [62]:
WordList[-10:]  #The last 10 words in the list

Out[62]:
['fools', 'like', 'me,', 'but', 'only', 'god', 'can', 'make', 'a', 'tree.']

The problem is that the "tree." has a period in it, but re can handle it easily. The match method defines a match to be when a string is the same as a pattern (regular expression) for the length of the regular expression.

Thus, we can continue, trusting that re.match will handle any subtleties of matching. Next, we define a dictionary with the pattern "tree" (lower case since we converted the poem to lower case).

In [63]:
PatternCounts = { 'tree':0 }


Now it is simply a matter of checking each word in the WordList to see if it matches the pattern. To do so, we do the following:

1. for each word in the WordList
1. compare the word to each of the keys in PatternCounts
2. if a word matches a pattern, then increase the count for that pattern by 1

That is, the matching part of the algorithm requires the use of a Python if statement. Specifically, the if statement executes its commands if there is a match, but does not if None (i.e., no match) occurs.

Note: We repeat the PatternCounts statement since initialization is an important part of the algorithm.

In [64]:
PatternCounts = { 'tree':0 }

for word in WordList:
for pattern in PatternCounts.keys():
if( re.match(pattern,word) ):
PatternCounts[pattern] += 1
PatternCounts

Out[64]:
{'tree': 5}

The += operator changes the value of the object by increasing it by the increment, which in this case means adding 1 and thus increasing the count of the number of matches.

Exercise 2¶

The algorithm above is repeated below. Modify it so that it counts the number of times the word that occurs in the poem.

In [66]:
PatternCounts = { 'tree':0 }

for word in WordList:
for pattern in PatternCounts.keys():
if( re.match(pattern,word) ):
PatternCounts[pattern] += 1
PatternCounts

Out[66]:
{'tree': 5}
In [67]:
PatternCounts = { 'that':0 }

for word in WordList:
for pattern in PatternCounts.keys():
if( re.match(pattern,word) ):
PatternCounts[pattern] += 1
PatternCounts

Out[67]:
{'that': 3}

A Simple Text Mining Application¶

The algorithm we just developed applies not only to short poems, but also to much larger documents. For example, when you uploaded this notebook, you should have also uploaded a text file with the name

USconstitution.txt

If so, then the next cell will show it as one of the files in the files list.

In [68]:
ls

1-IntroducingTheJupyterNotebookKEY.ipynb  Assignment3Supplemental.py
2-GettingStartedwithPython (1).ipynb      GettingStarted/
3-GettingStartedwithProgramming.ipynb     USconstitution.txt


The next cell uses a Python context ( the with structure) to read the USconstitution into Python in as a triple quote. A context allows Python to do "cleanup" (e.g., close the file) if anything strange happens it is not anticipating.

In [69]:
with open('USconstitution.txt','r') as TheFile:
USconstitution[:100]

Out[69]:
'We the People of the United States, in Order to form a more perfect Union,\nestablish Justice, insure'

Let's first do some pre-processing. As you can see in the first 100 characters listed above, the constitution doesn't follow modern conventions for capitalization. Also, there are characters (e.g., \n is the newline commmand) that have been inserted that were not in the original. We thus reduce all to lower case and remove strange characters.

In [70]:
usconstitution = USconstitution.lower()
# replace anything not a letter or number by a space
constitution = re.sub('[^a-z0-9]',' ',usconstitution)
constitution[:100]

Out[70]:
'we the people of the united states  in order to form a more perfect union  establish justice  insure'

Now we transform the constitution into a list of words.

In [71]:
WordList = constitution.split()
WordList[:10]  #The first 10 words in the list

Out[71]:
['we', 'the', 'people', 'of', 'the', 'united', 'states', 'in', 'order', 'to']

And now we can search for patterns. There are probably not any "trees" in the constitution, but words like "president" and "congress" are important!

In [72]:
PatternCounts = { 'president':0, 'congress':0 }

for word in WordList:
for pattern in PatternCounts.keys():
if( re.match(pattern,word) ):
PatternCounts[pattern] += 1
PatternCounts

Out[72]:
{'congress': 60, 'president': 121}

The president gets more mention than congress, but there is an issue. The word "senate" also refers to congress.

In [73]:
PatternCounts = { 'president':0, 'congress':0, 'senate':0 }

for word in WordList:
for pattern in PatternCounts.keys():
if( re.match(pattern,word) ):
PatternCounts[pattern] += 1
PatternCounts

Out[73]:
{'congress': 60, 'president': 121, 'senate': 28}

Exercise 3¶

Copy the previous cell's code into the next cell and add more search terms to PatternCounts. The terms you add should be justification for how balanced the constitution is in its discussion of the executive, legislative, and judicial branches -- at least as evidenced by word counts from the constitution.

In [75]:
PatternCounts = { 'president':0, 'congress':0, 'senate':0, 'executive':0, 'legislative':0, 'judicial':0 }

for word in WordList:
for pattern in PatternCounts.keys():
if( re.match(pattern,word) ):
PatternCounts[pattern] += 1
PatternCounts

Out[75]:
{'congress': 60,
'executive': 13,
'judicial': 7,
'legislative': 1,
'president': 121,
'senate': 28}
In [ ]:


The Executive branch is discussed or brought up the most (13 times) followed by judicial (7 times), and then legislative is the least noted as it was only counted one time.

First Letter Distributions¶

Amazingly, armed with nothing more than regular expressions and our simple dictionary-based algorithm, we can do all sorts of "text mining" types of applications, including visualizations!

First, however, let's set up the notebook for inline plots and load the pyplot module with the name plt.

In [76]:
%matplotlib inline
from matplotlib import pyplot as plt    #Matlab like plotting


That is, once we have converted a document into words -- i.e., into individual 'strings' -- then we can count occurences of patterns we may be interested in.

For example, we could count the number of words that begin with a vowel, since re.match returns a match if pattern occurs at the beginning of a word.

In [77]:
PatternCounts = { 'a':0, 'e':0, 'i':0, 'o':0, 'u':0  }

Document = """Attention!  Attention! This is a test of the emergency
we would not have said this is a test of the emergency
notification system.  We would have said
'Attention! Attention! This is an emergency'.
This concludes our test of the emergency notification system."""

WordList = Document.lower().split()

for word in WordList:
for pattern in PatternCounts.keys():
if( re.match(pattern,word) ):
PatternCounts[pattern] += 1
PatternCounts

Out[77]:
{'a': 8, 'e': 5, 'i': 4, 'o': 4, 'u': 0}

More importantly, if given a list of numbers or a document containing numbers, we can count how often the first digit of a number in the list is a 1, 2, 3, 4, 5, 6, 7, 8, or 9, respectively.

For example, the list below contains distances in light years to the 300 Brightest Stars.

In [78]:
LightYearsTo300BrightestStars = [9,310,4,37,25,42,770,11,144,430,530,17,320,65,600,260,34,25,3000,350,78,
430,52,700,88,240,130,111,1300,101,820,81,590,124,840,1800,145,101,270,630,
82,420,105,180,80,500,180,430,126,66,96,220,61,97,200,720,127,170,47,93,360,
36,610,130,1400,690,75,570,78,1500,230,148,920,55,380,400,65,550,310,79,210,
670,460,77,84,84,200,3000,49,540,72,140,220,460,390,58,530,1300,400,165,89,
160,73,60,77,270,170,140,99,37,520,510,116,310,520,1100,310,460,170,88,440,
39,1300,59,82,89,150,360,360,110,570,160,35,24,430,77,330,63,40,980,600,240,
370,102,39,71,170,230,180,200,160,440,170,460,540,730,390,610,260,1600,215,
185,88,760,220,420,250,96,132,1800,83,124,147,480,820,205,530,340,2600,300,
2000,310,85,340,900,250,100,149,150,139,410,101,86,48,570,78,540,220,240,370,
420,44,340,230,220,53,26,225,86,127,151,45,510,108,62,600,99,285,185,101,290,
630,215,101,560,160,185,370,102,175,500,530,350,380,1100,190,325,81,72,155,
120,165,180,1300,440,900,57,185,50,500,73,135,1100,740,200,730,150,235,370,
475,116,47,210,64,495,27,140,420,260,125,135,880,19,118,390,117,82,2100,112,
12,1200,420,220,250,130,790,59,115,270,117,29,135,1900,370,97,155,290,130,155,105
]


We want to count based on the first digit of each number. That is, we want to use the numbers as strings of digits , and consequently, we use list comprehension to transform each number into a string.

In [79]:
NumbersAsStrings = [ str(number) for number in LightYearsTo300BrightestStars ]
NumbersAsStrings[:10]  #the first 10 numbers

Out[79]:
['9', '310', '4', '37', '25', '42', '770', '11', '144', '430']

Now we simply count using the same approach as above (we don't care about the 0 digit, so it is omitted).

In [80]:
PatternCounts = { '1':0, '2':0, '3':0, '4':0, '5':0,
'6':0, '7':0, '8':0, '9':0 }

for word in NumbersAsStrings:
for pattern in PatternCounts.keys():
if( re.match(pattern,word) ):
PatternCounts[pattern] += 1
PatternCounts

Out[80]:
{'1': 90,
'2': 44,
'3': 36,
'4': 31,
'5': 28,
'6': 17,
'7': 21,
'8': 21,
'9': 12}

Notice that the digit 1 occurs more frequently than the other digits. We can in fact visualize this using a column graph in pyplot.

In [81]:
numdigits = [1,2,3,4,5,6,7,8,9]
strdigits = ['1', '2', '3', '4', '5', '6', '7', '8', '9']

Counts = [ PatternCounts[digit] for digit in strdigits ]

plt.bar( numdigits, Counts, align='center')
plt.xticks( numdigits ) # make sure all the digits are on the x-axis
plt.xlabel("Digits")
plt.ylabel("Counts");


That '1' occurs more often than '2', and '2' more often than '3', etcetera, is a well-known phenomenon known as Benford's law, which we will explore next.

Benford's Law¶

Benford's law says that in "naturally occurring" random numbers, the first digit of a number is 1 more often than it is 2, 3, 4, 5, 6, 7, 8, or 9. In particular, in base 10, the digit $d$ occurs with probability

$\Pr(d) = \log_{10}(d+1) - \log_{10}(d)$

.

In the Python numerical package numpy, the base 10 logarithm is log10(x)

In [82]:
import numpy

In [83]:
numpy.log10(1000)

Out[83]:
3.0

Since we will be using the logarithm rather often, let's import it itself so we don't always have to type "numpy" in front of it.

In [84]:
from numpy import log10


We can now use list comprehension to create the Benford's law distribution. In particular,

$Pr(d) = \log_{10}\left( \frac{d+1}d \right) = \log_{10}\left(1+\frac 1 d \right)$

Let's try it out (remember, that from the \_\_future\_\_ we need to have imported division).

In [85]:
[ log10(1+1/d) for d in range(1,10) ]

Out[85]:
[0.3010299956639812,
0.17609125905568124,
0.12493873660829992,
0.09691001300805642,
0.079181246047624818,
0.066946789630613221,
0.057991946977686733,
0.051152522447381291,
0.045757490560675143]

Now notice how similar this is to the PatternCounts.values() divided by the number of stars (300).

In [86]:
[ count/300 for count in Counts ]

Out[86]:
[0.3,
0.14666666666666667,
0.12,
0.10333333333333333,
0.09333333333333334,
0.056666666666666664,
0.07,
0.07,
0.04]

Not exactly, but the same type of falling off of frequency after the digit 1 which occurs about 30% of the time.

Benford's law is a phenomenon closely related to how numbers are represented in base 10 (or more generally, in base b). It is often "explained" in terms of how numerical values depend on units. In particular, "naturally occurring" random numbers tend to be associated with units like feet, inches, meters, kilograms, etcetera. A change of units implies a change in the numerical quantity. For example, 1 inch = 2.54 centimeters.

However, scaling a quantity implies a carry law. For example, multiplying by 2 is the same as adding a number to itself, and because of the possible final "carry" of 1, the digit 1 is more likely than any other digit to be the first digit of the result -- scaling 5.2 by 2 yields 10.4, which has a first digit of 1 because of the carry.

Let's look at another example, which are populations of the countries in the world in 2010 from largest to smallest.

In [87]:
CountriesPopulations =   [1360720000,1240380000,317541000,249866000,201032714,185649000,173615000,152518015,143657134,
127220000,118395054,99133000,89708900,86613986,85964400,80586000,77217000,76667864,67514000,65926261,
65820916,63705000,59917907,53259000,52981991,50219669,47464000,46704314,45439822,44928923,44354000,
40117096,38700000,38502396,37964000,35357000,35295770,34035000,33174000,30475144,30183400,30004000,
29994272,28946101,26494504,25500100,25235000,24895000,24658823,23700715,23382836,23373517,23202000,
21898000,21263403,20609294,20386799,20277597,20121641,17322796,17165000,17129076,16838100,16634603,
16363000,15679000,15438384,15302000,15135000,14580290,13567338,12973808,12825000,11296000,11184873,
11167325,10886500,10824200,10815197,10537222,10513800,10496000,10487289,10413211,10323000,10163000,
10027254,9906000,9639741,9477100,9468100,9445281,8555072,8501502,8264070,8134100,8112200,8104000,
7282041,7184000,7181505,7059653,6783374,6580800,6547800,6340000,6333000,6202000,6191155,6190280,
6071045,5663133,5627235,5450614,5415459,5399200,5240000,5096300,4822000,4667096,4616000,4593100,
4483800,4512900,4448000,4420549,4294000,4290612,3957000,3791622,3615086,3559500,3461041,3405813,
3286314,3065850,3017000,2944459,2821977,2931300,2711476,2113077,2074000,2062294,2061644,2045239,
2024904,2005200,1849000,1815606,1704000,1672000,1622000,1328019,1311870,1257900,1250000,1234571,
1066409,873000,865878,858038,840974,784894,743798,744160,620029,598200,581344,567000,537000,
534189,491875,416055,405739,392291,393162,351461,349728,325620,317280,294906,285000,268270,
264652,258958,237549,212645,187820,187356,180000,159358,150563,109000,106461,106405,103328,
103036,101484,101351,99000,90945,86295,84497,76098,71293,64237,63085,56483,56086,55519,55456,
54000,53883,48244,37429,36979,36942,36136,33540,31458,30001,29537,28502,23296,20901,14974,
13452,13135,11323,9945,8938,6081,4922,4000,2655,2563,2302,2072,1613,1411,839,596,56]


Now we convert the numbers into strings...

In [88]:
NumbersAsStrings = [ str(number) for number in CountriesPopulations ]
NumbersAsStrings[:5]  #first 5

Out[88]:
['1360720000', '1240380000', '317541000', '249866000', '201032714']

... and apply our algorithm to NumbersAsStrings.

In [89]:
PatternCounts = { '1':0, '2':0, '3':0, '4':0, '5':0,
'6':0, '7':0, '8':0, '9':0 }

for word in NumbersAsStrings:
for pattern in PatternCounts.keys():
if( re.match(pattern,word) ):
PatternCounts[pattern] += 1
PatternCounts

Out[89]:
{'1': 65,
'2': 45,
'3': 33,
'4': 22,
'5': 24,
'6': 17,
'7': 11,
'8': 18,
'9': 9}

We divide each value in PatternCounts by the total, thus producing a Frequency distribution.

This is shown in the column graph below:

In [90]:
numdigits = [1,2,3,4,5,6,7,8,9]
strdigits = ['1', '2', '3', '4', '5', '6', '7', '8', '9']

TotalCount = len( NumbersAsStrings )
Frequencies = [ PatternCounts[digit]/TotalCount for digit in strdigits ]

plt.bar( numdigits, Frequencies, align='center')
plt.xticks( numdigits ) # make sure all the digits are on the x-axis
plt.xlabel("Digits")
plt.ylabel("Frequencies")
plt.title("Does resemble Benford's law distribution");


Benford is not Always true!¶

Benford's law does not apply to all "naturally occurring" random numbers, however. For example, heights of human males in units of decimal feet mostly have either a 5 or a 6 as their first digit, with a few 7's and a few 4's. Execute the following sequence of cells to see an example where Benford's law does not hold.

In [91]:
MaleHeightsInAClassIOnceTaught = [5.5 ,
5.25,  5.62,  5.49,  5.52,  5.62,  5.41,  4.78,  6.26,
5.76,  5.99,  5.82,  6.02,  5.21,  5.55,  5.47,  5.57,  5.46,
5.4 ,  5.36,  5.56,  5.81,  5.73,  5.19,  6.43,  6.24,  5.53,
5.47,  5.82,  5.5 ,  5.72,  6.09,  5.74,  5.57,  5.77,  5.76,
5.85,  5.27,  5.23,  5.49,  5.71,  5.07,  6.04,  5.68,  5.67,
5.69,  5.86,  5.56,  6.1 ,  5.84,  5.43,  5.67,  5.04,  5.91,
5.36,  5.99,  5.29,  6.94,  5.45,  5.24,  5.76,  5.85,  5.61,
5.64,  5.45,  5.07,  5.75,  5.9 ,  5.85,  5.29,  5.69,  6.13,
5.54,  5.83,  5.29,  5.82,  5.42,  6.21,  5.74,  5.44,  5.95,
5.57,  5.51,  5.39,  5.8 ,  5.55,  5.65,  5.71,  5.82,  5.56,
5.52,  5.59,  5.54,  5.56,  5.74,  5.7 ,  5.73,  5.57,  5.79,
5.68,  5.66,  5.97,  5.95,  5.3 ,  5.35,  5.57,  6.15,  5.23,
5.64,  6.01,  6.12,  5.46,  5.34,  5.46,  5.76,  5.37,  5.32,
5.47,  5.89,  5.52,  5.47,  5.51,  5.33,  5.51,  5.84,  5.67,
5.9 ,  5.61 ]

In [92]:
NumbersAsStrings = [ str(number) for number in MaleHeightsInAClassIOnceTaught ]
NumbersAsStrings[:5]  #first 5

Out[92]:
['5.5', '5.25', '5.62', '5.49', '5.52']
In [93]:
PatternCounts = { '1':0, '2':0, '3':0, '4':0, '5':0,
'6':0, '7':0, '8':0, '9':0 }

for word in NumbersAsStrings:
for pattern in PatternCounts.keys():
if( re.match(pattern,word) ):
PatternCounts[pattern] += 1
PatternCounts

Out[93]:
{'1': 0, '2': 0, '3': 0, '4': 1, '5': 114, '6': 13, '7': 0, '8': 0, '9': 0}
In [94]:
numdigits = [1,2,3,4,5,6,7,8,9]
strdigits = ['1', '2', '3', '4', '5', '6', '7', '8', '9']

TotalCount = len( NumbersAsStrings )
Frequencies = [ PatternCounts[digit]/TotalCount for digit in strdigits ]

plt.bar( numdigits, Frequencies, align='center')
plt.xticks( numdigits ) # make sure all the digits are on the x-axis
plt.xlabel("Digits")
plt.ylabel("Frequencies")
plt.title("Male heights do not seem to satisfy Benford's law.");


Gaussian distributed numbers ( those whose distributions resemble a "bell curve" ) in general do not tend to satisfy Benford's law, nor does uniformly randomly generated numbers such as those produced by a random number generator.

In [95]:
UniformlyRandom = numpy.random.randint(0,100,1000)
UniformlyRandom[:20]

Out[95]:
array([64, 97, 49, 87, 11, 50, 50, 25, 41, 49, 30, 30, 79, 93, 77, 66, 27,
82, 60, 77])
In [96]:
NumbersAsStrings = [ str(number) for number in UniformlyRandom]
PatternCounts = { '1':0, '2':0, '3':0, '4':0, '5':0,
'6':0, '7':0, '8':0, '9':0 }

for word in NumbersAsStrings:
for pattern in PatternCounts.keys():
if( re.match(pattern,word) ):
PatternCounts[pattern] += 1

numdigits = [1,2,3,4,5,6,7,8,9]
strdigits = ['1', '2', '3', '4', '5', '6', '7', '8', '9']

TotalCount = len( NumbersAsStrings )
Frequencies = [ PatternCounts[digit]/TotalCount for digit in strdigits ]

plt.bar( numdigits, Frequencies, align='center')
plt.xticks( numdigits ) # make sure all the digits are on the x-axis
plt.xlabel("Digits")
plt.ylabel("Frequencies")
plt.title("Uniformly Random numbers do not satisfy Benford's law");

In [97]:
GaussianRandom = numpy.random.normal(50,25,1000)
GaussianRandom[:20]

Out[97]:
array([  50.94841232,   61.49065243,   41.04927409,   54.21446354,
37.4745367 ,   55.29785662,   67.365609  ,   34.64523101,
2.62123397,   67.63679727,   84.25281332,   40.84565143,
45.41958426,  100.65804182,  -17.20894648,   58.28237972,
43.25404614,   80.54533311,   64.54837152,   94.46212781])
In [98]:
NumbersAsStrings = [ str(number) for number in GaussianRandom]
PatternCounts = { '1':0, '2':0, '3':0, '4':0, '5':0,
'6':0, '7':0, '8':0, '9':0 }

for word in NumbersAsStrings:
for pattern in PatternCounts.keys():
if( re.match(pattern,word) ):
PatternCounts[pattern] += 1

numdigits = [1,2,3,4,5,6,7,8,9]
strdigits = ['1', '2', '3', '4', '5', '6', '7', '8', '9']

TotalCount = len( NumbersAsStrings )
Frequencies = [ PatternCounts[digit]/TotalCount for digit in strdigits ]

plt.bar( numdigits, Frequencies, align='center')
plt.xticks( numdigits ) # make sure all the digits are on the x-axis
plt.xlabel("Digits")
plt.ylabel("Frequencies")
plt.title("Gaussian (bell curve) Distributed numbers do not satisfy Benford's law");


Fraud Detection¶

In spite of its failures in some instances, Benford's law is a valuable tool -- and one that is necessarily implemented computationally. Population numbers, financial numbers, and many other sets of numbers can be expected to have a leading digit of 1 about 30% of the time.

In [99]:
username = "hammese"
%run -i Assignment3Supplemental.py


Loaded the data sets CountyPopulations1990, NYSE10Feb2017, CorporateFinancial, EnronFinancial, KimmyTheKite, MannyTheMooch, and SammyTheSnake.

<matplotlib.figure.Figure at 0x7fa10f879b70>

If the cell executed correctly, there are now 7 additional data sets in memory. For example, CountyPopulations1990 is the population of each county in the United States in the 1990 census; and it satisfies Benford's law.

In [100]:
NumbersAsStrings = [ str(number) for number in CountyPopulations1990]
PatternCounts = { '1':0, '2':0, '3':0, '4':0, '5':0,
'6':0, '7':0, '8':0, '9':0 }

for word in NumbersAsStrings:
for pattern in PatternCounts.keys():
if( re.match(pattern,word) ):
PatternCounts[pattern] += 1

numdigits = [1,2,3,4,5,6,7,8,9]
strdigits = ['1', '2', '3', '4', '5', '6', '7', '8', '9']

TotalCount = len( NumbersAsStrings )
Frequencies = [ PatternCounts[digit]/TotalCount for digit in strdigits ]

plt.bar( numdigits, Frequencies, align='center')
plt.plot( range(1,10),BenProbs[1:], 'k-' )
plt.plot( range(1,10),BenProbs[1:], 'k.', markersize = 10 )
plt.xticks( numdigits ) # make sure all the digits are on the x-axis
plt.xlabel("Digits")
plt.ylabel("Frequencies")
plt.title("County Population 1990 distribution (blue) close to Benford's law (black). ");


If this was a statistics course, we would at this point introduce the Kolmogorov-Smirnov test for determining if two sets of data come from the sample distribution. However, we will simply notice visually that the distribution represented by the blue bars does resemble the theoretical (black lines and dots) Benford law distribution.

For example, the New York Stock Exchange closing prices on February 10, 2017 also have a Benford law distribution.

In [101]:
NumbersAsStrings = [ str(number) for number in NYSE10Feb2017]
PatternCounts = { '1':0, '2':0, '3':0, '4':0, '5':0,
'6':0, '7':0, '8':0, '9':0 }

for word in NumbersAsStrings:
for pattern in PatternCounts.keys():
if( re.match(pattern,word) ):
PatternCounts[pattern] += 1

numdigits = [1,2,3,4,5,6,7,8,9]
strdigits = ['1', '2', '3', '4', '5', '6', '7', '8', '9']

TotalCount = len( NumbersAsStrings )
Frequencies = [ PatternCounts[digit]/TotalCount for digit in strdigits ]

plt.bar( numdigits, Frequencies, align='center')
plt.plot( range(1,10),BenProbs[1:], 'k-' )
plt.plot( range(1,10),BenProbs[1:], 'k.', markersize = 10 )
plt.xticks( numdigits ) # make sure all the digits are on the x-axis
plt.xlabel("Digits")
plt.ylabel("Frequencies")
plt.title("NYSE closing prices distribution (blue) is close to Benford's law (black). ");


Financial data in general follow a Benford's law distribution -- unless it has been forged. That is,

_ __People tend to fabricate numbers that are not 'naturally random' and thus do not satisfy Benford's law__ _

It is, in fact, a standard practice for law enforcement and tax officials to use Benford's law to test if a suspect "cooked the books" -- i.e., made up numbers -- or if the numbers are naturally random as would be expected.

For example, here is financial data (sales figures) for a large corporation for its 2014 fiscal year.

In [102]:
NumbersAsStrings = [ str(number) for number in CorporateFinancial]
PatternCounts = { '1':0, '2':0, '3':0, '4':0, '5':0,
'6':0, '7':0, '8':0, '9':0 }

for word in NumbersAsStrings:
for pattern in PatternCounts.keys():
if( re.match(pattern,word) ):
PatternCounts[pattern] += 1

numdigits = [1,2,3,4,5,6,7,8,9]
strdigits = ['1', '2', '3', '4', '5', '6', '7', '8', '9']

TotalCount = len( NumbersAsStrings )
Frequencies = [ PatternCounts[digit]/TotalCount for digit in strdigits ]

plt.bar( numdigits, Frequencies, align='center')
plt.plot( range(1,10),BenProbs[1:], 'k-' )
plt.plot( range(1,10),BenProbs[1:], 'k.', markersize = 10 )
plt.xticks( numdigits ) # make sure all the digits are on the x-axis
plt.xlabel("Digits")
plt.ylabel("Frequencies")
plt.title("Typical Financial data lead-digit distribution (blue) resembles Benford's law (black). ");


But now consider fraudulent financial data. In particular, although many of the numbers are from actual transactions, a large fraction of these numbers were "made up" by a human in the act of committing financial fraud.

In [103]:
NumbersAsStrings = [ str(number) for number in FraudFinancial ]
PatternCounts = { '1':0, '2':0, '3':0, '4':0, '5':0,
'6':0, '7':0, '8':0, '9':0 }

for word in NumbersAsStrings:
for pattern in PatternCounts.keys():
if( re.match(pattern,word) ):
PatternCounts[pattern] += 1

numdigits = [1,2,3,4,5,6,7,8,9]
strdigits = ['1', '2', '3', '4', '5', '6', '7', '8', '9']

TotalCount = len( NumbersAsStrings )
Frequencies = [ PatternCounts[digit]/TotalCount for digit in strdigits ]

plt.bar( numdigits, Frequencies, align='center')
plt.plot( range(1,10),BenProbs[1:], 'k-' )
plt.plot( range(1,10),BenProbs[1:], 'k.', markersize = 10 )
plt.xticks( numdigits ) # make sure all the digits are on the x-axis
plt.xlabel("Digits")
plt.ylabel("Frequencies")
plt.title("Fraudulent data lead-digit distribution (blue) versus Benford's law (black). ");


This is, in fact, your task in the remainder of this notebook. Copy and paste code from above, and modify as needed, to answer the following:

Exercise 4¶

Do the first 100 numbers in the Fibonnacci sequence satisfy Benford's law? Explain why or why not.

In [104]:
First100FibonacciNumbers = [1,1,2,3,5,8,13,21,34,55,89,144,233,377,610,987,1597,2584,4181,6765,
10946,17711,28657,46368,75025,121393,196418,317811,514229,832040,
1346269,2178309,3524578,5702887,9227465,14930352,24157817,39088169,63245986,
102334155,165580141,267914296,433494437,701408733,1134903170,1836311903,
2971215073,4807526976,7778742049,12586269025,20365011074,32951280099,
53316291173,86267571272,139583862445,225851433717,365435296162,
591286729879,956722026041,1548008755920,2504730781961,4052739537881,
6557470319842,10610209857723,17167680177565,27777890035288,44945570212853,
72723460248141,117669030460994,190392490709135,308061521170129,
498454011879264,806515533049393,1304969544928657,2111485077978050,3416454622906707,
5527939700884757,8944394323791464,14472334024676221,23416728348467685,
37889062373143906,61305790721611591,99194853094755497,160500643816367088,
259695496911122585,420196140727489673,679891637638612258,1100087778366101931,
1779979416004714189,2880067194370816120,4660046610375530309,7540113804746346429,
12200160415121876738,19740274219868223167,31940434634990099905,
51680708854858323072,83621143489848422977,135301852344706746049,
218922995834555169026,354224848179261915075]

In [105]:
NumbersAsStrings = [ str(number) for number in First100FibonacciNumbers ]
PatternCounts = { '1':0, '2':0, '3':0, '4':0, '5':0,
'6':0, '7':0, '8':0, '9':0 }

for word in NumbersAsStrings:
for pattern in PatternCounts.keys():
if( re.match(pattern,word) ):
PatternCounts[pattern] += 1

numdigits = [1,2,3,4,5,6,7,8,9]
strdigits = ['1', '2', '3', '4', '5', '6', '7', '8', '9']

TotalCount = len( NumbersAsStrings )
Frequencies = [ PatternCounts[digit]/TotalCount for digit in strdigits ]

plt.bar( numdigits, Frequencies, align='center')
plt.plot( range(1,10),BenProbs[1:], 'k-' )
plt.plot( range(1,10),BenProbs[1:], 'k.', markersize = 10 )
plt.xticks( numdigits ) # make sure all the digits are on the x-axis
plt.xlabel("Digits")
plt.ylabel("Frequencies")
plt.title("First 100 Fibonacci Numbers (blue) versus Benford's law (black). ");


Based off of the graph shows that the First 100 Fibonacci Numbers do in face satisfy Benford's law. There are a few locations on the graph, such as digit 8, where there is a little discrepancy between the frequency of the Fibonacci and Benford's, but not enough for me to consider it to not satisfy.

Exercise 5¶

There are 3 "suspects" in a fraud investigation who submitted the following numbers in their financial documents -- KimmyTheKite, MannyTheMooch, and SammyTheSnake. Would you suspect any -- or all -- of committing fraud and why? (Show your work)

In [106]:
print( 'KimmyTheKite = {0}\n'.format(KimmyTheKite) )
print( 'MannyTheMooch = {0}\n'.format(MannyTheMooch) )
print( 'SammyTheSnake = {0}\n'.format(SammyTheSnake) )

KimmyTheKite = [789.51, 291.34, 625.62, 716.81, 398.15, 748.03, 114.23, 874.33, 900.46, 20.49, 405.48, 966.71, 575.34, 802.39, 840.77, 782.53, 31.32, 380.27, 190.04, 946.33, 51.56, 597.62, 208.64, 690.01, 950.21, 173.86, 142.84, 688.81, 775.68, 798.95, 945.08, 890.14, 570.7, 690.96, 27.0, 862.13, 750.4, 177.73, 37.2, 195.95, 250.83, 699.53, 475.05, 143.55, 637.97, 434.47, 159.59, 169.44, 851.18, 734.08, 611.24, 136.02, 66.52, 445.13, 264.52, 306.57, 566.37, 418.1, 229.9, 473.91, 511.07, 567.82, 421.53, 324.49, 579.0, 257.8, 312.03, 262.5, 138.25, 265.67, 148.21, 348.16, 108.45, 267.77, 418.26, 433.25, 493.06, 243.36, 130.45, 356.85, 392.08, 200.28, 140.47, 219.44, 358.27, 191.03, 177.95, 357.37, 323.49, 368.53, 183.47, 110.83, 294.69, 173.66, 137.74, 177.22, 217.32, 190.57, 28.17, 299.32]

MannyTheMooch = [528.3, 189.45, 104.26, 311.47, 857.88, 359.29, 652.3, 168.57, 75.61, 781.55, 455.76, 920.31, 610.9, 328.13, 926.24, 309.39, 851.43, 959.78, 940.73, 873.24, 490.9, 521.41, 297.56, 111.2, 906.68, 494.27, 236.37, 610.01, 419.28, 885.18, 195.27, 823.97, 478.1, 419.78, 184.53, 469.63, 153.81, 130.75, 445.18, 169.29, 227.23, 59.84, 814.87, 506.29, 194.83, 527.74, 796.65, 290.9, 222.74, 666.5, 558.67, 646.08, 669.01, 276.93, 435.01, 223.8, 368.77, 346.81, 564.13, 400.37, 717.8, 130.3, 705.72, 25.17, 743.34, 268.01, 503.79, 218.77, 633.11, 147.24, 385.45, 226.6, 278.43, 225.69, 210.36, 355.77, 39.34, 187.84, 371.98, 226.53, 213.98, 323.34, 380.3, 312.51, 121.44, 132.63, 179.71, 246.42, 128.26, 232.56, 175.37, 157.87, 186.24, 193.34, 110.68, 151.98, 121.51, 119.02, 108.06, 110.0]

SammyTheSnake = [691.57, 236.91, 746.72, 184.68, 939.65, 742.83, 681.64, 344.12, 574.95, 184.35, 434.81, 877.24, 810.56, 142.18, 550.4, 642.51, 616.06, 122.47, 418.29, 265.27, 310.66, 229.47, 689.6, 102.52, 54.59, 510.58, 297.23, 548.54, 239.91, 211.07, 527.25, 501.11, 206.97, 500.8, 987.97, 789.85, 90.07, 866.68, 323.34, 371.2, 715.11, 792.48, 265.47, 704.21, 672.1, 935.33, 92.97, 592.57, 308.57, 78.04, 684.09, 557.97, 724.22, 945.56, 37.12, 935.87, 140.79, 220.23, 442.19, 818.76, 838.71, 266.59, 499.6, 721.06, 874.43, 482.69, 620.51, 866.66, 887.94, 529.43, 124.22, 709.69, 825.75, 347.8, 438.96, 654.54, 796.03, 822.69, 937.49, 387.02, 588.99, 207.74, 493.08, 134.7, 983.44, 662.67, 577.89, 586.03, 444.97, 592.45, 301.14, 293.07, 696.98, 92.38, 591.66, 9.06, 328.29, 530.32, 9.17, 165.89]


In [107]:
NumbersAsStrings = [ str(number) for number in KimmyTheKite ]
PatternCounts = { '1':0, '2':0, '3':0, '4':0, '5':0,
'6':0, '7':0, '8':0, '9':0 }

for word in NumbersAsStrings:
for pattern in PatternCounts.keys():
if( re.match(pattern,word) ):
PatternCounts[pattern] += 1

numdigits = [1,2,3,4,5,6,7,8,9]
strdigits = ['1', '2', '3', '4', '5', '6', '7', '8', '9']

TotalCount = len( NumbersAsStrings )
Frequencies = [ PatternCounts[digit]/TotalCount for digit in strdigits ]

plt.bar( numdigits, Frequencies, align='center')
plt.plot( range(1,10),BenProbs[1:], 'k-' )
plt.plot( range(1,10),BenProbs[1:], 'k.', markersize = 10 )
plt.xticks( numdigits ) # make sure all the digits are on the x-axis
plt.xlabel("Digits")
plt.ylabel("Frequencies")
plt.title("KimmyTheKite (blue) versus Benford's law (black). ");

In [108]:
NumbersAsStrings = [ str(number) for number in MannyTheMooch ]
PatternCounts = { '1':0, '2':0, '3':0, '4':0, '5':0,
'6':0, '7':0, '8':0, '9':0 }

for word in NumbersAsStrings:
for pattern in PatternCounts.keys():
if( re.match(pattern,word) ):
PatternCounts[pattern] += 1

numdigits = [1,2,3,4,5,6,7,8,9]
strdigits = ['1', '2', '3', '4', '5', '6', '7', '8', '9']

TotalCount = len( NumbersAsStrings )
Frequencies = [ PatternCounts[digit]/TotalCount for digit in strdigits ]

plt.bar( numdigits, Frequencies, align='center')
plt.plot( range(1,10),BenProbs[1:], 'k-' )
plt.plot( range(1,10),BenProbs[1:], 'k.', markersize = 10 )
plt.xticks( numdigits ) # make sure all the digits are on the x-axis
plt.xlabel("Digits")
plt.ylabel("Frequencies")
plt.title("MannyTheMooch (blue) versus Benford's law (black). ");

In [109]:
NumbersAsStrings = [ str(number) for number in SammyTheSnake ]
PatternCounts = { '1':0, '2':0, '3':0, '4':0, '5':0,
'6':0, '7':0, '8':0, '9':0 }

for word in NumbersAsStrings:
for pattern in PatternCounts.keys():
if( re.match(pattern,word) ):
PatternCounts[pattern] += 1

numdigits = [1,2,3,4,5,6,7,8,9]
strdigits = ['1', '2', '3', '4', '5', '6', '7', '8', '9']

TotalCount = len( NumbersAsStrings )
Frequencies = [ PatternCounts[digit]/TotalCount for digit in strdigits ]

plt.bar( numdigits, Frequencies, align='center')
plt.plot( range(1,10),BenProbs[1:], 'k-' )
plt.plot( range(1,10),BenProbs[1:], 'k.', markersize = 10 )
plt.xticks( numdigits ) # make sure all the digits are on the x-axis
plt.xlabel("Digits")
plt.ylabel("Frequencies")
plt.title("SammyTheSnake (blue) versus Benford's law (black). ");


SammyTheSnake could easily be suspected for committing fraud based off of the results of the graph. KimmyTheKite could be a little iffy on whether or not she committed fraud... her graph is a little off when it comes to aligning with Benford's law, but not nearly as off as SammyTheSnake. MannyTheMooch seems to be safe. His pretty closely align with Benford's law.