Contact
CoCalc Logo Icon
StoreFeaturesDocsShareSupport News Sign UpSign In
| Download

Stylometry_exercise

Views: 924
Kernel: Python 3 (Anaconda 5)

In this project we will do a little stylometry to measure the similarity of text documents.

Here is a demonstration of how your project will work.

#First we define the texts. text1 = "Crabs are a noisome animal, filthy and mean-spirited. The world would be better without crabs. We would be spared the evil sight of their clacking little claws." text2 = "Horses are well known to be ravenous devourers of clouds. They will especially eat clouds that are shaped like pieces of Civil War ordinance, such as cannons, or even sometimes smooth bore muskets." text3 = "Cancer the Crab is in particular a most offensive creature. Repellent to both eye and nose, this crab yet manages to offend senses heretofore unremarked upon, which the sensorium reserves only for terrific outrages to perception and taste." text4 = "Clouds pass by all the time. Some say they drift but others say that clouds scud. Scudding is something only clouds do, although there was a strangely named missile in the 1990's that was also said to scud. This is not to be confused with 'scut' which is a word describing a short tail." texts = text1,text2,text3,text4
# Now they are processed into lists of lowercase words. wordlists = [] for text in texts: ###FILL IN THIS CODE### print(wordlists )
File "<ipython-input-4-8cf36d4af73f>", line 7 print(wordlists ) ^ IndentationError: expected an indented block
# Now make one big list of all the words that occur in any of the documents... # We make the list have no repeats by converting from list to set and then back to list. all_words = [] for wordlist in wordlists: ###FILL IN THIS CODE### all_words = list(set(all_words))
# We make a frequency count of the words occurring in each document frequency_counts = [] for wl in wordlists: freqs = [] for word in all_words: ###FILL IN THIS CODE### frequency_counts.append(freqs) frequency_counts
# Now we convert from raw frequency counts to percentages. # We also sort the frequency lists alphabetically. for i,fq in enumerate(frequency_counts): total_words = sum(x[0] for x in fq) for pair in fq: ### FILL IN THIS CODE ### print(frequency_counts )
# We now compute a similarity score for the two documents. # The maximum similarity is 1 and the minimum similarity is 0 import numpy as np #We don't need the words in the frequency counts -- they are redundant and #complicate the math we need to do. # just_freqs throws out the word part and retains the frequency count. # It also converts to a numpy array to ease some math done later. just_freqs = [np.array([x[0] for x in fq]) for fq in frequency_counts] #This is how the score will be defined. #It's basically the cosine of the angle between the two word vectors # https://proofwiki.org/wiki/Cosine_Formula_for_Dot_Product def score(fqA,fqB): return (fqA.dot(fqB))/(np.sqrt(np.sum(fqA**2))*np.sqrt(np.sum(fqB**2))) for i,fq0 in enumerate(just_freqs): for j,fq1 in enumerate(just_freqs): print("The score of {} vs {} is {}".format(i,j,score(fq0,fq1)))