Kernel: Python 3
In [1]:
Working with Text Data
Types of data represented as strings
Example application: Sentiment analysis of movie reviews
In [2]:
File ‘data/aclImdb_v1.tar.gz’ already there; not retrieving.
In [3]:
data/aclImdb
├── test
│ ├── neg
│ └── pos
└── train
├── neg
├── pos
└── unsup
7 directories
In [4]:
In [5]:
type of text_train: <class 'list'>
length of text_train: 25000
text_train[6]:
b"This movie has a special way of telling the story, at first i found it rather odd as it jumped through time and I had no idea whats happening.<br /><br />Anyway the story line was although simple, but still very real and touching. You met someone the first time, you fell in love completely, but broke up at last and promoted a deadly agony. Who hasn't go through this? but we will never forget this kind of pain in our life. <br /><br />I would say i am rather touched as two actor has shown great performance in showing the love between the characters. I just wish that the story could be a happy ending."
In [5]:
In [6]:
array([0, 1])
In [7]:
Samples per class (training): [12500 12500]
In [8]:
Number of documents in test data: 25000
Samples per class (test): [12500 12500]
Representing text data as Bag of Words
Applying bag-of-words to a toy dataset
In [9]:
In [10]:
CountVectorizer(analyzer='word', binary=False, decode_error='strict',
dtype=<class 'numpy.int64'>, encoding='utf-8', input='content',
lowercase=True, max_df=1.0, max_features=None, min_df=1,
ngram_range=(1, 1), preprocessor=None, stop_words=None,
strip_accents=None, token_pattern='(?u)\\b\\w\\w+\\b',
tokenizer=None, vocabulary=None)
In [11]:
Vocabulary size: 13
Vocabulary content:
{'the': 9, 'fool': 3, 'doth': 2, 'think': 10, 'he': 4, 'is': 6, 'wise': 12, 'but': 1, 'man': 8, 'knows': 7, 'himself': 5, 'to': 11, 'be': 0}
In [12]:
bag_of_words: <2x13 sparse matrix of type '<class 'numpy.int64'>'
with 16 stored elements in Compressed Sparse Row format>
In [13]:
Dense representation of bag_of_words:
[[0 0 1 1 1 0 1 0 0 1 1 0 1]
[1 1 0 1 0 1 0 1 1 1 0 1 1]]
Bag-of-word for movie reviews
In [14]:
X_train:
<25000x74849 sparse matrix of type '<class 'numpy.int64'>'
with 3431196 stored elements in Compressed Sparse Row format>
In [15]:
Number of features: 74849
First 20 features:
['00', '000', '0000000000001', '00001', '00015', '000s', '001', '003830', '006', '007', '0079', '0080', '0083', '0093638', '00am', '00pm', '00s', '01', '01pm', '02']
Features 20010 to 20030:
['dratted', 'draub', 'draught', 'draughts', 'draughtswoman', 'draw', 'drawback', 'drawbacks', 'drawer', 'drawers', 'drawing', 'drawings', 'drawl', 'drawled', 'drawling', 'drawn', 'draws', 'draza', 'dre', 'drea']
Every 2000th feature:
['00', 'aesir', 'aquarian', 'barking', 'blustering', 'bête', 'chicanery', 'condensing', 'cunning', 'detox', 'draper', 'enshrined', 'favorit', 'freezer', 'goldman', 'hasan', 'huitieme', 'intelligible', 'kantrowitz', 'lawful', 'maars', 'megalunged', 'mostey', 'norrland', 'padilla', 'pincher', 'promisingly', 'receptionist', 'rivals', 'schnaas', 'shunning', 'sparse', 'subset', 'temptations', 'treatises', 'unproven', 'walkman', 'xylophonist']
In [16]:
Mean cross-validation accuracy: 0.88
In [17]:
Best cross-validation score: 0.89
Best parameters: {'C': 0.1}
In [18]:
Test score: 0.88
In [19]:
X_train with min_df: <25000x27271 sparse matrix of type '<class 'numpy.int64'>'
with 3354014 stored elements in Compressed Sparse Row format>
In [20]:
First 50 features:
['00', '000', '007', '00s', '01', '02', '03', '04', '05', '06', '07', '08', '09', '10', '100', '1000', '100th', '101', '102', '103', '104', '105', '107', '108', '10s', '10th', '11', '110', '112', '116', '117', '11th', '12', '120', '12th', '13', '135', '13th', '14', '140', '14th', '15', '150', '15th', '16', '160', '1600', '16mm', '16s', '16th']
Features 20010 to 20030:
['repentance', 'repercussions', 'repertoire', 'repetition', 'repetitions', 'repetitious', 'repetitive', 'rephrase', 'replace', 'replaced', 'replacement', 'replaces', 'replacing', 'replay', 'replayable', 'replayed', 'replaying', 'replays', 'replete', 'replica']
Every 700th feature:
['00', 'affections', 'appropriately', 'barbra', 'blurbs', 'butchered', 'cheese', 'commitment', 'courts', 'deconstructed', 'disgraceful', 'dvds', 'eschews', 'fell', 'freezer', 'goriest', 'hauser', 'hungary', 'insinuate', 'juggle', 'leering', 'maelstrom', 'messiah', 'music', 'occasional', 'parking', 'pleasantville', 'pronunciation', 'recipient', 'reviews', 'sas', 'shea', 'sneers', 'steiger', 'swastika', 'thrusting', 'tvs', 'vampyre', 'westerns']
In [21]:
Best cross-validation score: 0.89
Stop-words
In [22]:
Number of stop words: 318
Every 10th stopword:
['they', 'of', 'who', 'found', 'none', 'co', 'full', 'otherwise', 'never', 'have', 'she', 'neither', 'whereby', 'one', 'any', 'de', 'hence', 'wherever', 'whose', 'him', 'which', 'nine', 'still', 'from', 'here', 'what', 'everything', 'us', 'etc', 'mine', 'find', 'most']
In [23]:
X_train with stop words:
<25000x26966 sparse matrix of type '<class 'numpy.int64'>'
with 2149958 stored elements in Compressed Sparse Row format>
In [24]:
Best cross-validation score: 0.88
Rescaling the Data with tf-idf
In [25]:
Best cross-validation score: 0.89
In [26]:
Features with lowest tfidf:
['poignant' 'disagree' 'instantly' 'importantly' 'lacked' 'occurred'
'currently' 'altogether' 'nearby' 'undoubtedly' 'directs' 'fond' 'stinker'
'avoided' 'emphasis' 'commented' 'disappoint' 'realizing' 'downhill'
'inane']
Features with highest tfidf:
['coop' 'homer' 'dillinger' 'hackenstein' 'gadget' 'taker' 'macarthur'
'vargas' 'jesse' 'basket' 'dominick' 'the' 'victor' 'bridget' 'victoria'
'khouri' 'zizek' 'rob' 'timon' 'titanic']
In [27]:
Features with lowest idf:
['the' 'and' 'of' 'to' 'this' 'is' 'it' 'in' 'that' 'but' 'for' 'with'
'was' 'as' 'on' 'movie' 'not' 'have' 'one' 'be' 'film' 'are' 'you' 'all'
'at' 'an' 'by' 'so' 'from' 'like' 'who' 'they' 'there' 'if' 'his' 'out'
'just' 'about' 'he' 'or' 'has' 'what' 'some' 'good' 'can' 'more' 'when'
'time' 'up' 'very' 'even' 'only' 'no' 'would' 'my' 'see' 'really' 'story'
'which' 'well' 'had' 'me' 'than' 'much' 'their' 'get' 'were' 'other'
'been' 'do' 'most' 'don' 'her' 'also' 'into' 'first' 'made' 'how' 'great'
'because' 'will' 'people' 'make' 'way' 'could' 'we' 'bad' 'after' 'any'
'too' 'then' 'them' 'she' 'watch' 'think' 'acting' 'movies' 'seen' 'its'
'him']
Investigating model coefficients
In [28]:
Invalid PDF output
Bag of words with more than one word (n-grams)
In [29]:
bards_words:
['The fool doth think he is wise,', 'but the wise man knows himself to be a fool']
In [30]:
Vocabulary size: 13
Vocabulary:
['be', 'but', 'doth', 'fool', 'he', 'himself', 'is', 'knows', 'man', 'the', 'think', 'to', 'wise']
In [31]:
Vocabulary size: 14
Vocabulary:
['be fool', 'but the', 'doth think', 'fool doth', 'he is', 'himself to', 'is wise', 'knows himself', 'man knows', 'the fool', 'the wise', 'think he', 'to be', 'wise man']
In [32]:
Transformed data (dense):
[[0 0 1 1 1 0 1 0 0 1 0 1 0 0]
[1 1 0 0 0 1 0 1 1 0 1 0 1 1]]
In [33]:
Vocabulary size: 39
Vocabulary:['be', 'be fool', 'but', 'but the', 'but the wise', 'doth', 'doth think', 'doth think he', 'fool', 'fool doth', 'fool doth think', 'he', 'he is', 'he is wise', 'himself', 'himself to', 'himself to be', 'is', 'is wise', 'knows', 'knows himself', 'knows himself to', 'man', 'man knows', 'man knows himself', 'the', 'the fool', 'the fool doth', 'the wise', 'the wise man', 'think', 'think he', 'think he is', 'to', 'to be', 'to be fool', 'wise', 'wise man', 'wise man knows']
In [34]:
Best cross-validation score: 0.91
Best parameters:
{'logisticregression__C': 100, 'tfidfvectorizer__ngram_range': (1, 3)}
In [35]:
<matplotlib.colorbar.Colorbar at 0x7fa987ed74e0>
Invalid PDF output
In [36]:
(-22, 22)
Invalid PDF output
In [37]:
(-22, 22)
Invalid PDF output
Advanced tokenization, stemming and lemmatization
In [39]:
In [40]:
Lemmatization:
['our', 'meeting', 'today', 'be', 'bad', 'than', 'yesterday', ',', 'i', 'be', 'scar', 'of', 'meet', 'the', 'client', 'tomorrow', '.']
Stemming:
['our', 'meet', 'today', 'wa', 'wors', 'than', 'yesterday', ',', 'i', "'m", 'scare', 'of', 'meet', 'the', 'client', 'tomorrow', '.']
In [41]:
In [42]:
X_train_lemma.shape: (25000, 21637)
X_train.shape: (25000, 27271)
In [43]:
Best cross-validation score (standard CountVectorizer): 0.721
Best cross-validation score (lemmatization): 0.731
Topic Modeling and Document Clustering
Latent Dirichlet Allocation
In [44]:
In [42]:
In [43]:
lda.components_.shape: (10, 10000)
In [47]:
In [48]:
topic 0 topic 1 topic 2 topic 3 topic 4
-------- -------- -------- -------- --------
between war funny show didn
family world comedy series saw
young us guy episode thought
real american laugh tv am
us our jokes episodes thing
director documentary fun shows got
work history humor season 10
both years re new want
beautiful new hilarious years going
each human doesn television watched
topic 5 topic 6 topic 7 topic 8 topic 9
-------- -------- -------- -------- --------
action kids role performance horror
effects action cast role house
nothing animation john john killer
budget children version actor gets
script game novel cast woman
minutes disney director plays dead
original fun both jack girl
director old played michael around
least 10 mr oscar goes
doesn kid young father wife
In [49]:
In [50]:
In [51]:
topic 7 topic 16 topic 24 topic 25 topic 28
-------- -------- -------- -------- --------
thriller worst german car beautiful
suspense awful hitler gets young
horror boring nazi guy old
atmosphere horrible midnight around romantic
mystery stupid joe down between
house thing germany kill romance
director terrible years goes wonderful
quite script history killed heart
bit nothing new going feel
de worse modesty house year
performances waste cowboy away each
dark pretty jewish head french
twist minutes past take sweet
hitchcock didn kirk another boy
tension actors young getting loved
interesting actually spanish doesn girl
mysterious re enterprise now relationship
murder supposed von night saw
ending mean nazis right both
creepy want spock woman simple
topic 36 topic 37 topic 41 topic 45 topic 51
-------- -------- -------- -------- --------
performance excellent war music earth
role highly american song space
actor amazing world songs planet
cast wonderful soldiers rock superman
play truly military band alien
actors superb army soundtrack world
performances actors tarzan singing evil
played brilliant soldier voice humans
supporting recommend america singer aliens
director quite country sing human
oscar performance americans musical creatures
roles performances during roll miike
actress perfect men fan monsters
excellent drama us metal apes
screen without government concert clark
plays beautiful jungle playing burton
award human vietnam hear tim
work moving ii fans outer
playing world political prince men
gives recommended against especially moon
topic 53 topic 54 topic 63 topic 89 topic 97
-------- -------- -------- -------- --------
scott money funny dead didn
gary budget comedy zombie thought
streisand actors laugh gore wasn
star low jokes zombies ending
hart worst humor blood minutes
lundgren waste hilarious horror got
dolph 10 laughs flesh felt
career give fun minutes part
sabrina want re body going
role nothing funniest living seemed
temple terrible laughing eating bit
phantom crap joke flick found
judy must few budget though
melissa reviews moments head nothing
zorro imdb guy gory lot
gets director unfunny evil saw
barbra thing times shot long
cast believe laughed low interesting
short am comedies fulci few
serial actually isn re half
In [52]:
b'I love this movie and never get tired of watching. The music in it is great.\n'
b"I enjoyed Still Crazy more than any film I have seen in years. A successful band from the 70's decide to give it another try.\n"
b'Hollywood Hotel was the last movie musical that Busby Berkeley directed for Warner Bros. His directing style had changed or evolved to the point that this film does not contain his signature overhead shots or huge production numbers with thousands of extras.\n'
b"What happens to washed up rock-n-roll stars in the late 1990's? They launch a comeback / reunion tour. At least, that's what the members of Strange Fruit, a (fictional) 70's stadium rock group do.\n"
b'As a big-time Prince fan of the last three to four years, I really can\'t believe I\'ve only just got round to watching "Purple Rain". The brand new 2-disc anniversary Special Edition led me to buy it.\n'
b"This film is worth seeing alone for Jared Harris' outstanding portrayal of John Lennon. It doesn't matter that Harris doesn't exactly resemble Lennon; his mannerisms, expressions, posture, accent and attitude are pure Lennon.\n"
b"The funky, yet strictly second-tier British glam-rock band Strange Fruit breaks up at the end of the wild'n'wacky excess-ridden 70's. The individual band members go their separate ways and uncomfortably settle into lackluster middle age in the dull and uneventful 90's: morose keyboardist Stephen Rea winds up penniless and down on his luck, vain, neurotic, pretentious lead singer Bill Nighy tries (and fails) to pursue a floundering solo career, paranoid drummer Timothy Spall resides in obscurity on a remote farm so he can avoid paying a hefty back taxes debt, and surly bass player Jimmy Nail installs roofs for a living.\n"
b"I just finished reading a book on Anita Loos' work and the photo in TCM Magazine of MacDonald in her angel costume looked great (impressive wings), so I thought I'd watch this movie. I'd never heard of the film before, so I had no preconceived notions about it whatsoever.\n"
b'I love this movie!!! Purple Rain came out the year I was born and it has had my heart since I can remember. Prince is so tight in this movie.\n'
b"This movie is sort of a Carrie meets Heavy Metal. It's about a highschool guy who gets picked on alot and he totally gets revenge with the help of a Heavy Metal ghost.\n"
In [53]:
Invalid PDF output