Kernel: Python [conda env:py37]
In [1]:
Unsupervised Learning and Preprocessing
Types of unsupervised learning
Challenges in unsupervised learning
Preprocessing and Scaling
In [2]:
Invalid PDF output
Different Kinds of Preprocessing
Applying Data Transformations
In [3]:
(426, 30)
(143, 30)
In [4]:
In [5]:
MinMaxScaler(copy=True, feature_range=(0, 1))
In [6]:
transformed shape: (426, 30)
per-feature minimum before scaling:
[ 6.981 9.71 43.79 143.5 0.053 0.019 0. 0. 0.106
0.05 0.115 0.36 0.757 6.802 0.002 0.002 0. 0.
0.01 0.001 7.93 12.02 50.41 185.2 0.071 0.027 0.
0. 0.157 0.055]
per-feature maximum before scaling:
[ 28.11 39.28 188.5 2501. 0.163 0.287 0.427 0.201
0.304 0.096 2.873 4.885 21.98 542.2 0.031 0.135
0.396 0.053 0.061 0.03 36.04 49.54 251.2 4254.
0.223 0.938 1.17 0.291 0.577 0.149]
per-feature minimum after scaling:
[0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0.
0. 0. 0. 0. 0. 0.]
per-feature maximum after scaling:
[1. 1. 1. 1. 1. 1. 1. 1. 1. 1. 1. 1. 1. 1. 1. 1. 1. 1. 1. 1. 1. 1. 1. 1.
1. 1. 1. 1. 1. 1.]
In [7]:
per-feature minimum after scaling:
[ 0.034 0.023 0.031 0.011 0.141 0.044 0. 0. 0.154 -0.006
-0.001 0.006 0.004 0.001 0.039 0.011 0. 0. -0.032 0.007
0.027 0.058 0.02 0.009 0.109 0.026 0. 0. -0. -0.002]
per-feature maximum after scaling:
[0.958 0.815 0.956 0.894 0.811 1.22 0.88 0.933 0.932 1.037 0.427 0.498
0.441 0.284 0.487 0.739 0.767 0.629 1.337 0.391 0.896 0.793 0.849 0.745
0.915 1.132 1.07 0.924 1.205 1.631]
Scaling training and test data the same way
In [8]:
Invalid PDF output
In [9]:
The effect of preprocessing on supervised learning
In [10]:
Test set accuracy: 0.63
In [11]:
Scaled test set accuracy: 0.97
In [12]:
SVM test accuracy: 0.96
Dimensionality Reduction, Feature Extraction and Manifold Learning
Principal Component Analysis (PCA)
In [13]:
Invalid PDF output
Applying PCA to the cancer dataset for visualization
In [14]:
Invalid PDF output
In [15]:
In [16]:
Original shape: (569, 30)
Reduced shape: (569, 2)
In [17]:
Text(0,0.5,'Second principal component')
Invalid PDF output
In [18]:
PCA component shape: (2, 30)
In [19]:
PCA components:
[[ 0.219 0.104 0.228 0.221 0.143 0.239 0.258 0.261 0.138 0.064
0.206 0.017 0.211 0.203 0.015 0.17 0.154 0.183 0.042 0.103
0.228 0.104 0.237 0.225 0.128 0.21 0.229 0.251 0.123 0.132]
[-0.234 -0.06 -0.215 -0.231 0.186 0.152 0.06 -0.035 0.19 0.367
-0.106 0.09 -0.089 -0.152 0.204 0.233 0.197 0.13 0.184 0.28
-0.22 -0.045 -0.2 -0.219 0.172 0.144 0.098 -0.008 0.142 0.275]]
In [20]:
Text(0,0.5,'Principal components')
Invalid PDF output
Eigenfaces for feature extraction
In [21]:
Invalid PDF output
In [22]:
people.images.shape: (3023, 87, 65)
Number of classes: 62
In [23]:
Alejandro Toledo 39 Alvaro Uribe 35 Amelie Mauresmo 21
Andre Agassi 36 Angelina Jolie 20 Ariel Sharon 77
Arnold Schwarzenegger 42 Atal Bihari Vajpayee 24 Bill Clinton 29
Carlos Menem 21 Colin Powell 236 David Beckham 31
Donald Rumsfeld 121 George Robertson 22 George W Bush 530
Gerhard Schroeder 109 Gloria Macapagal Arroyo 44 Gray Davis 26
Guillermo Coria 30 Hamid Karzai 22 Hans Blix 39
Hugo Chavez 71 Igor Ivanov 20 Jack Straw 28
Jacques Chirac 52 Jean Chretien 55 Jennifer Aniston 21
Jennifer Capriati 42 Jennifer Lopez 21 Jeremy Greenstock 24
Jiang Zemin 20 John Ashcroft 53 John Negroponte 31
Jose Maria Aznar 23 Juan Carlos Ferrero 28 Junichiro Koizumi 60
Kofi Annan 32 Laura Bush 41 Lindsay Davenport 22
Lleyton Hewitt 41 Luiz Inacio Lula da Silva 48 Mahmoud Abbas 29
Megawati Sukarnoputri 33 Michael Bloomberg 20 Naomi Watts 22
Nestor Kirchner 37 Paul Bremer 20 Pete Sampras 22
Recep Tayyip Erdogan 30 Ricardo Lagos 27 Roh Moo-hyun 32
Rudolph Giuliani 26 Saddam Hussein 23 Serena Williams 52
Silvio Berlusconi 33 Tiger Woods 23 Tom Daschle 25
Tom Ridge 33 Tony Blair 144 Vicente Fox 32
Vladimir Putin 49 Winona Ryder 24
In [24]:
In [25]:
Test set score of 1-nn: 0.23
In [26]:
Invalid PDF output
In [27]:
X_train_pca.shape: (1547, 100)
In [28]:
Test set accuracy: 0.31
In [29]:
pca.components_.shape: (100, 5655)
In [30]:
Invalid PDF output
In [31]:
In [32]:
Invalid PDF output
In [33]:
Text(0,0.5,'Second principal component')
Invalid PDF output
Non-Negative Matrix Factorization (NMF)
Applying NMF to synthetic data
In [34]:
Invalid PDF output
Applying NMF to face images
In [35]:
Invalid PDF output
In [36]:
Invalid PDF output
In [37]:
Invalid PDF output
Invalid PDF output
In [38]:
Text(0,0.5,'Signal')
Invalid PDF output
In [39]:
Shape of measurements: (2000, 100)
In [40]:
Recovered signal shape: (2000, 3)
In [41]:
In [42]:
Invalid PDF output
Manifold Learning with t-SNE
In [43]:
Invalid PDF output
In [44]:
Text(0,0.5,'Second principal component')
Invalid PDF output
In [45]:
In [46]:
Text(0.5,0,'t-SNE feature 1')
Invalid PDF output
Clustering
k-Means clustering
In [47]:
Invalid PDF output
In [48]:
Invalid PDF output
In [49]:
KMeans(algorithm='auto', copy_x=True, init='k-means++', max_iter=300,
n_clusters=3, n_init=10, n_jobs=None, precompute_distances='auto',
random_state=None, tol=0.0001, verbose=0)
In [50]:
Cluster memberships:
[0 2 2 2 1 1 1 2 0 0 2 2 1 0 1 1 1 0 2 2 1 2 1 0 2 1 1 0 0 1 0 0 1 0 2 1 2
2 2 1 1 2 0 2 2 1 0 0 0 0 2 1 1 1 0 1 2 2 0 0 2 1 1 2 2 1 0 1 0 2 2 2 1 0
0 2 1 1 0 2 0 2 2 1 0 0 0 0 2 0 1 0 0 2 2 1 1 0 1 0]
In [51]:
[0 2 2 2 1 1 1 2 0 0 2 2 1 0 1 1 1 0 2 2 1 2 1 0 2 1 1 0 0 1 0 0 1 0 2 1 2
2 2 1 1 2 0 2 2 1 0 0 0 0 2 1 1 1 0 1 2 2 0 0 2 1 1 2 2 1 0 1 0 2 2 2 1 0
0 2 1 1 0 2 0 2 2 1 0 0 0 0 2 0 1 0 0 2 2 1 1 0 1 0]
In [52]:
[<matplotlib.lines.Line2D at 0x7fe4f7042cf8>,
<matplotlib.lines.Line2D at 0x7fe4f7055518>,
<matplotlib.lines.Line2D at 0x7fe4f7055cc0>]
Invalid PDF output
In [53]:
[<matplotlib.lines.Line2D at 0x7fe4f547aa90>,
<matplotlib.lines.Line2D at 0x7fe4f594c278>,
<matplotlib.lines.Line2D at 0x7fe4f594cb70>,
<matplotlib.lines.Line2D at 0x7fe4f5977358>,
<matplotlib.lines.Line2D at 0x7fe4f5977b38>]
Invalid PDF output
Failure cases of k-Means
In [54]:
Text(0,0.5,'Feature 1')
Invalid PDF output
In [55]:
Text(0,0.5,'Feature 1')
Invalid PDF output
In [56]:
Text(0,0.5,'Feature 1')
Invalid PDF output
Vector Quantization - Or Seeing k-Means as Decomposition
In [57]:
In [58]:
Text(0,0.5,'nmf')
Invalid PDF output
Invalid PDF output
In [59]:
Cluster memberships:
[9 2 5 4 2 7 9 6 9 6 1 0 2 6 1 9 3 0 3 1 7 6 8 6 8 5 2 7 5 8 9 8 6 5 3 7 0
9 4 5 0 1 3 5 2 8 9 1 5 6 1 0 7 4 6 3 3 6 3 8 0 4 2 9 6 4 8 2 8 4 0 4 0 5
6 4 5 9 3 0 7 8 0 7 5 8 9 8 0 7 3 9 7 1 7 2 2 0 4 5 6 7 8 9 4 5 4 1 2 3 1
8 8 4 9 2 3 7 0 9 9 1 5 8 5 1 9 5 6 7 9 1 4 0 6 2 6 4 7 9 5 5 3 8 1 9 5 6
3 5 0 2 9 3 0 8 6 0 3 3 5 6 3 2 0 2 3 0 2 6 3 4 4 1 5 6 7 1 1 3 2 4 7 2 7
3 8 6 4 1 4 3 9 9 5 1 7 5 8 2]
Invalid PDF output
In [60]:
Distance feature shape: (200, 10)
Distance features:
[[0.922 1.466 1.14 ... 1.166 1.039 0.233]
[1.142 2.517 0.12 ... 0.707 2.204 0.983]
[0.788 0.774 1.749 ... 1.971 0.716 0.944]
...
[0.446 1.106 1.49 ... 1.791 1.032 0.812]
[1.39 0.798 1.981 ... 1.978 0.239 1.058]
[1.149 2.454 0.045 ... 0.572 2.113 0.882]]
Agglomerative Clustering
In [61]:
Invalid PDF output
In [62]:
Text(0,0.5,'Feature 1')
Invalid PDF output
Hierarchical Clustering and Dendrograms
In [63]:
Invalid PDF output
In [64]:
Text(0,0.5,'Cluster distance')
Invalid PDF output
DBSCAN
In [65]:
Cluster memberships:
[-1 -1 -1 -1 -1 -1 -1 -1 -1 -1 -1 -1]
In [66]:
min_samples: 2 eps: 1.000000 cluster: [-1 0 0 -1 0 -1 1 1 0 1 -1 -1]
min_samples: 2 eps: 1.500000 cluster: [0 1 1 1 1 0 2 2 1 2 2 0]
min_samples: 2 eps: 2.000000 cluster: [0 1 1 1 1 0 0 0 1 0 0 0]
min_samples: 2 eps: 3.000000 cluster: [0 0 0 0 0 0 0 0 0 0 0 0]
min_samples: 3 eps: 1.000000 cluster: [-1 0 0 -1 0 -1 1 1 0 1 -1 -1]
min_samples: 3 eps: 1.500000 cluster: [0 1 1 1 1 0 2 2 1 2 2 0]
min_samples: 3 eps: 2.000000 cluster: [0 1 1 1 1 0 0 0 1 0 0 0]
min_samples: 3 eps: 3.000000 cluster: [0 0 0 0 0 0 0 0 0 0 0 0]
min_samples: 5 eps: 1.000000 cluster: [-1 -1 -1 -1 -1 -1 -1 -1 -1 -1 -1 -1]
min_samples: 5 eps: 1.500000 cluster: [-1 0 0 0 0 -1 -1 -1 0 -1 -1 -1]
min_samples: 5 eps: 2.000000 cluster: [-1 0 0 0 0 -1 -1 -1 0 -1 -1 -1]
min_samples: 5 eps: 3.000000 cluster: [0 0 0 0 0 0 0 0 0 0 0 0]
Invalid PDF output
In [67]:
Text(0,0.5,'Feature 1')
Invalid PDF output
Comparing and evaluating clustering algorithms
Evaluating clustering with ground truth
In [68]:
Invalid PDF output
In [69]:
Accuracy: 0.00
ARI: 1.00
Evaluating clustering without ground truth
In [70]:
Invalid PDF output
Comparing algorithms on the faces dataset
In [71]:
Analyzing the faces dataset with DBSCAN
In [72]:
Unique labels: [-1]
In [73]:
Unique labels: [-1]
In [74]:
Unique labels: [-1 0]
In [75]:
Number of points per cluster: [ 32 2031]
In [76]:
Invalid PDF output
In [77]:
eps=1
Number of clusters: 1
Cluster sizes: [2063]
eps=3
Number of clusters: 1
Cluster sizes: [2063]
eps=5
Number of clusters: 1
Cluster sizes: [2063]
eps=7
Number of clusters: 14
Cluster sizes: [2004 3 14 7 4 3 3 4 4 3 3 5 3 3]
eps=9
Number of clusters: 4
Cluster sizes: [1307 750 3 3]
eps=11
Number of clusters: 2
Cluster sizes: [ 413 1650]
eps=13
Number of clusters: 2
Cluster sizes: [ 120 1943]
In [78]:
Invalid PDF output
Invalid PDF output
Invalid PDF output
Invalid PDF output
Invalid PDF output
Invalid PDF output
Invalid PDF output
Invalid PDF output
Invalid PDF output
Invalid PDF output
Invalid PDF output
Invalid PDF output
Invalid PDF output
Analyzing the faces dataset with k-Means
In [79]:
Cluster sizes k-means: [155 175 238 75 358 257 91 219 323 172]
In [80]:
Invalid PDF output
In [81]:
Invalid PDF output
Analyzing the faces dataset with agglomerative clustering
In [82]:
cluster sizes agglomerative clustering: [169 660 144 329 217 85 18 261 31 149]
In [83]:
ARI: 0.09
In [84]:
Text(0,0.5,'Cluster distance')
Invalid PDF output
In [85]:
Invalid PDF output
Invalid PDF output
Invalid PDF output
Invalid PDF output
Invalid PDF output
Invalid PDF output
Invalid PDF output
Invalid PDF output
Invalid PDF output
Invalid PDF output
In [86]:
cluster sizes agglomerative clustering: [ 43 120 100 194 56 58 127 22 6 37 65 49 84 18 168 44 47 31
78 30 166 20 57 14 11 29 23 5 8 84 67 30 57 16 22 12
29 2 26 8]
Invalid PDF output
Invalid PDF output
Invalid PDF output
Invalid PDF output
Invalid PDF output
Summary of Clustering Methods
Summary and Outlook
In [87]: