Jupyter notebook 2018-19/Week 11 - Clustering/handouts/IrisCluster-Interactive.ipynb
Demonstration of k-means and hierarchical clustering
Information on the iris data here: https://en.wikipedia.org/wiki/Iris_flower_data_set
Information on the sklearn machine learning library here: http://scikit-learn.org/
We evaluate the different clustering methods using the adjusted rand index, a measure between 0 and 1 with 1 denoting a perfect agreement with the true labels. This is only possible when class labels are known.
Import Iris dataset from sklearn and put into a pandas DataFrame
Data summary: rows are samples, columns are features
Sepal length | Sepal width | Petal length | Petal width | |
---|---|---|---|---|
setosa | 5.1 | 3.5 | 1.4 | 0.2 |
setosa | 4.9 | 3.0 | 1.4 | 0.2 |
setosa | 4.7 | 3.2 | 1.3 | 0.2 |
setosa | 4.6 | 3.1 | 1.5 | 0.2 |
setosa | 5.0 | 3.6 | 1.4 | 0.2 |
Calculate PCA so that we can visualise the clusters of data in 2D
We will only use PCA to help visualise the data, it plays no role in any of the clustering algorithms used here
Run the k-means algorithm
Return the cluster labels for each data item
Compare cluster labels and class labels on the PCA projection
Note that the clustering was not done in the PCA space, it was done in the original 4-dimensional data space
What does the Adjusted Rand Index tell us here?
Failed to display Jupyter Widget of type interactive
.
If you're reading this message in the Jupyter Notebook or JupyterLab Notebook, it may mean that the widgets JavaScript is still loading. If this message persists, it likely means that the widgets JavaScript library is either not installed or not enabled. See the Jupyter Widgets Documentation for setup instructions.
If you're reading this message in another frontend (for example, a static rendering on GitHub or NBViewer), it may mean that your frontend doesn't currently support widgets.
Hierarchical clustering
For a very nice tutorial see this page.
The last column gives you the cluster size for each step of the algorithm. We see already in the second step a cluster of three points.
The first two columns are the indices of the points. So in the first step point 9 and point 34 are merged into a single cluster. All indices over the data size, in this case we have 150 points, denote a merge of an existing cluster with another cluster/point. In the second row we show points 37 and 150 merge - but point 150 does not exists because we have data indices 0-149 - remember in python we have 0-based counting! So this denotes the merging of point 37 with a new 'point' 150 which represents the cluster containing 9 and 34. Every time a cluster is formed it is given the next index, so the cluster containing 9, 34 and 37 will be numbered 151 and so on.
Now let us plot a dendogram summarising the clustering.
For a particular cutoff a specific number of clusters is defined. We show these on the PCA plot below the dendogram.
Failed to display Jupyter Widget of type interactive
.
If you're reading this message in the Jupyter Notebook or JupyterLab Notebook, it may mean that the widgets JavaScript is still loading. If this message persists, it likely means that the widgets JavaScript library is either not installed or not enabled. See the Jupyter Widgets Documentation for setup instructions.
If you're reading this message in another frontend (for example, a static rendering on GitHub or NBViewer), it may mean that your frontend doesn't currently support widgets.