CoCalc Public FilesLab4-Clustering.ipynb
Authors: Tim Yao, Daniel You, phonchi chung
Views : 20
Description: Jupyter notebook Lab4-Clustering.ipynb

# Clustering

In [1]:
%matplotlib inline
import numpy as np
import matplotlib.pyplot as plt
from scipy import stats

# use seaborn plotting defaults
import seaborn as sns; sns.set()

/projects/sage/sage-6.10/local/lib/python2.7/site-packages/matplotlib-1.5.0-py2.7-linux-x86_64.egg/matplotlib/__init__.py:872: UserWarning: axes.color_cycle is deprecated and replaced with axes.prop_cycle; please use the latter. warnings.warn(self.msg_depr % (key, alt_key))

## K-means 簡介

K Means 為僅靠data來學習出底層pattern的方法，他的演算法相當簡潔，底下我們隨機產生一些資料點

In [2]:
from sklearn.datasets.samples_generator import make_blobs
X, y = make_blobs(n_samples=300, centers=4,
random_state=0, cluster_std=0.60)
plt.scatter(X[:, 0], X[:, 1], s=50);


In [3]:
from sklearn.cluster import KMeans
est = KMeans(4)  # 4 clusters
est.fit(X)
y_kmeans = est.predict(X)
plt.scatter(X[:, 0], X[:, 1], c=y_kmeans, s=50, cmap='rainbow');


K-means 實做了 Expectation Maximization (EM) 演算法，這個演算法包含兩個部分:

1. 起始猜出各cluster的中心
2. 重複以下步驟直到收斂 A. 將資料點分到最近的cluster中心 B. 找出新分好的cluster mean指定為新的中心
In [9]:
from fig_code import plot_kmeans_interactive
plot_kmeans_interactive();

None

## Kmeans 作數字分類

In [10]:
from sklearn.datasets import load_digits
digits = load_digits() # Helper function

In [11]:
est = KMeans(n_clusters=10)
clusters = est.fit_predict(digits.data)
est.cluster_centers_.shape

(10, 64)

In [12]:
fig = plt.figure(figsize=(8, 3))
for i in range(10):
ax = fig.add_subplot(2, 5, 1 + i, xticks=[], yticks=[])
ax.imshow(est.cluster_centers_[i].reshape((8, 8)), cmap=plt.cm.binary)


In [8]:
from scipy.stats import mode
labels = np.zeros_like(clusters)
for i in range(10):

from sklearn.decomposition import PCA

X = PCA(2).fit_transform(digits.data)

kwargs = dict(cmap = plt.cm.get_cmap('rainbow', 10),
edgecolor='none', alpha=0.6)
fig, ax = plt.subplots(1, 2, figsize=(8, 4))
ax[0].scatter(X[:, 0], X[:, 1], c=labels, **kwargs)
ax[0].set_title('learned cluster labels')

ax[1].scatter(X[:, 0], X[:, 1], c=digits.target, **kwargs)
ax[1].set_title('true labels');


In [11]:
from sklearn.metrics import accuracy_score
accuracy_score(digits.target, labels)

0.78853644963828606

In [12]:
from sklearn.metrics import confusion_matrix
print(confusion_matrix(digits.target, labels))

plt.imshow(confusion_matrix(digits.target, labels),
cmap='Blues', interpolation='nearest')
plt.colorbar()
plt.grid(False)
plt.ylabel('true')
plt.xlabel('predicted');

[[177 0 0 0 1 0 0 0 0 0] [ 0 154 24 1 0 1 2 0 0 0] [ 1 10 147 13 0 0 0 4 0 2] [ 0 7 0 155 0 2 0 7 0 12] [ 0 9 0 0 162 0 0 10 0 0] [ 0 0 0 1 2 136 1 0 0 42] [ 1 3 0 0 0 0 177 0 0 0] [ 0 4 0 0 0 5 0 170 0 0] [ 0 105 3 2 0 7 2 3 0 52] [ 0 20 0 6 0 7 0 8 0 139]]
In [ ]: