Data Cleaning and Preprocessing
---------------------------------------------------------------------------
TypeError Traceback (most recent call last)
<ipython-input-18-55bb33e8ea81> in <module>
1 #Count the unique values in our data set.
2 for column in pokemon:
----> 3 unique_vals = np.unique(pokemon[column])
4 nr_values = len(unique_vals)
5 if nr_values < 10:
/anaconda3/lib/python3.7/site-packages/numpy/lib/arraysetops.py in unique(ar, return_index, return_inverse, return_counts, axis)
231 ar = np.asanyarray(ar)
232 if axis is None:
--> 233 ret = _unique1d(ar, return_index, return_inverse, return_counts)
234 return _unpack_tuple(ret)
235
/anaconda3/lib/python3.7/site-packages/numpy/lib/arraysetops.py in _unique1d(ar, return_index, return_inverse, return_counts)
279 aux = ar[perm]
280 else:
--> 281 ar.sort()
282 aux = ar
283 mask = np.empty(aux.shape, dtype=np.bool_)
TypeError: '<' not supported between instances of 'float' and 'str'
Feature Engineering - Numeric Representation
against_bug_0.25 | against_bug_0.5 | against_bug_1.0 | against_bug_2.0 | against_bug_4.0 | against_dark_0.25 | against_dark_0.5 | against_dark_1.0 | against_dark_2.0 | against_dark_4.0 | ... | type2_water | generation_1 | generation_2 | generation_3 | generation_4 | generation_5 | generation_6 | generation_7 | is_legendary_0 | is_legendary_1 | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
0 | 0 | 0 | 1 | 0 | 0 | 0 | 0 | 1 | 0 | 0 | ... | 0 | 1 | 0 | 0 | 0 | 0 | 0 | 0 | 1 | 0 |
1 | 0 | 0 | 1 | 0 | 0 | 0 | 0 | 1 | 0 | 0 | ... | 0 | 1 | 0 | 0 | 0 | 0 | 0 | 0 | 1 | 0 |
2 | 0 | 0 | 1 | 0 | 0 | 0 | 0 | 1 | 0 | 0 | ... | 0 | 1 | 0 | 0 | 0 | 0 | 0 | 0 | 1 | 0 |
3 | 0 | 1 | 0 | 0 | 0 | 0 | 0 | 1 | 0 | 0 | ... | 0 | 1 | 0 | 0 | 0 | 0 | 0 | 0 | 1 | 0 |
4 | 0 | 1 | 0 | 0 | 0 | 0 | 0 | 1 | 0 | 0 | ... | 0 | 1 | 0 | 0 | 0 | 0 | 0 | 0 | 1 | 0 |
5 rows × 1669 columns
K Means Clustering
What is Inertia
Inertia is the within-cluster sum of squares criterion. It is a metric that shows how internally coherent the clusters are. Inertia assumes that clusters are convex and isotropic; performs poorly elongated clusters, or manifolds with irregular shapes; better use hierarchical clustering here. Inertia also performs poorly in very high-dimensional spaces since Euclidean distances become inflated and no normalisation is applied beforehand; not a normalised metric.
Cluster 0 | Cluster 1 | Cluster 2 | Cluster 3 | Cluster 4 | Cluster 5 | Cluster 6 | |
---|---|---|---|---|---|---|---|
0 | 113 | 151 | 90 | 163 | 132 | 89 | 63 |
Running Principal Component Analysis (PCA) to Visualize & improve results
- What is it?
PCA is a dimensionality reduction technique that transforms a number of possibly correlated variables into a smaller number of uncorrelated variables; called principal componentsThe goal of PCA is to extract the most important information from the data table by compressing the size of the data set and keeping only the important information; then PCA computes new variables called principal components.
- Additional Info:
The first principal component is required to have the largest possible variance (inertia) and therefore this component will explain the largest part of the inertia / variance of the data table; hence the less principal components you have the lower the inertia will be after PCA.For every new component you add, the inertia will increase since the rotations are always performed in a subspace and the new axes will always explain less inertia than the original components; which are computed to be optimal Therefore, inertia should not be the criterium to choose the optimal number of principal component since the lower the components are the lower the inertia will be. 95% explained variance should be the criterium when choosing the number of principal components