| Download
📚 The CoCalc Library - books, templates and other resources
Path: cocalc-examples / introduction_to_ml_with_python / 04-representing-data-feature-engineering.ipynb
Views: 96106License: OTHER
Kernel: Python [conda env:py37]
In [1]:
Representing Data and Engineering Features
Categorical Variables
One-Hot-Encoding (Dummy variables)
In [2]:
age | workclass | education | gender | hours-per-week | occupation | income | |
---|---|---|---|---|---|---|---|
0 | 39 | State-gov | Bachelors | Male | 40 | Adm-clerical | <=50K |
1 | 50 | Self-emp-not-inc | Bachelors | Male | 13 | Exec-managerial | <=50K |
2 | 38 | Private | HS-grad | Male | 40 | Handlers-cleaners | <=50K |
3 | 53 | Private | 11th | Male | 40 | Handlers-cleaners | <=50K |
4 | 28 | Private | Bachelors | Female | 40 | Prof-specialty | <=50K |
Checking string-encoded categorical data
In [3]:
Male 21790
Female 10771
Name: gender, dtype: int64
In [4]:
Original features:
['age', 'workclass', 'education', 'gender', 'hours-per-week', 'occupation', 'income']
Features after get_dummies:
['age', 'hours-per-week', 'workclass_ ?', 'workclass_ Federal-gov', 'workclass_ Local-gov', 'workclass_ Never-worked', 'workclass_ Private', 'workclass_ Self-emp-inc', 'workclass_ Self-emp-not-inc', 'workclass_ State-gov', 'workclass_ Without-pay', 'education_ 10th', 'education_ 11th', 'education_ 12th', 'education_ 1st-4th', 'education_ 5th-6th', 'education_ 7th-8th', 'education_ 9th', 'education_ Assoc-acdm', 'education_ Assoc-voc', 'education_ Bachelors', 'education_ Doctorate', 'education_ HS-grad', 'education_ Masters', 'education_ Preschool', 'education_ Prof-school', 'education_ Some-college', 'gender_ Female', 'gender_ Male', 'occupation_ ?', 'occupation_ Adm-clerical', 'occupation_ Armed-Forces', 'occupation_ Craft-repair', 'occupation_ Exec-managerial', 'occupation_ Farming-fishing', 'occupation_ Handlers-cleaners', 'occupation_ Machine-op-inspct', 'occupation_ Other-service', 'occupation_ Priv-house-serv', 'occupation_ Prof-specialty', 'occupation_ Protective-serv', 'occupation_ Sales', 'occupation_ Tech-support', 'occupation_ Transport-moving', 'income_ <=50K', 'income_ >50K']
In [5]:
age | hours-per-week | workclass_ ? | workclass_ Federal-gov | ... | occupation_ Tech-support | occupation_ Transport-moving | income_ <=50K | income_ >50K | |
---|---|---|---|---|---|---|---|---|---|
0 | 39 | 40 | 0 | 0 | ... | 0 | 0 | 1 | 0 |
1 | 50 | 13 | 0 | 0 | ... | 0 | 0 | 1 | 0 |
2 rows × 46 columns
In [6]:
X.shape: (32561, 44) y.shape: (32561,)
In [7]:
Test score: 0.81
Numbers Can Encode Categoricals
In [8]:
Integer Feature | Categorical Feature | |
---|---|---|
0 | 0 | socks |
1 | 1 | fox |
2 | 2 | socks |
3 | 1 | box |
In [9]:
Integer Feature | Categorical Feature_box | Categorical Feature_fox | Categorical Feature_socks | |
---|---|---|---|---|
0 | 0 | 0 | 0 | 1 |
1 | 1 | 0 | 1 | 0 |
2 | 2 | 0 | 0 | 1 |
3 | 1 | 1 | 0 | 0 |
In [10]:
Integer Feature_0 | Integer Feature_1 | Integer Feature_2 | Categorical Feature_box | Categorical Feature_fox | Categorical Feature_socks | |
---|---|---|---|---|---|---|
0 | 1 | 0 | 0 | 0 | 0 | 1 |
1 | 0 | 1 | 0 | 0 | 1 | 0 |
2 | 0 | 0 | 1 | 0 | 0 | 1 |
3 | 0 | 1 | 0 | 1 | 0 | 0 |
OneHotEncoder and ColumnTransformer: Categorical Variables with scikit-learn
In [11]:
[[1. 0. 0. 0. 0. 1.]
[0. 1. 0. 0. 1. 0.]
[0. 0. 1. 0. 0. 1.]
[0. 1. 0. 1. 0. 0.]]
In [12]:
['x0_0' 'x0_1' 'x0_2' 'x1_box' 'x1_fox' 'x1_socks']
In [13]:
age | workclass | education | gender | hours-per-week | occupation | income | |
---|---|---|---|---|---|---|---|
0 | 39 | State-gov | Bachelors | Male | 40 | Adm-clerical | <=50K |
1 | 50 | Self-emp-not-inc | Bachelors | Male | 13 | Exec-managerial | <=50K |
2 | 38 | Private | HS-grad | Male | 40 | Handlers-cleaners | <=50K |
3 | 53 | Private | 11th | Male | 40 | Handlers-cleaners | <=50K |
4 | 28 | Private | Bachelors | Female | 40 | Prof-specialty | <=50K |
In [14]:
In [15]:
(24420, 44)
/home/andy/checkout/scikit-learn/sklearn/preprocessing/data.py:617: DataConversionWarning: Data with input dtype int64 were all converted to float64 by StandardScaler.
return self.partial_fit(X, y)
/home/andy/checkout/scikit-learn/sklearn/base.py:462: DataConversionWarning: Data with input dtype int64 were all converted to float64 by StandardScaler.
return self.fit(X, **fit_params).transform(X)
/home/andy/checkout/scikit-learn/sklearn/pipeline.py:605: DataConversionWarning: Data with input dtype int64 were all converted to float64 by StandardScaler.
res = transformer.transform(X)
In [16]:
Test score: 0.81
/home/andy/checkout/scikit-learn/sklearn/pipeline.py:605: DataConversionWarning: Data with input dtype int64 were all converted to float64 by StandardScaler.
res = transformer.transform(X)
In [17]:
OneHotEncoder(categorical_features=None, categories=None,
dtype=<class 'numpy.float64'>, handle_unknown='error',
n_values=None, sparse=False)
Convenient ColumnTransformer creation with make_columntransformer
In [18]:
Binning, Discretization, Linear Models, and Trees
In [19]:
<matplotlib.legend.Legend at 0x7f7416ef7358>
Invalid PDF output
In [20]:
In [21]:
bin edges:
[array([-2.967, -2.378, -1.789, -1.2 , -0.612, -0.023, 0.566, 1.155,
1.744, 2.333, 2.921])]
In [22]:
<120x10 sparse matrix of type '<class 'numpy.float64'>'
with 120 stored elements in Compressed Sparse Row format>
In [23]:
[[-0.753]
[ 2.704]
[ 1.392]
[ 0.592]
[-2.064]
[-2.064]
[-2.651]
[ 2.197]
[ 0.607]
[ 1.248]]
array([[0., 0., 0., 1., 0., 0., 0., 0., 0., 0.],
[0., 0., 0., 0., 0., 0., 0., 0., 0., 1.],
[0., 0., 0., 0., 0., 0., 0., 1., 0., 0.],
[0., 0., 0., 0., 0., 0., 1., 0., 0., 0.],
[0., 1., 0., 0., 0., 0., 0., 0., 0., 0.],
[0., 1., 0., 0., 0., 0., 0., 0., 0., 0.],
[1., 0., 0., 0., 0., 0., 0., 0., 0., 0.],
[0., 0., 0., 0., 0., 0., 0., 0., 1., 0.],
[0., 0., 0., 0., 0., 0., 1., 0., 0., 0.],
[0., 0., 0., 0., 0., 0., 0., 1., 0., 0.]])
In [24]:
In [25]:
Text(0.5, 0, 'Input feature')
Invalid PDF output
Interactions and Polynomials
In [26]:
(120, 11)
In [27]:
[<matplotlib.lines.Line2D at 0x7f7416d6d2e8>]
Invalid PDF output
In [28]:
(120, 20)
In [29]:
<matplotlib.legend.Legend at 0x7f7416ce1ef0>
Invalid PDF output
In [30]:
In [31]:
X_poly.shape: (120, 10)
In [32]:
Entries of X:
[[-0.753]
[ 2.704]
[ 1.392]
[ 0.592]
[-2.064]]
Entries of X_poly:
[[ -0.753 0.567 -0.427 0.321 -0.242 0.182 -0.137
0.103 -0.078 0.058]
[ 2.704 7.313 19.777 53.482 144.632 391.125 1057.714
2860.36 7735.232 20918.278]
[ 1.392 1.938 2.697 3.754 5.226 7.274 10.125
14.094 19.618 27.307]
[ 0.592 0.35 0.207 0.123 0.073 0.043 0.025
0.015 0.009 0.005]
[ -2.064 4.26 -8.791 18.144 -37.448 77.289 -159.516
329.222 -679.478 1402.367]]
In [33]:
Polynomial feature names:
['x0', 'x0^2', 'x0^3', 'x0^4', 'x0^5', 'x0^6', 'x0^7', 'x0^8', 'x0^9', 'x0^10']
In [34]:
<matplotlib.legend.Legend at 0x7f7416c433c8>
Invalid PDF output
In [35]:
<matplotlib.legend.Legend at 0x7f7416c22f60>
Invalid PDF output
In [36]:
In [37]:
X_train.shape: (379, 13)
X_train_poly.shape: (379, 105)
In [38]:
Polynomial feature names:
['1', 'x0', 'x1', 'x2', 'x3', 'x4', 'x5', 'x6', 'x7', 'x8', 'x9', 'x10', 'x11', 'x12', 'x0^2', 'x0 x1', 'x0 x2', 'x0 x3', 'x0 x4', 'x0 x5', 'x0 x6', 'x0 x7', 'x0 x8', 'x0 x9', 'x0 x10', 'x0 x11', 'x0 x12', 'x1^2', 'x1 x2', 'x1 x3', 'x1 x4', 'x1 x5', 'x1 x6', 'x1 x7', 'x1 x8', 'x1 x9', 'x1 x10', 'x1 x11', 'x1 x12', 'x2^2', 'x2 x3', 'x2 x4', 'x2 x5', 'x2 x6', 'x2 x7', 'x2 x8', 'x2 x9', 'x2 x10', 'x2 x11', 'x2 x12', 'x3^2', 'x3 x4', 'x3 x5', 'x3 x6', 'x3 x7', 'x3 x8', 'x3 x9', 'x3 x10', 'x3 x11', 'x3 x12', 'x4^2', 'x4 x5', 'x4 x6', 'x4 x7', 'x4 x8', 'x4 x9', 'x4 x10', 'x4 x11', 'x4 x12', 'x5^2', 'x5 x6', 'x5 x7', 'x5 x8', 'x5 x9', 'x5 x10', 'x5 x11', 'x5 x12', 'x6^2', 'x6 x7', 'x6 x8', 'x6 x9', 'x6 x10', 'x6 x11', 'x6 x12', 'x7^2', 'x7 x8', 'x7 x9', 'x7 x10', 'x7 x11', 'x7 x12', 'x8^2', 'x8 x9', 'x8 x10', 'x8 x11', 'x8 x12', 'x9^2', 'x9 x10', 'x9 x11', 'x9 x12', 'x10^2', 'x10 x11', 'x10 x12', 'x11^2', 'x11 x12', 'x12^2']
In [39]:
Score without interactions: 0.621
Score with interactions: 0.753
In [40]:
Score without interactions: 0.788
Score with interactions: 0.761
Univariate Nonlinear Transformations
In [41]:
In [42]:
Number of feature appearances:
[28 38 68 48 61 59 45 56 37 40 35 34 36 26 23 26 27 21 23 23 18 21 10 9
17 9 7 14 12 7 3 8 4 5 5 3 4 2 4 1 1 3 2 5 3 8 2 5
2 1 2 3 3 2 2 3 3 0 1 2 1 0 0 3 1 0 0 0 1 3 0 1
0 2 0 1 1 0 0 0 0 1 0 0 2 2 0 1 1 0 0 0 0 1 1 0
0 0 0 0 0 0 1 0 0 0 0 0 1 1 0 0 1 0 0 0 0 0 0 0
1 0 0 0 0 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1]
In [43]:
Text(0.5, 0, 'Value')
Invalid PDF output
In [44]:
Test score: 0.622
In [45]:
In [46]:
Text(0.5, 0, 'Value')
Invalid PDF output
In [47]:
Test score: 0.875
Automatic Feature Selection
Univariate statistics
In [48]:
X_train.shape: (284, 80)
X_train_selected.shape: (284, 40)
In [49]:
[ True True True True True True True True True False True False
True True True True True True False False True True True True
True True True True True True False False False True False True
False False True False False False False True False False True False
False True False True False False False False False False True False
True False False False False True False True False False False False
True True False True False False False False]
([], <a list of 0 Text yticklabel objects>)
Invalid PDF output
In [50]:
Score with all features: 0.930
Score with only selected features: 0.940
Model-based Feature Selection
In [51]:
In [52]:
X_train.shape: (284, 80)
X_train_l1.shape: (284, 40)
In [53]:
([], <a list of 0 Text yticklabel objects>)
Invalid PDF output
In [54]:
Test score: 0.951
Iterative feature selection
In [55]:
([], <a list of 0 Text yticklabel objects>)
Invalid PDF output
In [56]:
Test score: 0.951
In [57]:
Test score: 0.951
Utilizing Expert Knowledge
In [58]:
In [59]:
Citibike data:
starttime
2015-08-01 00:00:00 3
2015-08-01 03:00:00 0
2015-08-01 06:00:00 9
2015-08-01 09:00:00 41
2015-08-01 12:00:00 39
Freq: 3H, Name: one, dtype: int64
In [60]:
Text(0, 0.5, 'Rentals')
Invalid PDF output
In [61]:
In [62]:
In [63]:
Test-set R^2: -0.04
Invalid PDF output
In [64]:
Test-set R^2: 0.60
Invalid PDF output
In [65]:
Test-set R^2: 0.84
Invalid PDF output
In [66]:
Test-set R^2: 0.13
Invalid PDF output
In [67]:
In [68]:
Test-set R^2: 0.62
Invalid PDF output
In [69]:
Test-set R^2: 0.85
Invalid PDF output
In [70]:
In [71]:
In [72]:
Text(0, 0.5, 'Feature magnitude')
Invalid PDF output