Contact
CoCalc Logo Icon
StoreFeaturesDocsShareSupport News AboutSign UpSign In
| Download
Project: VJ123
Views: 208
Kernel: Python 2 (SageMath)

Notebook Instructions

You can run the notebook document sequentially (one cell at a time) by pressing shift + enter. While a cell is running, a [*] will display on the left. When it has been run, a number will display indicating the order in which it was run in the notebook [8].

Enter edit mode by pressing Enter or using the mouse to click on a cell's editor area. Edit mode is indicated by a green cell border and a prompt showing in the editor area.

Hyperparameter tuning

Hyperparameters cannot be learned by the model but instead need to be specified by the user before training the models. In this notebook, we will find the best hyperparameters for random forest model created in previous section using random search and grid search cross validation techniques.

Let's start with below steps which you already know!

  1. Import the data

  2. Define predictor variables and a target variable

  3. Split the data into train and test dataset

import pandas as pd data = pd.read_csv('AAPL.csv') # Returns data['ret1'] = data.Adj_Close.pct_change() data['ret5'] = data.ret1.rolling(5).sum() data['ret10'] = data.ret1.rolling(10).sum() data['ret20'] = data.ret1.rolling(20).sum() data['ret40'] = data.ret1.rolling(40).sum() # Standard Deviation data['std5'] = data.ret1.rolling(5).std() data['std10'] = data.ret1.rolling(10).std() data['std20'] = data.ret1.rolling(20).std() data['std40'] = data.ret1.rolling(40).std() # Future returns data['retFut1'] = data.ret1.shift(-1) # Define predictor variables (X) and a target variable (y) data = data.dropna() predictor_list = ['ret1','ret5', 'ret10', 'ret20', 'ret40', 'std5', 'std10', 'std20', 'std40'] X = data[predictor_list] y = data.retFut1 # Split the data into train and test dataset train_length = int(len(data)*0.80) X_train = X[:train_length] X_test = X[train_length:] y_train = y[:train_length] y_test = y[train_length:]

The main hyperparameters in random forest method are n_estimators, max_features, max_depth, min_samples_leaf, and bootstrap. We have defined below a range of values for each of these hyperparameters.

import numpy as np # Number of trees in random forest n_estimators = [int(x) for x in np.linspace(start = 10, stop = 20, num = 5)] # Number of features to consider at every split max_features = [round(x,2) for x in np.linspace(start = 0.3, stop = 1.0, num = 5)] # Max depth of the tree max_depth = [round(x,2) for x in np.linspace(start = 2, stop = 10, num = 5)] # Minimum number of samples required at each leaf node min_samples_leaf = [int(x) for x in np.linspace(start = 300, stop = 600, num = 10)] # Method of selecting training subset for training each tree bootstrap = [True, False] # Create the random grid param_grid = {'n_estimators': n_estimators, 'max_features': max_features, 'max_depth': max_depth, 'min_samples_leaf': min_samples_leaf, 'bootstrap': bootstrap } param_grid
{'bootstrap': [True, False], 'max_depth': [2.0, 4.0, 6.0, 8.0, 10.0], 'max_features': [0.3, 0.47, 0.65, 0.82, 1.0], 'min_samples_leaf': [300, 333, 366, 400, 433, 466, 500, 533, 566, 600], 'n_estimators': [10, 12, 15, 17, 20]}

The RandomizedSearchCV function from sklearn.model_selection package is used to find best hyperparameter values.

from sklearn.model_selection import RandomizedSearchCV # Uncomment below line to see detail about RandomizedSearchCV function help(RandomizedSearchCV)
Help on class RandomizedSearchCV in module sklearn.model_selection._search: class RandomizedSearchCV(BaseSearchCV) | Randomized search on hyper parameters. | | RandomizedSearchCV implements a "fit" and a "score" method. | It also implements "predict", "predict_proba", "decision_function", | "transform" and "inverse_transform" if they are implemented in the | estimator used. | | The parameters of the estimator used to apply these methods are optimized | by cross-validated search over parameter settings. | | In contrast to GridSearchCV, not all parameter values are tried out, but | rather a fixed number of parameter settings is sampled from the specified | distributions. The number of parameter settings that are tried is | given by n_iter. | | If all parameters are presented as a list, | sampling without replacement is performed. If at least one parameter | is given as a distribution, sampling with replacement is used. | It is highly recommended to use continuous distributions for continuous | parameters. | | Read more in the :ref:`User Guide <randomized_parameter_search>`. | | Parameters | ---------- | estimator : estimator object. | A object of that type is instantiated for each grid point. | This is assumed to implement the scikit-learn estimator interface. | Either estimator needs to provide a ``score`` function, | or ``scoring`` must be passed. | | param_distributions : dict | Dictionary with parameters names (string) as keys and distributions | or lists of parameters to try. Distributions must provide a ``rvs`` | method for sampling (such as those from scipy.stats.distributions). | If a list is given, it is sampled uniformly. | | n_iter : int, default=10 | Number of parameter settings that are sampled. n_iter trades | off runtime vs quality of the solution. | | scoring : string, callable, list/tuple, dict or None, default: None | A single string (see :ref:`scoring_parameter`) or a callable | (see :ref:`scoring`) to evaluate the predictions on the test set. | | For evaluating multiple metrics, either give a list of (unique) strings | or a dict with names as keys and callables as values. | | NOTE that when using custom scorers, each scorer should return a single | value. Metric functions returning a list/array of values can be wrapped | into multiple scorers that return one value each. | | See :ref:`multimetric_grid_search` for an example. | | If None, the estimator's default scorer (if available) is used. | | fit_params : dict, optional | Parameters to pass to the fit method. | | .. deprecated:: 0.19 | ``fit_params`` as a constructor argument was deprecated in version | 0.19 and will be removed in version 0.21. Pass fit parameters to | the ``fit`` method instead. | | n_jobs : int, default=1 | Number of jobs to run in parallel. | | pre_dispatch : int, or string, optional | Controls the number of jobs that get dispatched during parallel | execution. Reducing this number can be useful to avoid an | explosion of memory consumption when more jobs get dispatched | than CPUs can process. This parameter can be: | | - None, in which case all the jobs are immediately | created and spawned. Use this for lightweight and | fast-running jobs, to avoid delays due to on-demand | spawning of the jobs | | - An int, giving the exact number of total jobs that are | spawned | | - A string, giving an expression as a function of n_jobs, | as in '2*n_jobs' | | iid : boolean, default=True | If True, the data is assumed to be identically distributed across | the folds, and the loss minimized is the total loss per sample, | and not the mean loss across the folds. | | cv : int, cross-validation generator or an iterable, optional | Determines the cross-validation splitting strategy. | Possible inputs for cv are: | - None, to use the default 3-fold cross validation, | - integer, to specify the number of folds in a `(Stratified)KFold`, | - An object to be used as a cross-validation generator. | - An iterable yielding train, test splits. | | For integer/None inputs, if the estimator is a classifier and ``y`` is | either binary or multiclass, :class:`StratifiedKFold` is used. In all | other cases, :class:`KFold` is used. | | Refer :ref:`User Guide <cross_validation>` for the various | cross-validation strategies that can be used here. | | refit : boolean, or string default=True | Refit an estimator using the best found parameters on the whole | dataset. | | For multiple metric evaluation, this needs to be a string denoting the | scorer that would be used to find the best parameters for refitting | the estimator at the end. | | The refitted estimator is made available at the ``best_estimator_`` | attribute and permits using ``predict`` directly on this | ``RandomizedSearchCV`` instance. | | Also for multiple metric evaluation, the attributes ``best_index_``, | ``best_score_`` and ``best_parameters_`` will only be available if | ``refit`` is set and all of them will be determined w.r.t this specific | scorer. | | See ``scoring`` parameter to know more about multiple metric | evaluation. | | verbose : integer | Controls the verbosity: the higher, the more messages. | | random_state : int, RandomState instance or None, optional, default=None | Pseudo random number generator state used for random uniform sampling | from lists of possible values instead of scipy.stats distributions. | If int, random_state is the seed used by the random number generator; | If RandomState instance, random_state is the random number generator; | If None, the random number generator is the RandomState instance used | by `np.random`. | | error_score : 'raise' (default) or numeric | Value to assign to the score if an error occurs in estimator fitting. | If set to 'raise', the error is raised. If a numeric value is given, | FitFailedWarning is raised. This parameter does not affect the refit | step, which will always raise the error. | | return_train_score : boolean, optional | If ``False``, the ``cv_results_`` attribute will not include training | scores. | | Current default is ``'warn'``, which behaves as ``True`` in addition | to raising a warning when a training score is looked up. | That default will be changed to ``False`` in 0.21. | Computing training scores is used to get insights on how different | parameter settings impact the overfitting/underfitting trade-off. | However computing the scores on the training set can be computationally | expensive and is not strictly required to select the parameters that | yield the best generalization performance. | | Attributes | ---------- | cv_results_ : dict of numpy (masked) ndarrays | A dict with keys as column headers and values as columns, that can be | imported into a pandas ``DataFrame``. | | For instance the below given table | | +--------------+-------------+-------------------+---+---------------+ | | param_kernel | param_gamma | split0_test_score |...|rank_test_score| | +==============+=============+===================+===+===============+ | | 'rbf' | 0.1 | 0.8 |...| 2 | | +--------------+-------------+-------------------+---+---------------+ | | 'rbf' | 0.2 | 0.9 |...| 1 | | +--------------+-------------+-------------------+---+---------------+ | | 'rbf' | 0.3 | 0.7 |...| 1 | | +--------------+-------------+-------------------+---+---------------+ | | will be represented by a ``cv_results_`` dict of:: | | { | 'param_kernel' : masked_array(data = ['rbf', 'rbf', 'rbf'], | mask = False), | 'param_gamma' : masked_array(data = [0.1 0.2 0.3], mask = False), | 'split0_test_score' : [0.8, 0.9, 0.7], | 'split1_test_score' : [0.82, 0.5, 0.7], | 'mean_test_score' : [0.81, 0.7, 0.7], | 'std_test_score' : [0.02, 0.2, 0.], | 'rank_test_score' : [3, 1, 1], | 'split0_train_score' : [0.8, 0.9, 0.7], | 'split1_train_score' : [0.82, 0.5, 0.7], | 'mean_train_score' : [0.81, 0.7, 0.7], | 'std_train_score' : [0.03, 0.03, 0.04], | 'mean_fit_time' : [0.73, 0.63, 0.43, 0.49], | 'std_fit_time' : [0.01, 0.02, 0.01, 0.01], | 'mean_score_time' : [0.007, 0.06, 0.04, 0.04], | 'std_score_time' : [0.001, 0.002, 0.003, 0.005], | 'params' : [{'kernel' : 'rbf', 'gamma' : 0.1}, ...], | } | | NOTE | | The key ``'params'`` is used to store a list of parameter | settings dicts for all the parameter candidates. | | The ``mean_fit_time``, ``std_fit_time``, ``mean_score_time`` and | ``std_score_time`` are all in seconds. | | For multi-metric evaluation, the scores for all the scorers are | available in the ``cv_results_`` dict at the keys ending with that | scorer's name (``'_<scorer_name>'``) instead of ``'_score'`` shown | above. ('split0_test_precision', 'mean_train_precision' etc.) | | best_estimator_ : estimator or dict | Estimator that was chosen by the search, i.e. estimator | which gave highest score (or smallest loss if specified) | on the left out data. Not available if ``refit=False``. | | For multi-metric evaluation, this attribute is present only if | ``refit`` is specified. | | See ``refit`` parameter for more information on allowed values. | | best_score_ : float | Mean cross-validated score of the best_estimator. | | For multi-metric evaluation, this is not available if ``refit`` is | ``False``. See ``refit`` parameter for more information. | | best_params_ : dict | Parameter setting that gave the best results on the hold out data. | | For multi-metric evaluation, this is not available if ``refit`` is | ``False``. See ``refit`` parameter for more information. | | best_index_ : int | The index (of the ``cv_results_`` arrays) which corresponds to the best | candidate parameter setting. | | The dict at ``search.cv_results_['params'][search.best_index_]`` gives | the parameter setting for the best model, that gives the highest | mean score (``search.best_score_``). | | For multi-metric evaluation, this is not available if ``refit`` is | ``False``. See ``refit`` parameter for more information. | | scorer_ : function or a dict | Scorer function used on the held out data to choose the best | parameters for the model. | | For multi-metric evaluation, this attribute holds the validated | ``scoring`` dict which maps the scorer key to the scorer callable. | | n_splits_ : int | The number of cross-validation splits (folds/iterations). | | Notes | ----- | The parameters selected are those that maximize the score of the held-out | data, according to the scoring parameter. | | If `n_jobs` was set to a value higher than one, the data is copied for each | parameter setting(and not `n_jobs` times). This is done for efficiency | reasons if individual jobs take very little time, but may raise errors if | the dataset is large and not enough memory is available. A workaround in | this case is to set `pre_dispatch`. Then, the memory is copied only | `pre_dispatch` many times. A reasonable value for `pre_dispatch` is `2 * | n_jobs`. | | See Also | -------- | :class:`GridSearchCV`: | Does exhaustive search over a grid of parameters. | | :class:`ParameterSampler`: | A generator over parameter settins, constructed from | param_distributions. | | Method resolution order: | RandomizedSearchCV | BaseSearchCV | abc.NewBase | sklearn.base.BaseEstimator | sklearn.base.MetaEstimatorMixin | __builtin__.object | | Methods defined here: | | __init__(self, estimator, param_distributions, n_iter=10, scoring=None, fit_params=None, n_jobs=1, iid=True, refit=True, cv=None, verbose=0, pre_dispatch='2*n_jobs', random_state=None, error_score='raise', return_train_score='warn') | | ---------------------------------------------------------------------- | Data and other attributes defined here: | | __abstractmethods__ = frozenset([]) | | ---------------------------------------------------------------------- | Methods inherited from BaseSearchCV: | | decision_function(*args, **kwargs) | Call decision_function on the estimator with the best found parameters. | | Only available if ``refit=True`` and the underlying estimator supports | ``decision_function``. | | Parameters | ----------- | X : indexable, length n_samples | Must fulfill the input assumptions of the | underlying estimator. | | fit(self, X, y=None, groups=None, **fit_params) | Run fit with all sets of parameters. | | Parameters | ---------- | | X : array-like, shape = [n_samples, n_features] | Training vector, where n_samples is the number of samples and | n_features is the number of features. | | y : array-like, shape = [n_samples] or [n_samples, n_output], optional | Target relative to X for classification or regression; | None for unsupervised learning. | | groups : array-like, with shape (n_samples,), optional | Group labels for the samples used while splitting the dataset into | train/test set. | | **fit_params : dict of string -> object | Parameters passed to the ``fit`` method of the estimator | | inverse_transform(*args, **kwargs) | Call inverse_transform on the estimator with the best found params. | | Only available if the underlying estimator implements | ``inverse_transform`` and ``refit=True``. | | Parameters | ----------- | Xt : indexable, length n_samples | Must fulfill the input assumptions of the | underlying estimator. | | predict(*args, **kwargs) | Call predict on the estimator with the best found parameters. | | Only available if ``refit=True`` and the underlying estimator supports | ``predict``. | | Parameters | ----------- | X : indexable, length n_samples | Must fulfill the input assumptions of the | underlying estimator. | | predict_log_proba(*args, **kwargs) | Call predict_log_proba on the estimator with the best found parameters. | | Only available if ``refit=True`` and the underlying estimator supports | ``predict_log_proba``. | | Parameters | ----------- | X : indexable, length n_samples | Must fulfill the input assumptions of the | underlying estimator. | | predict_proba(*args, **kwargs) | Call predict_proba on the estimator with the best found parameters. | | Only available if ``refit=True`` and the underlying estimator supports | ``predict_proba``. | | Parameters | ----------- | X : indexable, length n_samples | Must fulfill the input assumptions of the | underlying estimator. | | score(self, X, y=None) | Returns the score on the given data, if the estimator has been refit. | | This uses the score defined by ``scoring`` where provided, and the | ``best_estimator_.score`` method otherwise. | | Parameters | ---------- | X : array-like, shape = [n_samples, n_features] | Input data, where n_samples is the number of samples and | n_features is the number of features. | | y : array-like, shape = [n_samples] or [n_samples, n_output], optional | Target relative to X for classification or regression; | None for unsupervised learning. | | Returns | ------- | score : float | | transform(*args, **kwargs) | Call transform on the estimator with the best found parameters. | | Only available if the underlying estimator supports ``transform`` and | ``refit=True``. | | Parameters | ----------- | X : indexable, length n_samples | Must fulfill the input assumptions of the | underlying estimator. | | ---------------------------------------------------------------------- | Data descriptors inherited from BaseSearchCV: | | classes_ | | grid_scores_ | | ---------------------------------------------------------------------- | Methods inherited from sklearn.base.BaseEstimator: | | __getstate__(self) | | __repr__(self) | | __setstate__(self, state) | | get_params(self, deep=True) | Get parameters for this estimator. | | Parameters | ---------- | deep : boolean, optional | If True, will return the parameters for this estimator and | contained subobjects that are estimators. | | Returns | ------- | params : mapping of string to any | Parameter names mapped to their values. | | set_params(self, **params) | Set the parameters of this estimator. | | The method works on simple estimators as well as on nested objects | (such as pipelines). The latter have parameters of the form | ``<component>__<parameter>`` so that it's possible to update each | component of a nested object. | | Returns | ------- | self | | ---------------------------------------------------------------------- | Data descriptors inherited from sklearn.base.BaseEstimator: | | __dict__ | dictionary for instance variables (if defined) | | __weakref__ | list of weak references to the object (if defined)
# Create the base model to tune from sklearn.ensemble import RandomForestRegressor random_forest = RandomForestRegressor()

The RandomizedSearchCV take following parameter as input

  1. estimator: The base estimator model for which best hyperparameter values are found.

  2. param_distributions: Dictionary of parameter names and list of values to try.

  3. n_iter: Number of parameter that are tried to find best values.

  4. random_state: The random seed value.

# Random search of parameters by searching across 50 different combinations rf_random = RandomizedSearchCV(estimator = random_forest, param_distributions = param_grid, n_iter = 50, random_state= 42 ) # Fit the model to find the best hyperparameter values rf_random.fit(X_train, y_train)
RandomizedSearchCV(cv=None, error_score='raise', estimator=RandomForestRegressor(bootstrap=True, criterion='mse', max_depth=None, max_features='auto', max_leaf_nodes=None, min_impurity_decrease=0.0, min_impurity_split=None, min_samples_leaf=1, min_samples_split=2, min_weight_fraction_leaf=0.0, n_estimators=10, n_jobs=1, oob_score=False, random_state=None, verbose=0, warm_start=False), fit_params=None, iid=True, n_iter=50, n_jobs=1, param_distributions={'n_estimators': [10, 12, 15, 17, 20], 'max_features': [0.3, 0.47, 0.65, 0.82, 1.0], 'bootstrap': [True, False], 'max_depth': [2.0, 4.0, 6.0, 8.0, 10.0], 'min_samples_leaf': [300, 333, 366, 400, 433, 466, 500, 533, 566, 600]}, pre_dispatch='2*n_jobs', random_state=42, refit=True, return_train_score='warn', scoring=None, verbose=0)

The best hyperparameters values for the random forest model is found below.

rf_random.best_params_
{'bootstrap': True, 'max_depth': 10.0, 'max_features': 0.47, 'min_samples_leaf': 366, 'n_estimators': 20}

In this step, we train the model created using the best hyperparameter values.

# Assign the best model to best_random_forest best_random_forest = rf_random.best_estimator_ # Initialize random_state to 42 best_random_forest.random_state = 42 # Fit the best random forest model on train dataset best_random_forest.fit(X_train, y_train)
RandomForestRegressor(bootstrap=False, criterion='mse', max_depth=10.0, max_features=0.3, max_leaf_nodes=None, min_impurity_decrease=0.0, min_impurity_split=None, min_samples_leaf=533, min_samples_split=2, min_weight_fraction_leaf=0.0, n_estimators=15, n_jobs=1, oob_score=False, random_state=42, verbose=0, warm_start=False)

Grid search

Similarly, we can find the best model using grid search cross validation technique. Since this method is time consuming as it tries out all possible combinations, we have defined below less hyperparameters values for illustration purpose only. You may specify more values for hyperparameter.

# Number of trees in random forest n_estimators = [int(x) for x in np.linspace(start = 10, stop = 20, num = 3)] # Number of features to consider at every split max_features = [round(x,2) for x in np.linspace(start = 0.3, stop = 1.0, num = 3)] # Minimum number of samples required at each leaf node min_samples_leaf = [int(x) for x in np.linspace(start = 300, stop = 600, num = 3)] # Method of selecting training subset for training each tree bootstrap = [True, False] # Create the random grid param_grid = {'n_estimators': n_estimators, 'max_features': max_features, 'min_samples_leaf': min_samples_leaf, 'bootstrap': bootstrap } param_grid
{'bootstrap': [True, False], 'max_features': [0.3, 0.65, 1.0], 'min_samples_leaf': [300, 450, 600], 'n_estimators': [10, 15, 20]}

The below code finds the best hyperparameter values.

from sklearn.model_selection import GridSearchCV # Uncomment below line to see detail about GridSearchCV function # help(GridSearchCV) # Grid search of parameters by searching all the possible combinations rf_grid = GridSearchCV(estimator = random_forest, param_grid = param_grid ) # Fit the model to find the best hyperparameter values rf_grid.fit(X_train, y_train) # Best hyperparameter values rf_grid.best_params_
{'bootstrap': False, 'max_features': 0.3, 'min_samples_leaf': 300, 'n_estimators': 15}

Practice

You can try it yourself of how the random forest model created through RandomSearchCV and GridSearchCV performs on test dataset.