Feature selection via grid search in supervised models

Feature selection is probably the most important part of machine learning, as well as hyperparameter tuning. How can we select the right set of features? Is it related to the model hyperparameters?

Let’s see an example in Python.

The purpose of feature selection

Feature selection is one of the most fascinating and probably underestimated fields in machine learning. Many people give too much importance to the model and think that a complex model will learn automatically which are the most important variables to use.

My experience as a data scientist has proven that simple algorithms are able to generalize better than complicated ones and that feature selection is often more important than the model itself. If you choose the wrong features, no model will learn anything. If you choose the right features, even a simple model could achieve good results.

Unsupervised or supervised?

Unsupervised feature selection involves techniques that don’t rely on some model efficiency but rely only on data. They are applied before any model training, so they are model-free. Such techniques are, for example, choosing the most correlated variables to the target variable using Pearson’s correlation coefficient, chi-square, mutual information and so on. This kind of feature selection is quite powerful, but sometimes it can be unreliable if it’s not followed by a proper model. For example, Pearson’s correlation coefficient measures linear correlation, but if the model is non-linear, the features selected by a linear approach may not be the best set possible.

Supervised feature selection chooses the best set of input features that maximize a model performance. This kind of feature selection is model-dependent because different models may think about input data in different waysand consider the feature importance under different points of view. We want a good model, so searching for a feature set that maximizes our model’s performance is reasonable, but we must choose the model a priori and this choice can make a bias in our analysis.

A mixed approach

If your model has hyperparameters (e.g. Random Forests), things become more difficult. How do you choose hyperparameters values and features? Do you choose the features before the values of the hyperparameters or you first optimize your hyperparameters on all the features and then choose only the most relevant inputs?

I don’t think that feature selection is independent of hyperparameter tuning. The same model with different hyperparameters values is actually another model, so it can consider the input features in a different way and show different performances even with the same features. On the contrary, changing the features of a model keeping the same values of the hyperparameters may affect performances due to collinearity or, more generally, due to the curse of dimensionality.

So, I think that the right answer is choosing the features and the values of the hyperparameters during the same search procedure. This becomes possible if we consider feature selection as part of the hyperparameter tuning process.

Everything will be more clear in this example. We’ll start with an unsupervised approach based on Pearson’s correlation coefficient. We’ll calculate the correlation coefficient between every feature and the target variable, convert it into an F-score and sort variables by this score. Then we’ll take the first k variables with the highest score, optimize the hyperparameters of our model trained with these features and repeat the process changing k until every combination of variables and values of hyperparameters is checked. The features/hyperparameters combination that maximizes the average performance in a 5-fold cross-validation is the one we are looking for.

Example in Python

Here follows an example of this procedure made in Python. You can find the whole code on GitHub here.

The selection of the K best variables is done by the SelectKBest module of scikit-learn. This object selects the most important features according to a given correlation metric.

Then we combine SelectKBest and our supervised model in a Pipeline object. Finally, we perform a hyperparameter tuning by GridSearchCV considering k as a hyperparameter of our pipeline. Remember, pipelines in python work exactly as models, so every parameter of each object included in the pipeline is considered as a pipeline hyperparameter.

For our example, we’ll use the Boston data included in sklearn. Since the target variable is a real number, we are facing a regression problem. The models we’re going to use in this example are Linear Regression and Random Forest regression.

Let’s import some libraries first.

import numpy as np
from sklearn.datasets import load_boston
from sklearn.feature_selection import SelectKBest,f_regression
from sklearn.pipeline import Pipeline
from sklearn.linear_model import LinearRegression
from sklearn.neighbors import KNeighborsRegressor
from sklearn.preprocessing import StandardScaler
from sklearn.ensemble import RandomForestRegressor
from sklearn.model_selection import GridSearchCV
from sklearn.metrics import mean_squared_error

Then we’ll load the Boston dataset.

data,target = load_boston(return_X_y=True)

Now we can perform our analysis with the models.

Linear Regression

The first thing to do is defining a pipeline that contains the feature selector and the model.

pipeline = Pipeline(
[
('selector',SelectKBest(f_regression)),
('model',LinearRegression())
]
)

The f_regression variable inside the SelectKBest constructor tells the selector that it must score the variables according to an F-score calculated starting from Pearson’s correlation coefficient between each feature and the target variable. After the feature selection, a Linear Regression on the selected features will be performed.

Then, we define the GridSearchCV object that performs a grid search on the number of features to use.

search = GridSearchCV(
estimator = pipeline,
param_grid = {'selector__k':[3,4,5,6,7,8,9,10]},
n_jobs=-1,
scoring="neg_mean_squared_error",
cv=5,
verbose=3
)

This object acts exactly like a model (so it has fit and predict methods). When we fit it, it will calculate the average value of the scoring metric (the mean squared error with a minus sign) on a 5-fold cross-validation (cv=5) for each value of k, which is the number of the most relevant variables to consider. Finally, the grid search chooses the k value that maximizes the average scoring value across the folds.

As you can see, the param_grid value contains a dictionary with one key, which is selector__k. You can see there is a double underscore inside this name. This is a special syntax of GridSearchCV that makes possible to specify the grid for the k parameter of the object called selector in the pipeline.

We can now fit the grid search and check the best value for k and the best score achieved.

search.fit(data,target)search.best_params_
# {'selector__k': 3}search.best_score_
# -36.4236890153343

As you can see, the selector has chosen the first 3 most relevant variables.

Let’s see what happens with a model that has hyperparameters.

Random Forest

For the Random Forest model, we can define another pipeline.

pipeline = Pipeline(
[
('selector',SelectKBest(f_regression)),
('model',RandomForestRegressor(random_state = 0))
]
)

We’ll perform the hyperparameter tuning only on the number of trees, which is the n_estimators parameter of the RandomForestRegressor object. This value will span from 10 to 190 with steps of 10.

The grid search is then:

search = GridSearchCV(
estimator = pipeline,
param_grid = {
'selector__k':[3,4,5,6,7,8,9,10] ,
'model__n_estimators':np.arange(10,200,10)
},
n_jobs=-1,
scoring="neg_mean_squared_error",
cv=5,
verbose=3
)

This new search spans over the values and the n_estimators values simultaneously. Considering 5 fits for each combination, it performs 760 different fits.

The final result is:

search.fit(data,target)search.best_params_
# {'model__n_estimators': 110, 'selector__k': 6}search.best_score_
# -22.170138432624004

So the grid search has found 6 features to consider and a model with 110 trees.

Conclusions

In this article, I’ve described a technique to mix feature selection and hyperparameter tuning in the same procedure, considering the feature set as a hyperparameter itself. We cannot know in advance which are the most important variables according to our model and this is more important when we work with a model that has hyperparameters. Different values for the hyperparameters could work differently with different sets of features, so feature selection should be done together with hyperparameter tuning.

Of course, feature selection introduces a new dimension in hyperparameter tuning, which increases the number of iterations in a grid search. Maybe a random search instead of a grid search can be more useful to quickly find a solution or, if you have it, a Spark cluster may parallelize the calculations and increase the program speed.