Which models are interpretable?

Model explanation is an essential task in supervised machine learning. Explaining how a model can represent the information is crucial to understanding the dynamics that rule our data. Let’s see some models that are easy to interpret.

Why do we need to interpret our models?

Data Scientists have the role to extract information from raw data. They aren’t engineers, nor they are software developers. They dig inside data and extract the gold from the mine.

Knowing what a model does and how it works is part of this job. Black-boxes models, although sometimes work better than other models, aren’t a good idea if we need to learn something from our data. On the contrary, some models that are, by their nature, pretty good at explaining how they convert data into information, must always be preferred and deeply studied.

Machine learning models are a way to translate the information into a suitable and understandable language, which is math. If we can translate something into math, we can learn more from it, because we can master math as much as we need. So, a model is not only an algorithm we can write to predict whether a customer will buy something or not. It’s a way to understand “why” a customer with a particular profile would probably buy something and another customer, with a different profile, is not going to buy anything.

So, model interpretation is crucial to give the proper shape to the information we’re looking for.

Interpretation via feature importance

One possible way to interpret how a model “thinks” about our data to extract information is to look at the importance of the features. Importance is usually a positive number. The higher this number, the higher the importance that the model provides to that particular feature.

In this article, I’ll show some models that give us their own interpretation of feature importance using Python and scikit-learn library. Let’s always remember that different models give us different importance to the features. That’s completely normal because each model is a way to look at that wonderful, complex prism that is information. We never have the complete view, so feature importance strongly depends on the model we choose.

Let’s first import some libraries and the “wine” dataset of scikit-learn.

import numpy as np
from sklearn.datasets import load_wine,load_diabetes
from sklearn.tree import DecisionTreeClassifier
from sklearn.ensemble import RandomForestClassifier,GradientBoostingClassifier
from sklearn.linear_model import *
from sklearn.svm import LinearSVC,LinearSVR
import xgboost as xgb
import matplotlib.pyplot as plt
from sklearn.pipeline import make_pipeline
from sklearn.preprocessing import StandardScaler

Let’s now load our dataset and store the names of the features.

X,y = load_wine(return_X_y=True)
features = load_wine()['feature_names']

Now, we are ready to calculate feature importance for different types of models.

Decision trees

Tree-based models calculate the importance of a feature according to the total improvement in the purity of the leaves that it provides across the entire tree. If a feature is able to split a dataset properly and increase the purity of the features, it is important for sure. For simplicity, importance scores in tree-based models are normalized so that they sum up to 1.

In scikit-learn, every decision tree-based model has a property called feature_importances_ which contains the importance of the features. It can be accessed after fitting our model.

Let’s see what happens with a simple decision tree model.

tree = DecisionTreeClassifier()
tree.fit(X,y)

plt.barh(features,tree.feature_importances_)
Feature importance according to decision tree

As we can see, some features have importance equal to 0. Maybe they don’t improve purity better than, for example, “proline” feature.

A very important model based on trees that is very useful for feature importance is the random forest. Generally speaking, ensemble models average the importance scores for each feature given by every weak learner and then the final scores are normalized again like in decision trees.

Here’s how to calculate feature importance for a random forest model.

rf = RandomForestClassifier()
rf.fit(X,y)

plt.barh(features,rf.feature_importances_)
Feature importance according to random forest

Feature importance given by a random forest can be used to perform feature selection, for example.

Even Gradient Boosting Decision Tree model can give us its own interpretation of feature importance.

gb = GradientBoostingClassifier()
gb.fit(X,y)

plt.barh(features,gb.feature_importances_)
Feature importance according to gradient boosting decision tree

This is true even for XGBoost.

xgboost = xgb.XGBClassifier()
xgboost.fit(X,y)

plt.barh(features,xgboost.feature_importances_)
Feature importance according to XGBoost

Linear models

Linear models are able to give us feature importance too. It is, actually, the absolute value of the coefficient of a feature. If we work with multi-class linear models for classification, we’ll sum the absolute values of the coefficients related to a single feature.

All the linear models require standardized or normalized features when it comes to calculating feature importance. Several models require this transformation by default, but we always have to apply it if we want to compare the coefficients with each other. That’s why I’ll use the pipeline object in scikit-learn to standardize our data before training the model.

Let’s see an example with the Support Vector Machines with linear kernel.

svm = make_pipeline(StandardScaler(),LinearSVC())

svm.fit(X,y)

plt.barh(features,np.abs(svm[1].coef_).sum(axis=0))
Feature importance according to Support Vector Machine

The same applies to logistic regression.

logit = make_pipeline(StandardScaler(),LogisticRegression())

logit.fit(X,y)

plt.barh(features,np.abs(logit[1].coef_).sum(axis=0))
Feature importance according to logistic regression

The other linear models that can be used for feature importance are regression models, so we have to load a regression dataset to see how they work. For the following examples, I’ll use the “diabetes” dataset.

X,y = load_diabetes(return_X_y=True)
features = load_diabetes()['feature_names']

Let’s see how linear regression works for calculating feature importance.

lr = make_pipeline(StandardScaler(),LinearRegression())

lr.fit(X,y)

plt.barh(features,np.abs(lr[1].coef_))
Feature importance according to linear regression

A very powerful model that can be used for feature importance (and for feature selection too) is the Lasso regression

lasso = make_pipeline(StandardScaler(),Lasso())

lasso.fit(X,y)

plt.barh(features,np.abs(lasso[1].coef_))

.

Feature importance according to Lasso regression

The closest friend to Lasso regression is Ridge regression, which is helpful as well.

ridge = make_pipeline(StandardScaler(),Ridge())

ridge.fit(X,y)

plt.barh(features,np.abs(ridge[1].coef_))
Feature importance according to Ridge regression

The last model we’re going to see mixes Lasso and Ridge regression together and it’s the Elastic Net regression.

en = make_pipeline(StandardScaler(),ElasticNet())

en.fit(X,y)

plt.barh(features,np.abs(en[1].coef_))
Feature importance according to Elastic Net regression

Conclusions

Model interpretation is often done by calculating feature importance and some models give us their own interpretation of the importance of the features. For those models that aren’t able to give feature importance, we can use some model-agnostic approaches like SHAP (very useful to explain neural networks, for example). Proper use of feature importance can improve the value of a Data Science project better than the predictive power of the model itself.