Training a model is a complex process requiring much effort and analysis. Once a model is ready, we know that it won’t be valid forever and that we’ll need to train it again. How can we decide if a model needs to be retrained? There are some techniques that help us.
The need for retraining
Data changes every day. Big Data changes every minute. Every model we could ever create will necessarily learn from the past. If our data statistics in the future don’t change, our model will likely remain stable and efficient. But if it changes, the model may become obsolete and need to be retrained.
It’s not only an issue related to how data changes, but it’s related to how the distribution of our data changes. Even the correlation between the features may introduce some new stuff that the model may not be able to handle properly.
Think about the introduction of a new product, a marketing push that produces an increase in sales, or even worse, a change in some search engine algorithms that lowers the organic traffic dramatically. All these examples are exogenous events that change the distribution of our data and may potentially affect the performance of our model that has been trained on obsolete data.
So, the deployment of a model is not the last part of the puzzle. After a Data Scientist has done their job, the Machine Learning Engineer has some other things to do in order to ensure the quality of the service provided by the model.
So, how can we figure out if a model requires to be retrained? Let’s see some techniques.
Scheduled retraining
One of the most common approaches is to schedule the retraining of a model. For example, once a month or once a quarter. This way, the model changes according to the data and follows the change in the probability distribution of the features.
In online trading, this is called Walk Forward Optimization. The general idea is that, if the model is trained periodically, it will never become old.
While this approach may appear useful, it has some flaws. First of all, training a model requires a lot of Exploratory Data Analysis that cannot be fully automated (that’s why I’ve created an entire online course around this topic). Then, an automated retraining software should perform feature selection automatically, because features that weren’t useful in the past may become useful now and useful features may become useless with new data. This entire process is slow, time-consuming and may lead to approximations. If you have a strong infrastructure, maybe you can afford it. Otherwise, you should definitely take a look at the following technique.
Rolling performance evaluation
Since the retraining of a model is a complex task, we should perform it only if necessary. A general approach is to monitor the performance of the model using a rolling test dataset, constantly refreshed with new data. If the model performance is acceptable in this dataset, there’s no need to retrain the model. If the performance decreases, then it’s time to trigger a retrain.
How to quantify the word “acceptable” is up to the Data Scientist. I’ve seen models retrained after a drop in the AuROC of about 10%, for example. My suggestion is to take a look at the average performance of the model in the rolling test dataset and compare it with the 95% confidence interval of the same metric in the original test dataset. If the performance with the new data is lower than the lower bound of the confidence interval, maybe there’s something wrong and the model is losing its capability to generalize the information.
Other models may require a human check to choose whether they need to be retrained or not. Maybe there’s not a single performance indicator that can lead us to decide, but there’s a complex decision scheme that is performed by looking at different things (e.g. precision and AuROC, R-squared and MAPE, particular correlation between the target and some features).
The idea is that the model has to be monitored after the deployment in production and somebody should constantly look at its performance in order to decide to trigger the retraining.
The flaw with this approach is that, once the model has been labeled as inefficient, a new model must be trained from scratch and this may require a lot of time. The worst case is when you need new features that aren’t present in your original dataset and this is the most time-consuming part of the entire process. During this time, you suffer from an inefficient model that runs in production and doesn’t provide you the value you have originally paid for.
So, deploying a model in production and then creating a monitoring and response process requires a lot of previous cost-benefit analysis.
How to mitigate the risk
My suggestion if you want a stable model in production is to use bagging models like Random Forest. If the distribution of the data is the same, bagging models will be more robust against new data that arise over time. In fact, by reducing the variance in the bias-variance tradeoff, if you retrain your model you get almost the same performance. So, you can keep the model as it is.
On the contrary, models based on boosting may become sensitive to small changes in our data and more unstable.
My suggestion is always to measure the variance of your performance metrics in cross-validation (don’t worry, it’s just the variance of the value of your performance indicator over the folds you use) and select the most stable model (i.e. the one with the lowest variance). It may not be the best-performing model, but it will be more robust as time passes.
Conclusions
Retraining a model is a complex task that cannot be fully automated unless you accept some approximations. If you can afford continuous, rolling retraining, then go for it. Otherwise, take into account a constant monitoring process and quick retraining if required. Each one of these techniques has its costs and benefits and should always be analyzed in a proper business case.