Adding endogenous features can improve model accuracy

Endogenous features are additional features that are built by staying within the existing data and reworking it. A Fourier transform or a wavelet decomposition are good examples.

To observe the effect of endogenous features, I implement an MNIST classifier using XGBoost from the book Practical Gradient Boosting. The example notebook in the Github repo below shows that the addition of an endogenous feature improves the accuracy of the XGBoost model from 97.78% to 98.15%.

<Github repo>

Optimizing XGboost with XGboost

Typically, hyperparameter tuning is performed by doing a grid search over the space of hyperparameter values.

However, the space of hyper-parameter values can be large.

Training multiple models over the range of hyper-parameter values can be time-consuming even with parallelization.

Randomized algorithms for hyper-parameter search exist and may provide satisfactory results at low computation cost, e.g. RandomizedSearchCV and HalvingRandomSerachCV in scikit-learn. Nevertheless, we can still miss out on a better configuration for the model.

Surrogate models allow optimal hyper-parameter search in a non-exhaustive way, guiding the search toward more promising candidates.

In this notebook, we train an XGBoost model to predict a model score associated with a hyperparameter value. This way, we do not have to run a complete training to calculate the model score.

The general algorithm can be summarized as follows: n configurations are randomly drawn. The surrogate model performs a gain estimation for each one. the most promising configuration is then actually tested on the model to be trained. the gain obtained is reused to enrich the surrogate model.

As a case study, we apply the hyperparameter model to the California housing dataset and compare results.

<Github repo>

Automatic feature extraction for time series using tsfresh package

At the time of writing this post (June 22nd, 2023), the tsfresh package contains 783 built-in functions to calculate meaningful features from a time series.

These features can be added as endogenous features to the existing training data set. Additional endogenous features can help improve the accuracy of baseline machine learning models. Some interesting features generated using the tsfresh package include quantiles at different levels, autocorrelations at different lags, auto-regressive coefficients with different lags, entropy of the time series, energy of the time series, etc.

<Github repo>

Calculating prediction quantiles using XGBoost

Oftentimes, a user may find it more insightful to have confidence intervals together with a point prediction. Confidence intervals provide a proxy for the range of deviation in the point prediction. A wider confidence interval may signify that the prediction is not very reliable.

In this notebook, we estimate the confidence interval of a prediction using a quantile objective function. A quantile objective function can also be used to generate a distribution of the prediction. This may be a source of additional features for downstream tasks.

log cosh function is used as a smooth approximation to a quantile function. Mathematically, the objective function for the α quantile is given by

A case study involving the California housing data is also provided.

<Github repo>