Data Cleaning • OOS

Out-of-sample forecasting typically focuses on re-estimating the actual forecasting model each time step for which one is attempting to simulate a historical forecast. However, it is as equally important to re-treat the information set each time time t, lest infect the information set itself with look-ahead bias.

Common ways for look-ahead bias to enter the information set include cleaning outliers, standardizing series, imputing missing values, and performing dimension reduction. For example, a forecaster working in 2020 may be interested in imputing missing values. Therefore, they use their favorite time series imputing technique, let’s say a simple historical average. However, when the user calculates the average of the time series (and has not previously made the series at least first-order stationary), then the user is using future observations to fill in past values. As a result, the model, through the imputed observations, will use future information to estimate its parameters. To avoid this source of look-ahead bias, the forecaster needs to re-calculate the historical average and impute the missing data with observations only through the historical forecasting date.

Another pernicious, yet common, example of such look ahead bias occur when using cross-validation to train ML models without respecting the time series dimension of an experiment. For example, it is good practice to standardize data before estimating a neural network. However, if all data is standardized before going into the training routine, then values in the training set during CV will inevitably have been standardized with data in the hold-out set. That is, the training data will contain information from the test set.

Out-of-sample information set cleaning can easily be implemented within OOS through its two workhorse forecasting routines, forecast_univariate and forecast_multivariate. At each time step a user may perform the following cleaning:

Clean Outliers
1. Winsorize
2. Trim

Impute Missing Observations
1. Interpolation
2. Kalman Filter
3. Fill-Forward
4. Average
5. Moving Average
6. Seasonal Decomposition

Note: imputing missing data is handled through the imputeTS package.

Dimension Reduction
1. Principal Components

Note: principal components are creating using the stats princomp function.