I have month end data for about 75 variables (numeric and category factor, but mostly numeric) for the last 5 years. I have a dependent variable that I’d like to understand the key drivers for, and be able to predict the probability of with new data. Typically I would use a random forest or LASSO regression, and I’m struggling given the data’s time series nature. I understand random forest, and most normal regression models assume independent observations, but I have month end sequential data points.
So what should I do? Should I just ignore the time series nature and run the models as-is? I know there’s models for everything, but I’m not familiar with another strong option to tackle this problem.
You should never ignore the time-series nature. You could keep using the same methods but after including lagged and seasonal-lagged features (of both DV and IVs) to your data.
This is where I’m struggling, because this suggests I essentially at least double the size of the data set by creating lags for each variable. Feels off, you know?
I believe you may map the noise in your variables using a bagging or boosting model (random forest or xgboost) first, and then attempt choosing the final model once you have more dependable features.
One approach would be to conduct an experiment with N variables, dropping the noisier ones first and arranging them in a random order. Perhaps try dropping some more and creating lags for some of the surviving ones.
Next, try to find the best possible final model using the features that are left.
In order to try and get a feeling of which variables make the most sense initially, I would also attempt to look at them qualitatively, taking into consideration the eventual use case and the quality of the data (nulls, outliers, distributions, etc.) prior to all of this.
As others have already mentioned, exercise extreme caution while performing cross-validation to avoid contaminating the test dataset and watch out for time series leaks.
Yes, you would be more than doubling, but if it is a time-series data where you suspect autocorrelation and lagged dependencies, so without taking these properties into consideration, you’d be making false conclusions from these methods you mentioned.
If size is an issue, you can use your judgement to select which variables should have lagged features and how many lagged steps. Also in some instances, just encoding temporal features may suffice.
You can still use those tools for variable selection, but maybe you want to try to model the response as a multivariate autoregressive time series with exogenous variables.