Join GitHub today
GitHub is home to over 50 million developers working together to host and review code, manage projects, and build software together.
Sign up[MRG] Cross-validation for time series (inserting gaps between the training set and the test set) #13761
Conversation
|
In your examples in the docstrings, why do the training sets sometimes have larger values than the testing sets? That would mean training a model on the future and predicting data from the past.
Notice how for TimeSeriesSplit all the training indices precede the test indices:
|
|
Yes, it is totally possible to train models on the future data and validate them on the past data. |
|
Help needed. The tests passed locally in my build but failed in some other builds. What could be the cause? |
|
Sorry I've not had time to look at this yet. Have you checked the build
logs?
https://dev.azure.com/scikit-learn/scikit-learn/_build/results?buildId=2928
|
|
I finally found the cause, which is the different interpretations of
, where Linux pylatest_conda interprets it as
Linux py35_conda_openblas and Linux py35_np_atlas interpret it as
According to the numpy manual, the first one is the correct interpretation, even for numpy |
|
|
ooooooooooooooo|||||||||||||xxxxxxxxxxxxx|||||||||||||||||||||||oooooooooooooooooooooo See here for more explanation. |
|
btw people have told me we should just implement https://topepo.github.io/caret/data-splitting.html#data-splitting-for-time-series though I don't think it has gaps? |
|
I'm not convinced we want to promote the gap approach here. The caret
approach seems useful but it is surprising that there is no way to limit
the number of splits: it just generate every possible test set. There is
also no gap.
|
No, it doesn't. Another R package named blockCV provides this functionality. It is about space series, but it also applies to time series (considering the time series as 1-D space series). |

Formed in 2009, the Archive Team (not to be confused with the archive.org Archive-It Team) is a rogue archivist collective dedicated to saving copies of rapidly dying or deleted websites for the sake of history and digital heritage. The group is 100% composed of volunteers and interested parties, and has expanded into a large amount of related projects for saving online and digital history.

Time series have temporal dependence, which may cause information leaks during the cross-validation.
One way to mitigate this risk is by introducing gaps between the training set and the testing set.
This PR implements such a feature for leave-p-out, K-fold, and the naive train-test split.
As for the walk-forward one, @kykosic is implementing, among others, a similar feature for the class
TimeSeriesSplitin #13204. I reckon his implementation promising, so I refrain from reinventing the wheel.Concerning my implementation, I "refactored" the whole structure while still keeping the same public API.
GapCrossValidatorreplacesBaseCrossValidatorand becomes the base whereGapLeavePOutandGapKFoldderive from. Although not tested, all other subclasses, I believe, can derive from the newGapCrossValidator. I put the quotation marks on the word refactor because I didn't really touch the original code. Instead, my code currently coexists with the original one.Classes and functions added:
Related issues and PRs
#6322, #13204
Related users
@kykosic, @amueller, @jnothman, @cbrummitt