[MRG] Cross-validation for time series (inserting gaps between the training set and the test set) #13761

WenjieZ · May 1, 2019

Time series have temporal dependence, which may cause information leaks during the cross-validation.
One way to mitigate this risk is by introducing gaps between the training set and the testing set.
This PR implements such a feature for leave-p-out, K-fold, and the naive train-test split.
As for the walk-forward one, @kykosic is implementing, among others, a similar feature for the class TimeSeriesSplit in #13204. I reckon his implementation promising, so I refrain from reinventing the wheel.

Concerning my implementation, I "refactored" the whole structure while still keeping the same public API. GapCrossValidator replaces BaseCrossValidator and becomes the base where GapLeavePOut and GapKFold derive from. Although not tested, all other subclasses, I believe, can derive from the new GapCrossValidator. I put the quotation marks on the word refactor because I didn't really touch the original code. Instead, my code currently coexists with the original one.

Classes and functions added:

GapCrossValidator
- GapLeavePOut
- GapKFold
gap_train_test_split

Related issues and PRs

#6322, #13204

Related users

@kykosic, @amueller, @jnothman, @cbrummitt

cbrummitt · May 2, 2019

In your examples in the docstrings, why do the training sets sometimes have larger values than the testing sets? That would mean training a model on the future and predicting data from the past.

>>> for train_index, test_index in kf.split(np.arange(10)):
    ...    print("TRAIN:", train_index, "TEST:", test_index)
    TRAIN: [6 7 8 9] TEST: [0 1]
    TRAIN: [8 9] TEST: [2 3]
    TRAIN: [0] TEST: [4 5]
    TRAIN: [0 1 2] TEST: [6 7]
    TRAIN: [0 1 2 3 4] TEST: [8 9]

Notice how for TimeSeriesSplit all the training indices precede the test indices:

>>> for train_index, test_index in tscv.split(X):
...    print("TRAIN:", train_index, "TEST:", test_index)
...    X_train, X_test = X[train_index], X[test_index]
...    y_train, y_test = y[train_index], y[test_index]
TRAIN: [0] TEST: [1]
TRAIN: [0 1] TEST: [2]
TRAIN: [0 1 2] TEST: [3]
TRAIN: [0 1 2 3] TEST: [4]
TRAIN: [0 1 2 3 4] TEST: [5]

WenjieZ · May 2, 2019

Yes, it is totally possible to train models on the future data and validate them on the past data.
It is theoretically valid for, say, stationary time series.

WenjieZ · May 7, 2019

Help needed. The tests passed locally in my build but failed in some other builds. What could be the cause?

jnothman · May 8, 2019

Sorry I've not had time to look at this yet. Have you checked the build logs? https://dev.azure.com/scikit-learn/scikit-learn/_build/results?buildId=2928

WenjieZ · May 10, 2019

I finally found the cause, which is the different interpretations of

a[[False, True, True, False, True]]

, where a is a numpy ndarray.

Linux pylatest_conda interprets it as

a[[1, 2, 4]]

Linux py35_conda_openblas and Linux py35_np_atlas interpret it as

a[[0, 1, 1, 0, 1]]

According to the numpy manual, the first one is the correct interpretation, even for numpy v1.11.

jnothman · May 29, 2019

What benefit does gap_after provide?
Why can't you implement GapKFold as a small change to TimeSeriesSplit?

WenjieZ · May 29, 2019

gap_before provides a gap before the test set, and gap_after provides a gap after the test set. The subset after this last gap is a part of the training set.

ooooooooooooooo|||||||||||||xxxxxxxxxxxxx|||||||||||||||||||||||oooooooooooooooooooooo
----training set---------gap-------test set------------gap-----------------training set

See here for more explanation.

amueller · May 29, 2019

btw people have told me we should just implement https://topepo.github.io/caret/data-splitting.html#data-splitting-for-time-series

though I don't think it has gaps?

jnothman · May 29, 2019

I'm not convinced we want to promote the gap approach here. The caret approach seems useful but it is surprising that there is no way to limit the number of splits: it just generate every possible test set. There is also no gap.

WenjieZ · May 30, 2019

though I don't think it has gaps?

No, it doesn't. Another R package named blockCV provides this functionality. It is about space series, but it also applies to time series (considering the time series as 1-D space series).

WenjieZ added 9 commits Apr 29, 2019

add class GapCrossValidator

3d8ccef

simplify GapCrossValidator

9eb8876

add class GapLeavePOut

1005dd4

add docstring to GapLeavePOut

e581b1d

add class GapKFold

6e04492

update __init__.py

f1e8df8

add gap_train_test_split

b39dda6

add docstring to gap_train_test_split

65678d3

gap_size=0 by default

Loading status checks…

90ffa23

WenjieZ changed the title ~~Feature: Cross-validation for time series (inserting gaps between the training and the testing)~~ [WIP] Cross-validation for time series (inserting gaps between the training and the testing) May 1, 2019

WenjieZ added 7 commits May 2, 2019

flake8

4684821

fix the "scikit-learn.scikit-learn" test

Loading status checks…

77d5c84

fix mark-->mask, add tests for GapCrossValidator

6c130f9

add tests for GapLeavePOut

1bdaebe

add tests for GapKFold

0c376bf

add tests for gap_train_test_split

49bc9a0

flake8

Loading status checks…

6a147b8

use np.nonzero() for boolean indexing

Loading status checks…

0f7eb71

WenjieZ changed the title ~~[WIP] Cross-validation for time series (inserting gaps between the training and the testing)~~ [MRG] Cross-validation for time series (inserting gaps between the training set and the test set) May 10, 2019

WenjieZ mentioned this pull request May 10, 2019

Some CI builds misinterpret numpy boolean indexing #13858

Closed

amueller mentioned this pull request Jul 23, 2019

Implement WalkForward cross-validator for time series data. #14376

Closed

amueller added the Needs Decision label Aug 6, 2019

github-actions bot added the module:model_selection label Mar 2, 2020

Oct	NOV	Dec
	12
2019	2020	2021

scikit-learn / scikit-learn

[MRG] Cross-validation for time series (inserting gaps between the training set and the test set) #13761

[MRG] Cross-validation for time series (inserting gaps between the training set and the test set) #13761

WenjieZ commented May 1, 2019 •

edited

cbrummitt commented May 2, 2019

WenjieZ commented May 2, 2019

WenjieZ commented May 7, 2019

jnothman commented May 8, 2019

WenjieZ commented May 10, 2019 •

edited

jnothman commented May 29, 2019

WenjieZ commented May 29, 2019

amueller commented May 29, 2019 •

edited

jnothman commented May 29, 2019

WenjieZ commented May 30, 2019

scikit-learn / scikit-learn

Sponsor scikit-learn/scikit-learn

Join GitHub today

[MRG] Cross-validation for time series (inserting gaps between the training set and the test set) #13761

[MRG] Cross-validation for time series (inserting gaps between the training set and the test set) #13761

Conversation

WenjieZ commented May 1, 2019 • edited

Classes and functions added:

Related issues and PRs

Related users

cbrummitt commented May 2, 2019

WenjieZ commented May 2, 2019

WenjieZ commented May 7, 2019

jnothman commented May 8, 2019

WenjieZ commented May 10, 2019 • edited

jnothman commented May 29, 2019

WenjieZ commented May 29, 2019

amueller commented May 29, 2019 • edited

jnothman commented May 29, 2019

WenjieZ commented May 30, 2019

Essential cookies

Always active

Analytics cookies

WenjieZ commented May 1, 2019 •

edited

WenjieZ commented May 10, 2019 •

edited

amueller commented May 29, 2019 •

edited