Description
This is an RFC on the lack of cloning of estimators in the pipeline.
Current situation
The Pipeline
modifies in the fit method the estimators that are given in it's steps
argument, in violation of the scikit-learn convention (bad bad coder).
Specific issue raised
To implement caching of the fit of transformers reliably, cloning of them is crucial (#7990), this way the caching is dependent only on the model parameters of the transformers.
In the PR (#7990), the first attempt was to implement a new class, CachedPipeline
that would behave like the Pipeline
but clone the transformers and cache them. The drawback is the multiplication of classes that are very similar, which makes features harder to discover and give surprises as there is a subttle difference between Pipeline
and CachedPipeline
.
Proposal
The proposal (put forward IRL by @ogrisel, but that has my favor too) is to deprecate the fact that pipeline.steps
is modified and introduce a .steps_
attribute (and a .named_steps_
attribute). That way we could merge Pipeline
and CachedPipeline
.
Implementation
The difficulty is the deprecation path, as often. Two options:
-
Make
steps
andnamed_steps
be properties, and add a warning upon access. Make it so that, for two releases, they return the modified estimators, ie what is stored insteps_
andnamed_steps_
. In two releases,named_steps
dies in favor ofnamed_steps_
, and we remove the properties. -
Create a new class, for instance
Pipe
, that has the new logic with optional caching.
I have no strong opinion on which deprecation path is best. Option 1 is more work but may less intrusive on the users.