The Wayback Machine - https://web.archive.org/web/20210117020640/https://github.com/scikit-learn/scikit-learn/pull/13307
Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Feature names with input features #13307

Open
wants to merge 63 commits into
base: master
from

Conversation

@amueller
Copy link
Member

@amueller amueller commented Feb 27, 2019

Build on top of #12627 but provides users with a nicer interface:
Again using this example:
https://scikit-learn.org/dev/auto_examples/compose/plot_column_transformer_mixed_types.html

We now have:

clf.named_steps['classifier'].input_features_

or alternatively:

clf.get_feature_names(X.columns.map(lambda x: x.upper()))
clf.named_steps['classifier'].input_features_

Transformers not implementing get_feature_names after this PR are:
['AdditiveChi2Sampler', 'FunctionTransformer', 'Imputer' (deprecated one), 'IterativeImputer', 'KBinsDiscretizer', 'KernelCenterer', 'KernelPCA', 'MissingIndicator']

possibly others that I can't test, like TfidfTransformer.

One issue I haven't thought about enough is naming for transformers like pca.
Producing "pca0, pca1" etc is good in a pipeline but in a ColumnTransformer it will probably lead to redundant names.

Currently suggested names:
feature_names_in_, feature_names_out_(?), update_feature_names(feature_names_in) (or should it be input_feature_names here? probably better be consistent, right?)

@adrinjalali
Copy link
Member

@adrinjalali adrinjalali commented Feb 27, 2019

repeating our discussion: it'd be nice if we had the pipeline set the feature names after each individual fit instead of at the end.

@amueller
Copy link
Member Author

@amueller amueller commented Feb 27, 2019

@adrinjalali can you give an example with sklearn steps that would benefit from that?

@amueller
Copy link
Member Author

@amueller amueller commented Feb 27, 2019

I guess it's @jnothman's example of preprocessing different features created by a count vectorizer depending on the words. But I think we decided we don't want to support that, right?
Even if you set them, that wouldn't help you as the next estimator doesn't have access to the previous estimator.

'title_bow__how', 'title_bow__last', 'title_bow__learned', 'title_bow__moveable',
'title_bow__of', 'title_bow__the', 'title_bow__trick', 'title_bow__watson',
'title_bow__wrath']
['city_category__city_London', 'city_category__city_Paris', 'city_category__city_Sallisaw',

This comment has been minimized.

@amueller

amueller Feb 27, 2019
Author Member

This is somewhat backwards-incompatible but maybe not that bad?

This comment has been minimized.

@GaelVaroquaux

GaelVaroquaux Mar 1, 2019
Member

If we rename get_feature_names, we no longer have the problem of backward incompatibility.

sklearn/pipeline.py Outdated Show resolved Hide resolved
sklearn/compose/_column_transformer.py Outdated Show resolved Hide resolved
sklearn/pipeline.py Outdated Show resolved Hide resolved
sklearn/pipeline.py Show resolved Hide resolved
@adrinjalali
Copy link
Member

@adrinjalali adrinjalali commented Apr 5, 2019

I put a draft for a feature name SLEP here: scikit-learn/enhancement_proposals#17

@amueller
Copy link
Member Author

@amueller amueller commented Apr 5, 2019

How would you implement feature names for FunctionTransformer? Have a feature_names parameter that can be 'passthrough' or a callable?

@jorisvandenbossche
Copy link
Member

@jorisvandenbossche jorisvandenbossche commented Apr 5, 2019

@jorisvandenbossche did you post your SLEP draft? Can you make it a PR on the SLEP repo?

Sorry for the slow reply here. The draft I wrote a month ago: https://hackmd.io/mip6r4t6QriuUUtnWTllKw?both#

I will update it one of the coming days with the latest discussions here, and then make a PR on the SLEP repo

@adrinjalali
Copy link
Member

@adrinjalali adrinjalali commented Apr 10, 2019

@jorisvandenbossche I added some notes inside your draft. I think we can already have a SLEP PR based on it to continue the discussion there.

@jorisvandenbossche
Copy link
Member

@jorisvandenbossche jorisvandenbossche commented May 28, 2019

I finally put up a PR with the SLEP text I wrote in March: scikit-learn/enhancement_proposals#18

I added some notes inside your draft. I think we can already have a SLEP PR based on it to continue the discussion there.

@adrinjalali Thanks! I didn't yet address any of the comments (just cleaned up the version that was in that draft). So could you copy your comments onto the PR?

amueller added 3 commits May 31, 2019
# Conflicts:
#	examples/compose/plot_column_transformer_mixed_types.py
#	sklearn/base.py
#	sklearn/impute.py
#	sklearn/pipeline.py
#	sklearn/tests/test_base.py
#	sklearn/tests/test_pipeline.py
@ajing
Copy link

@ajing ajing commented Feb 2, 2020

Any updates on this PR?

@adrinjalali
Copy link
Member

@adrinjalali adrinjalali commented Feb 2, 2020

@ajing the conversation is being continued across a bunch of different PRs, you can follow the conversation mostly here:
scikit-learn/enhancement_proposals#17
scikit-learn/enhancement_proposals#18
scikit-learn/enhancement_proposals#25

@thomasjpfan

This comment was marked as off-topic.

@adrinjalali
Copy link
Member

@adrinjalali adrinjalali commented Feb 18, 2020

The main issue with having the pipeline passing around feature names, is that the transformers won't have access to the feature names at fit time. An alternative is to pass the feature names as a parameter to fit, which breaks our current API and I guess with @amueller 's implementations we decided not to go down that path.

@thomasjpfan
Copy link
Member

@thomasjpfan thomasjpfan commented Feb 18, 2020

How would having the feature names at fit time benefit a user?

@adrinjalali
Copy link
Member

@adrinjalali adrinjalali commented Feb 18, 2020

It wouldn't with the existing transformers in our lib right now, but it would for cases where feature names are important at fit time in third party transformers. For instance, in the context of fairness, pretty much every transformer needs to know which are the sensitive features.

@thomasjpfan
Copy link
Member

@thomasjpfan thomasjpfan commented Feb 18, 2020

Ah I have placed #13307 (comment) in the wrong issue, I was suppose to put this one the get that defined get_feature_names everywhere: #12627

It wouldn't with the existing transformers in our lib right now, but it would for cases where feature names are important at fit time in third party transformers. For instance, in the context of fairness, pretty much every transformer needs to know which are the sensitive features.

I think getting the feature names through and allowing the transformers to use the feature names through DataArray can be seen as two separate problems. Imagine we have DataArray, how do we get the feature names inputed in classifier in a pipeline? Either we call pipe[:-1].transform(dataarray) and look at the columns or we store the names in the classifier and directly inspect the columns. Calling transform seems a bit much just to get the feature names and storing all the names in the classifier seems a little wasteful. (I am always uneasy with adding more requirements to the API.)

@amueller amueller mentioned this pull request Feb 18, 2020
0 of 3 tasks complete
@jnothman
Copy link
Member

@jnothman jnothman commented May 22, 2020

Copy link
Member

@thomasjpfan thomasjpfan left a comment

Some API thoughts.

>>> pipe.get_feature_names(iris.feature_names)
>>> pipe.named_steps.select.input_features_
['sepal length (cm)', 'sepal width (cm)', 'petal length (cm)', 'petal width (cm)']
>>> pipe.named_steps.clf.input_features_
array(['petal length (cm)', 'petal width (cm)'], dtype='<U17')
Comment on lines +162 to +166

This comment has been minimized.

@thomasjpfan

thomasjpfan May 25, 2020
Member

Calling get_feature_names changes the attribute input_features_. Do we have another place where we update an attribute outside of fit?

You can also provide custom feature names for a more human readable format using
``get_feature_names``::

>>> pipe.get_feature_names(iris.feature_names)

This comment has been minimized.

@thomasjpfan

thomasjpfan May 25, 2020
Member

This returning nothing is strange. It's more of "set_input_feature_names".

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Linked issues

Successfully merging this pull request may close these issues.

None yet

8 participants
You can’t perform that action at this time.
Morty Proxy This is a proxified and sanitized view of the page, visit original site.