Feature names with input features #13307
Conversation
# Conflicts: # sklearn/base.py # sklearn/impute.py # sklearn/preprocessing/data.py
|
repeating our discussion: it'd be nice if we had the pipeline set the feature names after each individual fit instead of at the end. |
|
@adrinjalali can you give an example with sklearn steps that would benefit from that? |
|
I guess it's @jnothman's example of preprocessing different features created by a count vectorizer depending on the words. But I think we decided we don't want to support that, right? |
… into get_input_features
| 'title_bow__how', 'title_bow__last', 'title_bow__learned', 'title_bow__moveable', | ||
| 'title_bow__of', 'title_bow__the', 'title_bow__trick', 'title_bow__watson', | ||
| 'title_bow__wrath'] | ||
| ['city_category__city_London', 'city_category__city_Paris', 'city_category__city_Sallisaw', |
amueller
Feb 27, 2019
Author
Member
This is somewhat backwards-incompatible but maybe not that bad?
This is somewhat backwards-incompatible but maybe not that bad?
GaelVaroquaux
Mar 1, 2019
Member
If we rename get_feature_names, we no longer have the problem of backward incompatibility.
If we rename get_feature_names, we no longer have the problem of backward incompatibility.
|
I put a draft for a feature name SLEP here: scikit-learn/enhancement_proposals#17 |
|
How would you implement feature names for |
Sorry for the slow reply here. The draft I wrote a month ago: https://hackmd.io/mip6r4t6QriuUUtnWTllKw?both# I will update it one of the coming days with the latest discussions here, and then make a PR on the SLEP repo |
|
@jorisvandenbossche I added some notes inside your draft. I think we can already have a SLEP PR based on it to continue the discussion there. |
|
I finally put up a PR with the SLEP text I wrote in March: scikit-learn/enhancement_proposals#18
@adrinjalali Thanks! I didn't yet address any of the comments (just cleaned up the version that was in that draft). So could you copy your comments onto the PR? |
# Conflicts: # examples/compose/plot_column_transformer_mixed_types.py # sklearn/base.py # sklearn/impute.py # sklearn/pipeline.py # sklearn/tests/test_base.py # sklearn/tests/test_pipeline.py
|
Any updates on this PR? |
|
@ajing the conversation is being continued across a bunch of different PRs, you can follow the conversation mostly here: |
This comment was marked as off-topic.
This comment was marked as off-topic.
|
The main issue with having the pipeline passing around feature names, is that the transformers won't have access to the feature names at |
|
How would having the feature names at fit time benefit a user? |
|
It wouldn't with the existing transformers in our lib right now, but it would for cases where feature names are important at fit time in third party transformers. For instance, in the context of fairness, pretty much every transformer needs to know which are the sensitive features. |
|
Ah I have placed #13307 (comment) in the wrong issue, I was suppose to put this one the get that defined
I think getting the feature names through and allowing the transformers to use the feature names through |
|
Any estimator dealing with heterogeneous features should have access to
feature names / descriptors / annotations. There is no reason to believe
that all annotations can be determined from dtype or distribution within a
sample.
|
|
Some API thoughts. |
| >>> pipe.get_feature_names(iris.feature_names) | ||
| >>> pipe.named_steps.select.input_features_ | ||
| ['sepal length (cm)', 'sepal width (cm)', 'petal length (cm)', 'petal width (cm)'] | ||
| >>> pipe.named_steps.clf.input_features_ | ||
| array(['petal length (cm)', 'petal width (cm)'], dtype='<U17') |
thomasjpfan
May 25, 2020
Member
Calling get_feature_names changes the attribute input_features_. Do we have another place where we update an attribute outside of fit?
Calling get_feature_names changes the attribute input_features_. Do we have another place where we update an attribute outside of fit?
| You can also provide custom feature names for a more human readable format using | ||
| ``get_feature_names``:: | ||
|
|
||
| >>> pipe.get_feature_names(iris.feature_names) |
thomasjpfan
May 25, 2020
Member
This returning nothing is strange. It's more of "set_input_feature_names".
This returning nothing is strange. It's more of "set_input_feature_names".


Build on top of #12627 but provides users with a nicer interface:
Again using this example:
https://scikit-learn.org/dev/auto_examples/compose/plot_column_transformer_mixed_types.html
We now have:
or alternatively:
Transformers not implementing get_feature_names after this PR are:
['AdditiveChi2Sampler', 'FunctionTransformer', 'Imputer' (deprecated one), 'IterativeImputer', 'KBinsDiscretizer', 'KernelCenterer', 'KernelPCA', 'MissingIndicator']
possibly others that I can't test, like TfidfTransformer.
One issue I haven't thought about enough is naming for transformers like pca.
Producing "pca0, pca1" etc is good in a pipeline but in a ColumnTransformer it will probably lead to redundant names.
Currently suggested names:
feature_names_in_,feature_names_out_(?),update_feature_names(feature_names_in)(or should it be input_feature_names here? probably better be consistent, right?)