[MRG] ENH Add get_feature_names for various transformers #6431
Conversation
|
@jnothman Sure! |
|
Hello @jnothman and @nelson-liu , |
| """ | ||
| if input_features is not None: | ||
| input_features = list(map(lambda input_feature: 'f(' + |
kmike
Sep 12, 2016
Contributor
What's the reason function name is not used here (self.func.__name__)?
What's the reason function name is not used here (self.func.__name__)?
jnothman
Sep 12, 2016
Member
I don't consider it very stable...?
I don't consider it very stable...?
kmike
Sep 12, 2016
Contributor
Right, there can be lambdas, etc, but that's still the best information available. It allow to e.g. get a more inspectable feature names if several FunctionTransformers are used in a FeatureUnion. Is this use case uncommon?
Right, there can be lambdas, etc, but that's still the best information available. It allow to e.g. get a more inspectable feature names if several FunctionTransformers are used in a FeatureUnion. Is this use case uncommon?
kmike
Nov 15, 2016
Contributor
To clarify, repr is unstable, but func.__name__ is more deterministic:
>>> def foo(x): pass
>>> repr(foo)
'<function foo at 0x106683950>'
>>> foo.__name__
'foo'
To clarify, repr is unstable, but func.__name__ is more deterministic:
>>> def foo(x): pass
>>> repr(foo)
'<function foo at 0x106683950>'
>>> foo.__name__
'foo'
jnothman
Nov 15, 2016
Member
We've got to balance a few things here:
- a machine-readable description of a feature's provenance (ideally a structured representation)
- a human readable (in limited memory!) description of a feature as distinct from a neighbour
- backwards compatibility issues (although I am willing to label some of these things as experimental and subject to change)
We've got to balance a few things here:
- a machine-readable description of a feature's provenance (ideally a structured representation)
- a human readable (in limited memory!) description of a feature as distinct from a neighbour
- backwards compatibility issues (although I am willing to label some of these things as experimental and subject to change)
kmike
Nov 16, 2016
Contributor
Yeah, good points.
a machine-readable description of a feature's provenance (ideally a structured representation)
Are you saying that string formatting is not powerful enough for these use cases, and there is a need for more structured representation (a dict, a special object)? What additional information are you thinking of?
a human readable (in limited memory!) description of a feature as distinct from a neighbour
In eli5 library instead of feature name lists we used a special object which generates feature names on a fly if they are autogenerated ('pca__x0' as opposed to 'petal width'), in order to avoid keeping it all in memory, and to save CPU time when only a subset of feature names is needed. For example, a common task is to get feature names which correspond to non-zero coefficients in a particular document, or feature names for top N features with largest coefficients. CPU was even more important than memory because there were noticeable delays in interactive usage.
Yeah, good points.
a machine-readable description of a feature's provenance (ideally a structured representation)
Are you saying that string formatting is not powerful enough for these use cases, and there is a need for more structured representation (a dict, a special object)? What additional information are you thinking of?
a human readable (in limited memory!) description of a feature as distinct from a neighbour
In eli5 library instead of feature name lists we used a special object which generates feature names on a fly if they are autogenerated ('pca__x0' as opposed to 'petal width'), in order to avoid keeping it all in memory, and to save CPU time when only a subset of feature names is needed. For example, a common task is to get feature names which correspond to non-zero coefficients in a particular document, or feature names for top N features with largest coefficients. CPU was even more important than memory because there were noticeable delays in interactive usage.
|
Yes, I can see the appeal in that approach, but still don't have a clear I can also see that __ might be an appropriate way to designate function On 16 November 2016 at 19:01, Mikhail Korobov notifications@github.com
|
| output_feature_names : list of string, length len(input_features) | ||
| """ | ||
| if input_features is not None: |
jnothman
Jan 18, 2017
Member
I think this logic is incorrect. func is not applied to each feature independently, but to X as a whole. The only way to get sensible feature names out of a FunctionTransformer is to have another parameter specify get_feature_names, essentially, or else to count the number of columns output from transform and make up default names on that basis.
I think this logic is incorrect. func is not applied to each feature independently, but to X as a whole. The only way to get sensible feature names out of a FunctionTransformer is to have another parameter specify get_feature_names, essentially, or else to count the number of columns output from transform and make up default names on that basis.
kmike
Jan 18, 2017
Contributor
Do you think a string template (passed to FunctionTransformer constructor?) is enough?
Do you think a string template (passed to FunctionTransformer constructor?) is enough?
jnothman
Jan 18, 2017
•
Member
No, something like FunctionTransformer(lambda X_df: X_df['col_a', 'col_b']) should have feature names ['col_a', 'col_b'].
No, something like FunctionTransformer(lambda X_df: X_df['col_a', 'col_b']) should have feature names ['col_a', 'col_b'].
| output_feature_names : list of string, length len(input_features) | ||
| """ | ||
| return input_features |
lucasdavid
Dec 12, 2017
Contributor
What if some of the features had only nan in them? Shouldn't this method clip them from input_features, as Imputer does to the column itself?
What if some of the features had only nan in them? Shouldn't this method clip them from input_features, as Imputer does to the column itself?


This is a PR for #6425 .
I've added
get_feature_namesfor scalers, normalizers and imputers.Is the purpose of this function to maintain compatibility when
get_feature_names()is implemented in Pipeline?Can @jnothman please have a look?