-
-
Notifications
You must be signed in to change notification settings - Fork 25.9k
ENH Adds feature_name_combiner to OneHotEncoder #22506
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
ENH Adds feature_name_combiner to OneHotEncoder #22506
Conversation
This change makes sure that OneHotEncoder.get_feature_names_out returns unique feature names, as every transformer should do. The approach is different than discussed in the issue, erring on the side of working by default, rather than forcing the user to provide a custom function that transforms (feature_name, value) into the feature name of the encoded column. It fixes scikit-learn#22488. I'm happy to chance the approach here, but I'd expect this being good enough for majority of usecases. --- This also changes the behaviour of the case explained in scikit-learn#16593. Instead of raising by accident on impossible operation between str and int, we cast the feature name to str - because we're producing strings at the end after all. The exception was accidental. I'm not sure whether we should close that other issue because there's a discussion on whether we should raise if column names are integers. One could argue that we should remove the str(feature_name) piece from this PR. I think we shouldn't because this change enables them to use OneHotEncoder without renaming fields. I suspect they have integer feature names because they don't care about interpreting them after the pipeline, at least until the expectation of feature names is clearly set.
There's a failing changelog check. I'm not sure I fully understand the process - |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for the PR. I still prefer the callable. Having custom suffixing logic feels like a lot to maintain for the edge case of [None, "None"]
.
sklearn/preprocessing/_encoders.py
Outdated
name = original_name + "__" + str(suffix_count) | ||
while name in name_counts: | ||
suffix_count += 1 | ||
name = original_name + "__" + str(suffix_count) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
As with any automatic solution, there will be weird edge cases. For example, with this PR:
from sklearn.preprocessing import OneHotEncoder
import pandas as pd
df = pd.DataFrame([["None"], [None], ["None__1"], ["None__2"]])
enc = OneHotEncoder().fit(df)
print(enc.get_feature_names_out())
# ['x0_None' 'x0_None__1' 'x0_None__2' 'x0_None__3']
It becomes difficult to figure out the correspondence between the categories and the feature names.
If we go with this, we would need to explain the suffixing behavior and how the feature names match up with the categories.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Totally agree about the edge cases, it'd be very difficult to figure out where it came from.
The question that comes to me which option is more preferred
- (more forgiving) this implementation as a default solution (always working, but sometimes confusing) + ability to change the naming function
- (more explicit) raising in case of non-unique feature names + ability to change the naming function.
It also might make sense to change the default function from using str
to using repr
in some future versions.
Which kind of interface did you have for that naming function?
One possible implementation is (feature: str, value: str) -> str
.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It would be (feature: Any, category: Any) -> str
. Sometimes the category can be ints or None
. And the feature
can be ints.
The nice thing about the callable is that we can have options denoted as strings. Let's say the parameter is called feature_name_combiner
(for the lack of a better name). It can accept a callable to allow users to combine the feature and category themselves. In the future, we can add an option like feature_name_combiner="auto_suffix"
to do auto suffixing, which would be a more forgiving setting.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It would be (feature: Any, category: Any) -> str. Sometimes the category can be ints or None. And the feature can be ints.
You're right. I've figured I had written the wrong signature but didn't have access to my computer after that. Hope that didn't confuse you too much.
The nice thing about the callable is that we can have options denoted as strings. Let's say the parameter is called feature_name_combiner (for the lack of a better name). It can accept a callable to allow users to combine the feature and category themselves. In the future, we can add an option like feature_name_combiner="auto_suffix" to do auto suffixing, which would be a more forgiving setting.
That's a good point, that interface would make the configuration easier, and is already present in many transformers.
sklearn/preprocessing/_encoders.py
Outdated
@@ -692,12 +693,14 @@ def get_feature_names(self, input_features=None): | ||
|
||
feature_names = [] | ||
for i in range(len(cats)): | ||
names = [input_features[i] + "_" + str(t) for t in cats[i]] | ||
names = [str(input_features[i]) + "_" + str(t) for t in cats[i]] |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is another big issue, where we have not decided what to do. In #21754 (review) we elected to error and give a better error message.
I wouldn't include this fix in this PR as it increases the scope and focus on the None
issue.
The changelog check is asking for a changelog entry in |
64e212b
to
e7cc82b
Compare
@thomasjpfan I think i'll need help running failing tests in local env. How do I run these doctests locally? |
e7cc82b
to
d8a7eda
Compare
@thomasjpfan this is ready for another review. I think I've included all the changes we've discussed. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for the update!
I am not sold on the feature_name_combiner
name yet. I am open to explore better names.
sklearn/preprocessing/_encoders.py
Outdated
if self.drop_idx_ is not None and self.drop_idx_[i] is not None: | ||
names.pop(self.drop_idx_[i]) | ||
feature_names.extend(names) | ||
feature_names = self._check_feature_names_uniqueness(feature_names) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I do not think we current enforce uniqueness. For example StandardScalar
is basically a noop returning the input features without checking for uniqueness. We also add a little more overhead for uniqueness checking.
In this case, OneHotEncoder
will not have unique names for the edge case of [None, "None"]
. OneHotEncoder does not usually accept multiple dtypes:
from sklearn.preprocessing import OneHotEncoder
enc = OneHotEncoder()
# Fails
enc.fit_transform([[1], ["1"]])
I think we can reduce the scope of this PR by only including the callable support. We can follow up with a PR on uniqueness checking.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I do not think we current enforce uniqueness. For example StandardScalar is basically a noop returning the input features without checking for uniqueness.
You're right, uniqueness is not currently enforced, I think it should be, though. Otherwise some later part of the pipeline might fail (like it did for me in the linked issue) and users have to go through a debugging process to figure out that the real cause is actually OneHotEncoder.
We also add a little more overhead for uniqueness checking.
This is a valid point. How latency sensitive do you think get_feature_names_out is?
I think we can reduce the scope of this PR by only including the callable support. We can follow up with a PR on uniqueness checking.
I can certainly do that if you think that there is another discussion needed around how we implement uniqueness check.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is a valid point. How latency sensitive do you think get_feature_names_out is?
It will matter a bit when we want to support pandas output in the future, which would call get_feature_names_out
in all transformers.
I can certainly do that if you think that there is another discussion needed around how we implement uniqueness check.
It's worth opening an issue and discussing. The SLEP007 does not say anything about uniqueness checking right now. I think there are valid arguments for uniqueness checking, but I would not want that discussion to block this PR. Supporting a callable is already a net improvement.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It's worth opening an issue and discussing. The SLEP007 does not say anything about uniqueness checking right now. I think there are valid arguments for uniqueness checking, but I would not want that discussion to block this PR. Supporting a callable is already a net improvement.
Thanks for the additional context, I wasn't aware of SLEPs. If there's no policy around unique feature names, the question is really why ColumnTransformer checks that, but I guess that's a discussion for the original issue #22488.
Alternative name I had in my mind was "feature_names_schema". It looks ok for string references like "append-string", but less good when you're passing a lambda in. Can we consider function a schema? Maybe yes 🤷 |
fad2234
to
361f60f
Compare
Co-authored-by: Thomas J. Fan <thomasjpfan@gmail.com>
Co-authored-by: Thomas J. Fan <thomasjpfan@gmail.com>
@thomasjpfan I've applied changes you've suggested. What do you say about changing the feature_name_combiner to Also, it'd be good to know why you're not sold on |
I am usually not happy with the first name I come up with. Edit: Combiner is a word. |
Let's stick with combiner for now. It's likely good enough. |
Is there anything else I need to do before it's ready to get merged? |
sklearn/preprocessing/_encoders.py
Outdated
elif callable(self.feature_name_combiner): | ||
return self.feature_name_combiner | ||
else: | ||
raise ValueError( |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Also we need to check that this error is raised correctly for invalid feature_name_combiners
.
sklearn/preprocessing/_encoders.py
Outdated
if isinstance( | ||
self.feature_name_combiner, str | ||
) and self.feature_name_combiner not in ("concat_string"): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We can call _get_feature_name_combiner
here to do the error checking and not use the return.
Co-authored-by: Thomas J. Fan <thomasjpfan@gmail.com>
Co-authored-by: Thomas J. Fan <thomasjpfan@gmail.com>
Co-authored-by: Thomas J. Fan <thomasjpfan@gmail.com>
Co-authored-by: Guillaume Lemaitre <g.lemaitre58@gmail.com>
Co-authored-by: Guillaume Lemaitre <g.lemaitre58@gmail.com>
8f1024d
to
d83b03b
Compare
sklearn/preprocessing/_encoders.py
Outdated
@@ -476,6 +496,34 @@ def infrequent_categories_(self): | ||
for category, indices in zip(self.categories_, infrequent_indices) | ||
] | ||
|
||
def _validate_keywords(self): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@glemaitre My understanding is that we don't need this one anymore since the introduction of parameter constraints. Should I just remove it?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes, this is taken care of by the _parameter_constraints
indeed.
I think I've got the code in a fine place, but I don't really understand errors around documentation. I'd appreciate if somebody could help with it. Update: I got it working. |
sklearn/preprocessing/_encoders.py
Outdated
@@ -322,6 +322,16 @@ class OneHotEncoder(_BaseEncoder): | ||
.. versionadded:: 1.1 | ||
Read more in the :ref:`User Guide <one_hot_encoder_infrequent_categories>`. | ||
|
||
feature_name_combiner : "concat_string" or callable, default="concat_string" |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The error is for the word "string". I think the semantics are clear enough to use "concat"
here instead of "concat_string".
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
When you say error, do you mean the literally error CI is raising?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
When you say error, do you mean the literally error CI is raising?
Yes, I am literally speaking about the CI error. The CI error is a false positive. The numpydoc test sees the word "string" and fail thinking it should be "str".
sklearn/preprocessing/_encoders.py
Outdated
:meth:`get_feature_names_out`. | ||
|
||
`"concat_string"` concatenates encoded feature name and category separated by | ||
'_'. E.g. feature X with values 1, 6, 7 create feature names `X_1, X_6, X_7`. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think we can extend this out with a code snippet and state "concat"
does: feature + "_" + str(category)
. This way a reader can see that "feature" does not call str
.
@thomasjpfan It's finally green 😁 |
Hey @glemaitre, I think I've addressed all the concerns from your review, and got the PR to the green state. |
Hi @mariokostelac we have a new workflow trying to push forward already approved PRs. Do you mind fixing conflicts, if you are still interested in working on that. Then I will apply a "waiting for second review", hoping this will motivate the reviewers. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM
Thanks @mariokostelac merging. |
Co-authored-by: Thomas J. Fan <thomasjpfan@gmail.com> Co-authored-by: Christian Lorentzen <lorentzen.ch@gmail.com> Co-authored-by: Guillaume Lemaitre <g.lemaitre58@gmail.com> Co-authored-by: Chiara Marmo <cmarmo@users.noreply.github.com>
This change makes sure that
OneHotEncoder.get_feature_names_out
returns unique feature names, as every transformer should do. The approach is different than discussed in the issue, erring on the side of working by default, rather than forcing the user to provide a custom function that transforms (feature_name, value) into the feature name of the encoded column. It's less flexible, but works without user's intervention.Fixes #22488.
I'm happy to chance the approach here, but I'd expect this being good enough for majority of use cases.
This also changes the behaviour of the case explained in #16593. Instead of raising by accident on impossible
int + str
, we cast the feature name tostr
- we're producing string feature names after all. The exception was accidental.I'm not sure whether we should close that other issue because there's a discussion on whether we should raise if column names are integers.
One could argue that we should remove the str(feature_name) piece from this PR.
I think we shouldn't because this change enables them to use OneHotEncoder without renaming fields. I suspect they have integer feature names because they don't care about interpreting them after the pipeline, at least until the expectation of feature names is clearly set.
Reference Issues/PRs
Fixes #22488
See also #16593
Forces a rebase for #21754