Skip to content

Navigation Menu

Sign in
Appearance settings

Search code, repositories, users, issues, pull requests...

Provide feedback

We read every piece of feedback, and take your input very seriously.

Saved searches

Use saved searches to filter your results more quickly

Appearance settings

ENH Adds feature_name_combiner to OneHotEncoder #22506

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merged

Conversation

mariokostelac
Copy link
Contributor

This change makes sure that OneHotEncoder.get_feature_names_out returns unique feature names, as every transformer should do. The approach is different than discussed in the issue, erring on the side of working by default, rather than forcing the user to provide a custom function that transforms (feature_name, value) into the feature name of the encoded column. It's less flexible, but works without user's intervention.

Fixes #22488.

I'm happy to chance the approach here, but I'd expect this being good enough for majority of use cases.


This also changes the behaviour of the case explained in #16593. Instead of raising by accident on impossible int + str, we cast the feature name to str - we're producing string feature names after all. The exception was accidental.

I'm not sure whether we should close that other issue because there's a discussion on whether we should raise if column names are integers.

One could argue that we should remove the str(feature_name) piece from this PR.

I think we shouldn't because this change enables them to use OneHotEncoder without renaming fields. I suspect they have integer feature names because they don't care about interpreting them after the pipeline, at least until the expectation of feature names is clearly set.

Reference Issues/PRs

Fixes #22488
See also #16593
Forces a rebase for #21754

This change makes sure that OneHotEncoder.get_feature_names_out returns
unique feature names, as every transformer should do. The approach is
different than discussed in the issue, erring on the side of working by
default, rather than forcing the user to provide a custom function that
transforms (feature_name, value) into the feature name of the encoded
column.

It fixes scikit-learn#22488.

I'm happy to chance the approach here, but I'd expect this being good
enough for majority of usecases.

---

This also changes the behaviour of the case explained in
scikit-learn#16593.
Instead of raising by accident on impossible operation between str and
int, we cast the feature name to str - because we're producing strings at
the end after all. The exception was accidental.

I'm not sure whether we should close that other issue because there's a
discussion on whether we should raise if column names are integers.

One could argue that we should remove the str(feature_name) piece from this PR.

I think we shouldn't because this change enables them to use
OneHotEncoder without renaming fields. I suspect they
have integer feature names because they don't care about interpreting
them after the pipeline, at least until the expectation of feature names
is clearly set.
@mariokostelac
Copy link
Contributor Author

There's a failing changelog check. I'm not sure I fully understand the process - **In Development** section confuses me. Also, not sure whether I have to cut the section for a new version (1.0.3) or put the change under development section.

Copy link
Member

@thomasjpfan thomasjpfan left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for the PR. I still prefer the callable. Having custom suffixing logic feels like a lot to maintain for the edge case of [None, "None"].

name = original_name + "__" + str(suffix_count)
while name in name_counts:
suffix_count += 1
name = original_name + "__" + str(suffix_count)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

As with any automatic solution, there will be weird edge cases. For example, with this PR:

from sklearn.preprocessing import OneHotEncoder
import pandas as pd

df = pd.DataFrame([["None"], [None], ["None__1"], ["None__2"]])
enc = OneHotEncoder().fit(df)

print(enc.get_feature_names_out())
# ['x0_None' 'x0_None__1' 'x0_None__2' 'x0_None__3']

It becomes difficult to figure out the correspondence between the categories and the feature names.

If we go with this, we would need to explain the suffixing behavior and how the feature names match up with the categories.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Totally agree about the edge cases, it'd be very difficult to figure out where it came from.

The question that comes to me which option is more preferred

  • (more forgiving) this implementation as a default solution (always working, but sometimes confusing) + ability to change the naming function
  • (more explicit) raising in case of non-unique feature names + ability to change the naming function.

It also might make sense to change the default function from using str to using repr in some future versions.


Which kind of interface did you have for that naming function?
One possible implementation is (feature: str, value: str) -> str.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It would be (feature: Any, category: Any) -> str. Sometimes the category can be ints or None. And the feature can be ints.

The nice thing about the callable is that we can have options denoted as strings. Let's say the parameter is called feature_name_combiner (for the lack of a better name). It can accept a callable to allow users to combine the feature and category themselves. In the future, we can add an option like feature_name_combiner="auto_suffix" to do auto suffixing, which would be a more forgiving setting.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It would be (feature: Any, category: Any) -> str. Sometimes the category can be ints or None. And the feature can be ints.

You're right. I've figured I had written the wrong signature but didn't have access to my computer after that. Hope that didn't confuse you too much.

The nice thing about the callable is that we can have options denoted as strings. Let's say the parameter is called feature_name_combiner (for the lack of a better name). It can accept a callable to allow users to combine the feature and category themselves. In the future, we can add an option like feature_name_combiner="auto_suffix" to do auto suffixing, which would be a more forgiving setting.

That's a good point, that interface would make the configuration easier, and is already present in many transformers.

@@ -692,12 +693,14 @@ def get_feature_names(self, input_features=None):

feature_names = []
for i in range(len(cats)):
names = [input_features[i] + "_" + str(t) for t in cats[i]]
names = [str(input_features[i]) + "_" + str(t) for t in cats[i]]
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is another big issue, where we have not decided what to do. In #21754 (review) we elected to error and give a better error message.

I wouldn't include this fix in this PR as it increases the scope and focus on the None issue.

@thomasjpfan
Copy link
Member

There's a failing changelog check. I'm not sure I fully understand the process - In Development section confuses me. Also, not sure whether I have to cut the section for a new version (1.0.3) or put the change under development section.

The changelog check is asking for a changelog entry in doc/whats_new/v1.1.rst in the sklearn.preprocessing section. If you scroll down you should see the sklearn.preprocessing section of the changelog.

@mariokostelac mariokostelac force-pushed the mario/suffix_ohe_feature_names branch 2 times, most recently from 64e212b to e7cc82b Compare February 17, 2022 12:24
@mariokostelac
Copy link
Contributor Author

@thomasjpfan I think i'll need help running failing tests in local env. How do I run these doctests locally?

@mariokostelac mariokostelac force-pushed the mario/suffix_ohe_feature_names branch from e7cc82b to d8a7eda Compare February 17, 2022 12:52
@mariokostelac
Copy link
Contributor Author

@thomasjpfan this is ready for another review. I think I've included all the changes we've discussed.

Copy link
Member

@thomasjpfan thomasjpfan left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for the update!

I am not sold on the feature_name_combiner name yet. I am open to explore better names.

if self.drop_idx_ is not None and self.drop_idx_[i] is not None:
names.pop(self.drop_idx_[i])
feature_names.extend(names)
feature_names = self._check_feature_names_uniqueness(feature_names)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I do not think we current enforce uniqueness. For example StandardScalar is basically a noop returning the input features without checking for uniqueness. We also add a little more overhead for uniqueness checking.

In this case, OneHotEncoder will not have unique names for the edge case of [None, "None"]. OneHotEncoder does not usually accept multiple dtypes:

from sklearn.preprocessing import OneHotEncoder

enc = OneHotEncoder()

# Fails
enc.fit_transform([[1], ["1"]])

I think we can reduce the scope of this PR by only including the callable support. We can follow up with a PR on uniqueness checking.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I do not think we current enforce uniqueness. For example StandardScalar is basically a noop returning the input features without checking for uniqueness.

You're right, uniqueness is not currently enforced, I think it should be, though. Otherwise some later part of the pipeline might fail (like it did for me in the linked issue) and users have to go through a debugging process to figure out that the real cause is actually OneHotEncoder.

We also add a little more overhead for uniqueness checking.

This is a valid point. How latency sensitive do you think get_feature_names_out is?

I think we can reduce the scope of this PR by only including the callable support. We can follow up with a PR on uniqueness checking.

I can certainly do that if you think that there is another discussion needed around how we implement uniqueness check.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is a valid point. How latency sensitive do you think get_feature_names_out is?

It will matter a bit when we want to support pandas output in the future, which would call get_feature_names_out in all transformers.

I can certainly do that if you think that there is another discussion needed around how we implement uniqueness check.

It's worth opening an issue and discussing. The SLEP007 does not say anything about uniqueness checking right now. I think there are valid arguments for uniqueness checking, but I would not want that discussion to block this PR. Supporting a callable is already a net improvement.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It's worth opening an issue and discussing. The SLEP007 does not say anything about uniqueness checking right now. I think there are valid arguments for uniqueness checking, but I would not want that discussion to block this PR. Supporting a callable is already a net improvement.

Thanks for the additional context, I wasn't aware of SLEPs. If there's no policy around unique feature names, the question is really why ColumnTransformer checks that, but I guess that's a discussion for the original issue #22488.

sklearn/preprocessing/_encoders.py Outdated Show resolved Hide resolved
sklearn/preprocessing/_encoders.py Outdated Show resolved Hide resolved
@mariokostelac
Copy link
Contributor Author

I am not sold on the feature_name_combiner name yet. I am open to explore better names.

Alternative name I had in my mind was "feature_names_schema". It looks ok for string references like "append-string", but less good when you're passing a lambda in. Can we consider function a schema? Maybe yes 🤷

@mariokostelac mariokostelac force-pushed the mario/suffix_ohe_feature_names branch from fad2234 to 361f60f Compare February 23, 2022 07:57
mario-at-intercom and others added 3 commits February 23, 2022 08:58
Co-authored-by: Thomas J. Fan <thomasjpfan@gmail.com>
Co-authored-by: Thomas J. Fan <thomasjpfan@gmail.com>
@mariokostelac
Copy link
Contributor Author

@thomasjpfan I've applied changes you've suggested. What do you say about changing the feature_name_combiner to feature_name_schema?

Also, it'd be good to know why you're not sold on feature_name_combiner - might help with getting a better name.

@thomasjpfan
Copy link
Member

thomasjpfan commented Feb 23, 2022

Also, it'd be good to know why you're not sold on feature_name_combiner - might help with getting a better name.

I am usually not happy with the first name I come up with. Also, I do not think "combiner" is an english word. Maybe feature_name_merger?

Edit: Combiner is a word.

@mariokostelac
Copy link
Contributor Author

mariokostelac commented Feb 23, 2022

Both work for me. I think combiner is a word, but maybe not as common.

Merger is certainly a word, but not used as a subject that merges something but an act of merging
CleanShot 2022-02-23 at 17 22 26@2x

You don't like the feature_name_schema as a name?

@thomasjpfan
Copy link
Member

Let's stick with combiner for now. It's likely good enough.

@mariokostelac
Copy link
Contributor Author

Is there anything else I need to do before it's ready to get merged?

sklearn/preprocessing/_encoders.py Outdated Show resolved Hide resolved
doc/whats_new/v1.1.rst Outdated Show resolved Hide resolved
sklearn/preprocessing/_encoders.py Outdated Show resolved Hide resolved
elif callable(self.feature_name_combiner):
return self.feature_name_combiner
else:
raise ValueError(
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Also we need to check that this error is raised correctly for invalid feature_name_combiners.

Comment on lines 387 to 389
if isinstance(
self.feature_name_combiner, str
) and self.feature_name_combiner not in ("concat_string"):
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We can call _get_feature_name_combiner here to do the error checking and not use the return.

mario-at-intercom and others added 2 commits February 24, 2022 07:53
Co-authored-by: Thomas J. Fan <thomasjpfan@gmail.com>
Co-authored-by: Thomas J. Fan <thomasjpfan@gmail.com>
@mariokostelac mariokostelac force-pushed the mario/suffix_ohe_feature_names branch from 8f1024d to d83b03b Compare June 30, 2022 10:46
@@ -476,6 +496,34 @@ def infrequent_categories_(self):
for category, indices in zip(self.categories_, infrequent_indices)
]

def _validate_keywords(self):
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@glemaitre My understanding is that we don't need this one anymore since the introduction of parameter constraints. Should I just remove it?

Copy link
Member

@glemaitre glemaitre Jun 30, 2022

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, this is taken care of by the _parameter_constraints indeed.

@mariokostelac
Copy link
Contributor Author

mariokostelac commented Jun 30, 2022

I think I've got the code in a fine place, but I don't really understand errors around documentation. I'd appreciate if somebody could help with it.

Update: I got it working.

@@ -322,6 +322,16 @@ class OneHotEncoder(_BaseEncoder):
.. versionadded:: 1.1
Read more in the :ref:`User Guide <one_hot_encoder_infrequent_categories>`.

feature_name_combiner : "concat_string" or callable, default="concat_string"
Copy link
Member

@thomasjpfan thomasjpfan Jun 30, 2022

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The error is for the word "string". I think the semantics are clear enough to use "concat" here instead of "concat_string".

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

When you say error, do you mean the literally error CI is raising?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

When you say error, do you mean the literally error CI is raising?

Yes, I am literally speaking about the CI error. The CI error is a false positive. The numpydoc test sees the word "string" and fail thinking it should be "str".

:meth:`get_feature_names_out`.

`"concat_string"` concatenates encoded feature name and category separated by
'_'. E.g. feature X with values 1, 6, 7 create feature names `X_1, X_6, X_7`.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think we can extend this out with a code snippet and state "concat" does: feature + "_" + str(category). This way a reader can see that "feature" does not call str.

@mariokostelac
Copy link
Contributor Author

@thomasjpfan It's finally green 😁

@mariokostelac
Copy link
Contributor Author

Hey @glemaitre, I think I've addressed all the concerns from your review, and got the PR to the green state.

@glemaitre glemaitre self-requested a review July 20, 2022 12:06
@cmarmo
Copy link
Contributor

cmarmo commented Oct 28, 2022

Hi @mariokostelac we have a new workflow trying to push forward already approved PRs. Do you mind fixing conflicts, if you are still interested in working on that. Then I will apply a "waiting for second review", hoping this will motivate the reviewers.
Thanks for your work so far!

@glemaitre glemaitre removed their assignment Nov 24, 2022
Copy link
Member

@glemaitre glemaitre left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM

@glemaitre glemaitre merged commit 3bc36e0 into scikit-learn:main Jan 12, 2023
@glemaitre
Copy link
Member

Thanks @mariokostelac merging.

glemaitre added a commit to glemaitre/scikit-learn that referenced this pull request Jan 12, 2023
Co-authored-by: Thomas J. Fan <thomasjpfan@gmail.com>
Co-authored-by: Christian Lorentzen <lorentzen.ch@gmail.com>
Co-authored-by: Guillaume Lemaitre <g.lemaitre58@gmail.com>
Co-authored-by: Chiara Marmo <cmarmo@users.noreply.github.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

OneHotEncoder creates non unique column names, when values None (NoneType) and "None" (str) are both present
6 participants
Morty Proxy This is a proxified and sanitized view of the page, visit original site.