ENH Adds feature_name_combiner to OneHotEncoder #22506

mariokostelac · Feb 16, 2022

This change makes sure that OneHotEncoder.get_feature_names_out returns unique feature names, as every transformer should do. The approach is different than discussed in the issue, erring on the side of working by default, rather than forcing the user to provide a custom function that transforms (feature_name, value) into the feature name of the encoded column. It's less flexible, but works without user's intervention.

Fixes #22488.

I'm happy to chance the approach here, but I'd expect this being good enough for majority of use cases.

This also changes the behaviour of the case explained in #16593. Instead of raising by accident on impossible int + str, we cast the feature name to str - we're producing string feature names after all. The exception was accidental.

I'm not sure whether we should close that other issue because there's a discussion on whether we should raise if column names are integers.

One could argue that we should remove the str(feature_name) piece from this PR.

I think we shouldn't because this change enables them to use OneHotEncoder without renaming fields. I suspect they have integer feature names because they don't care about interpreting them after the pipeline, at least until the expectation of feature names is clearly set.

Reference Issues/PRs

Fixes #22488
See also #16593
Forces a rebase for #21754

This change makes sure that OneHotEncoder.get_feature_names_out returns unique feature names, as every transformer should do. The approach is different than discussed in the issue, erring on the side of working by default, rather than forcing the user to provide a custom function that transforms (feature_name, value) into the feature name of the encoded column. It fixes scikit-learn#22488. I'm happy to chance the approach here, but I'd expect this being good enough for majority of usecases. --- This also changes the behaviour of the case explained in scikit-learn#16593. Instead of raising by accident on impossible operation between str and int, we cast the feature name to str - because we're producing strings at the end after all. The exception was accidental. I'm not sure whether we should close that other issue because there's a discussion on whether we should raise if column names are integers. One could argue that we should remove the str(feature_name) piece from this PR. I think we shouldn't because this change enables them to use OneHotEncoder without renaming fields. I suspect they have integer feature names because they don't care about interpreting them after the pipeline, at least until the expectation of feature names is clearly set.

mariokostelac · Feb 16, 2022

There's a failing changelog check. I'm not sure I fully understand the process - **In Development** section confuses me. Also, not sure whether I have to cut the section for a new version (1.0.3) or put the change under development section.

thomasjpfan

Thanks for the PR. I still prefer the callable. Having custom suffixing logic feels like a lot to maintain for the edge case of [None, "None"].

thomasjpfan · Feb 16, 2022

sklearn/preprocessing/_encoders.py

+                name = original_name + "__" + str(suffix_count)
+                while name in name_counts:
+                    suffix_count += 1
+                    name = original_name + "__" + str(suffix_count)


As with any automatic solution, there will be weird edge cases. For example, with this PR:

from sklearn.preprocessing import OneHotEncoder import pandas as pd df = pd.DataFrame([["None"], [None], ["None__1"], ["None__2"]]) enc = OneHotEncoder().fit(df) print(enc.get_feature_names_out()) # ['x0_None' 'x0_None__1' 'x0_None__2' 'x0_None__3']

It becomes difficult to figure out the correspondence between the categories and the feature names.

If we go with this, we would need to explain the suffixing behavior and how the feature names match up with the categories.

Totally agree about the edge cases, it'd be very difficult to figure out where it came from.

The question that comes to me which option is more preferred

(more forgiving) this implementation as a default solution (always working, but sometimes confusing) + ability to change the naming function

(more explicit) raising in case of non-unique feature names + ability to change the naming function.

It also might make sense to change the default function from using str to using repr in some future versions.

Which kind of interface did you have for that naming function?
One possible implementation is (feature: str, value: str) -> str.

It would be (feature: Any, category: Any) -> str. Sometimes the category can be ints or None. And the feature can be ints.

The nice thing about the callable is that we can have options denoted as strings. Let's say the parameter is called feature_name_combiner (for the lack of a better name). It can accept a callable to allow users to combine the feature and category themselves. In the future, we can add an option like feature_name_combiner="auto_suffix" to do auto suffixing, which would be a more forgiving setting.

It would be (feature: Any, category: Any) -> str. Sometimes the category can be ints or None. And the feature can be ints.

You're right. I've figured I had written the wrong signature but didn't have access to my computer after that. Hope that didn't confuse you too much.

The nice thing about the callable is that we can have options denoted as strings. Let's say the parameter is called feature_name_combiner (for the lack of a better name). It can accept a callable to allow users to combine the feature and category themselves. In the future, we can add an option like feature_name_combiner="auto_suffix" to do auto suffixing, which would be a more forgiving setting.

That's a good point, that interface would make the configuration easier, and is already present in many transformers.

thomasjpfan · Feb 16, 2022

sklearn/preprocessing/_encoders.py

@@ -692,12 +693,14 @@ def get_feature_names(self, input_features=None):

        feature_names = []
        for i in range(len(cats)):
-            names = [input_features[i] + "_" + str(t) for t in cats[i]]
+            names = [str(input_features[i]) + "_" + str(t) for t in cats[i]]


This is another big issue, where we have not decided what to do. In #21754 (review) we elected to error and give a better error message.

I wouldn't include this fix in this PR as it increases the scope and focus on the None issue.

thomasjpfan · Feb 16, 2022

There's a failing changelog check. I'm not sure I fully understand the process - In Development section confuses me. Also, not sure whether I have to cut the section for a new version (1.0.3) or put the change under development section.

The changelog check is asking for a changelog entry in doc/whats_new/v1.1.rst in the sklearn.preprocessing section. If you scroll down you should see the sklearn.preprocessing section of the changelog.

mariokostelac · Feb 17, 2022

@thomasjpfan I think i'll need help running failing tests in local env. How do I run these doctests locally?

mariokostelac · Feb 21, 2022

@thomasjpfan this is ready for another review. I think I've included all the changes we've discussed.

thomasjpfan

Thanks for the update!

I am not sold on the feature_name_combiner name yet. I am open to explore better names.

thomasjpfan · Feb 21, 2022

sklearn/preprocessing/_encoders.py

            if self.drop_idx_ is not None and self.drop_idx_[i] is not None:
                names.pop(self.drop_idx_[i])
            feature_names.extend(names)
+        feature_names = self._check_feature_names_uniqueness(feature_names)


I do not think we current enforce uniqueness. For example StandardScalar is basically a noop returning the input features without checking for uniqueness. We also add a little more overhead for uniqueness checking.

In this case, OneHotEncoder will not have unique names for the edge case of [None, "None"]. OneHotEncoder does not usually accept multiple dtypes:

from sklearn.preprocessing import OneHotEncoder enc = OneHotEncoder() # Fails enc.fit_transform([[1], ["1"]])

I think we can reduce the scope of this PR by only including the callable support. We can follow up with a PR on uniqueness checking.

I do not think we current enforce uniqueness. For example StandardScalar is basically a noop returning the input features without checking for uniqueness.

You're right, uniqueness is not currently enforced, I think it should be, though. Otherwise some later part of the pipeline might fail (like it did for me in the linked issue) and users have to go through a debugging process to figure out that the real cause is actually OneHotEncoder.

We also add a little more overhead for uniqueness checking.

This is a valid point. How latency sensitive do you think get_feature_names_out is?

I think we can reduce the scope of this PR by only including the callable support. We can follow up with a PR on uniqueness checking.

I can certainly do that if you think that there is another discussion needed around how we implement uniqueness check.

This is a valid point. How latency sensitive do you think get_feature_names_out is?

It will matter a bit when we want to support pandas output in the future, which would call get_feature_names_out in all transformers.

I can certainly do that if you think that there is another discussion needed around how we implement uniqueness check.

It's worth opening an issue and discussing. The SLEP007 does not say anything about uniqueness checking right now. I think there are valid arguments for uniqueness checking, but I would not want that discussion to block this PR. Supporting a callable is already a net improvement.

It's worth opening an issue and discussing. The SLEP007 does not say anything about uniqueness checking right now. I think there are valid arguments for uniqueness checking, but I would not want that discussion to block this PR. Supporting a callable is already a net improvement.

Thanks for the additional context, I wasn't aware of SLEPs. If there's no policy around unique feature names, the question is really why ColumnTransformer checks that, but I guess that's a discussion for the original issue #22488.

sklearn/preprocessing/_encoders.py

mariokostelac · Feb 22, 2022

I am not sold on the feature_name_combiner name yet. I am open to explore better names.

Alternative name I had in my mind was "feature_names_schema". It looks ok for string references like "append-string", but less good when you're passing a lambda in. Can we consider function a schema? Maybe yes 🤷

Co-authored-by: Thomas J. Fan <thomasjpfan@gmail.com>

mariokostelac · Feb 23, 2022

@thomasjpfan I've applied changes you've suggested. What do you say about changing the feature_name_combiner to feature_name_schema?

Also, it'd be good to know why you're not sold on feature_name_combiner - might help with getting a better name.

thomasjpfan · Feb 23, 2022

Also, it'd be good to know why you're not sold on feature_name_combiner - might help with getting a better name.

I am usually not happy with the first name I come up with. ~~Also, I do not think "combiner" is an english word.~~ Maybe feature_name_merger?

Edit: Combiner is a word.

mariokostelac · Feb 23, 2022

Both work for me. I think combiner is a word, but maybe not as common.

Merger is certainly a word, but not used as a subject that merges something but an act of merging

You don't like the feature_name_schema as a name?

thomasjpfan · Feb 23, 2022

Let's stick with combiner for now. It's likely good enough.

mariokostelac · Feb 23, 2022

Is there anything else I need to do before it's ready to get merged?

sklearn/preprocessing/_encoders.py

doc/whats_new/v1.1.rst

sklearn/preprocessing/_encoders.py

thomasjpfan · Feb 23, 2022

sklearn/preprocessing/_encoders.py

+        elif callable(self.feature_name_combiner):
+            return self.feature_name_combiner
+        else:
+            raise ValueError(


Also we need to check that this error is raised correctly for invalid feature_name_combiners.

thomasjpfan · Feb 23, 2022

sklearn/preprocessing/_encoders.py

+        if isinstance(
+            self.feature_name_combiner, str
+        ) and self.feature_name_combiner not in ("concat_string"):


We can call _get_feature_name_combiner here to do the error checking and not use the return.

Co-authored-by: Thomas J. Fan <thomasjpfan@gmail.com>

Co-authored-by: Guillaume Lemaitre <g.lemaitre58@gmail.com>

mariokostelac · Jun 30, 2022

sklearn/preprocessing/_encoders.py

@@ -476,6 +496,34 @@ def infrequent_categories_(self):
            for category, indices in zip(self.categories_, infrequent_indices)
        ]

+    def _validate_keywords(self):


@glemaitre My understanding is that we don't need this one anymore since the introduction of parameter constraints. Should I just remove it?

Yes, this is taken care of by the _parameter_constraints indeed.

mariokostelac · Jun 30, 2022

I think I've got the code in a fine place, but I don't really understand errors around documentation. I'd appreciate if somebody could help with it.

Update: I got it working.

thomasjpfan · Jun 30, 2022

sklearn/preprocessing/_encoders.py

@@ -322,6 +322,16 @@ class OneHotEncoder(_BaseEncoder):
        .. versionadded:: 1.1
            Read more in the :ref:`User Guide <one_hot_encoder_infrequent_categories>`.

+    feature_name_combiner : "concat_string" or callable, default="concat_string"


The error is for the word "string". I think the semantics are clear enough to use "concat" here instead of "concat_string".

When you say error, do you mean the literally error CI is raising?

When you say error, do you mean the literally error CI is raising?

Yes, I am literally speaking about the CI error. The CI error is a false positive. The numpydoc test sees the word "string" and fail thinking it should be "str".

thomasjpfan · Jun 30, 2022

sklearn/preprocessing/_encoders.py

+        :meth:`get_feature_names_out`.
+
+        `"concat_string"` concatenates encoded feature name and category separated by
+        '_'.  E.g. feature X with values 1, 6, 7 create feature names `X_1, X_6, X_7`.


I think we can extend this out with a code snippet and state "concat" does: feature + "_" + str(category). This way a reader can see that "feature" does not call str.

mariokostelac · Jul 8, 2022

@thomasjpfan It's finally green 😁

mariokostelac · Jul 18, 2022

Hey @glemaitre, I think I've addressed all the concerns from your review, and got the PR to the green state.

cmarmo · Oct 28, 2022

Hi @mariokostelac we have a new workflow trying to push forward already approved PRs. Do you mind fixing conflicts, if you are still interested in working on that. Then I will apply a "waiting for second review", hoping this will motivate the reviewers.
Thanks for your work so far!

glemaitre

LGTM

glemaitre · Jan 12, 2023

Thanks @mariokostelac merging.

Co-authored-by: Thomas J. Fan <thomasjpfan@gmail.com> Co-authored-by: Christian Lorentzen <lorentzen.ch@gmail.com> Co-authored-by: Guillaume Lemaitre <g.lemaitre58@gmail.com> Co-authored-by: Chiara Marmo <cmarmo@users.noreply.github.com>

github-actions bot added the module:preprocessing label Feb 16, 2022

mariokostelac mentioned this pull request Feb 16, 2022

OneHotEncoder creates non unique column names, when values None (NoneType) and "None" (str) are both present #22488

Closed

thomasjpfan reviewed Feb 16, 2022

View reviewed changes

mariokostelac added 3 commits February 17, 2022 08:59

Apply feedback from PR suggestions

8dee023

Add additional changelog entry

5d0219e

Fix docstring

792a5b6

mariokostelac force-pushed the mario/suffix_ohe_feature_names branch 2 times, most recently from 64e212b to e7cc82b Compare February 17, 2022 12:24

Fix indentation

d8a7eda

mariokostelac force-pushed the mario/suffix_ohe_feature_names branch from e7cc82b to d8a7eda Compare February 17, 2022 12:52

thomasjpfan reviewed Feb 21, 2022

View reviewed changes

Remove uniqueness check

361f60f

mariokostelac force-pushed the mario/suffix_ohe_feature_names branch from fad2234 to 361f60f Compare February 23, 2022 07:57

mario-at-intercom and others added 3 commits February 23, 2022 08:58

Update sklearn/preprocessing/_encoders.py

1b663b1

Co-authored-by: Thomas J. Fan <thomasjpfan@gmail.com>

Update sklearn/preprocessing/_encoders.py

f41f4f7

Co-authored-by: Thomas J. Fan <thomasjpfan@gmail.com>

Remove unused import

9b7829a

thomasjpfan reviewed Feb 23, 2022

View reviewed changes

mario-at-intercom and others added 2 commits February 24, 2022 07:53

Update sklearn/preprocessing/_encoders.py

2e54429

Co-authored-by: Thomas J. Fan <thomasjpfan@gmail.com>

Update doc/whats_new/v1.1.rst

1d28f41

Co-authored-by: Thomas J. Fan <thomasjpfan@gmail.com>

mario-at-intercom and others added 6 commits June 30, 2022 12:31

Update sklearn/preprocessing/_encoders.py

96d92f6

Co-authored-by: Thomas J. Fan <thomasjpfan@gmail.com>

Update sklearn/preprocessing/_encoders.py

89a6e41

Co-authored-by: Guillaume Lemaitre <g.lemaitre58@gmail.com>

Update sklearn/preprocessing/_encoders.py

d0278b1

Co-authored-by: Guillaume Lemaitre <g.lemaitre58@gmail.com>

Revert v1.1 changelog (it is already released)

709e6b8

Merge branch 'main' into mario/suffix_ohe_feature_names

b24c56a

Revert v1.1 changelog (it is already released)

d83b03b

mariokostelac force-pushed the mario/suffix_ohe_feature_names branch from 8f1024d to d83b03b Compare June 30, 2022 10:46

mariokostelac added 3 commits June 30, 2022 12:51

Add typecheck in runtime

b4ad4b0

Fix sphinx warning

f263ba5

Add param constraing for new param

3e73cff

mariokostelac commented Jun 30, 2022

View reviewed changes

mariokostelac added 3 commits June 30, 2022 15:08

Fix parameter constraint

e62569a

Remove unecessary _validate_params

7782a5e

Fix docs

26a3303

Fix docs

c834383

thomasjpfan reviewed Jun 30, 2022

View reviewed changes

Make doctest happy

771327c

glemaitre self-requested a review July 20, 2022 12:06

glemaitre removed their assignment Nov 24, 2022

glemaitre added 3 commits January 11, 2023 14:14

Merge remote-tracking branch 'origin/main' into pr/mariokostelac/22506

2511106

TST remove get_feature_names deprecated test

f5737ee

TST/FIX improve type of error

0dba092

glemaitre approved these changes Jan 11, 2023

View reviewed changes

glemaitre merged commit 3bc36e0 into scikit-learn:main Jan 12, 2023

Search code, repositories, users, issues, pull requests...

Uh oh!

ENH Adds feature_name_combiner to OneHotEncoder #22506

ENH Adds feature_name_combiner to OneHotEncoder #22506

Uh oh!

Conversation

mariokostelac commented Feb 16, 2022

Reference Issues/PRs

Uh oh!

mariokostelac commented Feb 16, 2022

Uh oh!

thomasjpfan left a comment

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

thomasjpfan commented Feb 16, 2022

Uh oh!

mariokostelac commented Feb 17, 2022

Uh oh!

mariokostelac commented Feb 21, 2022

Uh oh!

thomasjpfan left a comment

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

mariokostelac commented Feb 22, 2022

Uh oh!

mariokostelac commented Feb 23, 2022

Uh oh!

thomasjpfan commented Feb 23, 2022 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

mariokostelac commented Feb 23, 2022 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

thomasjpfan commented Feb 23, 2022

Uh oh!

mariokostelac commented Feb 23, 2022

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

glemaitre Jun 30, 2022 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

mariokostelac commented Jun 30, 2022 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

thomasjpfan Jun 30, 2022 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

thomasjpfan commented Feb 23, 2022 •

edited

Loading

mariokostelac commented Feb 23, 2022 •

edited

Loading

glemaitre Jun 30, 2022 •

edited

Loading

mariokostelac commented Jun 30, 2022 •

edited

Loading

thomasjpfan Jun 30, 2022 •

edited

Loading