Pass optional column names to parallel tree function in using the self.feature_names_in attribute. #26208

wesselhuising · Apr 18, 2023

Reference Issues/PRs

Example: Fixes #26140

What does this implement/fix? Explain your changes.

In case of using a pandas DataFrame as the X variable type, every tree inside the forest will set its feature_names_in_ attribute equal the list of the columns of set pandas DataFrame

Any other comments?

…list if known to the attributes of the tree

thomasjpfan

Thank you for the PR!

Based on the discussion in #26140, we do not have an agreed solution yet.

thomasjpfan · Apr 18, 2023

sklearn/ensemble/_forest.py

+
+        feature_names_in_ = None
+        if isinstance(X, pd.DataFrame):
+            feature_names_in_ = X.columns.to_list()


There is no need to extract the feature names here. When _validate_data is called, self.feature_names_in_ will be set. This can be passed into _parallel_build_trees directly:

diff --git a/sklearn/ensemble/_forest.py b/sklearn/ensemble/_forest.py index de729264e3..44be556ca4 100644 --- a/sklearn/ensemble/_forest.py +++ b/sklearn/ensemble/_forest.py @@ -467,6 +467,7 @@ class BaseForest(MultiOutputMixin, BaseEnsemble, metaclass=ABCMeta): verbose=self.verbose, class_weight=self.class_weight, n_samples_bootstrap=n_samples_bootstrap, + feature_names_in_=self.feature_names_in_, ) for i, t in enumerate(trees) )

Good one, let me try this.

I tried your solution but a lot of the tests are failing

AttributeError: 'RandomForestClassifier' object has no attribute 'feature_names_in_'

Using the updated version of my branch all the tests are on green for ensamble/tests.

feature_names_in_ = [] if hasattr(X, "columns"): feature_names_in_ = X.columns.to_list()

Ah yes, feature_names_in_ is not always set. Here is an update suggestion:

diff --git a/sklearn/ensemble/_forest.py b/sklearn/ensemble/_forest.py index 117e2e6016..e28b117e07 100644 --- a/sklearn/ensemble/_forest.py +++ b/sklearn/ensemble/_forest.py @@ -481,7 +481,7 @@ class BaseForest(MultiOutputMixin, BaseEnsemble, metaclass=ABCMeta): verbose=self.verbose, class_weight=self.class_weight, n_samples_bootstrap=n_samples_bootstrap, - feature_names_in_=feature_names_in_, + feature_names_in_=getattr(self, "feature_names_in_", None), ) for i, t in enumerate(trees) )

Cheers, nice one. I implemented your suggestion and all tests are on green for sklearn/ensamble/tests.

adrinjalali · Apr 18, 2023

Based on the discussion in #26140, we do not have an agreed solution yet.

yeah, I'd hold until we find a consensus there. But thanks for the PR.

pass column names to parallel fucntion in creating trees, adding the …

77fc5d6

…list if known to the attributes of the tree

github-actions bot added module:ensemble module:tree labels Apr 18, 2023

remove pandas import from code and use hasattr instead

da6555a

thomasjpfan reviewed Apr 18, 2023

View reviewed changes

improve code with suggestion from PR thread

c7ca397

wesselhuising changed the title ~~Pass column names to parallel function in creating trees in case X is a pandas DataFrame.~~ Pass optional column names to parallel tree function in using the self.feature_names_in attribute. Apr 18, 2023

thomasjpfan mentioned this pull request May 30, 2023

RandomForest not passing feature names to trees and creating warnings. #26140

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Uh oh!

Pass optional column names to parallel tree function in using the self.feature_names_in attribute. #26208

Pass optional column names to parallel tree function in using the self.feature_names_in attribute. #26208

Uh oh!

wesselhuising commented Apr 18, 2023

Uh oh!

thomasjpfan left a comment

Uh oh!

thomasjpfan Apr 18, 2023

Uh oh!

wesselhuising Apr 18, 2023

Uh oh!

wesselhuising Apr 18, 2023 •

edited

Loading

Uh oh!

thomasjpfan Apr 18, 2023

Uh oh!

wesselhuising Apr 18, 2023

Uh oh!

adrinjalali commented Apr 18, 2023

Uh oh!

Uh oh!

Search code, repositories, users, issues, pull requests...

Uh oh!

Pass optional column names to parallel tree function in using the self.feature_names_in attribute. #26208

Are you sure you want to change the base?

Pass optional column names to parallel tree function in using the self.feature_names_in attribute. #26208

Uh oh!

Conversation

wesselhuising commented Apr 18, 2023

Reference Issues/PRs

What does this implement/fix? Explain your changes.

Any other comments?

Uh oh!

thomasjpfan left a comment

Choose a reason for hiding this comment

Uh oh!

thomasjpfan Apr 18, 2023

Choose a reason for hiding this comment

Uh oh!

wesselhuising Apr 18, 2023

Choose a reason for hiding this comment

Uh oh!

wesselhuising Apr 18, 2023 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

thomasjpfan Apr 18, 2023

Choose a reason for hiding this comment

Uh oh!

wesselhuising Apr 18, 2023

Choose a reason for hiding this comment

Uh oh!

adrinjalali commented Apr 18, 2023

Uh oh!

Uh oh!

wesselhuising Apr 18, 2023 •

edited

Loading