Skip to content

Navigation Menu

Sign in
Appearance settings

Search code, repositories, users, issues, pull requests...

Provide feedback

We read every piece of feedback, and take your input very seriously.

Saved searches

Use saved searches to filter your results more quickly

Appearance settings

Pass optional column names to parallel tree function in using the self.feature_names_in attribute. #26208

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Open
wants to merge 3 commits into
base: main
Choose a base branch
Loading
from

Conversation

wesselhuising
Copy link

Reference Issues/PRs

Example: Fixes #26140

What does this implement/fix? Explain your changes.

In case of using a pandas DataFrame as the X variable type, every tree inside the forest will set its feature_names_in_ attribute equal the list of the columns of set pandas DataFrame

Any other comments?

…list if known to the attributes of the tree
Copy link
Member

@thomasjpfan thomasjpfan left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thank you for the PR!

Based on the discussion in #26140, we do not have an agreed solution yet.


feature_names_in_ = None
if isinstance(X, pd.DataFrame):
feature_names_in_ = X.columns.to_list()
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

There is no need to extract the feature names here. When _validate_data is called, self.feature_names_in_ will be set. This can be passed into _parallel_build_trees directly:

diff --git a/sklearn/ensemble/_forest.py b/sklearn/ensemble/_forest.py
index de729264e3..44be556ca4 100644
--- a/sklearn/ensemble/_forest.py
+++ b/sklearn/ensemble/_forest.py
@@ -467,6 +467,7 @@ class BaseForest(MultiOutputMixin, BaseEnsemble, metaclass=ABCMeta):
                     verbose=self.verbose,
                     class_weight=self.class_weight,
                     n_samples_bootstrap=n_samples_bootstrap,
+                    feature_names_in_=self.feature_names_in_,
                 )
                 for i, t in enumerate(trees)
             )

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good one, let me try this.

Copy link
Author

@wesselhuising wesselhuising Apr 18, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I tried your solution but a lot of the tests are failing

AttributeError: 'RandomForestClassifier' object has no attribute 'feature_names_in_'

Using the updated version of my branch all the tests are on green for ensamble/tests.

feature_names_in_ = []
if hasattr(X, "columns"):
    feature_names_in_ = X.columns.to_list()

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ah yes, feature_names_in_ is not always set. Here is an update suggestion:

diff --git a/sklearn/ensemble/_forest.py b/sklearn/ensemble/_forest.py
index 117e2e6016..e28b117e07 100644
--- a/sklearn/ensemble/_forest.py
+++ b/sklearn/ensemble/_forest.py
@@ -481,7 +481,7 @@ class BaseForest(MultiOutputMixin, BaseEnsemble, metaclass=ABCMeta):
                     verbose=self.verbose,
                     class_weight=self.class_weight,
                     n_samples_bootstrap=n_samples_bootstrap,
-                    feature_names_in_=feature_names_in_,
+                    feature_names_in_=getattr(self, "feature_names_in_", None),
                 )
                 for i, t in enumerate(trees)
             )

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Cheers, nice one. I implemented your suggestion and all tests are on green for sklearn/ensamble/tests.

@adrinjalali
Copy link
Member

Based on the discussion in #26140, we do not have an agreed solution yet.

yeah, I'd hold until we find a consensus there. But thanks for the PR.

@wesselhuising wesselhuising changed the title Pass column names to parallel function in creating trees in case X is a pandas DataFrame. Pass optional column names to parallel tree function in using the self.feature_names_in attribute. Apr 18, 2023
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

RandomForest not passing feature names to trees and creating warnings.
3 participants
Morty Proxy This is a proxified and sanitized view of the page, visit original site.