-
-
Notifications
You must be signed in to change notification settings - Fork 25.9k
Pass optional column names to parallel tree function in using the self.feature_names_in attribute. #26208
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: main
Are you sure you want to change the base?
Pass optional column names to parallel tree function in using the self.feature_names_in attribute. #26208
Conversation
…list if known to the attributes of the tree
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thank you for the PR!
Based on the discussion in #26140, we do not have an agreed solution yet.
sklearn/ensemble/_forest.py
Outdated
|
||
feature_names_in_ = None | ||
if isinstance(X, pd.DataFrame): | ||
feature_names_in_ = X.columns.to_list() |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
There is no need to extract the feature names here. When _validate_data
is called, self.feature_names_in_
will be set. This can be passed into _parallel_build_trees
directly:
diff --git a/sklearn/ensemble/_forest.py b/sklearn/ensemble/_forest.py
index de729264e3..44be556ca4 100644
--- a/sklearn/ensemble/_forest.py
+++ b/sklearn/ensemble/_forest.py
@@ -467,6 +467,7 @@ class BaseForest(MultiOutputMixin, BaseEnsemble, metaclass=ABCMeta):
verbose=self.verbose,
class_weight=self.class_weight,
n_samples_bootstrap=n_samples_bootstrap,
+ feature_names_in_=self.feature_names_in_,
)
for i, t in enumerate(trees)
)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Good one, let me try this.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I tried your solution but a lot of the tests are failing
AttributeError: 'RandomForestClassifier' object has no attribute 'feature_names_in_'
Using the updated version of my branch all the tests are on green for ensamble/tests
.
feature_names_in_ = []
if hasattr(X, "columns"):
feature_names_in_ = X.columns.to_list()
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ah yes, feature_names_in_
is not always set. Here is an update suggestion:
diff --git a/sklearn/ensemble/_forest.py b/sklearn/ensemble/_forest.py
index 117e2e6016..e28b117e07 100644
--- a/sklearn/ensemble/_forest.py
+++ b/sklearn/ensemble/_forest.py
@@ -481,7 +481,7 @@ class BaseForest(MultiOutputMixin, BaseEnsemble, metaclass=ABCMeta):
verbose=self.verbose,
class_weight=self.class_weight,
n_samples_bootstrap=n_samples_bootstrap,
- feature_names_in_=feature_names_in_,
+ feature_names_in_=getattr(self, "feature_names_in_", None),
)
for i, t in enumerate(trees)
)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Cheers, nice one. I implemented your suggestion and all tests are on green for sklearn/ensamble/tests
.
yeah, I'd hold until we find a consensus there. But thanks for the PR. |
Reference Issues/PRs
Example: Fixes #26140
What does this implement/fix? Explain your changes.
In case of using a pandas DataFrame as the X variable type, every tree inside the forest will set its
feature_names_in_
attribute equal the list of the columns of set pandas DataFrameAny other comments?