-
-
Notifications
You must be signed in to change notification settings - Fork 25.8k
Fix/handle categorical features #30798 #30799
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: main
Are you sure you want to change the base?
Fix/handle categorical features #30798 #30799
Conversation
The average_precision_score function was returning misleading results (1.0 or 0.0) when given a single sample. Added validation to require at least 2 samples and provide a clear error message. Fixes scikit-learn#30615
The average_precision_score function was returning misleading results (1.0 or 0.0) when given a single sample. Added validation to require at least 2 samples and provide a clear error message. Fixes scikit-learn#30615
The average_precision_score function was returning misleading results (1.0 or 0.0) when given a single sample. Added validation to require at least 2 samples and provide a clear error message. Fixes scikit-learn#30615
The average_precision_score function was returning misleading results (1.0 or 0.0) when given a single sample. Added validation to require at least 2 samples and provide a clear error message. Fixes scikit-learn#30615
…-sample-30615 Fix:Fix/average precision score single sample 30615
❌ Linting issuesThis PR is introducing linting issues. Here's a summary of the issues. Note that you can avoid having linting issues by enabling You can see the details of the linting issues under the
|
The original issue shows a stacktrace where the error is coming from xgboost, not scikit-learn, so I'm not sure what's to fix here. |
if isinstance(X, pd.DataFrame): | ||
X_new = X.iloc[:, candidate_mask] | ||
else: | ||
X_new = X[:, candidate_mask] |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
You should the function _safe_indexing
instead.
However, here it is already to late because you already too late. We should avoid the call to validate_data
or at least not validate X
to not convert it to a NumPy array.
We also need to use n_features = _num_features(X)
to compute the number of features.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@glemaitre Should I modify the earlier validation step to prevent X from being converted to a NumPy array? If so, should I change how validate_data is called, or should we handle it differently?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
validate_data
has a skip_check_array
arg which would skip conversion to numpy.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ok @adrinjalali
Fix: Handle Categorical Features in
SequentialFeatureSelector
([#30785](#30785))Overview:
This update resolves an issue in the
SequentialFeatureSelector
class, where text and categorical features were not handled properly despite the estimator supporting them. The issue was tracked under [#30785](#30785).Problem:
The
SequentialFeatureSelector
was not correctly handling pandas DataFrame inputs that included categorical or text-based features, leading to errors during feature selection. Although some estimators (e.g.,XGBRegressor
) support categorical features, the feature selector was failing to process them as expected.Solution:
To address this issue, the following change was made to the
sklearn/feature_selection/_sequential.py
file:Added Handling for Pandas DataFrames:
The code was updated to ensure compatibility with pandas DataFrames. When the input
X
is a DataFrame, it uses.iloc[]
to correctly slice the columns based on thecandidate_mask
.Impact:
SequentialFeatureSelector
now properly handles both numeric and categorical features in DataFrames.XGBRegressor
, which support categorical data, now function correctly with the feature selector without raising errors.Testing: