Fix/handle categorical features #30798 #30799

Shyanil · Feb 9, 2025

Fix: Handle Categorical Features in `SequentialFeatureSelector` ([#30785](#30785))

Overview:
This update resolves an issue in the SequentialFeatureSelector class, where text and categorical features were not handled properly despite the estimator supporting them. The issue was tracked under [#30785](#30785).

Problem:
The SequentialFeatureSelector was not correctly handling pandas DataFrame inputs that included categorical or text-based features, leading to errors during feature selection. Although some estimators (e.g., XGBRegressor) support categorical features, the feature selector was failing to process them as expected.

Solution:
To address this issue, the following change was made to the sklearn/feature_selection/_sequential.py file:

Added Handling for Pandas DataFrames:
The code was updated to ensure compatibility with pandas DataFrames. When the input X is a DataFrame, it uses .iloc[] to correctly slice the columns based on the candidate_mask.
```
if isinstance(X, pd.DataFrame):
    X_new = X.iloc[:, candidate_mask]
else:
    X_new = X[:, candidate_mask]
```

Impact:

The fix ensures that the SequentialFeatureSelector now properly handles both numeric and categorical features in DataFrames.
Estimators like XGBRegressor, which support categorical data, now function correctly with the feature selector without raising errors.

Testing:

Verified the solution with datasets containing numeric, categorical, and text features. The feature selector now works seamlessly across all data types.

The average_precision_score function was returning misleading results (1.0 or 0.0) when given a single sample. Added validation to require at least 2 samples and provide a clear error message. Fixes scikit-learn#30615

…-sample-30615 Fix:Fix/average precision score single sample 30615

github-actions · Feb 9, 2025

❌ Linting issues

This PR is introducing linting issues. Here's a summary of the issues. Note that you can avoid having linting issues by enabling pre-commit hooks. Instructions to enable them can be found here.

You can see the details of the linting issues under the lint job here

`black`

black detected issues. Please run black . locally and push the changes. Here you can see the detected issues. Note that running black might also fix some of the issues which might be detected by ruff. Note that the installed black version is black=24.3.0.


--- /home/runner/work/scikit-learn/scikit-learn/sklearn/metrics/_ranking.py	2025-02-10 09:36:15.086966+00:00
+++ /home/runner/work/scikit-learn/scikit-learn/sklearn/metrics/_ranking.py	2025-02-10 09:36:32.897032+00:00
@@ -218,11 +218,10 @@
     0.77...
     """
 
     def _binary_uninterpolated_average_precision(
         y_true, y_score, pos_label=1, sample_weight=None
-
     ):
         if len(y_true) < 2:
             raise ValueError(
                 f"Average precision requires at least 2 samples. Got {len(y_true)}."
                 " A single sample cannot form a precision-recall curve."
would reformat /home/runner/work/scikit-learn/scikit-learn/sklearn/metrics/_ranking.py

Oh no! 💥 💔 💥
1 file would be reformatted, 920 files would be left unchanged.

_{Generated for commit: acb30ab. Link to the linter CI: here}

adrinjalali · Feb 11, 2025

The original issue shows a stacktrace where the error is coming from xgboost, not scikit-learn, so I'm not sure what's to fix here.

glemaitre · Feb 11, 2025

sklearn/feature_selection/_sequential.py

+            if isinstance(X, pd.DataFrame):
+                X_new = X.iloc[:, candidate_mask]
+            else:
+                X_new = X[:, candidate_mask]


You should the function _safe_indexing instead.

However, here it is already to late because you already too late. We should avoid the call to validate_data or at least not validate X to not convert it to a NumPy array.

We also need to use n_features = _num_features(X) to compute the number of features.

@glemaitre Should I modify the earlier validation step to prevent X from being converted to a NumPy array? If so, should I change how validate_data is called, or should we handle it differently?

validate_data has a skip_check_array arg which would skip conversion to numpy.

Ok @adrinjalali

Shyanil and others added 8 commits January 12, 2025 23:34

FIX Fix average_precision_score for single sample case

2f224d2

The average_precision_score function was returning misleading results (1.0 or 0.0) when given a single sample. Added validation to require at least 2 samples and provide a clear error message. Fixes scikit-learn#30615

FIX Fix average_precision_score for single sample case

f43553f

The average_precision_score function was returning misleading results (1.0 or 0.0) when given a single sample. Added validation to require at least 2 samples and provide a clear error message. Fixes scikit-learn#30615

FIX Fix average_precision_score for single sample case

e114d15

The average_precision_score function was returning misleading results (1.0 or 0.0) when given a single sample. Added validation to require at least 2 samples and provide a clear error message. Fixes scikit-learn#30615

FIX Fix average_precision_score for single sample case

e58a18f

The average_precision_score function was returning misleading results (1.0 or 0.0) when given a single sample. Added validation to require at least 2 samples and provide a clear error message. Fixes scikit-learn#30615

Merge pull request #1 from Shyanil/fix/average-precision-score-single…

7778785

…-sample-30615 Fix:Fix/average precision score single sample 30615

Merge branch 'scikit-learn:main' into main

11ee502

Fix: Handle categorical features properly in SequentialFeatureSelector

a2735b4

Fix: Handle categorical features properly in SequentialFeatureSelector

cf42700

github-actions bot added module:feature_selection module:metrics labels Feb 9, 2025

Merge branch 'main' into fix/handle-categorical-features-1

acb30ab

glemaitre reviewed Feb 11, 2025

View reviewed changes

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Uh oh!

Fix/handle categorical features #30798 #30799

Fix/handle categorical features #30798 #30799

Uh oh!

Shyanil commented Feb 9, 2025

Uh oh!

github-actions bot commented Feb 9, 2025 •

edited

Loading

Uh oh!

adrinjalali commented Feb 11, 2025

Uh oh!

glemaitre Feb 11, 2025

Uh oh!

Shyanil Feb 12, 2025

Uh oh!

adrinjalali Feb 12, 2025

Uh oh!

Shyanil Feb 13, 2025

Uh oh!

Uh oh!

Search code, repositories, users, issues, pull requests...

Uh oh!

Fix/handle categorical features #30798 #30799

Are you sure you want to change the base?

Fix/handle categorical features #30798 #30799

Uh oh!

Conversation

Shyanil commented Feb 9, 2025

Fix: Handle Categorical Features in SequentialFeatureSelector ([#30785](#30785))

Uh oh!

github-actions bot commented Feb 9, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

❌ Linting issues

black

Uh oh!

adrinjalali commented Feb 11, 2025

Uh oh!

glemaitre Feb 11, 2025

Choose a reason for hiding this comment

Uh oh!

Shyanil Feb 12, 2025

Choose a reason for hiding this comment

Uh oh!

adrinjalali Feb 12, 2025

Choose a reason for hiding this comment

Uh oh!

Shyanil Feb 13, 2025

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Fix: Handle Categorical Features in `SequentialFeatureSelector` ([#30785](#30785))

github-actions bot commented Feb 9, 2025 •

edited

Loading

`black`