MNT n_features_in_ consistency in decomposition #18557

ogrisel · Oct 7, 2020

Early PR to let @thomasjpfan (and other know) that I started working on this module.

Builds on #18514.

ogrisel · Oct 7, 2020

sklearn/tests/test_common.py

@@ -324,4 +323,5 @@ def test_strict_mode_parametrize_with_checks(estimator, check):
 @pytest.mark.parametrize("estimator", N_FEATURES_IN_AFTER_FIT_ESTIMATORS,
                         ids=_get_check_estimator_ids)
 def test_check_n_features_in_after_fitting(estimator):
+    _set_checking_parameters(estimator)


@thomasjpfan if you work on other module you might need this as long as this check is not part of the list of standard checks.

ogrisel · Oct 7, 2020

Thanks for fixing the title of this PR :)

ogrisel · Oct 7, 2020

sklearn/decomposition/_nmf.py

@@ -1347,6 +1347,9 @@ def transform(self, X):
            Transformed data.
        """
        check_is_fitted(self)
+        X = self._validate_data(X, accept_sparse=('csr', 'csc'),
+                                dtype=[np.float64, np.float32],
+                                reset=False)

        W, _, n_iter_ = non_negative_factorization(


non_negative_factorization also does a call to check_array internally so calling _validate_data here causes some performance overhead. For simplicity's sake I don't want to do optimize this as part of this PR but we might want to add a kwarg to non_negative_factorization to skip input validation.

maybe write that as a TODO comment?

This is most likely going to appear in other places. Another solution would be to have a private _non_negative_factorization that does not call check_array, but still check for non-negative values.

I added a comment.

NicolasHug

Thanks @ogrisel, a few questions but looks good

NicolasHug · Oct 7, 2020

sklearn/decomposition/_base.py

@@ -124,7 +123,7 @@ def transform(self, X):
        """
        check_is_fitted(self)

-        X = check_array(X)
+        X = self._validate_data(X, dtype=[np.float64, np.float32], reset=False)


do we need to pass the dtype here?

I see no reason why this could would not work properly in float32 so this is a slight performance improvement.

Note that the fit of PCA accepts float32 without upcasting so this change makes transform consistent with fit.

sklearn/decomposition/_dict_learning.py

NicolasHug · Oct 7, 2020

sklearn/decomposition/_fastica.py


        Returns
        -------
        X_new : ndarray of shape (n_samples, n_components)
        """
        check_is_fitted(self)

-        X = check_array(X, copy=copy, dtype=FLOAT_DTYPES)
+        X = self._validate_data(X, copy=(copy and self.whiten),


Isn't this a change of behavior?

This is a small memory efficiency optim :)

sklearn/decomposition/_incremental_pca.py

NicolasHug · Oct 7, 2020

sklearn/decomposition/_nmf.py

@@ -1347,6 +1347,9 @@ def transform(self, X):
            Transformed data.
        """
        check_is_fitted(self)
+        X = self._validate_data(X, accept_sparse=('csr', 'csc'),
+                                dtype=[np.float64, np.float32],
+                                reset=False)

        W, _, n_iter_ = non_negative_factorization(


maybe write that as a TODO comment?

sklearn/decomposition/_lda.py

sklearn/decomposition/tests/test_online_lda.py

Co-authored-by: Nicolas Hug <contact@nicolas-hug.com>

NicolasHug

thanks @ogrisel, LGTM when green!

sklearn/decomposition/_lda.py

thomasjpfan

LGTM

Co-authored-by: Nicolas Hug <contact@nicolas-hug.com>

ogrisel added 2 commits October 7, 2020 16:04

Enabling common test for features consistency for decomposition

4725baa

feature consistency for dict learning

bf48c58

github-actions bot added the module:decomposition label Oct 7, 2020

ogrisel commented Oct 7, 2020

View reviewed changes

thomasjpfan changed the title ~~feature_names_in_ consistency with _validate_data in sklearn.decomposition module~~ n_features_in_ consistency with _validate_data in sklearn.decomposition module Oct 7, 2020

ogrisel added 10 commits October 7, 2020 18:38

feature consistency for TruncatedSVD

45a8498

feature consistency for sparse pca

c6bbb48

feature consistency for pca

7c433fe

clean-up now useless imports

41d2ccf

feature consistency for kernel pca

b58a9e7

feature consistency for NMF

a454f2f

n_features_in_ consistency for LatentDiricheletAllocation

c2fb377

n_features_in_ consistency for IncrementalPCA

b445058

n_features_in_ consistency for FastICA

b18d930

n_features_in_ consistency for FactorAnalysis

2fb837d

ogrisel force-pushed the features_in_consistency_decomposition branch from 108c7a4 to 2fb837d Compare October 7, 2020 16:39

ogrisel commented Oct 7, 2020

View reviewed changes

ogrisel marked this pull request as ready for review October 7, 2020 16:46

Forgot reset=False in FactorAnalysis.score_samples

d635e91

ogrisel requested review from NicolasHug, thomasjpfan and glemaitre October 7, 2020 17:12

NicolasHug reviewed Oct 7, 2020

View reviewed changes

ogrisel and others added 6 commits October 8, 2020 01:51

Apply suggestions from code review

3c9b66d

Co-authored-by: Nicolas Hug <contact@nicolas-hug.com>

Remove redundant tests

9e0c457

Add comment about redundant input validation checks

20440cf

Should not accept_sparse in IncrementalPCA

2deb5f4

Simplification of input validation in FactorAnalysis.score_samples

bdbbc0d

Fixed typo

d6dcc60

NicolasHug approved these changes Oct 8, 2020

View reviewed changes

sklearn/decomposition/_lda.py Show resolved Hide resolved

ogrisel added 2 commits October 8, 2020 15:02

Typo

9cf02d8

Remove comment that no longer applies

e73e931

thomasjpfan approved these changes Oct 8, 2020

View reviewed changes

thomasjpfan changed the title ~~n_features_in_ consistency with _validate_data in sklearn.decomposition module~~ MNT n_features_in_ consistency in decomposition Oct 8, 2020

thomasjpfan merged commit 548a452 into scikit-learn:master Oct 8, 2020

ogrisel deleted the features_in_consistency_decomposition branch October 8, 2020 14:53

thomasjpfan mentioned this pull request Oct 9, 2020

ENH Sets assume_finite in _non_negative_factorization #18581

Merged

amrcode pushed a commit to amrcode/scikit-learn that referenced this pull request Oct 19, 2020

MNT n_features_in_ consistency in decomposition (scikit-learn#18557)

d51d1e9

Co-authored-by: Nicolas Hug <contact@nicolas-hug.com>

jayzed82 pushed a commit to jayzed82/scikit-learn that referenced this pull request Oct 22, 2020

MNT n_features_in_ consistency in decomposition (scikit-learn#18557)

b50d528

Co-authored-by: Nicolas Hug <contact@nicolas-hug.com>

Search code, repositories, users, issues, pull requests...

Uh oh!

MNT n_features_in_ consistency in decomposition #18557

MNT n_features_in_ consistency in decomposition #18557

Uh oh!

Conversation

ogrisel commented Oct 7, 2020 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

ogrisel commented Oct 7, 2020

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

NicolasHug left a comment

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

NicolasHug left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

thomasjpfan left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

ogrisel commented Oct 7, 2020 •

edited

Loading