KMeans()-check_transformer_data_not_an_array test failing locally from time to time in main

Describe the bug

Locally I see intermittent failures of the KMeans()-check_transformer_data_not_an_array test. I don't see this failures on 0.24.2.

One additional weird thing is that this is not happening in the CI and I seem to be the first to complain about it (at least I could not find it in the issues).

❯ pytest sklearn/tests/test_common.py  -k 'KMeans and data_not_an_array'
================================================================================================================================== test session starts ===================================================================================================================================
platform linux -- Python 3.7.7, pytest-6.2.3, py-1.10.0, pluggy-0.13.1
hypothesis profile 'default' -> database=DirectoryBasedExampleDatabase('/home/lesteve/dev/scikit-learn/.hypothesis/examples')
rootdir: /home/lesteve/dev/scikit-learn, configfile: setup.cfg
plugins: hypothesis-4.36.2, asyncio-0.10.0, cov-2.7.1
collected 7785 items / 7783 deselected / 2 selected                                                                                                                                                                                                                                      

sklearn/tests/test_common.py F.                                                                                                                                                                                                                                                    [100%]

======================================================================================================================================== FAILURES ========================================================================================================================================
_____________________________________________________________________________________________________________ test_estimators[KMeans()-check_transformer_data_not_an_array] ______________________________________________________________________________________________________________

estimator = KMeans(max_iter=5, n_clusters=2, n_init=2), check = functools.partial(<function check_transformer_data_not_an_array at 0x7fec5ceb7050>, 'KMeans'), request = <FixtureRequest for <Function test_estimators[KMeans()-check_transformer_data_not_an_array]>>

    @parametrize_with_checks(list(_tested_estimators()))
    def test_estimators(estimator, check, request):
        # Common tests for estimator instances
        with ignore_warnings(category=(FutureWarning,
                                       ConvergenceWarning,
                                       UserWarning, FutureWarning)):
            _set_checking_parameters(estimator)
>           check(estimator)

sklearn/tests/test_common.py:90: 
_ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
sklearn/utils/_testing.py:308: in wrapper
    return fn(*args, **kwargs)
sklearn/utils/estimator_checks.py:1289: in check_transformer_data_not_an_array
    _check_transformer(name, transformer, X, y)
sklearn/utils/estimator_checks.py:1366: in _check_transformer
    % transformer, atol=1e-2)
_ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 

x = array([[0.20255037, 3.57452256],
       [3.22900935, 0.23278803],
       [3.2526988 , 0.33507256],
       [0.34546063,...79101, 0.53812952],
       [0.32877832, 3.45131422],
       [0.19314137, 3.21278957],
       [3.79826855, 0.41606372]])
y = array([[3.57452256, 0.20255037],
       [0.23278803, 3.22900935],
       [0.33507256, 3.2526988 ],
       [3.23476868,...12952, 3.75679101],
       [3.45131422, 0.32877832],
       [3.21278957, 0.19314137],
       [0.41606372, 3.79826855]]), rtol = 1e-07, atol = 0.01
err_msg = 'fit_transform and transform outcomes not consistent in KMeans(max_iter=5, n_clusters=2, n_init=2, random_state=0)'

    def assert_allclose_dense_sparse(x, y, rtol=1e-07, atol=1e-9, err_msg=''):
        """Assert allclose for sparse and dense data.
    
        Both x and y need to be either sparse or dense, they
        can't be mixed.
    
        Parameters
        ----------
        x : {array-like, sparse matrix}
            First array to compare.
    
        y : {array-like, sparse matrix}
            Second array to compare.
    
        rtol : float, default=1e-07
            relative tolerance; see numpy.allclose.
    
        atol : float, default=1e-9
            absolute tolerance; see numpy.allclose. Note that the default here is
            more tolerant than the default for numpy.testing.assert_allclose, where
            atol=0.
    
        err_msg : str, default=''
            Error message to raise.
        """
        if sp.sparse.issparse(x) and sp.sparse.issparse(y):
            x = x.tocsr()
            y = y.tocsr()
            x.sum_duplicates()
            y.sum_duplicates()
            assert_array_equal(x.indices, y.indices, err_msg=err_msg)
            assert_array_equal(x.indptr, y.indptr, err_msg=err_msg)
            assert_allclose(x.data, y.data, rtol=rtol, atol=atol, err_msg=err_msg)
        elif not sp.sparse.issparse(x) and not sp.sparse.issparse(y):
            # both dense
>           assert_allclose(x, y, rtol=rtol, atol=atol, err_msg=err_msg)
E           AssertionError: 
E           Not equal to tolerance rtol=1e-07, atol=0.01
E           fit_transform and transform outcomes not consistent in KMeans(max_iter=5, n_clusters=2, n_init=2, random_state=0)
E           Mismatched elements: 60 / 60 (100%)
E           Max absolute difference: 3.38712923
E           Max relative difference: 24.28104678
E            x: array([[0.20255 , 3.574523],
E                  [3.229009, 0.232788],
E                  [3.252699, 0.335073],...
E            y: array([[3.574523, 0.20255 ],
E                  [0.232788, 3.229009],
E                  [0.335073, 3.252699],...

sklearn/utils/_testing.py:415: AssertionError
=============================================================================================================== 1 failed, 1 passed, 7783 deselected, 32 warnings in 2.69s ================================================================================================================

Steps/Code to Reproduce

I can reproduce this failure consistently with:

Create a test-kmeans.sh file:

#!/bin/bash
set -e

conda create -n test python scipy cython pytest joblib threadpoolctl -y
conda activate test
pip install --no-build-isolation --editable .

# make sure to run it a few times to trigger the test failure
for i in $(seq 1 50); do
    pytest sklearn/tests/test_common.py  -k 'KMeans and data_not_an_array'
done

Run test-kmeans.sh:

source test-kmeans.py

Expected Results

No test failure

Actual Results

Test failure

Other comments

Looking at bit at bit more, it seems that when calling KMeans.fit the cluster centers can be in a different order in main, whereas the order is consistent in 0.24.2. Wild-guess: maybe something due to some use of low-level parallelism in KMeans?

import numpy as np

from sklearn.cluster import KMeans

from sklearn.datasets import make_blobs

X, y = make_blobs(n_samples=30, centers=[[0, 0, 0], [1, 1, 1]],
                  random_state=0, n_features=2, cluster_std=0.1)

kmeans = KMeans(max_iter=5, n_clusters=2, n_init=2, random_state=0)

ref_cluster_centers = kmeans.fit(X, y).cluster_centers_
print('reference cluster centers:\n', ref_cluster_centers)
print('-'*80)

for i in range(100):
    cluster_centers = kmeans.fit(X, y).cluster_centers_
    if not np.allclose(ref_cluster_centers, cluster_centers):
        print('differing cluster centers:\n', cluster_centers)

Versions

System:
    python: 3.9.5 (default, May 18 2021, 19:34:48)  [GCC 7.3.0]
executable: /home/lesteve/miniconda3/envs/test/bin/python
   machine: Linux-5.4.0-73-generic-x86_64-with-glibc2.31

Python dependencies:
          pip: 21.1.1
   setuptools: 52.0.0.post20210125
      sklearn: 1.0.dev0
        numpy: 1.20.2
        scipy: 1.6.2
       Cython: 0.29.23
       pandas: None
   matplotlib: None
       joblib: 1.0.1
threadpoolctl: 2.1.0

Built with OpenMP: True
None

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Uh oh!

KMeans()-check_transformer_data_not_an_array test failing locally from time to time in main #20199

Describe the bug

Steps/Code to Reproduce

Expected Results

Actual Results

Other comments

Versions

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Search code, repositories, users, issues, pull requests...

Uh oh!

KMeans()-check_transformer_data_not_an_array test failing locally from time to time in main #20199

Description

Describe the bug

Steps/Code to Reproduce

Expected Results

Actual Results

Other comments

Versions

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions