Skip to content

Navigation Menu

Sign in
Appearance settings

Search code, repositories, users, issues, pull requests...

Provide feedback

We read every piece of feedback, and take your input very seriously.

Saved searches

Use saved searches to filter your results more quickly

Appearance settings

KMeans()-check_transformer_data_not_an_array test failing locally from time to time in main #20199

Copy link
Copy link
Closed
@lesteve

Description

@lesteve
Issue body actions

Describe the bug

Locally I see intermittent failures of the KMeans()-check_transformer_data_not_an_array test. I don't see this failures on 0.24.2.

One additional weird thing is that this is not happening in the CI and I seem to be the first to complain about it (at least I could not find it in the issues).

❯ pytest sklearn/tests/test_common.py  -k 'KMeans and data_not_an_array'
================================================================================================================================== test session starts ===================================================================================================================================
platform linux -- Python 3.7.7, pytest-6.2.3, py-1.10.0, pluggy-0.13.1
hypothesis profile 'default' -> database=DirectoryBasedExampleDatabase('/home/lesteve/dev/scikit-learn/.hypothesis/examples')
rootdir: /home/lesteve/dev/scikit-learn, configfile: setup.cfg
plugins: hypothesis-4.36.2, asyncio-0.10.0, cov-2.7.1
collected 7785 items / 7783 deselected / 2 selected                                                                                                                                                                                                                                      

sklearn/tests/test_common.py F.                                                                                                                                                                                                                                                    [100%]

======================================================================================================================================== FAILURES ========================================================================================================================================
_____________________________________________________________________________________________________________ test_estimators[KMeans()-check_transformer_data_not_an_array] ______________________________________________________________________________________________________________

estimator = KMeans(max_iter=5, n_clusters=2, n_init=2), check = functools.partial(<function check_transformer_data_not_an_array at 0x7fec5ceb7050>, 'KMeans'), request = <FixtureRequest for <Function test_estimators[KMeans()-check_transformer_data_not_an_array]>>

    @parametrize_with_checks(list(_tested_estimators()))
    def test_estimators(estimator, check, request):
        # Common tests for estimator instances
        with ignore_warnings(category=(FutureWarning,
                                       ConvergenceWarning,
                                       UserWarning, FutureWarning)):
            _set_checking_parameters(estimator)
>           check(estimator)

sklearn/tests/test_common.py:90: 
_ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
sklearn/utils/_testing.py:308: in wrapper
    return fn(*args, **kwargs)
sklearn/utils/estimator_checks.py:1289: in check_transformer_data_not_an_array
    _check_transformer(name, transformer, X, y)
sklearn/utils/estimator_checks.py:1366: in _check_transformer
    % transformer, atol=1e-2)
_ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 

x = array([[0.20255037, 3.57452256],
       [3.22900935, 0.23278803],
       [3.2526988 , 0.33507256],
       [0.34546063,...79101, 0.53812952],
       [0.32877832, 3.45131422],
       [0.19314137, 3.21278957],
       [3.79826855, 0.41606372]])
y = array([[3.57452256, 0.20255037],
       [0.23278803, 3.22900935],
       [0.33507256, 3.2526988 ],
       [3.23476868,...12952, 3.75679101],
       [3.45131422, 0.32877832],
       [3.21278957, 0.19314137],
       [0.41606372, 3.79826855]]), rtol = 1e-07, atol = 0.01
err_msg = 'fit_transform and transform outcomes not consistent in KMeans(max_iter=5, n_clusters=2, n_init=2, random_state=0)'

    def assert_allclose_dense_sparse(x, y, rtol=1e-07, atol=1e-9, err_msg=''):
        """Assert allclose for sparse and dense data.
    
        Both x and y need to be either sparse or dense, they
        can't be mixed.
    
        Parameters
        ----------
        x : {array-like, sparse matrix}
            First array to compare.
    
        y : {array-like, sparse matrix}
            Second array to compare.
    
        rtol : float, default=1e-07
            relative tolerance; see numpy.allclose.
    
        atol : float, default=1e-9
            absolute tolerance; see numpy.allclose. Note that the default here is
            more tolerant than the default for numpy.testing.assert_allclose, where
            atol=0.
    
        err_msg : str, default=''
            Error message to raise.
        """
        if sp.sparse.issparse(x) and sp.sparse.issparse(y):
            x = x.tocsr()
            y = y.tocsr()
            x.sum_duplicates()
            y.sum_duplicates()
            assert_array_equal(x.indices, y.indices, err_msg=err_msg)
            assert_array_equal(x.indptr, y.indptr, err_msg=err_msg)
            assert_allclose(x.data, y.data, rtol=rtol, atol=atol, err_msg=err_msg)
        elif not sp.sparse.issparse(x) and not sp.sparse.issparse(y):
            # both dense
>           assert_allclose(x, y, rtol=rtol, atol=atol, err_msg=err_msg)
E           AssertionError: 
E           Not equal to tolerance rtol=1e-07, atol=0.01
E           fit_transform and transform outcomes not consistent in KMeans(max_iter=5, n_clusters=2, n_init=2, random_state=0)
E           Mismatched elements: 60 / 60 (100%)
E           Max absolute difference: 3.38712923
E           Max relative difference: 24.28104678
E            x: array([[0.20255 , 3.574523],
E                  [3.229009, 0.232788],
E                  [3.252699, 0.335073],...
E            y: array([[3.574523, 0.20255 ],
E                  [0.232788, 3.229009],
E                  [0.335073, 3.252699],...

sklearn/utils/_testing.py:415: AssertionError
=============================================================================================================== 1 failed, 1 passed, 7783 deselected, 32 warnings in 2.69s ================================================================================================================

Steps/Code to Reproduce

I can reproduce this failure consistently with:

Create a test-kmeans.sh file:

#!/bin/bash
set -e

conda create -n test python scipy cython pytest joblib threadpoolctl -y
conda activate test
pip install --no-build-isolation --editable .

# make sure to run it a few times to trigger the test failure
for i in $(seq 1 50); do
    pytest sklearn/tests/test_common.py  -k 'KMeans and data_not_an_array'
done

Run test-kmeans.sh:

source test-kmeans.py

Expected Results

No test failure

Actual Results

Test failure

Other comments

Looking at bit at bit more, it seems that when calling KMeans.fit the cluster centers can be in a different order in main, whereas the order is consistent in 0.24.2. Wild-guess: maybe something due to some use of low-level parallelism in KMeans?

import numpy as np

from sklearn.cluster import KMeans

from sklearn.datasets import make_blobs

X, y = make_blobs(n_samples=30, centers=[[0, 0, 0], [1, 1, 1]],
                  random_state=0, n_features=2, cluster_std=0.1)

kmeans = KMeans(max_iter=5, n_clusters=2, n_init=2, random_state=0)

ref_cluster_centers = kmeans.fit(X, y).cluster_centers_
print('reference cluster centers:\n', ref_cluster_centers)
print('-'*80)

for i in range(100):
    cluster_centers = kmeans.fit(X, y).cluster_centers_
    if not np.allclose(ref_cluster_centers, cluster_centers):
        print('differing cluster centers:\n', cluster_centers)

Versions

System:
    python: 3.9.5 (default, May 18 2021, 19:34:48)  [GCC 7.3.0]
executable: /home/lesteve/miniconda3/envs/test/bin/python
   machine: Linux-5.4.0-73-generic-x86_64-with-glibc2.31

Python dependencies:
          pip: 21.1.1
   setuptools: 52.0.0.post20210125
      sklearn: 1.0.dev0
        numpy: 1.20.2
        scipy: 1.6.2
       Cython: 0.29.23
       pandas: None
   matplotlib: None
       joblib: 1.0.1
threadpoolctl: 2.1.0

Built with OpenMP: True
None

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions

      Morty Proxy This is a proxified and sanitized view of the page, visit original site.