Closed
Description
Describe the bug
Locally I see intermittent failures of the KMeans()-check_transformer_data_not_an_array
test. I don't see this failures on 0.24.2.
One additional weird thing is that this is not happening in the CI and I seem to be the first to complain about it (at least I could not find it in the issues).
❯ pytest sklearn/tests/test_common.py -k 'KMeans and data_not_an_array'
================================================================================================================================== test session starts ===================================================================================================================================
platform linux -- Python 3.7.7, pytest-6.2.3, py-1.10.0, pluggy-0.13.1
hypothesis profile 'default' -> database=DirectoryBasedExampleDatabase('/home/lesteve/dev/scikit-learn/.hypothesis/examples')
rootdir: /home/lesteve/dev/scikit-learn, configfile: setup.cfg
plugins: hypothesis-4.36.2, asyncio-0.10.0, cov-2.7.1
collected 7785 items / 7783 deselected / 2 selected
sklearn/tests/test_common.py F. [100%]
======================================================================================================================================== FAILURES ========================================================================================================================================
_____________________________________________________________________________________________________________ test_estimators[KMeans()-check_transformer_data_not_an_array] ______________________________________________________________________________________________________________
estimator = KMeans(max_iter=5, n_clusters=2, n_init=2), check = functools.partial(<function check_transformer_data_not_an_array at 0x7fec5ceb7050>, 'KMeans'), request = <FixtureRequest for <Function test_estimators[KMeans()-check_transformer_data_not_an_array]>>
@parametrize_with_checks(list(_tested_estimators()))
def test_estimators(estimator, check, request):
# Common tests for estimator instances
with ignore_warnings(category=(FutureWarning,
ConvergenceWarning,
UserWarning, FutureWarning)):
_set_checking_parameters(estimator)
> check(estimator)
sklearn/tests/test_common.py:90:
_ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _
sklearn/utils/_testing.py:308: in wrapper
return fn(*args, **kwargs)
sklearn/utils/estimator_checks.py:1289: in check_transformer_data_not_an_array
_check_transformer(name, transformer, X, y)
sklearn/utils/estimator_checks.py:1366: in _check_transformer
% transformer, atol=1e-2)
_ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _
x = array([[0.20255037, 3.57452256],
[3.22900935, 0.23278803],
[3.2526988 , 0.33507256],
[0.34546063,...79101, 0.53812952],
[0.32877832, 3.45131422],
[0.19314137, 3.21278957],
[3.79826855, 0.41606372]])
y = array([[3.57452256, 0.20255037],
[0.23278803, 3.22900935],
[0.33507256, 3.2526988 ],
[3.23476868,...12952, 3.75679101],
[3.45131422, 0.32877832],
[3.21278957, 0.19314137],
[0.41606372, 3.79826855]]), rtol = 1e-07, atol = 0.01
err_msg = 'fit_transform and transform outcomes not consistent in KMeans(max_iter=5, n_clusters=2, n_init=2, random_state=0)'
def assert_allclose_dense_sparse(x, y, rtol=1e-07, atol=1e-9, err_msg=''):
"""Assert allclose for sparse and dense data.
Both x and y need to be either sparse or dense, they
can't be mixed.
Parameters
----------
x : {array-like, sparse matrix}
First array to compare.
y : {array-like, sparse matrix}
Second array to compare.
rtol : float, default=1e-07
relative tolerance; see numpy.allclose.
atol : float, default=1e-9
absolute tolerance; see numpy.allclose. Note that the default here is
more tolerant than the default for numpy.testing.assert_allclose, where
atol=0.
err_msg : str, default=''
Error message to raise.
"""
if sp.sparse.issparse(x) and sp.sparse.issparse(y):
x = x.tocsr()
y = y.tocsr()
x.sum_duplicates()
y.sum_duplicates()
assert_array_equal(x.indices, y.indices, err_msg=err_msg)
assert_array_equal(x.indptr, y.indptr, err_msg=err_msg)
assert_allclose(x.data, y.data, rtol=rtol, atol=atol, err_msg=err_msg)
elif not sp.sparse.issparse(x) and not sp.sparse.issparse(y):
# both dense
> assert_allclose(x, y, rtol=rtol, atol=atol, err_msg=err_msg)
E AssertionError:
E Not equal to tolerance rtol=1e-07, atol=0.01
E fit_transform and transform outcomes not consistent in KMeans(max_iter=5, n_clusters=2, n_init=2, random_state=0)
E Mismatched elements: 60 / 60 (100%)
E Max absolute difference: 3.38712923
E Max relative difference: 24.28104678
E x: array([[0.20255 , 3.574523],
E [3.229009, 0.232788],
E [3.252699, 0.335073],...
E y: array([[3.574523, 0.20255 ],
E [0.232788, 3.229009],
E [0.335073, 3.252699],...
sklearn/utils/_testing.py:415: AssertionError
=============================================================================================================== 1 failed, 1 passed, 7783 deselected, 32 warnings in 2.69s ================================================================================================================
Steps/Code to Reproduce
I can reproduce this failure consistently with:
Create a test-kmeans.sh
file:
#!/bin/bash
set -e
conda create -n test python scipy cython pytest joblib threadpoolctl -y
conda activate test
pip install --no-build-isolation --editable .
# make sure to run it a few times to trigger the test failure
for i in $(seq 1 50); do
pytest sklearn/tests/test_common.py -k 'KMeans and data_not_an_array'
done
Run test-kmeans.sh
:
source test-kmeans.py
Expected Results
No test failure
Actual Results
Test failure
Other comments
Looking at bit at bit more, it seems that when calling KMeans.fit
the cluster centers can be in a different order in main, whereas the order is consistent in 0.24.2. Wild-guess: maybe something due to some use of low-level parallelism in KMeans?
import numpy as np
from sklearn.cluster import KMeans
from sklearn.datasets import make_blobs
X, y = make_blobs(n_samples=30, centers=[[0, 0, 0], [1, 1, 1]],
random_state=0, n_features=2, cluster_std=0.1)
kmeans = KMeans(max_iter=5, n_clusters=2, n_init=2, random_state=0)
ref_cluster_centers = kmeans.fit(X, y).cluster_centers_
print('reference cluster centers:\n', ref_cluster_centers)
print('-'*80)
for i in range(100):
cluster_centers = kmeans.fit(X, y).cluster_centers_
if not np.allclose(ref_cluster_centers, cluster_centers):
print('differing cluster centers:\n', cluster_centers)
Versions
System:
python: 3.9.5 (default, May 18 2021, 19:34:48) [GCC 7.3.0]
executable: /home/lesteve/miniconda3/envs/test/bin/python
machine: Linux-5.4.0-73-generic-x86_64-with-glibc2.31
Python dependencies:
pip: 21.1.1
setuptools: 52.0.0.post20210125
sklearn: 1.0.dev0
numpy: 1.20.2
scipy: 1.6.2
Cython: 0.29.23
pandas: None
matplotlib: None
joblib: 1.0.1
threadpoolctl: 2.1.0
Built with OpenMP: True
None
Metadata
Metadata
Assignees
Labels
No labels