Skip to content

Navigation Menu

Sign in
Appearance settings

Search code, repositories, users, issues, pull requests...

Provide feedback

We read every piece of feedback, and take your input very seriously.

Saved searches

Use saved searches to filter your results more quickly

Appearance settings

"The Python kernel is unresponsive" when fitting a reasonable sized sparse matrix into NearestNeighbors #31059

Copy link
Copy link
Open
@fabienarnaud

Description

@fabienarnaud
Issue body actions

Describe the bug

Hi all,

I have a python code that has been running every day for the past years, which uses NearestNeighbors to find best matches.
All of a sudden, in both our TEST and PRD environments, our code has been crashing on the NearestNeighbors function with the following message: "The Python kernel is unresponsive". This started last Friday 21st of March 2025.

What puzzles me is that we haven't made any modifications to our code, the data hasn't changed (at least in our TEST environment) and we didn't change the version of scikit-learn.
The exact command that throws the error is:

nbrs = NearestNeighbors(n_neighbors = 1, metric = 'cosine').fit(X)

where X is a sparse matrix compressed to sparse rows that contains 38506x53709 elements.

We run the code on Databricks (runtime 15.4LTS, where scikit-learn is on 1.3.0).
I also tried with scikit-learn 1.4.2 (preinstalled in Databricks runtime 16.2) but had the same issue.

The error suggests a memory issue, but I'm struggling to understand why this would happen now while the context is exactly the same as what it was before. Furthermore, we use the same code with the same Databricks cluster for another data set which is at least 6x bigger and that one runs successfully in just a few seconds.

I'm not a data scientist and therefore quite confused as to why this would no longer run. Since our environment didn't change, I was wondering if anything would have changed in respect to scikit-learn v1.3.0 for any odd reason, or if you heard anything similar recently from some other user(s)?

Steps/Code to Reproduce

from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.neighbors import NearestNeighbors

df_table = df_table.toPandas().add_prefix("b.")
vectorizer = TfidfVectorizer(analyzer = 'char', ngram_range = (1, 4))
X = vectorizer.fit_transform(df_table['b.concat_match_col'].values.astype('U'))
nbrs = NearestNeighbors(n_neighbors = 1, metric = 'cosine').fit(X)

# df_table['b.concat_match_col'] is a pandas dataframe that contains 7 columns
# and 38506 rows. I can't place the whole code to build the dataframe here
# because it's quite long

Expected Results

No error should be thrown

Actual Results

The NearestNeighbor function now returns this:

Image

Versions

System:
    python: 3.11.0rc1 (main, Aug 12 2022, 10:02:14) [GCC 11.2.0]
executable: /local_disk0/.ephemeral_nfs/envs/pythonEnv-de91d56e-aaea-4084-9eca-24ddcb3a19d4/bin/python
   machine: Linux-5.15.0-1078-azure-x86_64-with-glibc2.35

Python dependencies:
      sklearn: 1.3.0
          pip: 23.2.1
   setuptools: 68.0.0
        numpy: 1.23.5
        scipy: 1.11.1
       Cython: 0.29.32
       pandas: 1.5.3
   matplotlib: 3.7.2
       joblib: 1.2.0
threadpoolctl: 2.2.0

Built with OpenMP: True
Exception ignored on calling ctypes callback function: <function _ThreadpoolInfo._find_modules_with_dl_iterate_phdr.<locals>.match_module_callback at 0x7f407fbc1b20>
Traceback (most recent call last):
  File "/databricks/python/lib/python3.11/site-packages/threadpoolctl.py", line 400, in match_module_callback
    self._make_module_from_path(filepath)
  File "/databricks/python/lib/python3.11/site-packages/threadpoolctl.py", line 515, in _make_module_from_path
    module = module_class(filepath, prefix, user_api, internal_api)
             ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/databricks/python/lib/python3.11/site-packages/threadpoolctl.py", line 606, in __init__
    self.version = self.get_version()
                   ^^^^^^^^^^^^^^^^^^
  File "/databricks/python/lib/python3.11/site-packages/threadpoolctl.py", line 646, in get_version
    config = get_config().split()
             ^^^^^^^^^^^^^^^^^^
AttributeError: 'NoneType' object has no attribute 'split'

threadpoolctl info:
       filepath: /databricks/python3/lib/python3.11/site-packages/scipy.libs/libopenblasp-r0-23e5df77.3.21.dev.so
         prefix: libopenblas
       user_api: blas
   internal_api: openblas
        version: 0.3.21.dev
    num_threads: 8
threading_layer: pthreads
   architecture: Zen

       filepath: /databricks/python3/lib/python3.11/site-packages/scikit_learn.libs/libgomp-a34b3233.so.1.0.0
         prefix: libgomp
       user_api: openmp
   internal_api: openmp
        version: None
    num_threads: 8

Metadata

Metadata

Assignees

No one assigned

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions

      Morty Proxy This is a proxified and sanitized view of the page, visit original site.