FIX Stop EfficiencyWarnings in DBSCAN #31337

Luis-Varona · May 8, 2025

What does this implement/fix? Explain your changes.

As per #31030, DBSCAN consistently triggers efficiency warnings due to explicitly setting the diagonal of the X matrix in fitting neighborhoods. This stems from not sorting the precomputed sparse matrix by row values. Here, we instead update the neighborhoods variable after the initial fitting to avoid this.

Any other comments?

It may also be possible to simply add X = sort_graph_by_row_values(X, warn_when_not_sorted=False) after the original code's X.setdiag(X.diagonal()), but (1) this way seems more efficient and (2) I am not sure if this reordering would potentially affect the data in an undesired manner.

github-actions · May 8, 2025

✔️ Linting Passed

All linting checks passed. Your pull request is in excellent shape! ☀️

_{Generated for commit: 9e337a0. Link to the linter CI: here}

As per scikit-learn#31030, DBSCAN consistently triggers efficiency warnings due to explicitly setting the diagonal of the X matrix in fitting neighborhoods. This stems from not sorting the precomputed sparse matrix by row values. Here, we instead update the neighborhoods variable after the initial fitting to avoid this. It may also be possible to simply add X = sort_graph_by_row_values(X, warn_when_not_sorted=False) after the original code's X.setdiag(X.diagonal()), but (1) this way seems more efficient and (2) I am not sure if this reordering would potentially affect the data in an undesired manner.

Removed the unused warnings import, originally included due to the old neighborhoods-related code that was replaced.

Luis-Varona · May 8, 2025

@adrinjalali @luispedro Thoughts on this potential fix to #31030? 🙂 (I will also add a comment to the changelog, but do let me know first if the actual code changes are up to par before I do so.)

Luis-Varona · May 9, 2025

@yuwei-1, I'd most certainly appreciate your feedback here if you'd like to grant it. Thanks again for the discussion in #31030. 🙂

adrinjalali · May 12, 2025

Someone needs to check if this diff is actually equivalent to the previous code, and I'm not sure it is, which might mean in edge cases we might be getting a different / wrong result.

adam2392

I think this overall looks in the right direction to me, but I could be wrong.

cc: @Micky774 who helped implement HDBScan. Can you comment on this fix that removes the efficiency warning?

One thing that would also support this change is some simulations showing w/ and w/o this fix. WDYT @Luis-Varona ?

adam2392 · May 14, 2025

sklearn/cluster/_dbscan.py

+        # Each point is its own neighbor, so update the neighborhoods
+        # accordingly after the initial fitting
+        if self.metric == "precomputed" and sparse.issparse(X):
+            for i, neighborhood in enumerate(neighborhoods):
+                if i not in neighborhoods[i]:
+                    neighborhoods[i] = np.append(neighborhood, i)


Originally, the distance of each point to itself is explicitly set in the sparse case to say "we have distance of 0 with myself".

Then L412 is rand to compute the neighborhoods.

Since L412 is now ran first, does this matter? I suspect not, but think we just want a few more eyes on this to prevent any form of a regression.

Luis-Varona · May 14, 2025

I think this overall looks in the right direction to me, but I could be wrong.

cc: @Micky774 who helped implement HDBScan. Can you comment on this fix that removes the efficiency warning?

One thing that would also support this change is some simulations showing w/ and w/o this fix. WDYT @Luis-Varona ?

@adam2392 Sounds good! Yup, I'll try benchmarking (and also confirming that the results are all the same across a wide variety of inputs). I guess I should just use the results locally instead of pushing them to the codebase…

Luis-Varona · May 17, 2025

I think this overall looks in the right direction to me, but I could be wrong.
cc: @Micky774 who helped implement HDBScan. Can you comment on this fix that removes the efficiency warning?
One thing that would also support this change is some simulations showing w/ and w/o this fix. WDYT @Luis-Varona ?

@adam2392 Sounds good! Yup, I'll try benchmarking (and also confirming that the results are all the same across a wide variety of inputs). I guess I should just use the results locally instead of pushing them to the codebase…

@adam2392 re: this, I've just been a bit busy the past few days but I hope to start on the weekend/early next week.

github-actions bot added the module:cluster label May 8, 2025

Luis-Varona force-pushed the 31030-dbscan-efficiency-warning branch from 5d58195 to ffd2c7c Compare May 8, 2025 01:48

Apply ruff formatting

b133e54

Removed the unused warnings import, originally included due to the old neighborhoods-related code that was replaced.

Luis-Varona mentioned this pull request May 8, 2025

DBSCAN always triggers and EfficiencyWarning #31030

Open

Merge branch 'main' into 31030-dbscan-efficiency-warning

371dd6e

betatim changed the title ~~Fixes #31030 - Stop EfficiencyWarnings in DBSCAN~~ FIX Stop EfficiencyWarnings in DBSCAN May 9, 2025

Luis-Varona added 2 commits May 9, 2025 12:42

Merge branch 'main' into 31030-dbscan-efficiency-warning

b15fb5b

Merge branch 'scikit-learn:main' into 31030-dbscan-efficiency-warning

af67402

adam2392 reviewed May 14, 2025

View reviewed changes

Merge branch 'main' into 31030-dbscan-efficiency-warning

9e337a0

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

FIX Stop EfficiencyWarnings in DBSCAN #31337

FIX Stop EfficiencyWarnings in DBSCAN #31337

Luis-Varona commented May 8, 2025 •

edited by betatim

Loading

github-actions bot commented May 8, 2025 •

edited

Loading

Luis-Varona commented May 8, 2025

Luis-Varona commented May 9, 2025

adrinjalali commented May 12, 2025

adam2392 left a comment

adam2392 May 14, 2025

Luis-Varona commented May 14, 2025

Luis-Varona commented May 17, 2025

Search code, repositories, users, issues, pull requests...

FIX Stop EfficiencyWarnings in DBSCAN #31337

Are you sure you want to change the base?

FIX Stop EfficiencyWarnings in DBSCAN #31337

Conversation

Luis-Varona commented May 8, 2025 • edited by betatim Loading

What does this implement/fix? Explain your changes.

Any other comments?

github-actions bot commented May 8, 2025 • edited Loading

✔️ Linting Passed

Luis-Varona commented May 8, 2025

Luis-Varona commented May 9, 2025

adrinjalali commented May 12, 2025

adam2392 left a comment

Choose a reason for hiding this comment

adam2392 May 14, 2025

Choose a reason for hiding this comment

Luis-Varona commented May 14, 2025

Luis-Varona commented May 17, 2025

Luis-Varona commented May 8, 2025 •

edited by betatim

Loading

github-actions bot commented May 8, 2025 •

edited

Loading