Skip to content

Navigation Menu

Search code, repositories, users, issues, pull requests...

Provide feedback

We read every piece of feedback, and take your input very seriously.

Saved searches

Use saved searches to filter your results more quickly

Appearance settings

FIX Stop EfficiencyWarnings in DBSCAN #31337

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Open
wants to merge 6 commits into
base: main
Choose a base branch
Loading
from

Conversation

Luis-Varona
Copy link
Contributor

@Luis-Varona Luis-Varona commented May 8, 2025

closes #31030

What does this implement/fix? Explain your changes.

As per #31030, DBSCAN consistently triggers efficiency warnings due to explicitly setting the diagonal of the X matrix in fitting neighborhoods. This stems from not sorting the precomputed sparse matrix by row values. Here, we instead update the neighborhoods variable after the initial fitting to avoid this.

Any other comments?

It may also be possible to simply add X = sort_graph_by_row_values(X, warn_when_not_sorted=False) after the original code's X.setdiag(X.diagonal()), but (1) this way seems more efficient and (2) I am not sure if this reordering would potentially affect the data in an undesired manner.

Copy link

github-actions bot commented May 8, 2025

✔️ Linting Passed

All linting checks passed. Your pull request is in excellent shape! ☀️

Generated for commit: 9e337a0. Link to the linter CI: here

As per scikit-learn#31030, DBSCAN consistently triggers efficiency warnings due to explicitly setting the diagonal of the X matrix in fitting neighborhoods. This stems from not sorting the precomputed sparse matrix by row values. Here, we instead update the neighborhoods variable after the initial fitting to avoid this.

It may also be possible to simply add X = sort_graph_by_row_values(X, warn_when_not_sorted=False) after the original code's X.setdiag(X.diagonal()), but (1) this way seems more efficient and (2) I am not sure if this reordering would potentially affect the data in an undesired manner.
@Luis-Varona Luis-Varona force-pushed the 31030-dbscan-efficiency-warning branch from 5d58195 to ffd2c7c Compare May 8, 2025 01:48
Removed the unused warnings import, originally included due to the old neighborhoods-related code that was replaced.
@Luis-Varona
Copy link
Contributor Author

@adrinjalali @luispedro Thoughts on this potential fix to #31030? 🙂 (I will also add a comment to the changelog, but do let me know first if the actual code changes are up to par before I do so.)

@Luis-Varona
Copy link
Contributor Author

@yuwei-1, I'd most certainly appreciate your feedback here if you'd like to grant it. Thanks again for the discussion in #31030. 🙂

@betatim betatim changed the title Fixes #31030 - Stop EfficiencyWarnings in DBSCAN FIX Stop EfficiencyWarnings in DBSCAN May 9, 2025
@adrinjalali
Copy link
Member

Someone needs to check if this diff is actually equivalent to the previous code, and I'm not sure it is, which might mean in edge cases we might be getting a different / wrong result.

Copy link
Member

@adam2392 adam2392 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think this overall looks in the right direction to me, but I could be wrong.

cc: @Micky774 who helped implement HDBScan. Can you comment on this fix that removes the efficiency warning?

One thing that would also support this change is some simulations showing w/ and w/o this fix. WDYT @Luis-Varona ?

Comment on lines +414 to +419
# Each point is its own neighbor, so update the neighborhoods
# accordingly after the initial fitting
if self.metric == "precomputed" and sparse.issparse(X):
for i, neighborhood in enumerate(neighborhoods):
if i not in neighborhoods[i]:
neighborhoods[i] = np.append(neighborhood, i)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Originally, the distance of each point to itself is explicitly set in the sparse case to say "we have distance of 0 with myself".

Then L412 is rand to compute the neighborhoods.

Since L412 is now ran first, does this matter? I suspect not, but think we just want a few more eyes on this to prevent any form of a regression.

@Luis-Varona
Copy link
Contributor Author

I think this overall looks in the right direction to me, but I could be wrong.

cc: @Micky774 who helped implement HDBScan. Can you comment on this fix that removes the efficiency warning?

One thing that would also support this change is some simulations showing w/ and w/o this fix. WDYT @Luis-Varona ?

@adam2392 Sounds good! Yup, I'll try benchmarking (and also confirming that the results are all the same across a wide variety of inputs). I guess I should just use the results locally instead of pushing them to the codebase…

@Luis-Varona
Copy link
Contributor Author

I think this overall looks in the right direction to me, but I could be wrong.
cc: @Micky774 who helped implement HDBScan. Can you comment on this fix that removes the efficiency warning?
One thing that would also support this change is some simulations showing w/ and w/o this fix. WDYT @Luis-Varona ?

@adam2392 Sounds good! Yup, I'll try benchmarking (and also confirming that the results are all the same across a wide variety of inputs). I guess I should just use the results locally instead of pushing them to the codebase…

@adam2392 re: this, I've just been a bit busy the past few days but I hope to start on the weekend/early next week.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

DBSCAN always triggers and EfficiencyWarning
3 participants
Morty Proxy This is a proxified and sanitized view of the page, visit original site.