-
-
Notifications
You must be signed in to change notification settings - Fork 25.8k
FIX Stop EfficiencyWarnings in DBSCAN #31337
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: main
Are you sure you want to change the base?
FIX Stop EfficiencyWarnings in DBSCAN #31337
Conversation
As per scikit-learn#31030, DBSCAN consistently triggers efficiency warnings due to explicitly setting the diagonal of the X matrix in fitting neighborhoods. This stems from not sorting the precomputed sparse matrix by row values. Here, we instead update the neighborhoods variable after the initial fitting to avoid this. It may also be possible to simply add X = sort_graph_by_row_values(X, warn_when_not_sorted=False) after the original code's X.setdiag(X.diagonal()), but (1) this way seems more efficient and (2) I am not sure if this reordering would potentially affect the data in an undesired manner.
5d58195
to
ffd2c7c
Compare
Removed the unused warnings import, originally included due to the old neighborhoods-related code that was replaced.
@adrinjalali @luispedro Thoughts on this potential fix to #31030? 🙂 (I will also add a comment to the changelog, but do let me know first if the actual code changes are up to par before I do so.) |
Someone needs to check if this diff is actually equivalent to the previous code, and I'm not sure it is, which might mean in edge cases we might be getting a different / wrong result. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think this overall looks in the right direction to me, but I could be wrong.
cc: @Micky774 who helped implement HDBScan. Can you comment on this fix that removes the efficiency warning?
One thing that would also support this change is some simulations showing w/ and w/o this fix. WDYT @Luis-Varona ?
# Each point is its own neighbor, so update the neighborhoods | ||
# accordingly after the initial fitting | ||
if self.metric == "precomputed" and sparse.issparse(X): | ||
for i, neighborhood in enumerate(neighborhoods): | ||
if i not in neighborhoods[i]: | ||
neighborhoods[i] = np.append(neighborhood, i) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Originally, the distance of each point to itself is explicitly set in the sparse case to say "we have distance of 0 with myself".
Then L412 is rand to compute the neighborhoods.
Since L412 is now ran first, does this matter? I suspect not, but think we just want a few more eyes on this to prevent any form of a regression.
@adam2392 Sounds good! Yup, I'll try benchmarking (and also confirming that the results are all the same across a wide variety of inputs). I guess I should just use the results locally instead of pushing them to the codebase… |
@adam2392 re: this, I've just been a bit busy the past few days but I hope to start on the weekend/early next week. |
closes #31030
What does this implement/fix? Explain your changes.
As per #31030, DBSCAN consistently triggers efficiency warnings due to explicitly setting the diagonal of the X matrix in fitting neighborhoods. This stems from not sorting the precomputed sparse matrix by row values. Here, we instead update the neighborhoods variable after the initial fitting to avoid this.
Any other comments?
It may also be possible to simply add
X = sort_graph_by_row_values(X, warn_when_not_sorted=False)
after the original code'sX.setdiag(X.diagonal())
, but (1) this way seems more efficient and (2) I am not sure if this reordering would potentially affect the data in an undesired manner.