Skip to content

Navigation Menu

Sign in
Appearance settings

Search code, repositories, users, issues, pull requests...

Provide feedback

We read every piece of feedback, and take your input very seriously.

Saved searches

Use saved searches to filter your results more quickly

Appearance settings

DBSCAN too slow and consumes too much memory for large datasets: a simple tweak can fix this. #17650

Copy link
Copy link
Open
@jenniferjang

Description

@jenniferjang
Issue body actions

Describe the workflow you want to enable

Currently, DBSCAN is very slow for large datasets and can use a lot of memory, especially in higher dimensions.

For example, running sklearn.cluster.DBSCAN on a size 745k dataset with 411 features took over 6 days to finish running on a powerful cloud machine, and running on a size ~1M dataset with a reasonable setting of epsilon for that dataset ran out of memory on that machine with over 750GB RAM.

In a very recent work I did with Google Research [1], we found that you can get over 200x speedup and 250x savings in memory based on a minor tweak to the DBSCAN algorithm without sacrificing clustering accuracy.

[1] https://arxiv.org/abs/2006.06743

Describe your proposed solution

The solution is based on a simple modification to the brute force version. Instead of calculating the pairwise distances for all pairs of examples to obtain the nearest-neighbor graph, you can instead only calculate it for a fraction s of the pairs (selected randomly) and clusters based on that subsampled graph. This would make the algorithm feasible for larger datasets, without hurting the clustering quality (in some cases s = 0.001 suffices). The idea is that you don’t actually need the full nearest-neighbor graph to get the desired clustering.

All comparisons in [1] were done against the algorithm=’auto’ setting of sklearn.cluster.DBSCAN.

The only new parameter needed would be 0 < s <= 1, the sampling ratio which would only be applicable for the algorithm=’brute-force’ setting. (Perhaps we might also want a random seed parameter if one wants to be able to have reproducible runs).

Describe alternatives you've considered, if relevant

Additional context

Metadata

Metadata

Assignees

No one assigned

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions

      Morty Proxy This is a proxified and sanitized view of the page, visit original site.