DBSCAN too slow and consumes too much memory for large datasets: a simple tweak can fix this.

Describe the workflow you want to enable

Currently, DBSCAN is very slow for large datasets and can use a lot of memory, especially in higher dimensions.

For example, running sklearn.cluster.DBSCAN on a size 745k dataset with 411 features took over 6 days to finish running on a powerful cloud machine, and running on a size ~1M dataset with a reasonable setting of epsilon for that dataset ran out of memory on that machine with over 750GB RAM.

In a very recent work I did with Google Research [1], we found that you can get over 200x speedup and 250x savings in memory based on a minor tweak to the DBSCAN algorithm without sacrificing clustering accuracy.

[1] https://arxiv.org/abs/2006.06743

Describe your proposed solution

The solution is based on a simple modification to the brute force version. Instead of calculating the pairwise distances for all pairs of examples to obtain the nearest-neighbor graph, you can instead only calculate it for a fraction s of the pairs (selected randomly) and clusters based on that subsampled graph. This would make the algorithm feasible for larger datasets, without hurting the clustering quality (in some cases s = 0.001 suffices). The idea is that you don’t actually need the full nearest-neighbor graph to get the desired clustering.

All comparisons in [1] were done against the algorithm=’auto’ setting of sklearn.cluster.DBSCAN.

The only new parameter needed would be 0 < s <= 1, the sampling ratio which would only be applicable for the algorithm=’brute-force’ setting. (Perhaps we might also want a random seed parameter if one wants to be able to have reproducible runs).

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Uh oh!

DBSCAN too slow and consumes too much memory for large datasets: a simple tweak can fix this. #17650

Describe the workflow you want to enable

Describe your proposed solution

Describe alternatives you've considered, if relevant

Additional context

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Search code, repositories, users, issues, pull requests...

Uh oh!

DBSCAN too slow and consumes too much memory for large datasets: a simple tweak can fix this. #17650

Description

Describe the workflow you want to enable

Describe your proposed solution

Describe alternatives you've considered, if relevant

Additional context

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions