Description
Describe the workflow you want to enable
Currently, DBSCAN is very slow for large datasets and can use a lot of memory, especially in higher dimensions.
For example, running sklearn.cluster.DBSCAN on a size 745k dataset with 411 features took over 6 days to finish running on a powerful cloud machine, and running on a size ~1M dataset with a reasonable setting of epsilon for that dataset ran out of memory on that machine with over 750GB RAM.
In a very recent work I did with Google Research [1], we found that you can get over 200x speedup and 250x savings in memory based on a minor tweak to the DBSCAN algorithm without sacrificing clustering accuracy.
[1] https://arxiv.org/abs/2006.06743
Describe your proposed solution
The solution is based on a simple modification to the brute force version. Instead of calculating the pairwise distances for all pairs of examples to obtain the nearest-neighbor graph, you can instead only calculate it for a fraction s of the pairs (selected randomly) and clusters based on that subsampled graph. This would make the algorithm feasible for larger datasets, without hurting the clustering quality (in some cases s = 0.001 suffices). The idea is that you don’t actually need the full nearest-neighbor graph to get the desired clustering.
All comparisons in [1] were done against the algorithm=’auto’ setting of sklearn.cluster.DBSCAN.
The only new parameter needed would be 0 < s <= 1, the sampling ratio which would only be applicable for the algorithm=’brute-force’ setting. (Perhaps we might also want a random seed parameter if one wants to be able to have reproducible runs).