FEA CSR support for all `DistanceMetric` #23604

jjerphan · Jun 13, 2022

Reference Issues/PRs

Precedes support for fused sparse-dense datasets for PairwiseDistancesReductions (see #22587)

What does this implement/fix? Explain your changes.

This implements supports distance metric computation for CSR data.

Importantly, this:

define DistanceMetric.{dist_csr,rdist_csr} (adapted versions of DistanceMetric.{dist,rdist}) for CSR data (see the pxd file for the definition).
implement DistanceMetric.{dist_csr,rdist_csr} for all the DistanceMetric excepted PyFuncDistance.
this uses a indices wrapping to be able to support the sparse-dense, dense-sparse cases and to be robust to explicit zeros representation with a minimal memory footprint

Any other comments?

Additional changes:

use np.float64 for extra datastructures in all the cases for best precisions (namely, precision matrices, weights vectors, work vectors)
minor reformatting in implementation and minor renaming in tests for consistency
additional private python method to allow testing those interfaces
mahalanobis now tested using an adapted tolerance for its tests cases

ℹ️
The +2000 diff really is due to logic duplication. It would be +600 if it were entirely factorised. Yet, I do not see how we can remove this easily without loosing performance and without additional costly indirection. Hence this choice comes for performance at the cost of maintainability.

TODO:

implement sparse support for Haversine + dedicated tests
check that the float32 -> float64 upcasts are done at the right location to preserve numerical stability

jjerphan · Jun 14, 2022

Not sure of why is fails for Linux_Docker debian_atlas_32bit. 🤔

sklearn/metrics/tests/test_dist_metrics.py

sklearn/metrics/_dist_metrics.pxd.tp

sklearn/metrics/tests/test_dist_metrics.py

sklearn/metrics/_dist_metrics.pyx.tp

Co-authored-by: Olivier Grisel <olivier.grisel@ensta.org>

This is kind of an hack for now. IMO, it would be better to use a flatiter on a view if possible. See discussions on: https://groups.google.com/g/cython-users/c/MR4xWCvUKHU Co-authored-by: Olivier Grisel <olivier.grisel@ensta.org>

sklearn/utils/_typedefs.pxd

sklearn/metrics/_dist_metrics.pyx.tp

Co-authored-by: Olivier Grisel <olivier.grisel@ensta.org>

sklearn/metrics/_dist_metrics.pyx.tp

Co-authored-by: Olivier Grisel <olivier.grisel@ensta.org>

Co-authored-by: Christian Lorentzen <lorentzen.ch@gmail.com>

ogrisel

Thanks @jjerphan! The design, docstrings, tests and code look good to me.

The only remaining question is the need to update the assertions in the dense test (see below) because we now to the upcasting before the element-wise differences (both in the dense and sparse cases) while previously we would do the the upcasting to float64 right after the element wise differences:

sklearn/metrics/tests/test_dist_metrics.py

sklearn/metrics/_dist_metrics.pyx.tp

Co-authored-by: Olivier Grisel <olivier.grisel@ensta.org>

sklearn/metrics/_dist_metrics.pyx.tp

sklearn/metrics/tests/test_dist_metrics.py

ogrisel

LGTM!

ogrisel · Jun 22, 2022

@jeremiedbb @lorentzenchr I think this PR is ready for final review.

jeremiedbb

I made a quick pass. Looks good overall. Just a couple of comments for now

sklearn/metrics/_dist_metrics.pyx.tp

Also do test for c-contiguity.

Co-authored-by: Jérémie du Boisberranger <jeremiedbb@users.noreply.github.com>

sklearn/metrics/_dist_metrics.pyx.tp

Co-authored-by: Olivier Grisel <olivier.grisel@ensta.org>

jjerphan · Jun 27, 2022

Follow-up on IRL discussions with @ogrisel and @jeremiedbb before I forget: currently, the dense-sparse case is handled as a special case of the sparse-sparse case (where dense vectors are seen and iterated against like sparse vectors). This allows having just two interfaces (namely DistanceMetric.{dist_csr,rdist_csr}) for the upcoming support of distance computations on sparse-dense, dense-sparse and sparse-sparse pair of vectors.

Still, @jeremiedbb mentioned that there might be a more performant alternatives (avoiding jmp, IIRC) for handling the dense-sparse case and I think it makes sense to explore it.

Should we explore alternatives in this PR? I am +0 as I think alternatives for the dense-sparse case come as additional methods rather than as modifications of the new DistanceMetric.{dist_csr,rdist_csr}, which could in the meantime be used for the dense-sparse case.

lorentzenchr · Jun 27, 2022

I'm in favor of proceeding with the current implementation and improve later - if we find good improvements.

lorentzenchr

I haven't reviewed all the code in _dist_metrics.pyx.tp in detail but I find the tests good enough to trust the correctness.

lorentzenchr · Jun 28, 2022

sklearn/metrics/_dist_metrics.pyx.tp

+        const SPARSE_INDEX_TYPE_t x1_start,
+        const SPARSE_INDEX_TYPE_t x1_end,
+        const SPARSE_INDEX_TYPE_t x2_start,
+        const SPARSE_INDEX_TYPE_t x2_end,


It would be nice to have a description of theses somewhere in the docstrings, just once.
x_data and x_indices is the 2-d sparse array. So why to I need x_start and x_end?

Does 731370a clarifies this?

Hum, this PR was merged before addressing this comment. It might be good to improve that in a follow-up PR @jjerphan.

A comment was added that explains those parameters. Maybe we can do a little better, but I considered it as resolved and therefore merged.

@ogrisel: did you mean the proposal of 731370a was insufficient, out of scope, or could be rephrased? :)

jeremiedbb · Jun 29, 2022

Follow-up on IRL discussions with @ogrisel and @jeremiedbb before I forget: currently, the dense-sparse case is handled as a special case of the sparse-sparse case (where dense vectors are seen and iterated against like sparse vectors). This allows having just two interfaces (namely DistanceMetric.{dist_csr,rdist_csr}) for the upcoming support of distance computations on sparse-dense, dense-sparse and sparse-sparse pair of vectors.

Still, @jeremiedbb mentioned that there might be a more performant alternatives (avoiding jmp, IIRC) for handling the dense-sparse case and I think it makes sense to explore it.

Should we explore alternatives in this PR? I am +0 as I think alternatives for the dense-sparse case come as additional methods rather than as modifications of the new DistanceMetric.{dist_csr,rdist_csr}, which could in the meantime be used for the dense-sparse case.

I quickly tried something but I was not convinced by what I was doing. I think we can keep it simple in this PR, and I'm in favor of exploring alternatives in a separate PR.

jeremiedbb

LGTM

Co-authored-by: Christian Lorentzen <lorentzen.ch@gmail.com>

MAINT Implement CSR support for all DistanceMetric

b8bd875

github-actions bot added module:metrics cython labels Jun 13, 2022

jjerphan added the No Changelog Needed label Jun 13, 2022

jjerphan changed the title ~~MAINT Implement CSR support for all DistanceMetric~~ MAINT CSR support for all DistanceMetric Jun 13, 2022

Merge branch 'main' into maint/dist-metrics-csr-support

7b07188

jjerphan mentioned this pull request Jun 14, 2022

FEA Fused sparse-dense support for PairwiseDistancesReduction #23585

Merged

4 tasks

jjerphan added 2 commits June 15, 2022 09:52

TST Remove useless guard

fb99680

TST Skip JaccardDistance on 32bit architecture

d39d2b2

jjerphan marked this pull request as ready for review June 15, 2022 09:44

jjerphan added the Waiting for Reviewer label Jun 15, 2022

jjerphan commented Jun 16, 2022

View reviewed changes

sklearn/metrics/tests/test_dist_metrics.py Outdated Show resolved Hide resolved

ogrisel reviewed Jun 16, 2022

View reviewed changes

ogrisel mentioned this pull request Jun 16, 2022

[RFC] Support for int64 indexed SciPy sparse matrices in Cython code #23653

Open

jjerphan added 4 commits June 16, 2022 16:49

MAINT Define dtype alias for sparse matrices indices

011e2a2

MAINT Do not shadow dtype names in Tempita templating

a579630

fixup! MAINT Define dtype alias for sparse matrices indices

98e9d21

TST Use cdist and pdist appropriately

8aa4e44

ogrisel reviewed Jun 16, 2022

View reviewed changes

sklearn/metrics/_dist_metrics.pyx.tp Outdated Show resolved Hide resolved

sklearn/metrics/_dist_metrics.pyx.tp Outdated Show resolved Hide resolved

sklearn/metrics/_dist_metrics.pyx.tp Outdated Show resolved Hide resolved

sklearn/metrics/_dist_metrics.pyx.tp Outdated Show resolved Hide resolved

jjerphan and others added 3 commits June 17, 2022 09:49

DOC Improve comments

9edfa11

Co-authored-by: Olivier Grisel <olivier.grisel@ensta.org>

Fixups

ee5c6bf

MAINT Wrap of indptr values to support sparse-dense

bf5eb59

This is kind of an hack for now. IMO, it would be better to use a flatiter on a view if possible. See discussions on: https://groups.google.com/g/cython-users/c/MR4xWCvUKHU Co-authored-by: Olivier Grisel <olivier.grisel@ensta.org>

ogrisel reviewed Jun 17, 2022

View reviewed changes

sklearn/utils/_typedefs.pxd Show resolved Hide resolved

sklearn/metrics/_dist_metrics.pyx.tp Show resolved Hide resolved

sklearn/metrics/_dist_metrics.pyx.tp Outdated Show resolved Hide resolved

ogrisel reviewed Jun 17, 2022

View reviewed changes

sklearn/metrics/_dist_metrics.pyx.tp Outdated Show resolved Hide resolved

jjerphan and others added 2 commits June 17, 2022 14:23

Apply review comments

92b8a6c

Co-authored-by: Olivier Grisel <olivier.grisel@ensta.org>

More interesting boolean data for tests

dc6f8cf

ogrisel reviewed Jun 17, 2022

View reviewed changes

sklearn/metrics/_dist_metrics.pyx.tp Outdated Show resolved Hide resolved

sklearn/metrics/_dist_metrics.pyx.tp Outdated Show resolved Hide resolved

jjerphan and others added 2 commits June 17, 2022 16:46

FIX Various corrections

bb06f59

FIX Make Jaccard, Hamming and Hashing robust to explicit zeros

a5eb20d

Co-authored-by: Olivier Grisel <olivier.grisel@ensta.org>

jjerphan and others added 3 commits June 20, 2022 12:34

Rename methods and correctly format their signatures

b3759fe

Co-authored-by: Christian Lorentzen <lorentzen.ch@gmail.com>

fixup! TST Remove xfail for Jaccard on 32bit arch.

7f89236

FEA CSR support for HaversineDistance

01a0c33

ogrisel reviewed Jun 21, 2022

View reviewed changes

sklearn/metrics/tests/test_dist_metrics.py Outdated Show resolved Hide resolved

sklearn/metrics/_dist_metrics.pyx.tp Outdated Show resolved Hide resolved

Fix typo

7d8a717

Co-authored-by: Olivier Grisel <olivier.grisel@ensta.org>

ogrisel reviewed Jun 22, 2022

View reviewed changes

sklearn/metrics/_dist_metrics.pyx.tp Show resolved Hide resolved

Do not upcast to 64bit yet keep the same precision

563e359

jeremiedbb reviewed Jun 22, 2022

View reviewed changes

sklearn/metrics/tests/test_dist_metrics.py Outdated Show resolved Hide resolved

jjerphan and others added 2 commits June 22, 2022 13:29

Do use the default rtol

f863a51

Set rtol explicitly in test_distance_metrics_dtype_consistency

5ba0fbe

ogrisel approved these changes Jun 22, 2022

View reviewed changes

jjerphan mentioned this pull request Jun 22, 2022

PERF PairwiseDistancesReductions initial work #22587

Closed

jeremiedbb reviewed Jun 22, 2022

View reviewed changes

sklearn/metrics/_dist_metrics.pyx.tp Outdated Show resolved Hide resolved

sklearn/metrics/_dist_metrics.pyx.tp Outdated Show resolved Hide resolved

jjerphan and others added 2 commits June 23, 2022 11:38

Implement the sparse-dense and the dense-sparse case for c-contiguity

4f45839

Also do test for c-contiguity.

Add validation on X and Y, accepting CSR as inputs

3e3e888

Co-authored-by: Jérémie du Boisberranger <jeremiedbb@users.noreply.github.com>

ogrisel reviewed Jun 23, 2022

View reviewed changes

sklearn/metrics/_dist_metrics.pyx.tp Show resolved Hide resolved

jjerphan and others added 2 commits June 23, 2022 12:51

Remove left-overs

ddc49d5

Co-authored-by: Olivier Grisel <olivier.grisel@ensta.org>

Merge branch 'main' into maint/dist-metrics-csr-support

a83887c

ogrisel mentioned this pull request Jun 24, 2022

ENH Improve performance of KNeighborsClassifier.predict #23721

Closed

1 task

lorentzenchr approved these changes Jun 28, 2022

View reviewed changes

jeremiedbb approved these changes Jun 29, 2022

View reviewed changes

DOC Motivate the signature for DistanceMetric.{dist_csr, rdist_csr}

731370a

Co-authored-by: Christian Lorentzen <lorentzen.ch@gmail.com>

lorentzenchr changed the title ~~MAINT CSR support for all DistanceMetric~~ FEA CSR support for all DistanceMetric Jun 29, 2022

lorentzenchr merged commit b157ac7 into scikit-learn:main Jun 29, 2022

jjerphan deleted the maint/dist-metrics-csr-support branch June 29, 2022 13:42

ogrisel pushed a commit to ogrisel/scikit-learn that referenced this pull request Jul 11, 2022

FEA CSR support for all DistanceMetric (scikit-learn#23604)

3f556c0

Search code, repositories, users, issues, pull requests...

Uh oh!

FEA CSR support for all DistanceMetric #23604

FEA CSR support for all DistanceMetric #23604

Uh oh!

Conversation

jjerphan commented Jun 13, 2022 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Reference Issues/PRs

What does this implement/fix? Explain your changes.

Any other comments?

TODO:

Uh oh!

jjerphan commented Jun 14, 2022

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

ogrisel left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

ogrisel left a comment

Choose a reason for hiding this comment

Uh oh!

ogrisel commented Jun 22, 2022

Uh oh!

jeremiedbb left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

jjerphan commented Jun 27, 2022

Uh oh!

lorentzenchr commented Jun 27, 2022

Uh oh!

lorentzenchr left a comment

Choose a reason for hiding this comment

Uh oh!

lorentzenchr Jun 28, 2022

Choose a reason for hiding this comment

Uh oh!

jjerphan Jun 29, 2022

Choose a reason for hiding this comment

Uh oh!

ogrisel Jun 29, 2022

Choose a reason for hiding this comment

Uh oh!

lorentzenchr Jun 29, 2022

Choose a reason for hiding this comment

Uh oh!

jjerphan Jun 29, 2022

Choose a reason for hiding this comment

Uh oh!

jeremiedbb commented Jun 29, 2022

Uh oh!

jeremiedbb left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

FEA CSR support for all `DistanceMetric` #23604

FEA CSR support for all `DistanceMetric` #23604

jjerphan commented Jun 13, 2022 •

edited

Loading