Skip to content

Navigation Menu

Sign in
Appearance settings

Search code, repositories, users, issues, pull requests...

Provide feedback

We read every piece of feedback, and take your input very seriously.

Saved searches

Use saved searches to filter your results more quickly

Appearance settings

FEA CSR support for all DistanceMetric #23604

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merged

Conversation

jjerphan
Copy link
Member

@jjerphan jjerphan commented Jun 13, 2022

Reference Issues/PRs

Precedes support for fused sparse-dense datasets for PairwiseDistancesReductions (see #22587)

What does this implement/fix? Explain your changes.

This implements supports distance metric computation for CSR data.

Importantly, this:

  • define DistanceMetric.{dist_csr,rdist_csr} (adapted versions of DistanceMetric.{dist,rdist}) for CSR data (see the pxd file for the definition).
  • implement DistanceMetric.{dist_csr,rdist_csr} for all the DistanceMetric excepted PyFuncDistance.
  • this uses a indices wrapping to be able to support the sparse-dense, dense-sparse cases and to be robust to explicit zeros representation with a minimal memory footprint

Any other comments?

Additional changes:

  • use np.float64 for extra datastructures in all the cases for best precisions (namely, precision matrices, weights vectors, work vectors)
  • minor reformatting in implementation and minor renaming in tests for consistency
  • additional private python method to allow testing those interfaces
  • mahalanobis now tested using an adapted tolerance for its tests cases

ℹ️
The +2000 diff really is due to logic duplication. It would be +600 if it were entirely factorised. Yet, I do not see how we can remove this easily without loosing performance and without additional costly indirection. Hence this choice comes for performance at the cost of maintainability.

TODO:

  • implement sparse support for Haversine + dedicated tests
  • check that the float32 -> float64 upcasts are done at the right location to preserve numerical stability

@jjerphan jjerphan changed the title MAINT Implement CSR support for all DistanceMetric MAINT CSR support for all DistanceMetric Jun 13, 2022
@jjerphan
Copy link
Member Author

Not sure of why is fails for Linux_Docker debian_atlas_32bit. 🤔

@jjerphan jjerphan marked this pull request as ready for review June 15, 2022 09:44
sklearn/metrics/tests/test_dist_metrics.py Outdated Show resolved Hide resolved
sklearn/metrics/_dist_metrics.pxd.tp Show resolved Hide resolved
sklearn/metrics/tests/test_dist_metrics.py Show resolved Hide resolved
sklearn/metrics/tests/test_dist_metrics.py Show resolved Hide resolved
sklearn/metrics/tests/test_dist_metrics.py Show resolved Hide resolved
sklearn/metrics/tests/test_dist_metrics.py Outdated Show resolved Hide resolved
sklearn/metrics/_dist_metrics.pyx.tp Outdated Show resolved Hide resolved
sklearn/metrics/_dist_metrics.pyx.tp Outdated Show resolved Hide resolved
sklearn/metrics/_dist_metrics.pyx.tp Outdated Show resolved Hide resolved
sklearn/metrics/_dist_metrics.pyx.tp Outdated Show resolved Hide resolved
jjerphan and others added 3 commits June 17, 2022 09:49
Co-authored-by: Olivier Grisel <olivier.grisel@ensta.org>
This is kind of an hack for now.

IMO, it would be better to use a flatiter on a view if possible.
See discussions on: https://groups.google.com/g/cython-users/c/MR4xWCvUKHU

Co-authored-by: Olivier Grisel <olivier.grisel@ensta.org>
sklearn/utils/_typedefs.pxd Show resolved Hide resolved
sklearn/metrics/_dist_metrics.pyx.tp Show resolved Hide resolved
sklearn/metrics/_dist_metrics.pyx.tp Outdated Show resolved Hide resolved
sklearn/metrics/_dist_metrics.pyx.tp Outdated Show resolved Hide resolved
jjerphan and others added 2 commits June 17, 2022 14:23
Co-authored-by: Olivier Grisel <olivier.grisel@ensta.org>
sklearn/metrics/_dist_metrics.pyx.tp Outdated Show resolved Hide resolved
sklearn/metrics/_dist_metrics.pyx.tp Outdated Show resolved Hide resolved
jjerphan and others added 2 commits June 17, 2022 16:46
Copy link
Member

@ogrisel ogrisel left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks @jjerphan! The design, docstrings, tests and code look good to me.

The only remaining question is the need to update the assertions in the dense test (see below) because we now to the upcasting before the element-wise differences (both in the dense and sparse cases) while previously we would do the the upcasting to float64 right after the element wise differences:

sklearn/metrics/tests/test_dist_metrics.py Outdated Show resolved Hide resolved
sklearn/metrics/_dist_metrics.pyx.tp Outdated Show resolved Hide resolved
Co-authored-by: Olivier Grisel <olivier.grisel@ensta.org>
sklearn/metrics/_dist_metrics.pyx.tp Show resolved Hide resolved
sklearn/metrics/tests/test_dist_metrics.py Outdated Show resolved Hide resolved
Copy link
Member

@ogrisel ogrisel left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM!

@ogrisel
Copy link
Member

ogrisel commented Jun 22, 2022

@jeremiedbb @lorentzenchr I think this PR is ready for final review.

Copy link
Member

@jeremiedbb jeremiedbb left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I made a quick pass. Looks good overall. Just a couple of comments for now

sklearn/metrics/_dist_metrics.pyx.tp Outdated Show resolved Hide resolved
sklearn/metrics/_dist_metrics.pyx.tp Outdated Show resolved Hide resolved
jjerphan and others added 2 commits June 23, 2022 11:38
Co-authored-by: Jérémie du Boisberranger <jeremiedbb@users.noreply.github.com>
sklearn/metrics/_dist_metrics.pyx.tp Show resolved Hide resolved
jjerphan and others added 2 commits June 23, 2022 12:51
Co-authored-by: Olivier Grisel <olivier.grisel@ensta.org>
@jjerphan
Copy link
Member Author

Follow-up on IRL discussions with @ogrisel and @jeremiedbb before I forget: currently, the dense-sparse case is handled as a special case of the sparse-sparse case (where dense vectors are seen and iterated against like sparse vectors). This allows having just two interfaces (namely DistanceMetric.{dist_csr,rdist_csr}) for the upcoming support of distance computations on sparse-dense, dense-sparse and sparse-sparse pair of vectors.

Still, @jeremiedbb mentioned that there might be a more performant alternatives (avoiding jmp, IIRC) for handling the dense-sparse case and I think it makes sense to explore it.

Should we explore alternatives in this PR? I am +0 as I think alternatives for the dense-sparse case come as additional methods rather than as modifications of the new DistanceMetric.{dist_csr,rdist_csr}, which could in the meantime be used for the dense-sparse case.

@lorentzenchr
Copy link
Member

I'm in favor of proceeding with the current implementation and improve later - if we find good improvements.

Copy link
Member

@lorentzenchr lorentzenchr left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I haven't reviewed all the code in _dist_metrics.pyx.tp in detail but I find the tests good enough to trust the correctness.

Comment on lines +402 to +405
const SPARSE_INDEX_TYPE_t x1_start,
const SPARSE_INDEX_TYPE_t x1_end,
const SPARSE_INDEX_TYPE_t x2_start,
const SPARSE_INDEX_TYPE_t x2_end,
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It would be nice to have a description of theses somewhere in the docstrings, just once.
x_data and x_indices is the 2-d sparse array. So why to I need x_start and x_end?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Does 731370a clarifies this?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hum, this PR was merged before addressing this comment. It might be good to improve that in a follow-up PR @jjerphan.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

A comment was added that explains those parameters. Maybe we can do a little better, but I considered it as resolved and therefore merged.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@ogrisel: did you mean the proposal of 731370a was insufficient, out of scope, or could be rephrased? :)

@jeremiedbb
Copy link
Member

Follow-up on IRL discussions with @ogrisel and @jeremiedbb before I forget: currently, the dense-sparse case is handled as a special case of the sparse-sparse case (where dense vectors are seen and iterated against like sparse vectors). This allows having just two interfaces (namely DistanceMetric.{dist_csr,rdist_csr}) for the upcoming support of distance computations on sparse-dense, dense-sparse and sparse-sparse pair of vectors.

Still, @jeremiedbb mentioned that there might be a more performant alternatives (avoiding jmp, IIRC) for handling the dense-sparse case and I think it makes sense to explore it.

Should we explore alternatives in this PR? I am +0 as I think alternatives for the dense-sparse case come as additional methods rather than as modifications of the new DistanceMetric.{dist_csr,rdist_csr}, which could in the meantime be used for the dense-sparse case.

I quickly tried something but I was not convinced by what I was doing. I think we can keep it simple in this PR, and I'm in favor of exploring alternatives in a separate PR.

Copy link
Member

@jeremiedbb jeremiedbb left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM

Co-authored-by: Christian Lorentzen <lorentzen.ch@gmail.com>
@lorentzenchr lorentzenchr changed the title MAINT CSR support for all DistanceMetric FEA CSR support for all DistanceMetric Jun 29, 2022
@lorentzenchr lorentzenchr merged commit b157ac7 into scikit-learn:main Jun 29, 2022
@jjerphan jjerphan deleted the maint/dist-metrics-csr-support branch June 29, 2022 13:42
ogrisel pushed a commit to ogrisel/scikit-learn that referenced this pull request Jul 11, 2022
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants
Morty Proxy This is a proxified and sanitized view of the page, visit original site.