-
-
Notifications
You must be signed in to change notification settings - Fork 25.9k
FEA CSR support for all DistanceMetric
#23604
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
FEA CSR support for all DistanceMetric
#23604
Conversation
DistanceMetric
Not sure of why is fails for |
Co-authored-by: Olivier Grisel <olivier.grisel@ensta.org>
This is kind of an hack for now. IMO, it would be better to use a flatiter on a view if possible. See discussions on: https://groups.google.com/g/cython-users/c/MR4xWCvUKHU Co-authored-by: Olivier Grisel <olivier.grisel@ensta.org>
Co-authored-by: Olivier Grisel <olivier.grisel@ensta.org>
Co-authored-by: Olivier Grisel <olivier.grisel@ensta.org>
Co-authored-by: Christian Lorentzen <lorentzen.ch@gmail.com>
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks @jjerphan! The design, docstrings, tests and code look good to me.
The only remaining question is the need to update the assertions in the dense test (see below) because we now to the upcasting before the element-wise differences (both in the dense and sparse cases) while previously we would do the the upcasting to float64 right after the element wise differences:
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM!
@jeremiedbb @lorentzenchr I think this PR is ready for final review. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I made a quick pass. Looks good overall. Just a couple of comments for now
Also do test for c-contiguity.
Co-authored-by: Jérémie du Boisberranger <jeremiedbb@users.noreply.github.com>
Co-authored-by: Olivier Grisel <olivier.grisel@ensta.org>
Follow-up on IRL discussions with @ogrisel and @jeremiedbb before I forget: currently, the dense-sparse case is handled as a special case of the sparse-sparse case (where dense vectors are seen and iterated against like sparse vectors). This allows having just two interfaces (namely Still, @jeremiedbb mentioned that there might be a more performant alternatives (avoiding Should we explore alternatives in this PR? I am +0 as I think alternatives for the dense-sparse case come as additional methods rather than as modifications of the new |
I'm in favor of proceeding with the current implementation and improve later - if we find good improvements. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I haven't reviewed all the code in _dist_metrics.pyx.tp in detail but I find the tests good enough to trust the correctness.
const SPARSE_INDEX_TYPE_t x1_start, | ||
const SPARSE_INDEX_TYPE_t x1_end, | ||
const SPARSE_INDEX_TYPE_t x2_start, | ||
const SPARSE_INDEX_TYPE_t x2_end, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It would be nice to have a description of theses somewhere in the docstrings, just once.
x_data and x_indices is the 2-d sparse array. So why to I need x_start and x_end?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Does 731370a clarifies this?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Hum, this PR was merged before addressing this comment. It might be good to improve that in a follow-up PR @jjerphan.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
A comment was added that explains those parameters. Maybe we can do a little better, but I considered it as resolved and therefore merged.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@ogrisel: did you mean the proposal of 731370a was insufficient, out of scope, or could be rephrased? :)
I quickly tried something but I was not convinced by what I was doing. I think we can keep it simple in this PR, and I'm in favor of exploring alternatives in a separate PR. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM
Co-authored-by: Christian Lorentzen <lorentzen.ch@gmail.com>
DistanceMetric
DistanceMetric
Reference Issues/PRs
Precedes support for fused sparse-dense datasets for
PairwiseDistancesReductions
(see #22587)What does this implement/fix? Explain your changes.
This implements supports distance metric computation for CSR data.
Importantly, this:
DistanceMetric.{dist_csr,rdist_csr}
(adapted versions ofDistanceMetric.{dist,rdist}
) for CSR data (see thepxd
file for the definition).DistanceMetric.{dist_csr,rdist_csr}
for all theDistanceMetric
exceptedPyFuncDistance
.Any other comments?
Additional changes:
ℹ️
The +2000 diff really is due to logic duplication. It would be +600 if it were entirely factorised. Yet, I do not see how we can remove this easily without loosing performance and without additional costly indirection. Hence this choice comes for performance at the cost of maintainability.
TODO: