Skip to content

Navigation Menu

Sign in
Appearance settings

Search code, repositories, users, issues, pull requests...

Provide feedback

We read every piece of feedback, and take your input very seriously.

Saved searches

Use saved searches to filter your results more quickly

Appearance settings

[CUDA][CUTLASS][submodule] Fixes for CUTLASS upgrade #131493

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Closed
wants to merge 6 commits into from

Conversation

eqy
Copy link
Collaborator

@eqy eqy commented Jul 23, 2024

@eqy eqy added module: sparse Related to torch.sparse module: cuda Related to torch.cuda, and CUDA support in general open source ciflow/trunk Trigger trunk jobs on your pull request topic: not user facing topic category labels Jul 23, 2024
Copy link

pytorch-bot bot commented Jul 23, 2024

🔗 Helpful Links

🧪 See artifacts and rendered test results at hud.pytorch.org/pr/131493

Note: Links to docs will display an error until the docs builds have been completed.

❌ 17 New Failures, 2 Unrelated Failures

As of commit 0c49853 with merge base efe21ee (image):

NEW FAILURES - The following jobs have failed:

FLAKY - The following jobs failed but were likely due to flakiness present on trunk:

This comment was automatically generated by Dr. CI and updates every 15 minutes.

@alexsamardzic
Copy link
Collaborator

LGTM - as far as changes in SparseSemiStructured(Linear|Ops).cu concerned.

@albanD albanD requested a review from ptrblck July 23, 2024 22:30
@albanD albanD added the triaged This issue has been looked at a team member, and triaged and prioritized into an appropriate module label Jul 23, 2024
@@ -141,13 +141,13 @@ void f8f8bf16_rowwise_impl(
cute::Stride<cute::Int<1>, cute::Int<0>, cute::Int<0>>>;

using WScale = cutlass::epilogue::fusion::Sm90RowBroadcast<
PONG ? 2 : 1,
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

does this have perf impact?

@drisspg
Copy link
Contributor

drisspg commented Jul 29, 2024

I hade this PR: #131687, hopefully will be able to autoclose after this lands :)

@Skylion007
Copy link
Collaborator

@eqy Unfortunately, didn't fix the build issues.

@eqy
Copy link
Collaborator Author

eqy commented Jul 29, 2024

I'll see if I can reproduce the build issues on 11.8/12.1, which I haven't been trying locally yet...

@eqy eqy requested a review from syed-ahmed as a code owner July 30, 2024 21:30
@eqy
Copy link
Collaborator Author

eqy commented Jul 30, 2024

Update: this is because hrsqrt isn't available on SM5.2 which is what the failing build/test CI runners are targeting...
I'll see if we can workaround this with an arch guard on the PyTorch side and bug CUTLASS to address it if not

@eqy
Copy link
Collaborator Author

eqy commented Aug 1, 2024

oof that windows failure doesn't look so fun

@Skylion007
Copy link
Collaborator

@eqy Sigh, seems like the CUTLASS issue is open: NVIDIA/cutlass#1571

@Skylion007
Copy link
Collaborator

@eqy The problematic kernels seem to be copy pasted from xformers and xformers has already updated to CUTLASS 3.5.0, are they are any diffs from there that could be useful to fix the error?

@eqy
Copy link
Collaborator Author

eqy commented Aug 9, 2024

@pytorchmergebot merge

@pytorchmergebot
Copy link
Collaborator

This PR updates submodules third_party/cutlass

If those updates are intentional, please add "submodule" keyword to PR title/description.

@eqy
Copy link
Collaborator Author

eqy commented Aug 9, 2024

@pytorchmergebot merge

@eqy eqy changed the title [CUDA][CUTLASS] Fixes for CUTLASS upgrade [CUDA][CUTLASS][submodule] Fixes for CUTLASS upgrade Aug 9, 2024
@pytorchmergebot
Copy link
Collaborator

Merge started

Your change will be merged once all checks pass (ETA 0-4 Hours).

Learn more about merging in the wiki.

Questions? Feedback? Please reach out to the PyTorch DevX Team

Advanced Debugging
Check the merge workflow status
here

malfet pushed a commit to aditew01/pytorch that referenced this pull request Sep 13, 2024
pytorch-bot bot pushed a commit that referenced this pull request Sep 13, 2024
This reverts commit 4aa66f6.

Reverted #131493 on behalf of https://github.com/izaitsevfb due to breaks internal builds with identifier "std::numeric_limits< ::cutlass::half_t> ::infinity" is undefined in device code ([comment](#131493 (comment)))
@Skylion007
Copy link
Collaborator

@eqy Anything we need for CUTLASS 3.6?

@eqy
Copy link
Collaborator Author

eqy commented Nov 1, 2024

@pytorchmergebot rebase

@pytorchmergebot
Copy link
Collaborator

@pytorchbot started a rebase job onto refs/remotes/origin/viable/strict. Check the current status here

@pytorchmergebot
Copy link
Collaborator

Successfully rebased cutlass35 onto refs/remotes/origin/viable/strict, please pull locally before adding more changes (for example, via git checkout cutlass35 && git pull --rebase)

@Skylion007
Copy link
Collaborator

Skylion007 commented Nov 4, 2024

Looks like some annoying missing [maybe-unused] tags in CUTLASS ;-; @eqy I don't see an issue opened for this, I assume it's still an issue on main?

@@ -174,7 +174,7 @@ void f8f8bf16_rowwise_impl(

// Implement rowwise scaling epilogue.
constexpr int ColBroadcastStages = 0;
constexpr int RowBroadcastStages = PingPong::value ? 2 : 1;
constexpr int RowBroadcastStages = 0;
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

NOTE: This change is required to compile, but I have "misaligned address" errors after that when I use CUTLASS 3.6 and use the PingPong kernels. I found a workaround by inverting the order of the scales, eg:

  using EpilogueEVT = cutlass::epilogue::fusion::Sm90EVT<Cast,
      cutlass::epilogue::fusion::Sm90EVT<Add, Bias,
        cutlass::epilogue::fusion::Sm90EVT<Multiply, WScale, 
          cutlass::epilogue::fusion::Sm90EVT<Multiply, XScale,
            Accum>>>>;

(requires to change how the arguments are supplied to the epilogue as well)
cc @drisspg

@@ -9,16 +9,6 @@
// sparsification, as a bitmask.
// NOTE: Algorithms might select LESS than 8 values in total in some cases.

namespace platform {
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@eqy If the improved 9.5.1> fixes for CUTLASS are blocking the changes, can we at least merge the early PR that was passing tests and still does upgrade CUTLASS a little bit? Until we can have the warnings fixed in CUTLASS of course.

@Skylion007
Copy link
Collaborator

@eqy @atalman Are we going to push to get some CUTLASS upgrade in 2.6 for the Flash Attention speed ups?

@nighting0le01
Copy link

nighting0le01 commented Dec 12, 2024

@eqy @atalman Are we going to push to get some CUTLASS upgrade in 2.6 for the Flash Attention speed ups?

hey @eqy do you mean FlashAttention-3 is supported in SDPA now??

@Skylion007
Copy link
Collaborator

@eqy @atalman Are we going to push to get some CUTLASS upgrade in 2.6 for the Flash Attention speed ups?

hey @eqy do you mean FlashAttention-3 is supported in SDPA now??

Now, there was some optimizations for faster FA2 in the cutlass upgrade changelog notes

@@ -264,13 +270,32 @@ void f8f8bf16_rowwise_impl(
stride_b},
{{{{bias.has_value() ? reinterpret_cast<DtypeBias*>(bias->data_ptr())
: nullptr},
{{reinterpret_cast<DtypeScale*>(x_scale.data_ptr())},
{{reinterpret_cast<DtypeScale*>(w_scale.data_ptr())}}}}},
{{reinterpret_cast<DtypeScale*>(w_scale.data_ptr())},
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

These are flipped?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes that's because the order of the epilogues got flipped (see just above) to workaround a CUTLASS bug in this new version

@Skylion007
Copy link
Collaborator

Redundant given recent CUTLASS merges

@Skylion007 Skylion007 closed this Jan 29, 2025
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
ciflow/inductor ciflow/periodic Trigger jobs ran periodically on master (periodic.yml) on the PR ciflow/trunk Trigger trunk jobs on your pull request Merged module: cuda Related to torch.cuda, and CUDA support in general module: inductor module: sparse Related to torch.sparse open source Reverted topic: not user facing topic category triaged This issue has been looked at a team member, and triaged and prioritized into an appropriate module
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Morty Proxy This is a proxified and sanitized view of the page, visit original site.