[CUDA][CUTLASS][submodule] Fixes for CUTLASS upgrade #131493

eqy · Jul 23, 2024

Unblocks/unbreaks against newer CUTLASS (3.5+)

CC @nWEIdia @xwang233 @ptrblck @thakkarV

cc @alexsamardzic @nikitaved @pearu @cpuhrsch @amjames @bhosmer @jcaip @ptrblck @msaroufim @voznesenskym @penguinwu @EikanWang @jgong5 @Guobing-Chen @XiaobingSuper @zhuhaozhe @blzheng @wenzhe-nrv @jiayisunx @ipiszy @yf225 @chenyang78 @kadeng @muchulee8 @ColinPeppler @desertfire @chauhang

pytorch-bot · Jul 23, 2024

🔗 Helpful Links

🧪 See artifacts and rendered test results at hud.pytorch.org/pr/131493

📄 Preview Python docs built from this PR
📄 Preview C++ docs built from this PR
❓ Need help or want to give feedback on the CI? Visit the bot commands wiki or our office hours

Note: Links to docs will display an error until the docs builds have been completed.

❌ 17 New Failures, 2 Unrelated Failures

As of commit 0c49853 with merge base efe21ee ():

NEW FAILURES - The following jobs have failed:

inductor / unit-test / cuda12.4-py3.10-gcc9-sm86 / build (gh)
Process completed with exit code 1.
inductor / unit-test / cuda12.4-py3.12-gcc9-sm86 / build (gh)
Process completed with exit code 1.
inductor / unit-test / cuda12.4-py3.13-gcc9-sm86 / build (gh)
Process completed with exit code 1.
inductor / unit-test / linux-jammy-cpu-py3.12-gcc11-inductor-halide / build (gh)
ninja: build stopped: subcommand failed
inductor / unit-test / linux-jammy-cpu-py3.12-gcc11-inductor-triton-cpu / build (gh)
Process completed with exit code 1.
Lint / lintrunner-noclang / linux-job (gh)
>>> Lint for aten/src/ATen/native/cuda/RowwiseScaledMM.cu:
linux-binary-manywheel / manywheel-py3_9-cuda12_4-build / build (gh)
ninja: build stopped: subcommand failed
linux-binary-manywheel / manywheel-py3_9-cuda12_6-build / build (gh)
ninja: build stopped: subcommand failed
periodic / linux-focal-cuda12.1-py3-gcc9-slow-gradcheck / build (gh)
Process completed with exit code 1.
periodic / linux-focal-cuda12.1-py3.10-gcc9 / build (gh)
Process completed with exit code 1.
periodic / linux-focal-cuda12.4-py3.10-gcc9 / build (gh)
Process completed with exit code 1.
pull / cuda12.4-py3.10-gcc9-sm75 / build (gh)
Process completed with exit code 1.
pull / linux-focal-cuda12.4-py3.10-gcc9 / build (gh)
Process completed with exit code 1.
pull / linux-focal-cuda12.4-py3.10-gcc9-sm89 / build (gh)
Process completed with exit code 1.
trunk / cuda12.1-py3.10-gcc9-sm80 / build (gh)
Process completed with exit code 1.
trunk / libtorch-linux-focal-cuda12.4-py3.10-gcc9-debug / build (gh)
Process completed with exit code 1.
trunk / linux-focal-cuda12.4-py3.10-gcc9-no-ops / build (gh)
Process completed with exit code 1.

FLAKY - The following jobs failed but were likely due to flakiness present on trunk:

periodic / linux-focal-cuda11.8-py3.10-gcc9-debug / test (default, 6, 7, lf.linux.4xlarge.nvidia.gpu, oncall:debug-build) (gh) (similar failure)
test_nestedtensor.py::TestNestedTensorSubclassCUDA::test_linear_nt_dim_3_cuda
periodic / linux-focal-cuda11.8-py3.9-gcc9 / test (multigpu, 1, 1, lf.linux.g5.12xlarge.nvidia.gpu, oncall:distributed) (gh) (similar failure)
'Test'

This comment was automatically generated by Dr. CI and updates every 15 minutes.

alexsamardzic · Jul 23, 2024

LGTM - as far as changes in SparseSemiStructured(Linear|Ops).cu concerned.

drisspg · Jul 29, 2024

aten/src/ATen/native/cuda/RowwiseScaledMM.cu

@@ -141,13 +141,13 @@ void f8f8bf16_rowwise_impl(
      cute::Stride<cute::Int<1>, cute::Int<0>, cute::Int<0>>>;

  using WScale = cutlass::epilogue::fusion::Sm90RowBroadcast<
-      PONG ? 2 : 1,


does this have perf impact?

drisspg · Jul 29, 2024

I hade this PR: #131687, hopefully will be able to autoclose after this lands :)

Skylion007 · Jul 29, 2024

@eqy Unfortunately, didn't fix the build issues.

eqy · Jul 29, 2024

I'll see if I can reproduce the build issues on 11.8/12.1, which I haven't been trying locally yet...

eqy · Jul 30, 2024

Update: this is because hrsqrt isn't available on SM5.2 which is what the failing build/test CI runners are targeting...
I'll see if we can workaround this with an arch guard on the PyTorch side and bug CUTLASS to address it if not

eqy · Aug 1, 2024

oof that windows failure doesn't look so fun

Skylion007 · Aug 1, 2024

@eqy Sigh, seems like the CUTLASS issue is open: NVIDIA/cutlass#1571

Skylion007 · Aug 5, 2024

@eqy The problematic kernels seem to be copy pasted from xformers and xformers has already updated to CUTLASS 3.5.0, are they are any diffs from there that could be useful to fix the error?

eqy · Aug 9, 2024

@pytorchmergebot merge

pytorchmergebot · Aug 9, 2024

This PR updates submodules third_party/cutlass

If those updates are intentional, please add "submodule" keyword to PR title/description.

eqy · Aug 9, 2024

@pytorchmergebot merge

pytorchmergebot · Aug 9, 2024

Merge started

Your change will be merged once all checks pass (ETA 0-4 Hours).

Learn more about merging in the wiki.

Questions? Feedback? Please reach out to the PyTorch DevX Team

Advanced Debugging

Check the merge workflow status
here

Unblocks/unbreaks against newer CUTLASS (3.5+) CC @nWEIdia @xwang233 @ptrblck @thakkarV Pull Request resolved: pytorch#131493 Approved by: https://github.com/Skylion007

This reverts commit 4aa66f6. Reverted #131493 on behalf of https://github.com/izaitsevfb due to breaks internal builds with identifier "std::numeric_limits< ::cutlass::half_t> ::infinity" is undefined in device code ([comment](#131493 (comment)))

Skylion007 · Oct 14, 2024

@eqy Anything we need for CUTLASS 3.6?

eqy · Nov 1, 2024

@pytorchmergebot rebase

pytorchmergebot · Nov 1, 2024

@pytorchbot started a rebase job onto refs/remotes/origin/viable/strict. Check the current status here

pytorchmergebot · Nov 1, 2024

Successfully rebased cutlass35 onto refs/remotes/origin/viable/strict, please pull locally before adding more changes (for example, via git checkout cutlass35 && git pull --rebase)

Skylion007 · Nov 4, 2024

Looks like some annoying missing [maybe-unused] tags in CUTLASS ;-; @eqy I don't see an issue opened for this, I assume it's still an issue on main?

danthe3rd · Nov 14, 2024

aten/src/ATen/native/cuda/RowwiseScaledMM.cu

@@ -174,7 +174,7 @@ void f8f8bf16_rowwise_impl(

  // Implement rowwise scaling epilogue.
  constexpr int ColBroadcastStages = 0;
-  constexpr int RowBroadcastStages = PingPong::value ? 2 : 1;
+  constexpr int RowBroadcastStages = 0;


NOTE: This change is required to compile, but I have "misaligned address" errors after that when I use CUTLASS 3.6 and use the PingPong kernels. I found a workaround by inverting the order of the scales, eg:

using EpilogueEVT = cutlass::epilogue::fusion::Sm90EVT<Cast, cutlass::epilogue::fusion::Sm90EVT<Add, Bias, cutlass::epilogue::fusion::Sm90EVT<Multiply, WScale, cutlass::epilogue::fusion::Sm90EVT<Multiply, XScale, Accum>>>>;

(requires to change how the arguments are supplied to the epilogue as well)
cc @drisspg

Skylion007 · Nov 17, 2024

aten/src/ATen/native/sparse/cuda/ComputeSparseTile.h

@@ -9,16 +9,6 @@
 // sparsification, as a bitmask.
 // NOTE: Algorithms might select LESS than 8 values in total in some cases.

-namespace platform {


@eqy If the improved 9.5.1> fixes for CUTLASS are blocking the changes, can we at least merge the early PR that was passing tests and still does upgrade CUTLASS a little bit? Until we can have the warnings fixed in CUTLASS of course.

Skylion007 · Nov 26, 2024

@eqy @atalman Are we going to push to get some CUTLASS upgrade in 2.6 for the Flash Attention speed ups?

nighting0le01 · Dec 12, 2024

@eqy @atalman Are we going to push to get some CUTLASS upgrade in 2.6 for the Flash Attention speed ups?

hey @eqy do you mean FlashAttention-3 is supported in SDPA now??

Skylion007 · Jan 2, 2025

@eqy @atalman Are we going to push to get some CUTLASS upgrade in 2.6 for the Flash Attention speed ups?

hey @eqy do you mean FlashAttention-3 is supported in SDPA now??

Now, there was some optimizations for faster FA2 in the cutlass upgrade changelog notes

Skylion007 · Jan 2, 2025

aten/src/ATen/native/cuda/RowwiseScaledMM.cu

@@ -264,13 +270,32 @@ void f8f8bf16_rowwise_impl(
       stride_b},
      {{{{bias.has_value() ? reinterpret_cast<DtypeBias*>(bias->data_ptr())
                           : nullptr},
-         {{reinterpret_cast<DtypeScale*>(x_scale.data_ptr())},
-          {{reinterpret_cast<DtypeScale*>(w_scale.data_ptr())}}}}},
+         {{reinterpret_cast<DtypeScale*>(w_scale.data_ptr())},


These are flipped?

Yes that's because the order of the epilogues got flipped (see just above) to workaround a CUTLASS bug in this new version

Skylion007 · Jan 29, 2025

Redundant given recent CUTLASS merges

eqy added module: sparse Related to torch.sparse module: cuda Related to torch.cuda, and CUDA support in general open source ciflow/trunk Trigger trunk jobs on your pull request topic: not user facing topic category labels Jul 23, 2024

albanD requested a review from ptrblck July 23, 2024 22:30

albanD added the triaged This issue has been looked at a team member, and triaged and prioritized into an appropriate module label Jul 23, 2024

jackkosaian mentioned this pull request Jul 29, 2024

[BE][Ez]: Update CUTLASS submodule to 3.5.1 #123458

Closed

drisspg reviewed Jul 29, 2024

View reviewed changes

Skylion007 approved these changes Jul 29, 2024

View reviewed changes

eqy requested a review from syed-ahmed as a code owner July 30, 2024 21:30

pytorch-bot bot added ciflow/inductor module: inductor labels Aug 8, 2024

eqy changed the title ~~[CUDA][CUTLASS] Fixes for CUTLASS upgrade~~ [CUDA][CUTLASS][submodule] Fixes for CUTLASS upgrade Aug 9, 2024

pytorchmergebot added the merging label Aug 9, 2024

pytorchmergebot closed this in 927b4c1 Aug 9, 2024

eqy force-pushed the cutlass35 branch from 5a8aadf to 137ceb9 Compare September 10, 2024 20:32

Skylion007 mentioned this pull request Oct 14, 2024

Any plan to support flash attention 3 for hopper GPUs? #137901

Open

eqy mentioned this pull request Nov 1, 2024

Select SDPA backend smartly by sdp_params and benchmark results #138907

Open

pytorchmergebot force-pushed the cutlass35 branch from 137ceb9 to eb63912 Compare November 1, 2024 19:15

danthe3rd reviewed Nov 14, 2024

View reviewed changes

Skylion007 reviewed Nov 17, 2024

View reviewed changes

This was referenced Nov 28, 2024

Add support for blackwell codegen #141724

Closed

[BUG] seq_v unusued, but set causing complaints in upstream PyTorch compilation NVIDIA/cutlass#1971

Open

Add windows CUDA 12.6 nightly builds #141805

Closed

eqy added 6 commits December 17, 2024 23:49

check in

b6bb406

?

3fc463a

add exclude kernel names

8ebc709

in the arena, trying things

10ac2f9

bump to 3.6.0

002aa57

incorporate workaround

0c49853

eqy force-pushed the cutlass35 branch from eb63912 to 0c49853 Compare December 17, 2024 23:50

Skylion007 mentioned this pull request Jan 2, 2025

[Submodule] Bump Cutlass to 3.5.1 OSS PR #144000

Closed

Skylion007 reviewed Jan 2, 2025

View reviewed changes

Skylion007 closed this Jan 29, 2025

Search code, repositories, users, issues, pull requests...

[CUDA][CUTLASS][submodule] Fixes for CUTLASS upgrade #131493

[CUDA][CUTLASS][submodule] Fixes for CUTLASS upgrade #131493

Uh oh!

Conversation

eqy commented Jul 23, 2024 • edited by pytorch-bot bot Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

pytorch-bot bot commented Jul 23, 2024 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

🔗 Helpful Links

🧪 See artifacts and rendered test results at hud.pytorch.org/pr/131493

❌ 17 New Failures, 2 Unrelated Failures

Uh oh!

alexsamardzic commented Jul 23, 2024

Uh oh!

drisspg Jul 29, 2024

Choose a reason for hiding this comment

Uh oh!

drisspg commented Jul 29, 2024

Uh oh!

Skylion007 commented Jul 29, 2024

Uh oh!

eqy commented Jul 29, 2024

Uh oh!

eqy commented Jul 30, 2024

Uh oh!

eqy commented Aug 1, 2024

Uh oh!

Skylion007 commented Aug 1, 2024

Uh oh!

Skylion007 commented Aug 5, 2024

Uh oh!

eqy commented Aug 9, 2024

Uh oh!

pytorchmergebot commented Aug 9, 2024

Uh oh!

eqy commented Aug 9, 2024

Uh oh!

pytorchmergebot commented Aug 9, 2024

Merge started

Uh oh!

Skylion007 commented Oct 14, 2024

Uh oh!

eqy commented Nov 1, 2024

Uh oh!

pytorchmergebot commented Nov 1, 2024

Uh oh!

pytorchmergebot commented Nov 1, 2024

Uh oh!

Skylion007 commented Nov 4, 2024 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

danthe3rd Nov 14, 2024

Choose a reason for hiding this comment

Uh oh!

Skylion007 Nov 17, 2024

Choose a reason for hiding this comment

Uh oh!

Skylion007 commented Nov 26, 2024

Uh oh!

nighting0le01 commented Dec 12, 2024 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Skylion007 commented Jan 2, 2025

Uh oh!

Skylion007 Jan 2, 2025

Choose a reason for hiding this comment

Uh oh!

danthe3rd Jan 2, 2025

Choose a reason for hiding this comment

Uh oh!

Skylion007 commented Jan 29, 2025

Uh oh!

Uh oh!

eqy commented Jul 23, 2024 •

edited by pytorch-bot bot

Loading

pytorch-bot bot commented Jul 23, 2024 •

edited

Loading

Skylion007 commented Nov 4, 2024 •

edited

Loading

nighting0le01 commented Dec 12, 2024 •

edited

Loading