[Submodule] Bump Cutlass to 3.5.1 OSS PR #144000

drisspg · Dec 30, 2024

Stack from ghstack (oldest at bottom):

Summary

Follow up PR to #143515. That PR added a bunch of macro switches to ensure both 3.4 and 3.5.1 built succesfully. This PR actual bumps the cutlass pin to 3.5.1.

I am going to do a stack on top to add an conditional gates for 3.6 hijacking the 3.4 switches. We will leap frog our way to the top :)

cc @ptrblck @msaroufim @eqy @yanbing-j @vkuzo @albanD @kadeng @penguinwu

[ghstack-poisoned]

pytorch-bot · Dec 30, 2024

🔗 Helpful Links

🧪 See artifacts and rendered test results at hud.pytorch.org/pr/144000

📄 Preview Python docs built from this PR
📄 Preview C++ docs built from this PR
❓ Need help or want to give feedback on the CI? Visit the bot commands wiki or our office hours

Note: Links to docs will display an error until the docs builds have been completed.

✅ No Failures

As of commit 10b0f6c with merge base 2409b49 ():
💚 Looks good so far! There are no failures yet. 💚

This comment was automatically generated by Dr. CI and updates every 15 minutes.

ghstack-source-id: 962588b Pull Request resolved: #144000

Skylion007 · Dec 30, 2024

@drisspg We can remove a warning for the CUTLASS headers unused variable for now, and fix it upstream if it's not already fixed. I reported the issue as while ago.

drisspg · Dec 31, 2024

import torch

M, K, N = 256, 32, 256

# Create tensors
a = torch.randn((M, K), device="cuda", dtype=torch.bfloat16).to(torch.float8_e4m3fn)
b = torch.randn((N, K), device="cuda", dtype=torch.bfloat16).to(torch.float8_e4m3fn)
scales_a = torch.randn(M, device="cuda", dtype=torch.float)
scales_b = torch.randn(N, device="cuda", dtype=torch.float)
import sys

print("Scales_b address: ", hex(scales_b.data_ptr()))
sys.stdout.flush()

# Perform scaled matrix multiply
result = torch._scaled_mm(
    a,
    b.t(),
    scale_a=scales_a[:, None],
    scale_b=scales_b[None, :],
    out_dtype=torch.bfloat16,
    use_fast_accum=True,
)

Running with

function sanitize() {
    CUDA_LAUNCH_BLOCKING=1
    PYTORCH_NO_CUDA_MEMORY_CACHING=1
    TORCHINDUCTOR_FORCE_DISABLE_CACHES=1
    compute-sanitizer --tool memcheck "$@"
}

For big M, K, N the nearest allocation is scales_b. When N % 256 no more memory error, however even if there is no more memory error for the 256 aligned shapes the kernel still appears to not be behaving corrrectly since in unit tests the numerical deviation is much higher.

drisspg · Dec 31, 2024

ohhhh actually I just realized we flipped the xscale/w_scale because of aruntime error but we didnt' flip the EVT args.. trying now

runtime error for provenance:

test/test_matmul_cuda.py terminate called after throwing an instance of 'c10::Error'
  what():  CUDA error: misaligned address
Compile with `TORCH_USE_CUDA_DSA` to enable device-side assertions.

Exception raised from c10_cuda_check_implementation at /home/drisspg/meta/pytorch/c10/cuda/CUDAException.cpp:43 (most recent call first):
frame #0: c10::Error::Error(c10::SourceLocation, std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> >) + 0x98 (0x7f01a4f92c88 in /home/drisspg/meta/pytorch/torch/lib/libc10.so)
frame #1: c10::detail::torchCheckFail(char const*, char const*, unsigned int, std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > const&) + 0xe4 (0x7f01a4f3cde6 in /home/drisspg/meta/pytorch/torch/lib/libc10.so)
frame #2: c10::cuda::c10_cuda_check_implementation(int, char const*, char const*, int, bool) + 0x3be (0x7f01a6521fde in /home/drisspg/meta/pytorch/torch/lib/libc10_cuda.so)

Skylion007 · Jan 2, 2025

@drisspg Isn't this missing the other cutlass fixes from here: #131493 ?

drisspg · Jan 2, 2025

@Skylion007
Do to how we need to get this to work internally this is a multi-staged land:
#143551, and the main one here: #143515

This I will clean up and rebase after eveything lands that will bump the actual thrid-party submodule and then I will undo all my ifdefs.

After that lands I hope to bump to 3.6 as a fast follow

## Summary Upon bumping the 3.5.1 pin we get: ```Shell [3093/3116] Building CUDA object caffe2/CMakeFiles/torch_cuda.dir/__/aten/src/ATen/native/cuda/RowwiseScaledMM.cu.o /home/drisspg/meta/pytorch/aten/src/ATen/native/cuda/RowwiseScaledMM.cu(17): warning #177-D: function "nvrtc_cuTensorMapEncodeTiled" was declared but never referenced static CUresult nvrtc_cuTensorMapEncodeTiled( ^ Remark: The warnings can be suppressed with "-diag-suppress <warning-number>" ``` And bad TMA Init ```Shell TMA Desc Addr: 0x7ffd4ea7f3c0 format 0 dim 3 gmem_address 0x7f18db683000 globalDim (512,2048,1,1,1) globalStrides (1,512,0,0,0) boxDim (128,64,1,1,1) elementStrides (1,1,1,1,1) interleave 0 swizzle 3 l2Promotion 2 oobFill 0 Error: Failed to initialize the TMA descriptor 999 TMA Desc Addr: 0x7ffd4ea7f3c0 format 0 dim 3 gmem_address 0x7f18db603000 globalDim (512,1024,1,1,1) globalStrides (1,512,0,0,0) boxDim (128,128,1,1,1) elementStrides (1,1,1,1,1) interleave 0 swizzle 3 l2Promotion 2 oobFill 0 Error: Failed to initialize the TMA descriptor 999 TMA Desc Addr: 0x7ffd4ea7f380 format 9 dim 3 gmem_address 0x7f18dc800000 globalDim (2048,1024,1,1,1) globalStrides (2,4096,0,0,0) boxDim (64,32,1,1,1) elementStrides (1,1,1,1,1) interleave 0 swizzle 3 l2Promotion 2 oobFill 0 Error: Failed to initialize the TMA descriptor 999 TMA Desc Addr: 0x7ffd4ea7f380 format 9 dim 3 gmem_address 0x7f18dc800000 globalDim (2048,1024,1,1,1) globalStrides (2,4096,0,0,0) boxDim (64,32,1,1,1) elementStrides (1,1,1,1,1) interleave 0 swizzle 3 l2Promotion 2 oobFill 0 Error: Failed to initialize the TMA descriptor 999 ``` If we remove the name rewrite hack we no longer get the above, but instead all our results are wrong in While this no longer throws the error we get broken numeric testing: `pytest test/inductor/test_fp8.py -k "rowwise"` ```Shell AssertionError: Tensor-likes are not close! Mismatched elements: 16345 / 16384 (99.8%) Greatest absolute difference: inf at index (513, 5) (up to 0.07 allowed) Greatest relative difference: inf at index (513, 5) (up to 0.01 allowed) ``` ### Update we are getting IMA reads ``` Shell ========= Invalid __global__ read of size 4 bytes ========= at void cutlass::device_kernel<cutlass::gemm::kernel::GemmUniversal<cute::tuple<int, int, int>, cutlass::gemm::collective::CollectiveMma<cutlass::gemm::MainloopSm90TmaGmmaWarpSpecialized<(int)8, cute::tuple<cute::C<(int)2>, cute::C<(int)1>, cute::C<(int)1>>, cutlass::gemm::KernelTmaWarpSpecializedFP8FastAccum>, cute::tuple<cute::C<(int)64>, cute::C<(int)128>, cute::C<(int)128>>, cutlass::float_e4m3_t, cute::tuple<long, cute::C<(int)1>, long>, cutlass::float_e4m3_t, cute::tuple<long, cute::C<(int)1>, long>, cute::TiledMMA<cute::MMA_Atom<cute::SM90_64x128x32_F32E4M3E4M3_SS_TN<(cute::GMMA::ScaleIn)1, (cute::GMMA::ScaleIn)1>>, cute::Layout<cute::tuple<cute::C<(int)1>, cute::C<(int)1>, cute::C<(int)1>>, cute::tuple<cute::C<(int)0>, cute::C<(int)0>, cute::C<(int)0>>>, cute::tuple<cute::Underscore, cute::Underscore, cute::Underscore>>, cute::SM90_TMA_LOAD, cute::ComposedLayout<cute::Swizzle<(int)3, (int)4, (int)3>, cute::smem_ptr_flag_bits<(int)8>, cute::Layout<cute::tuple<cute::C<(int)8>, cute::C<(int)128>>, cute::tuple<cute::C<(int)128>, cute::C<(int)1>>>>, void, cute::identity, cute::SM90_TMA_LOAD_MULTICAST, cute::ComposedLayout<cute::Swizzle<(int)3, (int)4, (int)3>, cute::smem_ptr_flag_bits<(int)8>, cute::Layout<cute::tuple<cute::C<(int)8>, cute::C<(int)128>>, cute::tuple<cute::C<(int)128>, cute::C<(int)1>>>>, void, cute::identity>, cutlass::epilogue::collective::CollectiveEpilogue<cutlass::epilogue::Sm90TmaWarpSpecialized<(int)4, (int)2, (int)16, (bool)1, (bool)0>, cute::tuple<cute::C<(int)64>, cute::C<(int)128>, cute::C<(int)128>>, cute::tuple<cute::C<(int)64>, cute::C<(int)32>>, cutlass::bfloat16_t, cute::tuple<long, cute::C<(int)1>, long>, cutlass::bfloat16_t, cute::tuple<long, cute::C<(int)1>, long>, cutlass::epilogue::fusion::Sm90TreeVisitor<cutlass::epilogue::fusion::Sm90Compute<cutlass::epilogue::thread::Identity, cutlass::bfloat16_t, float, (cutlass::FloatRoundStyle)2, void>, cutlass::epilogue::fusion::Sm90TreeVisitor<cutlass::epilogue::fusion::Sm90Compute<cutlass::plus, float, float, (cutlass::FloatRoundStyle)2, void>, cutlass::epilogue::fusion::Sm90RowBroadcast<(int)0, cute::tuple<cute::C<(int)64>, cute::C<(int)128>, cute::C<(int)128>>, cutlass::bfloat16_t, cute::tuple<cute::C<(int)0>, cute::C<(int)1>, cute::C<(int)0>>, (int)8, (bool)1>, cutlass::epilogue::fusion::Sm90TreeVisitor<cutlass::epilogue::fusion::Sm90Compute<cutlass::multiplies, float, float, (cutlass::FloatRoundStyle)2, void>, cutlass::epilogue::fusion::Sm90RowBroadcast<(int)0, cute::tuple<cute::C<(int)64>, cute::C<(int)128>, cute::C<(int)128>>, float, cute::tuple<cute::C<(int)0>, cute::C<(int)1>, cute::C<(int)0>>, (int)4, (bool)1>, cutlass::epilogue::fusion::Sm90TreeVisitor<cutlass::epilogue::fusion::Sm90Compute<cutlass::multiplies, float, float, (cutlass::FloatRoundStyle)2, void>, cutlass::epilogue::fusion::Sm90ColBroadcast<(int)0, cute::tuple<cute::C<(int)64>, cute::C<(int)128>, cute::C<(int)128>>, float, cute::tuple<cute::C<(int)1>, cute::C<(int)0>, cute::C<(int)0>>, (int)4, (bool)1>, cutlass::epilogue::fusion::Sm90AccFetch>>>>, cute::SM90_TMA_LOAD, cute::ComposedLayout<cute::Swizzle<(int)2, (int)4, (int)3>, cute::smem_ptr_flag_bits<(int)16>, cute::Layout<cute::tuple<cute::C<(int)8>, cute::C<(int)32>>, cute::tuple<cute::C<(int)32>, cute::C<(int)1>>>>, cute::SM75_U32x4_LDSM_N, cute::SM90_TMA_STORE, cute::ComposedLayout<cute::Swizzle<(int)2, (int)4, (int)3>, cute::smem_ptr_flag_bits<(int)16>, cute::Layout<cute::tuple<cute::C<(int)8>, cute::C<(int)32>>, cute::tuple<cute::C<(int)32>, cute::C<(int)1>>>>, cute::SM90_U32x4_STSM_N, cute::Copy_Atom<cute::SM90_U32x4_STSM_N, cutlass::half_t>>, void, void>>(T1::Params)+0x1a80 ========= by thread (195,0,0) in block (4,0,0) ========= Address 0x7f1e02608680 is out of bounds ========= and is 896 bytes before the nearest allocation at 0x7f1e02608a00 of size 16,384 bytes ========= Saved host backtrace up to driver entry point at kernel launch time ========= Host Frame: [0x2f19a3] ========= in /lib64/libcuda.so.1 ========= Host Frame: [0x15b79] ========= in /usr/local/cuda-12.4/lib64/libcudart.so.12 ``` ### Updat2 The ima seems very similar to: #133334 but now on scale_b instead of scale_a [ghstack-poisoned]

ghstack-source-id: 8c09856 Pull Request resolved: #144000

[ghstack-poisoned]

ghstack-source-id: 2948721 Pull Request resolved: #144000

[ghstack-poisoned]

ghstack-source-id: 5e8564e Pull Request resolved: #144000

[ghstack-poisoned]

ghstack-source-id: 63b7f3b Pull Request resolved: #144000

[ghstack-poisoned]

ghstack-source-id: 4fcaeb5 Pull Request resolved: #144000

[ghstack-poisoned]

ghstack-source-id: 9b80e35 Pull Request resolved: #144000

[ghstack-poisoned]

ghstack-source-id: ce4ebc0 Pull Request resolved: #144000

drisspg · Jan 3, 2025

@pytorchbot merge

pytorchmergebot · Jan 3, 2025

This PR updates submodules third_party/cutlass

If those updates are intentional, please add "submodule" keyword to PR title/description.

malfet · Jan 4, 2025

aten/src/ATen/native/cuda/RowwiseScaledMM.cu

+#include <c10/macros/Macros.h>
+
+// Two warninngs in Cutlass included header files
+C10_DIAGNOSTIC_PUSH_AND_IGNORED_IF_DEFINED("-Wset-but-not-used")


Technically it's good to pair each C10_DIAGNOSTIC_PUSH with C10_DIAGNOSTIC_POP...

drisspg · Jan 4, 2025

@pytorchbot merge

pytorchmergebot · Jan 4, 2025

Merge started

Your change will be merged once all checks pass (ETA 0-4 Hours).

Learn more about merging in the wiki.

Questions? Feedback? Please reach out to the PyTorch DevX Team

Advanced Debugging

Check the merge workflow status
here

ghstack-source-id: ce4ebc0 Pull Request resolved: pytorch/pytorch#144000

Update

0cf998c

[ghstack-poisoned]

drisspg requested review from eqy and syed-ahmed as code owners December 30, 2024 19:43

drisspg mentioned this pull request Dec 30, 2024

[cutlass-3] Update third-party/cutlass-3 from 3.4 to 3.5.1 (#143515) #143999

Closed

drisspg added a commit that referenced this pull request Dec 30, 2024

Bump Cutlass to 3.5.1 OSS PR

fe590fe

ghstack-source-id: 962588b Pull Request resolved: #144000

drisspg added module: cuda Related to torch.cuda, and CUDA support in general topic: not user facing topic category module: float8 For torch.float8_e5m2 and torch.float8_e4m3 labels Dec 30, 2024

drisspg requested a review from lw December 30, 2024 19:49

Skylion007 approved these changes Dec 30, 2024

View reviewed changes

eqy approved these changes Dec 30, 2024

View reviewed changes

yifuwang mentioned this pull request Dec 31, 2024

[AsyncMM] re-enable and adapt to cutlass 3.6.0 #144011

Closed

drisspg added a commit that referenced this pull request Jan 3, 2025

Bump Cutlass to 3.5.1 OSS PR

485ded7

ghstack-source-id: 8c09856 Pull Request resolved: #144000

Update

b462285

[ghstack-poisoned]

drisspg added a commit that referenced this pull request Jan 3, 2025

Bump Cutlass to 3.5.1 OSS PR

a88f9ae

ghstack-source-id: 2948721 Pull Request resolved: #144000

Update

d6fdaff

[ghstack-poisoned]

drisspg added a commit that referenced this pull request Jan 3, 2025

Bump Cutlass to 3.5.1 OSS PR

bd5c2e0

ghstack-source-id: 5e8564e Pull Request resolved: #144000

drisspg added the ciflow/trunk Trigger trunk jobs on your pull request label Jan 3, 2025

Update

e18c8de

[ghstack-poisoned]

drisspg added a commit that referenced this pull request Jan 3, 2025

Bump Cutlass to 3.5.1 OSS PR

e1bf579

ghstack-source-id: 63b7f3b Pull Request resolved: #144000

Update

9fccc30

[ghstack-poisoned]

drisspg added a commit that referenced this pull request Jan 3, 2025

Bump Cutlass to 3.5.1 OSS PR

b5c6f3c

ghstack-source-id: 4fcaeb5 Pull Request resolved: #144000

Update

104d2a9

[ghstack-poisoned]

drisspg added a commit that referenced this pull request Jan 3, 2025

Bump Cutlass to 3.5.1 OSS PR

dd351c5

ghstack-source-id: 9b80e35 Pull Request resolved: #144000

Update

10b0f6c

[ghstack-poisoned]

drisspg added a commit that referenced this pull request Jan 3, 2025

Bump Cutlass to 3.5.1 OSS PR

40993d4

ghstack-source-id: ce4ebc0 Pull Request resolved: #144000

malfet approved these changes Jan 4, 2025

View reviewed changes

drisspg mentioned this pull request Jan 4, 2025

[Submodule] Upgrade to Cutlass 3.6 #144180

Closed

drisspg changed the title ~~Bump Cutlass to 3.5.1 OSS PR~~ [Submodule] Bump Cutlass to 3.5.1 OSS PR Jan 4, 2025

pytorchmergebot added the merging label Jan 4, 2025

pytorchmergebot added the Merged label Jan 4, 2025

pytorchmergebot closed this in 005a4b9 Jan 4, 2025

pytorchmergebot removed the merging label Jan 4, 2025

Git-Hub-Chris pushed a commit to Git-Hub-Chris/PyTorch that referenced this pull request Jan 19, 2025

Bump Cutlass to 3.5.1 OSS PR

6d22b69

ghstack-source-id: ce4ebc0 Pull Request resolved: pytorch/pytorch#144000

github-actions bot deleted the gh/drisspg/108/head branch February 4, 2025 02:03

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

[Submodule] Bump Cutlass to 3.5.1 OSS PR #144000

[Submodule] Bump Cutlass to 3.5.1 OSS PR #144000

Uh oh!

drisspg commented Dec 30, 2024 •

edited by pytorch-bot bot

Loading

Uh oh!

pytorch-bot bot commented Dec 30, 2024 •

edited

Loading

Uh oh!

Skylion007 commented Dec 30, 2024

Uh oh!

drisspg commented Dec 31, 2024 •

edited

Loading

Uh oh!

drisspg commented Dec 31, 2024 •

edited

Loading

Uh oh!

Skylion007 commented Jan 2, 2025

Uh oh!

drisspg commented Jan 2, 2025 •

edited

Loading

Uh oh!

drisspg commented Jan 3, 2025

Uh oh!

pytorchmergebot commented Jan 3, 2025

Uh oh!

malfet Jan 4, 2025

Uh oh!

drisspg commented Jan 4, 2025

Uh oh!

pytorchmergebot commented Jan 4, 2025

Uh oh!

Uh oh!

Search code, repositories, users, issues, pull requests...

[Submodule] Bump Cutlass to 3.5.1 OSS PR #144000

[Submodule] Bump Cutlass to 3.5.1 OSS PR #144000

Uh oh!

Conversation

drisspg commented Dec 30, 2024 • edited by pytorch-bot bot Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Summary

Uh oh!

pytorch-bot bot commented Dec 30, 2024 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

🔗 Helpful Links

🧪 See artifacts and rendered test results at hud.pytorch.org/pr/144000

✅ No Failures

Uh oh!

Skylion007 commented Dec 30, 2024

Uh oh!

drisspg commented Dec 31, 2024 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

drisspg commented Dec 31, 2024 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Skylion007 commented Jan 2, 2025

Uh oh!

drisspg commented Jan 2, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

drisspg commented Jan 3, 2025

Uh oh!

pytorchmergebot commented Jan 3, 2025

Uh oh!

malfet Jan 4, 2025

Choose a reason for hiding this comment

Uh oh!

drisspg commented Jan 4, 2025

Uh oh!

pytorchmergebot commented Jan 4, 2025

Merge started

Uh oh!

Uh oh!

drisspg commented Dec 30, 2024 •

edited by pytorch-bot bot

Loading

pytorch-bot bot commented Dec 30, 2024 •

edited

Loading

drisspg commented Dec 31, 2024 •

edited

Loading

drisspg commented Dec 31, 2024 •

edited

Loading

drisspg commented Jan 2, 2025 •

edited

Loading