Skip to content

Navigation Menu

Sign in
Appearance settings

Search code, repositories, users, issues, pull requests...

Provide feedback

We read every piece of feedback, and take your input very seriously.

Saved searches

Use saved searches to filter your results more quickly

Appearance settings

[Submodule] Bump Cutlass to 3.5.1 OSS PR #144000

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Closed
wants to merge 8 commits into from

Conversation

drisspg
Copy link
Contributor

@drisspg drisspg commented Dec 30, 2024

Stack from ghstack (oldest at bottom):

Summary

Follow up PR to #143515. That PR added a bunch of macro switches to ensure both 3.4 and 3.5.1 built succesfully. This PR actual bumps the cutlass pin to 3.5.1.

I am going to do a stack on top to add an conditional gates for 3.6 hijacking the 3.4 switches. We will leap frog our way to the top :)

cc @ptrblck @msaroufim @eqy @yanbing-j @vkuzo @albanD @kadeng @penguinwu

[ghstack-poisoned]
Copy link

pytorch-bot bot commented Dec 30, 2024

🔗 Helpful Links

🧪 See artifacts and rendered test results at hud.pytorch.org/pr/144000

Note: Links to docs will display an error until the docs builds have been completed.

✅ No Failures

As of commit 10b0f6c with merge base 2409b49 (image):
💚 Looks good so far! There are no failures yet. 💚

This comment was automatically generated by Dr. CI and updates every 15 minutes.

drisspg added a commit that referenced this pull request Dec 30, 2024
ghstack-source-id: 962588b
Pull Request resolved: #144000
@drisspg drisspg added module: cuda Related to torch.cuda, and CUDA support in general topic: not user facing topic category module: float8 For torch.float8_e5m2 and torch.float8_e4m3 labels Dec 30, 2024
@drisspg drisspg requested a review from lw December 30, 2024 19:49
@Skylion007
Copy link
Collaborator

@drisspg We can remove a warning for the CUTLASS headers unused variable for now, and fix it upstream if it's not already fixed. I reported the issue as while ago.

@drisspg
Copy link
Contributor Author

drisspg commented Dec 31, 2024

import torch

M, K, N = 256, 32, 256

# Create tensors
a = torch.randn((M, K), device="cuda", dtype=torch.bfloat16).to(torch.float8_e4m3fn)
b = torch.randn((N, K), device="cuda", dtype=torch.bfloat16).to(torch.float8_e4m3fn)
scales_a = torch.randn(M, device="cuda", dtype=torch.float)
scales_b = torch.randn(N, device="cuda", dtype=torch.float)
import sys

print("Scales_b address: ", hex(scales_b.data_ptr()))
sys.stdout.flush()

# Perform scaled matrix multiply
result = torch._scaled_mm(
    a,
    b.t(),
    scale_a=scales_a[:, None],
    scale_b=scales_b[None, :],
    out_dtype=torch.bfloat16,
    use_fast_accum=True,
)

Running with

function sanitize() {
    CUDA_LAUNCH_BLOCKING=1
    PYTORCH_NO_CUDA_MEMORY_CACHING=1
    TORCHINDUCTOR_FORCE_DISABLE_CACHES=1
    compute-sanitizer --tool memcheck "$@"
}

For big M, K, N the nearest allocation is scales_b. When N % 256 no more memory error, however even if there is no more memory error for the 256 aligned shapes the kernel still appears to not be behaving corrrectly since in unit tests the numerical deviation is much higher.

@drisspg
Copy link
Contributor Author

drisspg commented Dec 31, 2024

ohhhh actually I just realized we flipped the xscale/w_scale because of aruntime error but we didnt' flip the EVT args.. trying now

runtime error for provenance:

test/test_matmul_cuda.py terminate called after throwing an instance of 'c10::Error'
  what():  CUDA error: misaligned address
Compile with `TORCH_USE_CUDA_DSA` to enable device-side assertions.

Exception raised from c10_cuda_check_implementation at /home/drisspg/meta/pytorch/c10/cuda/CUDAException.cpp:43 (most recent call first):
frame #0: c10::Error::Error(c10::SourceLocation, std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> >) + 0x98 (0x7f01a4f92c88 in /home/drisspg/meta/pytorch/torch/lib/libc10.so)
frame #1: c10::detail::torchCheckFail(char const*, char const*, unsigned int, std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > const&) + 0xe4 (0x7f01a4f3cde6 in /home/drisspg/meta/pytorch/torch/lib/libc10.so)
frame #2: c10::cuda::c10_cuda_check_implementation(int, char const*, char const*, int, bool) + 0x3be (0x7f01a6521fde in /home/drisspg/meta/pytorch/torch/lib/libc10_cuda.so)

@Skylion007
Copy link
Collaborator

@drisspg Isn't this missing the other cutlass fixes from here: #131493 ?

@drisspg
Copy link
Contributor Author

drisspg commented Jan 2, 2025

@Skylion007
Do to how we need to get this to work internally this is a multi-staged land:
#143551, and the main one here: #143515

This I will clean up and rebase after eveything lands that will bump the actual thrid-party submodule and then I will undo all my ifdefs.

After that lands I hope to bump to 3.6 as a fast follow


## Summary
Upon bumping the 3.5.1 pin we get: 

```Shell
[3093/3116] Building CUDA object caffe2/CMakeFiles/torch_cuda.dir/__/aten/src/ATen/native/cuda/RowwiseScaledMM.cu.o
/home/drisspg/meta/pytorch/aten/src/ATen/native/cuda/RowwiseScaledMM.cu(17): warning #177-D: function "nvrtc_cuTensorMapEncodeTiled" was declared but never referenced
  static CUresult nvrtc_cuTensorMapEncodeTiled(
                  ^

Remark: The warnings can be suppressed with "-diag-suppress <warning-number>"
```
And bad TMA Init
```Shell
TMA Desc Addr:   0x7ffd4ea7f3c0
format         0
dim            3
gmem_address   0x7f18db683000
globalDim      (512,2048,1,1,1)
globalStrides  (1,512,0,0,0)
boxDim         (128,64,1,1,1)
elementStrides (1,1,1,1,1)
interleave     0
swizzle        3
l2Promotion    2
oobFill        0
Error: Failed to initialize the TMA descriptor 999
TMA Desc Addr:   0x7ffd4ea7f3c0
format         0
dim            3
gmem_address   0x7f18db603000
globalDim      (512,1024,1,1,1)
globalStrides  (1,512,0,0,0)
boxDim         (128,128,1,1,1)
elementStrides (1,1,1,1,1)
interleave     0
swizzle        3
l2Promotion    2
oobFill        0
Error: Failed to initialize the TMA descriptor 999
TMA Desc Addr:   0x7ffd4ea7f380
format         9
dim            3
gmem_address   0x7f18dc800000
globalDim      (2048,1024,1,1,1)
globalStrides  (2,4096,0,0,0)
boxDim         (64,32,1,1,1)
elementStrides (1,1,1,1,1)
interleave     0
swizzle        3
l2Promotion    2
oobFill        0
Error: Failed to initialize the TMA descriptor 999
TMA Desc Addr:   0x7ffd4ea7f380
format         9
dim            3
gmem_address   0x7f18dc800000
globalDim      (2048,1024,1,1,1)
globalStrides  (2,4096,0,0,0)
boxDim         (64,32,1,1,1)
elementStrides (1,1,1,1,1)
interleave     0
swizzle        3
l2Promotion    2
oobFill        0
Error: Failed to initialize the TMA descriptor 999
```

If we remove the name rewrite hack we no longer get the above, but instead all our results are wrong in 

While this no longer throws the error we get broken numeric testing:

`pytest test/inductor/test_fp8.py -k "rowwise"`
```Shell
AssertionError: Tensor-likes are not close!

Mismatched elements: 16345 / 16384 (99.8%)
Greatest absolute difference: inf at index (513, 5) (up to 0.07 allowed)
Greatest relative difference: inf at index (513, 5) (up to 0.01 allowed)

```

### Update we are getting IMA reads
``` Shell
========= Invalid __global__ read of size 4 bytes
=========     at void cutlass::device_kernel<cutlass::gemm::kernel::GemmUniversal<cute::tuple<int, int, int>, cutlass::gemm::collective::CollectiveMma<cutlass::gemm::MainloopSm90TmaGmmaWarpSpecialized<(int)8, cute::tuple<cute::C<(int)2>, cute::C<(int)1>, cute::C<(int)1>>, cutlass::gemm::KernelTmaWarpSpecializedFP8FastAccum>, cute::tuple<cute::C<(int)64>, cute::C<(int)128>, cute::C<(int)128>>, cutlass::float_e4m3_t, cute::tuple<long, cute::C<(int)1>, long>, cutlass::float_e4m3_t, cute::tuple<long, cute::C<(int)1>, long>, cute::TiledMMA<cute::MMA_Atom<cute::SM90_64x128x32_F32E4M3E4M3_SS_TN<(cute::GMMA::ScaleIn)1, (cute::GMMA::ScaleIn)1>>, cute::Layout<cute::tuple<cute::C<(int)1>, cute::C<(int)1>, cute::C<(int)1>>, cute::tuple<cute::C<(int)0>, cute::C<(int)0>, cute::C<(int)0>>>, cute::tuple<cute::Underscore, cute::Underscore, cute::Underscore>>, cute::SM90_TMA_LOAD, cute::ComposedLayout<cute::Swizzle<(int)3, (int)4, (int)3>, cute::smem_ptr_flag_bits<(int)8>, cute::Layout<cute::tuple<cute::C<(int)8>, cute::C<(int)128>>, cute::tuple<cute::C<(int)128>, cute::C<(int)1>>>>, void, cute::identity, cute::SM90_TMA_LOAD_MULTICAST, cute::ComposedLayout<cute::Swizzle<(int)3, (int)4, (int)3>, cute::smem_ptr_flag_bits<(int)8>, cute::Layout<cute::tuple<cute::C<(int)8>, cute::C<(int)128>>, cute::tuple<cute::C<(int)128>, cute::C<(int)1>>>>, void, cute::identity>, cutlass::epilogue::collective::CollectiveEpilogue<cutlass::epilogue::Sm90TmaWarpSpecialized<(int)4, (int)2, (int)16, (bool)1, (bool)0>, cute::tuple<cute::C<(int)64>, cute::C<(int)128>, cute::C<(int)128>>, cute::tuple<cute::C<(int)64>, cute::C<(int)32>>, cutlass::bfloat16_t, cute::tuple<long, cute::C<(int)1>, long>, cutlass::bfloat16_t, cute::tuple<long, cute::C<(int)1>, long>, cutlass::epilogue::fusion::Sm90TreeVisitor<cutlass::epilogue::fusion::Sm90Compute<cutlass::epilogue::thread::Identity, cutlass::bfloat16_t, float, (cutlass::FloatRoundStyle)2, void>, cutlass::epilogue::fusion::Sm90TreeVisitor<cutlass::epilogue::fusion::Sm90Compute<cutlass::plus, float, float, (cutlass::FloatRoundStyle)2, void>, cutlass::epilogue::fusion::Sm90RowBroadcast<(int)0, cute::tuple<cute::C<(int)64>, cute::C<(int)128>, cute::C<(int)128>>, cutlass::bfloat16_t, cute::tuple<cute::C<(int)0>, cute::C<(int)1>, cute::C<(int)0>>, (int)8, (bool)1>, cutlass::epilogue::fusion::Sm90TreeVisitor<cutlass::epilogue::fusion::Sm90Compute<cutlass::multiplies, float, float, (cutlass::FloatRoundStyle)2, void>, cutlass::epilogue::fusion::Sm90RowBroadcast<(int)0, cute::tuple<cute::C<(int)64>, cute::C<(int)128>, cute::C<(int)128>>, float, cute::tuple<cute::C<(int)0>, cute::C<(int)1>, cute::C<(int)0>>, (int)4, (bool)1>, cutlass::epilogue::fusion::Sm90TreeVisitor<cutlass::epilogue::fusion::Sm90Compute<cutlass::multiplies, float, float, (cutlass::FloatRoundStyle)2, void>, cutlass::epilogue::fusion::Sm90ColBroadcast<(int)0, cute::tuple<cute::C<(int)64>, cute::C<(int)128>, cute::C<(int)128>>, float, cute::tuple<cute::C<(int)1>, cute::C<(int)0>, cute::C<(int)0>>, (int)4, (bool)1>, cutlass::epilogue::fusion::Sm90AccFetch>>>>, cute::SM90_TMA_LOAD, cute::ComposedLayout<cute::Swizzle<(int)2, (int)4, (int)3>, cute::smem_ptr_flag_bits<(int)16>, cute::Layout<cute::tuple<cute::C<(int)8>, cute::C<(int)32>>, cute::tuple<cute::C<(int)32>, cute::C<(int)1>>>>, cute::SM75_U32x4_LDSM_N, cute::SM90_TMA_STORE, cute::ComposedLayout<cute::Swizzle<(int)2, (int)4, (int)3>, cute::smem_ptr_flag_bits<(int)16>, cute::Layout<cute::tuple<cute::C<(int)8>, cute::C<(int)32>>, cute::tuple<cute::C<(int)32>, cute::C<(int)1>>>>, cute::SM90_U32x4_STSM_N, cute::Copy_Atom<cute::SM90_U32x4_STSM_N, cutlass::half_t>>, void, void>>(T1::Params)+0x1a80
=========     by thread (195,0,0) in block (4,0,0)
=========     Address 0x7f1e02608680 is out of bounds
=========     and is 896 bytes before the nearest allocation at 0x7f1e02608a00 of size 16,384 bytes
=========     Saved host backtrace up to driver entry point at kernel launch time
=========     Host Frame: [0x2f19a3]
=========                in /lib64/libcuda.so.1
=========     Host Frame: [0x15b79]
=========                in /usr/local/cuda-12.4/lib64/libcudart.so.12
```

### Updat2

The ima seems very similar to: #133334 but now on scale_b instead of scale_a

[ghstack-poisoned]
drisspg added a commit that referenced this pull request Jan 3, 2025
ghstack-source-id: 8c09856
Pull Request resolved: #144000
[ghstack-poisoned]
drisspg added a commit that referenced this pull request Jan 3, 2025
ghstack-source-id: 2948721
Pull Request resolved: #144000
[ghstack-poisoned]
drisspg added a commit that referenced this pull request Jan 3, 2025
ghstack-source-id: 5e8564e
Pull Request resolved: #144000
@drisspg drisspg added the ciflow/trunk Trigger trunk jobs on your pull request label Jan 3, 2025
[ghstack-poisoned]
drisspg added a commit that referenced this pull request Jan 3, 2025
ghstack-source-id: 63b7f3b
Pull Request resolved: #144000
[ghstack-poisoned]
drisspg added a commit that referenced this pull request Jan 3, 2025
ghstack-source-id: 4fcaeb5
Pull Request resolved: #144000
[ghstack-poisoned]
drisspg added a commit that referenced this pull request Jan 3, 2025
ghstack-source-id: 9b80e35
Pull Request resolved: #144000
[ghstack-poisoned]
drisspg added a commit that referenced this pull request Jan 3, 2025
ghstack-source-id: ce4ebc0
Pull Request resolved: #144000
@drisspg
Copy link
Contributor Author

drisspg commented Jan 3, 2025

@pytorchbot merge

@pytorchmergebot
Copy link
Collaborator

This PR updates submodules third_party/cutlass

If those updates are intentional, please add "submodule" keyword to PR title/description.

#include <c10/macros/Macros.h>

// Two warninngs in Cutlass included header files
C10_DIAGNOSTIC_PUSH_AND_IGNORED_IF_DEFINED("-Wset-but-not-used")
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Technically it's good to pair each C10_DIAGNOSTIC_PUSH with C10_DIAGNOSTIC_POP...

@drisspg drisspg changed the title Bump Cutlass to 3.5.1 OSS PR [Submodule] Bump Cutlass to 3.5.1 OSS PR Jan 4, 2025
@drisspg
Copy link
Contributor Author

drisspg commented Jan 4, 2025

@pytorchbot merge

@pytorchmergebot
Copy link
Collaborator

Merge started

Your change will be merged once all checks pass (ETA 0-4 Hours).

Learn more about merging in the wiki.

Questions? Feedback? Please reach out to the PyTorch DevX Team

Advanced Debugging
Check the merge workflow status
here

Git-Hub-Chris pushed a commit to Git-Hub-Chris/PyTorch that referenced this pull request Jan 19, 2025
ghstack-source-id: ce4ebc0
Pull Request resolved: pytorch/pytorch#144000
@github-actions github-actions bot deleted the gh/drisspg/108/head branch February 4, 2025 02:03
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
ciflow/trunk Trigger trunk jobs on your pull request Merged module: cuda Related to torch.cuda, and CUDA support in general module: float8 For torch.float8_e5m2 and torch.float8_e4m3 topic: not user facing topic category
Projects
None yet
Development

Successfully merging this pull request may close these issues.

5 participants
Morty Proxy This is a proxified and sanitized view of the page, visit original site.