-
Notifications
You must be signed in to change notification settings - Fork 24.4k
[Submodule] Bump Cutlass to 3.5.1 OSS PR #144000
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
🔗 Helpful Links🧪 See artifacts and rendered test results at hud.pytorch.org/pr/144000
Note: Links to docs will display an error until the docs builds have been completed. ✅ No FailuresAs of commit 10b0f6c with merge base 2409b49 ( This comment was automatically generated by Dr. CI and updates every 15 minutes. |
@drisspg We can remove a warning for the CUTLASS headers unused variable for now, and fix it upstream if it's not already fixed. I reported the issue as while ago. |
import torch
M, K, N = 256, 32, 256
# Create tensors
a = torch.randn((M, K), device="cuda", dtype=torch.bfloat16).to(torch.float8_e4m3fn)
b = torch.randn((N, K), device="cuda", dtype=torch.bfloat16).to(torch.float8_e4m3fn)
scales_a = torch.randn(M, device="cuda", dtype=torch.float)
scales_b = torch.randn(N, device="cuda", dtype=torch.float)
import sys
print("Scales_b address: ", hex(scales_b.data_ptr()))
sys.stdout.flush()
# Perform scaled matrix multiply
result = torch._scaled_mm(
a,
b.t(),
scale_a=scales_a[:, None],
scale_b=scales_b[None, :],
out_dtype=torch.bfloat16,
use_fast_accum=True,
) Running with function sanitize() {
CUDA_LAUNCH_BLOCKING=1
PYTORCH_NO_CUDA_MEMORY_CACHING=1
TORCHINDUCTOR_FORCE_DISABLE_CACHES=1
compute-sanitizer --tool memcheck "$@"
} For big M, K, N the nearest allocation is scales_b. When N % 256 no more memory error, however even if there is no more memory error for the 256 aligned shapes the kernel still appears to not be behaving corrrectly since in unit tests the numerical deviation is much higher. |
ohhhh actually I just realized we flipped the xscale/w_scale because of aruntime error but we didnt' flip the EVT args.. trying now runtime error for provenance:
|
@Skylion007 This I will clean up and rebase after eveything lands that will bump the actual thrid-party submodule and then I will undo all my ifdefs. After that lands I hope to bump to 3.6 as a fast follow |
## Summary Upon bumping the 3.5.1 pin we get: ```Shell [3093/3116] Building CUDA object caffe2/CMakeFiles/torch_cuda.dir/__/aten/src/ATen/native/cuda/RowwiseScaledMM.cu.o /home/drisspg/meta/pytorch/aten/src/ATen/native/cuda/RowwiseScaledMM.cu(17): warning #177-D: function "nvrtc_cuTensorMapEncodeTiled" was declared but never referenced static CUresult nvrtc_cuTensorMapEncodeTiled( ^ Remark: The warnings can be suppressed with "-diag-suppress <warning-number>" ``` And bad TMA Init ```Shell TMA Desc Addr: 0x7ffd4ea7f3c0 format 0 dim 3 gmem_address 0x7f18db683000 globalDim (512,2048,1,1,1) globalStrides (1,512,0,0,0) boxDim (128,64,1,1,1) elementStrides (1,1,1,1,1) interleave 0 swizzle 3 l2Promotion 2 oobFill 0 Error: Failed to initialize the TMA descriptor 999 TMA Desc Addr: 0x7ffd4ea7f3c0 format 0 dim 3 gmem_address 0x7f18db603000 globalDim (512,1024,1,1,1) globalStrides (1,512,0,0,0) boxDim (128,128,1,1,1) elementStrides (1,1,1,1,1) interleave 0 swizzle 3 l2Promotion 2 oobFill 0 Error: Failed to initialize the TMA descriptor 999 TMA Desc Addr: 0x7ffd4ea7f380 format 9 dim 3 gmem_address 0x7f18dc800000 globalDim (2048,1024,1,1,1) globalStrides (2,4096,0,0,0) boxDim (64,32,1,1,1) elementStrides (1,1,1,1,1) interleave 0 swizzle 3 l2Promotion 2 oobFill 0 Error: Failed to initialize the TMA descriptor 999 TMA Desc Addr: 0x7ffd4ea7f380 format 9 dim 3 gmem_address 0x7f18dc800000 globalDim (2048,1024,1,1,1) globalStrides (2,4096,0,0,0) boxDim (64,32,1,1,1) elementStrides (1,1,1,1,1) interleave 0 swizzle 3 l2Promotion 2 oobFill 0 Error: Failed to initialize the TMA descriptor 999 ``` If we remove the name rewrite hack we no longer get the above, but instead all our results are wrong in While this no longer throws the error we get broken numeric testing: `pytest test/inductor/test_fp8.py -k "rowwise"` ```Shell AssertionError: Tensor-likes are not close! Mismatched elements: 16345 / 16384 (99.8%) Greatest absolute difference: inf at index (513, 5) (up to 0.07 allowed) Greatest relative difference: inf at index (513, 5) (up to 0.01 allowed) ``` ### Update we are getting IMA reads ``` Shell ========= Invalid __global__ read of size 4 bytes ========= at void cutlass::device_kernel<cutlass::gemm::kernel::GemmUniversal<cute::tuple<int, int, int>, cutlass::gemm::collective::CollectiveMma<cutlass::gemm::MainloopSm90TmaGmmaWarpSpecialized<(int)8, cute::tuple<cute::C<(int)2>, cute::C<(int)1>, cute::C<(int)1>>, cutlass::gemm::KernelTmaWarpSpecializedFP8FastAccum>, cute::tuple<cute::C<(int)64>, cute::C<(int)128>, cute::C<(int)128>>, cutlass::float_e4m3_t, cute::tuple<long, cute::C<(int)1>, long>, cutlass::float_e4m3_t, cute::tuple<long, cute::C<(int)1>, long>, cute::TiledMMA<cute::MMA_Atom<cute::SM90_64x128x32_F32E4M3E4M3_SS_TN<(cute::GMMA::ScaleIn)1, (cute::GMMA::ScaleIn)1>>, cute::Layout<cute::tuple<cute::C<(int)1>, cute::C<(int)1>, cute::C<(int)1>>, cute::tuple<cute::C<(int)0>, cute::C<(int)0>, cute::C<(int)0>>>, cute::tuple<cute::Underscore, cute::Underscore, cute::Underscore>>, cute::SM90_TMA_LOAD, cute::ComposedLayout<cute::Swizzle<(int)3, (int)4, (int)3>, cute::smem_ptr_flag_bits<(int)8>, cute::Layout<cute::tuple<cute::C<(int)8>, cute::C<(int)128>>, cute::tuple<cute::C<(int)128>, cute::C<(int)1>>>>, void, cute::identity, cute::SM90_TMA_LOAD_MULTICAST, cute::ComposedLayout<cute::Swizzle<(int)3, (int)4, (int)3>, cute::smem_ptr_flag_bits<(int)8>, cute::Layout<cute::tuple<cute::C<(int)8>, cute::C<(int)128>>, cute::tuple<cute::C<(int)128>, cute::C<(int)1>>>>, void, cute::identity>, cutlass::epilogue::collective::CollectiveEpilogue<cutlass::epilogue::Sm90TmaWarpSpecialized<(int)4, (int)2, (int)16, (bool)1, (bool)0>, cute::tuple<cute::C<(int)64>, cute::C<(int)128>, cute::C<(int)128>>, cute::tuple<cute::C<(int)64>, cute::C<(int)32>>, cutlass::bfloat16_t, cute::tuple<long, cute::C<(int)1>, long>, cutlass::bfloat16_t, cute::tuple<long, cute::C<(int)1>, long>, cutlass::epilogue::fusion::Sm90TreeVisitor<cutlass::epilogue::fusion::Sm90Compute<cutlass::epilogue::thread::Identity, cutlass::bfloat16_t, float, (cutlass::FloatRoundStyle)2, void>, cutlass::epilogue::fusion::Sm90TreeVisitor<cutlass::epilogue::fusion::Sm90Compute<cutlass::plus, float, float, (cutlass::FloatRoundStyle)2, void>, cutlass::epilogue::fusion::Sm90RowBroadcast<(int)0, cute::tuple<cute::C<(int)64>, cute::C<(int)128>, cute::C<(int)128>>, cutlass::bfloat16_t, cute::tuple<cute::C<(int)0>, cute::C<(int)1>, cute::C<(int)0>>, (int)8, (bool)1>, cutlass::epilogue::fusion::Sm90TreeVisitor<cutlass::epilogue::fusion::Sm90Compute<cutlass::multiplies, float, float, (cutlass::FloatRoundStyle)2, void>, cutlass::epilogue::fusion::Sm90RowBroadcast<(int)0, cute::tuple<cute::C<(int)64>, cute::C<(int)128>, cute::C<(int)128>>, float, cute::tuple<cute::C<(int)0>, cute::C<(int)1>, cute::C<(int)0>>, (int)4, (bool)1>, cutlass::epilogue::fusion::Sm90TreeVisitor<cutlass::epilogue::fusion::Sm90Compute<cutlass::multiplies, float, float, (cutlass::FloatRoundStyle)2, void>, cutlass::epilogue::fusion::Sm90ColBroadcast<(int)0, cute::tuple<cute::C<(int)64>, cute::C<(int)128>, cute::C<(int)128>>, float, cute::tuple<cute::C<(int)1>, cute::C<(int)0>, cute::C<(int)0>>, (int)4, (bool)1>, cutlass::epilogue::fusion::Sm90AccFetch>>>>, cute::SM90_TMA_LOAD, cute::ComposedLayout<cute::Swizzle<(int)2, (int)4, (int)3>, cute::smem_ptr_flag_bits<(int)16>, cute::Layout<cute::tuple<cute::C<(int)8>, cute::C<(int)32>>, cute::tuple<cute::C<(int)32>, cute::C<(int)1>>>>, cute::SM75_U32x4_LDSM_N, cute::SM90_TMA_STORE, cute::ComposedLayout<cute::Swizzle<(int)2, (int)4, (int)3>, cute::smem_ptr_flag_bits<(int)16>, cute::Layout<cute::tuple<cute::C<(int)8>, cute::C<(int)32>>, cute::tuple<cute::C<(int)32>, cute::C<(int)1>>>>, cute::SM90_U32x4_STSM_N, cute::Copy_Atom<cute::SM90_U32x4_STSM_N, cutlass::half_t>>, void, void>>(T1::Params)+0x1a80 ========= by thread (195,0,0) in block (4,0,0) ========= Address 0x7f1e02608680 is out of bounds ========= and is 896 bytes before the nearest allocation at 0x7f1e02608a00 of size 16,384 bytes ========= Saved host backtrace up to driver entry point at kernel launch time ========= Host Frame: [0x2f19a3] ========= in /lib64/libcuda.so.1 ========= Host Frame: [0x15b79] ========= in /usr/local/cuda-12.4/lib64/libcudart.so.12 ``` ### Updat2 The ima seems very similar to: #133334 but now on scale_b instead of scale_a [ghstack-poisoned]
@pytorchbot merge |
This PR updates submodules third_party/cutlass If those updates are intentional, please add "submodule" keyword to PR title/description. |
#include <c10/macros/Macros.h> | ||
|
||
// Two warninngs in Cutlass included header files | ||
C10_DIAGNOSTIC_PUSH_AND_IGNORED_IF_DEFINED("-Wset-but-not-used") |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Technically it's good to pair each C10_DIAGNOSTIC_PUSH
with C10_DIAGNOSTIC_POP
...
@pytorchbot merge |
Merge startedYour change will be merged once all checks pass (ETA 0-4 Hours). Learn more about merging in the wiki. Questions? Feedback? Please reach out to the PyTorch DevX Team |
ghstack-source-id: ce4ebc0 Pull Request resolved: pytorch/pytorch#144000
Stack from ghstack (oldest at bottom):
Summary
Follow up PR to #143515. That PR added a bunch of macro switches to ensure both 3.4 and 3.5.1 built succesfully. This PR actual bumps the cutlass pin to 3.5.1.
I am going to do a stack on top to add an conditional gates for 3.6 hijacking the 3.4 switches. We will leap frog our way to the top :)
cc @ptrblck @msaroufim @eqy @yanbing-j @vkuzo @albanD @kadeng @penguinwu