Skip to content

Navigation Menu

Sign in
Appearance settings

Search code, repositories, users, issues, pull requests...

Provide feedback

We read every piece of feedback, and take your input very seriously.

Saved searches

Use saved searches to filter your results more quickly

Appearance settings

[mlir][AMDGPU] Implement gpu.subgroup_reduce with DPP intrinsics on AMD GPUs #133204

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merged
merged 28 commits into from
Apr 24, 2025
Merged
Changes from 1 commit
Commits
Show all changes
28 commits
Select commit Hold shift + click to select a range
029b2cc
Creates AMDToGPUPass to house a subgroup reduce lowering pattern to DPP
Muzammiluddin-Syed-ECE Mar 25, 2025
427c817
Fix for numerical issues in MatVec tests
Muzammiluddin-Syed-ECE Apr 2, 2025
655251b
Rewrites pattern to be closer to device lib impl.
Muzammiluddin-Syed-ECE Apr 3, 2025
081d6f7
Removes AMDToGPUPass, moving pattern into existing pass
Muzammiluddin-Syed-ECE Apr 3, 2025
0d560c2
Adding permlanex16 and other dpp related ops to mlir dialect
Muzammiluddin-Syed-ECE Apr 10, 2025
015e9b9
Fixing permlanex16 intrinsic failure
Muzammiluddin-Syed-ECE Apr 11, 2025
945f0e8
simplify verbose typing
Muzammiluddin-Syed-ECE Apr 11, 2025
1b356ed
testing numerics
Muzammiluddin-Syed-ECE Apr 12, 2025
7fd30c0
fixing
Muzammiluddin-Syed-ECE Apr 12, 2025
0c28b4d
fixing
Muzammiluddin-Syed-ECE Apr 12, 2025
bfda712
fixing
Muzammiluddin-Syed-ECE Apr 12, 2025
54c08ef
trying again
Muzammiluddin-Syed-ECE Apr 14, 2025
6535bda
Fixing implementation
Muzammiluddin-Syed-ECE Apr 14, 2025
85e3b62
Adding DPP test
Muzammiluddin-Syed-ECE Apr 14, 2025
3392f08
Addressing PR comments
Muzammiluddin-Syed-ECE Apr 14, 2025
b59922a
removing unnecessary header
Muzammiluddin-Syed-ECE Apr 14, 2025
6431293
Addressing PR comments
Muzammiluddin-Syed-ECE Apr 16, 2025
ae25fa0
moving permlanex16 changes to another commit
Muzammiluddin-Syed-ECE Apr 16, 2025
9745098
fixing test
Muzammiluddin-Syed-ECE Apr 16, 2025
a6c35b3
fixing code formatting
Muzammiluddin-Syed-ECE Apr 16, 2025
8a9cefb
Updating implementation to support gfx 10+
Muzammiluddin-Syed-ECE Apr 16, 2025
c395203
Small formatting change
Muzammiluddin-Syed-ECE Apr 16, 2025
ab15c44
Removing ReadlaneOps from test
Muzammiluddin-Syed-ECE Apr 16, 2025
55f442e
Improve dpp implementation
Muzammiluddin-Syed-ECE Apr 16, 2025
6442288
fixing formatting
Muzammiluddin-Syed-ECE Apr 17, 2025
848c6ba
Fixing implementation of DPP subgroup reduce
Muzammiluddin-Syed-ECE Apr 22, 2025
6da1653
Addressing PR comments
Muzammiluddin-Syed-ECE Apr 23, 2025
e19a615
Fixing Typo in RUN command
Muzammiluddin-Syed-ECE Apr 23, 2025
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
Prev Previous commit
Next Next commit
fixing test
Signed-off-by: Muzammiluddin Syed <muzasyed@amd.com>
  • Loading branch information
Muzammiluddin-Syed-ECE committed Apr 16, 2025
commit 97450983127a0ce7ca43d4e120fe84215225ebbd
15 changes: 14 additions & 1 deletion 15 mlir/test/Dialect/GPU/subgroup-reduce-lowering.mlir
Original file line number Diff line number Diff line change
Expand Up @@ -32,11 +32,15 @@ gpu.module @kernels {
// CHECK-SUB: %[[R2:.+]] = gpu.subgroup_reduce add %[[E2]] : (f16) -> f16
// CHECK-SUB: %[[V2:.+]] = vector.insert %[[R2]], %[[V1]] [4] : f16 into vector<5xf16>
// CHECK-SUB: "test.consume"(%[[V2]]) : (vector<5xf16>) -> ()
// CHECK-DPP-COUNT-6: amdgpu.dpp
// CHECK-DPP: rocdl.readlane
%sum0 = gpu.subgroup_reduce add %arg0 : (vector<5xf16>) -> (vector<5xf16>)
"test.consume"(%sum0) : (vector<5xf16>) -> ()

// CHECK-SUB-COUNT-3: gpu.subgroup_reduce mul {{.+}} uniform
// CHECK-SUB: "test.consume"
// CHECK-DPP-COUNT-6: amdgpu.dpp
// CHECK-DPP: rocdl.readlane
%sum1 = gpu.subgroup_reduce mul %arg0 uniform : (vector<5xf16>) -> (vector<5xf16>)
"test.consume"(%sum1) : (vector<5xf16>) -> ()

Expand Down Expand Up @@ -66,11 +70,15 @@ gpu.module @kernels {
// CHECK-SUB: %[[R0:.+]] = gpu.subgroup_reduce add %[[E0]] : (f32) -> f32
// CHECK-SUB: %[[V0:.+]] = vector.broadcast %[[R0]] : f32 to vector<1xf32>
// CHECK-SUB: "test.consume"(%[[V0]]) : (vector<1xf32>) -> ()
// CHECK-DPP-COUNT-6: amdgpu.dpp
// CHECK-DPP: rocdl.readlane
%sum0 = gpu.subgroup_reduce add %arg0 : (vector<1xf32>) -> (vector<1xf32>)
"test.consume"(%sum0) : (vector<1xf32>) -> ()

// CHECK-SUB: gpu.subgroup_reduce add {{.+}} uniform : (f32) -> f32
// CHECK-SUB: "test.consume"
// CHECK-DPP-COUNT-6: amdgpu.dpp
// CHECK-DPP: rocdl.readlane
%sum1 = gpu.subgroup_reduce add %arg0 uniform : (vector<1xf32>) -> (vector<1xf32>)
"test.consume"(%sum1) : (vector<1xf32>) -> ()

Expand All @@ -84,6 +92,7 @@ gpu.module @kernels {

// CHECK-SUB: gpu.subgroup_reduce add {{.+}} uniform cluster(size = 8, stride = 4) : (f32) -> f32
// CHECK-SUB: "test.consume"
// CHECK-DPP-NOT: amdgpu.dpp
%sum3 = gpu.subgroup_reduce add %arg0 uniform cluster(size = 8, stride = 4) : (vector<1xf32>) -> (vector<1xf32>)
"test.consume"(%sum3) : (vector<1xf32>) -> ()

Expand Down Expand Up @@ -137,6 +146,9 @@ gpu.module @kernels {
// CHECK-SHFL: %[[S4:.+]], %{{.+}} = gpu.shuffle xor %[[A3]], %[[C16]], %[[C32]] : i32
// CHECK-SHFL: %[[A4:.+]] = arith.addi %[[A3]], %[[S4]] : i32
// CHECK-SHFL: "test.consume"(%[[A4]]) : (i32) -> ()

// CHECK-DPP-COUNT-6: amdgpu.dpp
// CHECK-DPP: rocdl.readlane
%sum0 = gpu.subgroup_reduce add %arg0 : (i32) -> i32
"test.consume"(%sum0) : (i32) -> ()

Expand Down Expand Up @@ -258,7 +270,6 @@ gpu.module @kernels {
// CHECK-SHFL-LABEL: gpu.func @kernel5(
// CHECK-SHFL-SAME: %[[ARG0:.+]]: i16)
// CHECK-DPP-LABEL: gpu.func @kernel5(
// CHECK-DPP-NOT: amdgpu.dpp
gpu.func @kernel5(%arg0: i16) kernel {
// CHECK-SHFL: %[[E0:.+]] = arith.extui %[[ARG0]] : i16 to i32
// CHECK-SHFL: %[[S0:.+]], %{{.+}} = gpu.shuffle xor %[[E0]], {{.+}} : i32
Expand All @@ -270,6 +281,8 @@ gpu.module @kernels {
// CHECK-SHFL: arith.trunci {{.+}} : i32 to i16
// CHECK-SHFL: %[[AL:.+]] = arith.addi {{.+}} : i16
// CHECK-SHFL: "test.consume"(%[[AL]]) : (i16) -> ()
// CHECK-DPP-COUNT-6: amdgpu.dpp
// CHECK-DPP: rocdl.readlane
%sum0 = gpu.subgroup_reduce add %arg0 : (i16) -> i16
"test.consume"(%sum0) : (i16) -> ()

Expand Down
Morty Proxy This is a proxified and sanitized view of the page, visit original site.