Skip to content

Navigation Menu

Sign in
Appearance settings

Search code, repositories, users, issues, pull requests...

Provide feedback

We read every piece of feedback, and take your input very seriously.

Saved searches

Use saved searches to filter your results more quickly

Appearance settings

[mlir][AMDGPU] Implement gpu.subgroup_reduce with DPP intrinsics on AMD GPUs #133204

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merged
merged 28 commits into from
Apr 24, 2025
Merged
Show file tree
Hide file tree
Changes from 1 commit
Commits
Show all changes
28 commits
Select commit Hold shift + click to select a range
029b2cc
Creates AMDToGPUPass to house a subgroup reduce lowering pattern to DPP
Muzammiluddin-Syed-ECE Mar 25, 2025
427c817
Fix for numerical issues in MatVec tests
Muzammiluddin-Syed-ECE Apr 2, 2025
655251b
Rewrites pattern to be closer to device lib impl.
Muzammiluddin-Syed-ECE Apr 3, 2025
081d6f7
Removes AMDToGPUPass, moving pattern into existing pass
Muzammiluddin-Syed-ECE Apr 3, 2025
0d560c2
Adding permlanex16 and other dpp related ops to mlir dialect
Muzammiluddin-Syed-ECE Apr 10, 2025
015e9b9
Fixing permlanex16 intrinsic failure
Muzammiluddin-Syed-ECE Apr 11, 2025
945f0e8
simplify verbose typing
Muzammiluddin-Syed-ECE Apr 11, 2025
1b356ed
testing numerics
Muzammiluddin-Syed-ECE Apr 12, 2025
7fd30c0
fixing
Muzammiluddin-Syed-ECE Apr 12, 2025
0c28b4d
fixing
Muzammiluddin-Syed-ECE Apr 12, 2025
bfda712
fixing
Muzammiluddin-Syed-ECE Apr 12, 2025
54c08ef
trying again
Muzammiluddin-Syed-ECE Apr 14, 2025
6535bda
Fixing implementation
Muzammiluddin-Syed-ECE Apr 14, 2025
85e3b62
Adding DPP test
Muzammiluddin-Syed-ECE Apr 14, 2025
3392f08
Addressing PR comments
Muzammiluddin-Syed-ECE Apr 14, 2025
b59922a
removing unnecessary header
Muzammiluddin-Syed-ECE Apr 14, 2025
6431293
Addressing PR comments
Muzammiluddin-Syed-ECE Apr 16, 2025
ae25fa0
moving permlanex16 changes to another commit
Muzammiluddin-Syed-ECE Apr 16, 2025
9745098
fixing test
Muzammiluddin-Syed-ECE Apr 16, 2025
a6c35b3
fixing code formatting
Muzammiluddin-Syed-ECE Apr 16, 2025
8a9cefb
Updating implementation to support gfx 10+
Muzammiluddin-Syed-ECE Apr 16, 2025
c395203
Small formatting change
Muzammiluddin-Syed-ECE Apr 16, 2025
ab15c44
Removing ReadlaneOps from test
Muzammiluddin-Syed-ECE Apr 16, 2025
55f442e
Improve dpp implementation
Muzammiluddin-Syed-ECE Apr 16, 2025
6442288
fixing formatting
Muzammiluddin-Syed-ECE Apr 17, 2025
848c6ba
Fixing implementation of DPP subgroup reduce
Muzammiluddin-Syed-ECE Apr 22, 2025
6da1653
Addressing PR comments
Muzammiluddin-Syed-ECE Apr 23, 2025
e19a615
Fixing Typo in RUN command
Muzammiluddin-Syed-ECE Apr 23, 2025
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
Prev Previous commit
Next Next commit
fixing code formatting
Signed-off-by: Muzammiluddin Syed <muzasyed@amd.com>
  • Loading branch information
Muzammiluddin-Syed-ECE committed Apr 16, 2025
commit a6c35b3a88cc22eb5f01447cdd69f5b1c017fd4a
14 changes: 8 additions & 6 deletions 14 mlir/include/mlir/Dialect/GPU/Transforms/Passes.h
Original file line number Diff line number Diff line change
Expand Up @@ -63,6 +63,12 @@ void populateGpuLowerSubgroupReduceToShufflePatterns(
RewritePatternSet &patterns, unsigned subgroupSize,
unsigned shuffleBitwidth = 32, PatternBenefit benefit = 1);

/// Disjoint counterpart of `populateGpuLowerSubgroupReduceToShufflePatterns`
/// that only matches `gpu.subgroup_reduce` ops with a `cluster_size`.
void populateGpuLowerClusteredSubgroupReduceToShufflePatterns(
RewritePatternSet &patterns, unsigned subgroupSize,
unsigned shuffleBitwidth = 32, PatternBenefit benefit = 1);

/// Collect a set of patterns to lower `gpu.subgroup_reduce` into `amdgpu.dpp`
/// ops over scalar types. Assumes that the subgroup has
/// `subgroupSize` lanes. Applicable only to AMD GPUs.
Expand All @@ -71,16 +77,12 @@ void populateGpuLowerSubgroupReduceToDPPPatterns(RewritePatternSet &patterns,
amdgpu::Chipset chipset,
PatternBenefit benefit = 1);

/// Disjoint counterpart of `populateGpuLowerSubgroupReduceToDPPPatterns`
/// that only matches `gpu.subgroup_reduce` ops with a `cluster_size`.
kuhar marked this conversation as resolved.
Show resolved Hide resolved
void populateGpuLowerClusteredSubgroupReduceToDPPPatterns(
RewritePatternSet &patterns, unsigned subgroupSize, amdgpu::Chipset chipset,
PatternBenefit benefit = 1);

/// Disjoint counterpart of `populateGpuLowerSubgroupReduceToShufflePatterns`
/// that only matches `gpu.subgroup_reduce` ops with a `cluster_size`.
void populateGpuLowerClusteredSubgroupReduceToShufflePatterns(
RewritePatternSet &patterns, unsigned subgroupSize,
unsigned shuffleBitwidth = 32, PatternBenefit benefit = 1);

/// Collect all patterns to rewrite ops within the GPU dialect.
inline void populateGpuRewritePatterns(RewritePatternSet &patterns) {
populateGpuAllReducePatterns(patterns);
Expand Down
23 changes: 12 additions & 11 deletions 23 mlir/lib/Dialect/GPU/Transforms/SubgroupReduceLowering.cpp
Original file line number Diff line number Diff line change
Expand Up @@ -10,13 +10,13 @@
//
//===----------------------------------------------------------------------===//

#include "mlir/Dialect/AMDGPU/IR/AMDGPUDialect.h"
#include "mlir/Dialect/AMDGPU/Utils/Chipset.h"
#include "mlir/Dialect/Arith/IR/Arith.h"
#include "mlir/Dialect/GPU/IR/GPUDialect.h"
#include "mlir/Dialect/LLVMIR/ROCDLDialect.h"
#include "mlir/Dialect/AMDGPU/IR/AMDGPUDialect.h"
#include "mlir/Dialect/GPU/Transforms/Passes.h"
#include "mlir/Dialect/GPU/Utils/GPUUtils.h"
#include "mlir/Dialect/LLVMIR/ROCDLDialect.h"
#include "mlir/Dialect/Vector/IR/VectorOps.h"
#include "mlir/IR/BuiltinTypes.h"
#include "mlir/IR/Location.h"
Expand Down Expand Up @@ -366,10 +366,11 @@ struct VectorSubgroupReduceToShuffles final
bool matchClustered = false;
};

std::optional<Value> createSubgroupDPPReduction(OpBuilder &b, Location loc, Value input,
gpu::AllReduceOperation mode,
const ClusterInfo &ci,
amdgpu::Chipset chipset) {
std::optional<Value> createSubgroupDPPReduction(OpBuilder &b, Location loc,
Muzammiluddin-Syed-ECE marked this conversation as resolved.
Show resolved Hide resolved
Value input,
gpu::AllReduceOperation mode,
const ClusterInfo &ci,
amdgpu::Chipset chipset) {
Value dppResult;
Value result = input;
constexpr int allRows = 0xf;
Expand Down Expand Up @@ -510,11 +511,11 @@ void mlir::populateGpuLowerSubgroupReduceToDPPPatterns(
}

void mlir::populateGpuLowerClusteredSubgroupReduceToDPPPatterns(
RewritePatternSet &patterns, unsigned subgroupSize, amdgpu::Chipset chipset,
PatternBenefit benefit) {
patterns.add<ScalarSubgroupReduceToDPP>(patterns.getContext(), subgroupSize,
/*matchClustered=*/true, chipset,
benefit);
RewritePatternSet &patterns, unsigned subgroupSize, amdgpu::Chipset chipset,
PatternBenefit benefit) {
patterns.add<ScalarSubgroupReduceToDPP>(patterns.getContext(), subgroupSize,
/*matchClustered=*/true, chipset,
benefit);
}

Muzammiluddin-Syed-ECE marked this conversation as resolved.
Show resolved Hide resolved
void mlir::populateGpuLowerSubgroupReduceToShufflePatterns(
Expand Down
10 changes: 5 additions & 5 deletions 10 mlir/test/lib/Dialect/GPU/TestGpuRewrite.cpp
Original file line number Diff line number Diff line change
Expand Up @@ -31,9 +31,8 @@ struct TestGpuRewritePass
MLIR_DEFINE_EXPLICIT_INTERNAL_INLINE_TYPE_ID(TestGpuRewritePass)

void getDependentDialects(DialectRegistry &registry) const override {
registry.insert<amdgpu::AMDGPUDialect, arith::ArithDialect,
func::FuncDialect, index::IndexDialect,
memref::MemRefDialect, ROCDL::ROCDLDialect>();
registry.insert<arith::ArithDialect, func::FuncDialect, index::IndexDialect,
memref::MemRefDialect>();
}
StringRef getArgument() const final { return "test-gpu-rewrite"; }
StringRef getDescription() const final {
Expand All @@ -58,8 +57,9 @@ struct TestGpuSubgroupReduceLoweringPass
: PassWrapper(pass) {}

void getDependentDialects(DialectRegistry &registry) const override {
registry.insert<amdgpu::AMDGPUDialect, arith::ArithDialect, LLVM::LLVMDialect,
ROCDL::ROCDLDialect, vector::VectorDialect>();
registry
.insert<amdgpu::AMDGPUDialect, arith::ArithDialect, LLVM::LLVMDialect,
ROCDL::ROCDLDialect, vector::VectorDialect>();
}

StringRef getArgument() const final {
Expand Down
Morty Proxy This is a proxified and sanitized view of the page, visit original site.