[AMDGPU] Run LowerLDS at the end of the fullLTO pipeline #75333

Pierre-vh · Dec 13, 2023

This change allows us to use --lto-partitions in some cases (not at all guaranteed it works perfectly), as LDS is lowered before the module is split for parallel codegen.

LowerrLDS doesn't support being ran twice because it'll think the lowered LDS GVs are "absolute addresses" LDS which aren't supported, so I just added a module flag to detect multiple runs.

We must run LowerLDS before splitting modules as it needs to see all callers of functions with LDS to properly lower them.

llvmbot · Dec 13, 2023

@llvm/pr-subscribers-backend-amdgpu

Author: Pierre van Houtryve (Pierre-vh)

Changes

This change allows us to use --lto-partitions in some cases (not at all guaranteed it works perfectly), as LDS is lowered before the module is split for parallel codegen.

LowerrLDS doesn't support being ran twice because it'll think the lowered LDS GVs are "absolute addresses" LDS which aren't supported, so I just added a module flag to detect multiple runs.

We must run LowerLDS before splitting modules as it needs to see all callers of functions with LDS to properly lower them.

Full diff: https://github.com/llvm/llvm-project/pull/75333.diff

4 Files Affected:

(modified) llvm/lib/Target/AMDGPU/AMDGPU.h (+4-1)
(modified) llvm/lib/Target/AMDGPU/AMDGPULowerModuleLDSPass.cpp (+21-3)
(modified) llvm/lib/Target/AMDGPU/AMDGPUTargetMachine.cpp (+9)
(added) llvm/test/CodeGen/AMDGPU/lower-module-lds-reruns.ll (+39)

diff --git a/llvm/lib/Target/AMDGPU/AMDGPU.h b/llvm/lib/Target/AMDGPU/AMDGPU.h
index 1b75607e1dc32b..7c9dc68b296ea0 100644
--- a/llvm/lib/Target/AMDGPU/AMDGPU.h
+++ b/llvm/lib/Target/AMDGPU/AMDGPU.h
@@ -130,7 +130,10 @@ extern char &AMDGPULowerModuleLDSLegacyPassID;
 
 struct AMDGPULowerModuleLDSPass : PassInfoMixin<AMDGPULowerModuleLDSPass> {
   const AMDGPUTargetMachine &TM;
-  AMDGPULowerModuleLDSPass(const AMDGPUTargetMachine &TM_) : TM(TM_) {}
+  bool IsEarlyRun;
+  AMDGPULowerModuleLDSPass(const AMDGPUTargetMachine &TM_,
+                           bool IsEarlyRun = false)
+      : TM(TM_), IsEarlyRun(IsEarlyRun) {}
 
   PreservedAnalyses run(Module &M, ModuleAnalysisManager &AM);
 };
diff --git a/llvm/lib/Target/AMDGPU/AMDGPULowerModuleLDSPass.cpp b/llvm/lib/Target/AMDGPU/AMDGPULowerModuleLDSPass.cpp
index d2a02143e4e7f5..883607da878a7f 100644
--- a/llvm/lib/Target/AMDGPU/AMDGPULowerModuleLDSPass.cpp
+++ b/llvm/lib/Target/AMDGPU/AMDGPULowerModuleLDSPass.cpp
@@ -215,6 +215,12 @@ using namespace llvm;
 
 namespace {
 
+cl::opt<bool>
+    ForceAddModuleFlag("amdgpu-lower-module-lds-force-add-moduleflag",
+                       cl::desc("Always add the module flag that prevents "
+                                "multiple runs of LowerModuleLDS."),
+                       cl::init(false), cl::ReallyHidden);
+
 cl::opt<bool> SuperAlignLDSGlobals(
     "amdgpu-super-align-lds-globals",
     cl::desc("Increase alignment of LDS if it is not on align boundary"),
@@ -254,6 +260,7 @@ template <typename T> std::vector<T> sortByName(std::vector<T> &&V) {
 
 class AMDGPULowerModuleLDS {
   const AMDGPUTargetMachine &TM;
+  bool IsEarlyRun;
 
   static void
   removeLocalVarsFromUsedLists(Module &M,
@@ -328,7 +335,8 @@ class AMDGPULowerModuleLDS {
   }
 
 public:
-  AMDGPULowerModuleLDS(const AMDGPUTargetMachine &TM_) : TM(TM_) {}
+  AMDGPULowerModuleLDS(const AMDGPUTargetMachine &TM_, bool IsEarlyRun = false)
+      : TM(TM_), IsEarlyRun(IsEarlyRun) {}
 
   using FunctionVariableMap = DenseMap<Function *, DenseSet<GlobalVariable *>>;
 
@@ -1088,6 +1096,15 @@ class AMDGPULowerModuleLDS {
   }
 
   bool runOnModule(Module &M) {
+    // This pass may run twice in a full LTO pipeline.
+    //
+    // If we ran it early, we'll have added metadata to skip next runs.
+    if (M.getModuleFlag("amdgcn.lowered_module_lds"))
+      return false;
+    if (IsEarlyRun || ForceAddModuleFlag)
+      M.addModuleFlag(Module::ModFlagBehavior::Warning,
+                      "amdgcn.lowered_module_lds", 1);
+
     CallGraph CG = CallGraph(M);
     bool Changed = superAlignLDSGlobals(M);
 
@@ -1574,6 +1591,7 @@ llvm::createAMDGPULowerModuleLDSLegacyPass(const AMDGPUTargetMachine *TM) {
 
 PreservedAnalyses AMDGPULowerModuleLDSPass::run(Module &M,
                                                 ModuleAnalysisManager &) {
-  return AMDGPULowerModuleLDS(TM).runOnModule(M) ? PreservedAnalyses::none()
-                                                 : PreservedAnalyses::all();
+  return AMDGPULowerModuleLDS(TM, IsEarlyRun).runOnModule(M)
+             ? PreservedAnalyses::none()
+             : PreservedAnalyses::all();
 }
diff --git a/llvm/lib/Target/AMDGPU/AMDGPUTargetMachine.cpp b/llvm/lib/Target/AMDGPU/AMDGPUTargetMachine.cpp
index 97db19064f6fd1..d603bf84eefa46 100644
--- a/llvm/lib/Target/AMDGPU/AMDGPUTargetMachine.cpp
+++ b/llvm/lib/Target/AMDGPU/AMDGPUTargetMachine.cpp
@@ -770,6 +770,15 @@ void AMDGPUTargetMachine::registerPassBuilderCallbacks(PassBuilder &PB) {
 
         PM.addPass(createCGSCCToFunctionPassAdaptor(std::move(FPM)));
       });
+
+  PB.registerFullLinkTimeOptimizationLastEPCallback(
+      [this](ModulePassManager &PM, OptimizationLevel Level) {
+        // We want to support the -lto-partitions=N option as "best effort".
+        // For that, we need to lower LDS earlier in the pipeline before the
+        // module is partitioned for codegen.
+        if (EnableLowerModuleLDS)
+          PM.addPass(AMDGPULowerModuleLDSPass(*this, /*IsEarlyRun*/ true));
+      });
 }
 
 int64_t AMDGPUTargetMachine::getNullPointerValue(unsigned AddrSpace) {
diff --git a/llvm/test/CodeGen/AMDGPU/lower-module-lds-reruns.ll b/llvm/test/CodeGen/AMDGPU/lower-module-lds-reruns.ll
new file mode 100644
index 00000000000000..f0a46b2d6ead80
--- /dev/null
+++ b/llvm/test/CodeGen/AMDGPU/lower-module-lds-reruns.ll
@@ -0,0 +1,39 @@
+; RUN: opt -S -mtriple=amdgcn-- -passes=amdgpu-lower-module-lds --amdgpu-lower-module-lds-strategy=module %s -o %t.ll
+; RUN: not --crash opt -S -mtriple=amdgcn-- -passes=amdgpu-lower-module-lds --amdgpu-lower-module-lds-strategy=module %t.ll -o - 2>&1 | FileCheck %s --check-prefix=ERR
+
+; RUN: opt -S -mtriple=amdgcn-- -passes=amdgpu-lower-module-lds --amdgpu-lower-module-lds-strategy=module --amdgpu-lower-module-lds-force-add-moduleflag=1 %s -o %t.ll
+; RUN: opt -S -mtriple=amdgcn-- -passes=amdgpu-lower-module-lds --amdgpu-lower-module-lds-strategy=module %t.ll -o - | FileCheck %s
+
+; Check re-run of LowerModuleLDS don't crash when the module flag is used.
+;
+; We first check this test still crashes when ran twice. If it no longer crashes at some point
+; we should update it to ensure the flag still does its job.
+;
+; This test jus has the bare minimum checks to see if the pass ran.
+
+; ERR: LLVM ERROR: LDS variables with absolute addresses are unimplemented.
+
+; CHECK: %llvm.amdgcn.module.lds.t = type { float, [4 x i8], i32 }
+; CHECK: @llvm.amdgcn.module.lds = internal addrspace(3) global %llvm.amdgcn.module.lds.t poison, align 8
+
+; CHECK: attributes #0 = { "amdgpu-lds-size"="12" }
+
+@var0 = addrspace(3) global float poison, align 8
+@var1 = addrspace(3) global i32 poison, align 8
+@ptr = addrspace(1) global ptr addrspace(3) @var1, align 4
+@with_init = addrspace(3) global i64 0
+
+define void @func() {
+  %dec = atomicrmw fsub ptr addrspace(3) @var0, float 1.0 monotonic
+  %val0 = load i32, ptr addrspace(3) @var1, align 4
+  %val1 = add i32 %val0, 4
+  store i32 %val1, ptr addrspace(3) @var1, align 4
+  %unused0 = atomicrmw add ptr addrspace(3) @with_init, i64 1 monotonic
+  ret void
+}
+
+define amdgpu_kernel void @kern_call() {
+  call void @func()
+  %dec = atomicrmw fsub ptr addrspace(3) @var0, float 2.0 monotonic
+  ret void
+}

Pierre-vh · Dec 15, 2023

Just a thought: if we're not okay with changing the order of LowerModuleLDS for gpu-rdc (because this affects all full LTO builds), I can try adding a parameter to the callback to tell us if we're doing module splitting or not, and if we're not, we just run things as usual.

JonChesterfield · Dec 15, 2023

Moving the pass is probably fine - it should be kept adjacent to promotealloca but otherwise doesn't matter much. Reasonable chance promotealloca will need fixes too.

The flag is ugly but it'll take some effort to make the pass properly composable, in the sense that it can be run repeatedly or on separate TUs before on the whole module.

Similarly hacky to the flag, we could look for the magic lookup table it builds (has a known name) and skip the pass if it already exists.

arsenm · Dec 18, 2023

Moving the pass is probably fine - it should be kept adjacent to promotealloca but otherwise doesn't matter much. Reasonable chance promotealloca will need fixes too.

BTW what happened to the patches to allow module LDS to report to promote-alloca the available LDS budget?

The flag is ugly but it'll take some effort to make the pass properly composable, in the sense that it can be run repeatedly or on separate TUs before on the whole module.

That would be nice

Pierre-vh · Jan 4, 2024

Is this okay to land?

Pierre-vh · Jan 11, 2024

ping

arsenm

How difficult would it be to support absolute addresses in the pass?

arsenm · Jan 12, 2024

llvm/lib/Target/AMDGPU/AMDGPULowerModuleLDSPass.cpp

+    // This pass may run twice in a full LTO pipeline.
+    //
+    // If we ran it early, we'll have added metadata to skip next runs.
+    if (M.getModuleFlag("amdgcn.lowered_module_lds"))


Rather than communicating this with the flag, better to just not add it in the pipeline configuration?

I think there's different TargetMachine instances in this case, we have no way to tell if we're running the backend as part of a full lto partitioned module (= LowerLDS ran) or if we're running it as usual (= we'd like to keep LowerLDS in the usual place). The module flag is the only way to reliably communicate that I can think of

Pierre-vh · Jan 12, 2024

How difficult would it be to support absolute addresses in the pass?

I'm not sure, that's a question for @JonChesterfield I believe :)
In this case I'm just interested in making the pass able to run more than once to enable some experiments with ``--lto-partitions`

Pierre-vh · Jan 22, 2024

ping

Pierre-vh · Jan 29, 2024

ping - can it land?

Pierre-vh · Feb 7, 2024

Ping

JonChesterfield · Feb 12, 2024

Spawning #81491 to track

BTW what happened to the patches to allow module LDS to report to promote-alloca the available LDS budget?

I think the machinery landed and the promote alloca patch got lost somewhere around the phab -> github transition. There are some limitations around there that could be stamped out, e.g. there's no particular reason the alloca->lds translation has to be restricted to kernels. I'll push it back onto the work list.

How difficult would it be to support absolute addresses in the pass?

The LDS lowering pass (currently) crawls the IR module, writes "allocates N bytes" on kernels and absolute address metadata on variables. It's partly set up like that in order to later support application developers writing absolute address constraints on globals as that seems generally useful and someone thought it was a plausible feature request in one of the meetings.

As far as allocating goes, making it composable means treating the kernel as a bump allocator. That's definitely fine.

There are some sketchy semantics and codgen implications that falls out.

I think the conclusion of the design choices will be that a variable with an absolute address cannot be moved, kernels will not try to allocate extra memory to encompass it, data layout will be generally compromised relative to the single application case.

Given two modules which have independently had LDS lowering applied, llvm-link of the two is generally going to be unsound, for the same reasons that machine code linking of independently lowered LDS is not viable.

So while "handle absolute addresses" is relatively straightforward from the perspective of the LDS lowering pass, it's going to be interpreted as meaning LDS lowering can be arbitrarily composed as opposed to the very limited situations in which it's actually valid. In particular, a cleverer LDS pass will make the current error message go away, and do something entirely locally correct, and the emergent behaviour will be broken nonsense as soon as someone steps slightly away from the thinlto workflow, probably without any diagnostics.

I would propose the following:

Check for any LDS variables with absolute addresses.
If all LDS variables have absolute addresses, no work to do, return. Success.
If none have absolute addresses, run the pass. Add a postcondition that ever LDS variable has an absolute address if it's not already there.
If some have absolute addresses, and some do not, fatal_error with some message along the lines of "no deal, this stuff doesn't compose properly".

Other interesting ideas are to refuse to compile code containing external LDS variables, which is currently left to the backend to report iirc.

The right thing is to change the LDS lowering scheme such that it does work in the presence of machine code linking. That is difficult to do and probably involves patching the linker and/or the loaders. However that will then work however code is spliced together, modulo kernels refusing to launch because there's only 64k or so of LDS available and machine code linking is going to compromise data layout efficiency badly.

The above suggestion is better than trying to pass state around to disable subsequent calls to the pass. If you want to do the magic implicit disabling instead, do it in the pass pipeline control, not in the pass itself.

The thinlto workflow will hit case 3. on each piece, then hit case 2. if I'm understanding correctly. Hopefully it doesn't duplicate global variables while splitting the module as that definitely won't work.

See llvm#81491 Split from llvm#75333

If all variables in the module are absolute, this means we're running the pass again on an already lowered module, and that works. If none of them are absolute, lowering can proceed as usual. Only diagnose cases where we have a mix of absolute/non-absolute GVs, which means we added LDS GVs after lowering, which is broken. See #81491 Split from #75333

Pierre-vh · Mar 11, 2024

This now just adds the pass to the FullLTO pipeline
@arsenm

arsenm · Mar 11, 2024

Can we get some tests for this? In particular I would like to be sure that we still compile correctly -O0 + LTO. Do we have tests that run the module lowering multiple times?

Pierre-vh · Mar 11, 2024

Can we get some tests for this? In particular I would like to be sure that we still compile correctly -O0 + LTO. Do we have tests that run the module lowering multiple times?

#81729 added one

arsenm · Mar 14, 2024

Can we get some tests for this? In particular I would like to be sure that we still compile correctly -O0 + LTO. Do we have tests that run the module lowering multiple times?

#81729 added one

That only covers the second part, not the pipeline change

This change allows us to use `--lto-partitions` in some cases (not guaranteed it works perfectly), as LDS is lowered before the module is split for parallel codegen.

Pierre-vh · Mar 18, 2024

This is causing LLVM ERROR: Module cannot mix absolute and non-absolute LDS GVs on the openmp buildbot, and I have no idea why: https://lab.llvm.org/buildbot/#/builders/193/builds/48532/steps/6/logs/FAIL__libomptarget____amdgcn-amd-amdhsa__cuda_no_d
I cannot reproduce it locally either

)" This reverts commit 9b98692.

…5626) Reland of #75333

…vm#85626) Reland of llvm#75333

JonChesterfield · Jan 14, 2025

Running the pass on the end of full lto doesn't work. There's lots of ways to break that, it was specifically thin lto that carefully carved out an exception. This has just shown up in fortran, @ergawy pinged me about it.

I'll raise an issue to get better visibility on it.

Pierre-vh requested review from jayfoad, arsenm and JonChesterfield December 13, 2023 12:47

llvmbot added the backend:AMDGPU label Dec 13, 2023

arsenm reviewed Jan 12, 2024

View reviewed changes

Pierre-vh requested a review from arsenm January 12, 2024 12:04

Pierre-vh force-pushed the lto-lowerlds branch from f59b39f to 7384287 Compare January 31, 2024 09:25

JonChesterfield mentioned this pull request Feb 12, 2024

LDS lowering should survive absolute addresses #81491

Open

Pierre-vh added a commit to Pierre-vh/llvm-project that referenced this pull request Feb 14, 2024

[AMDGPU] Let LowerModuleLDS run twice on the same module

5835996

See llvm#81491 Split from llvm#75333

Pierre-vh mentioned this pull request Feb 14, 2024

[AMDGPU] Let LowerModuleLDS run twice on the same module #81729

Merged

Pierre-vh force-pushed the lto-lowerlds branch from 7384287 to 17ea305 Compare March 11, 2024 08:33

Pierre-vh added 2 commits March 14, 2024 13:01

[AMDGPU] Run LowerLDS at the end of the fullLTO pipeline

8f84137

This change allows us to use `--lto-partitions` in some cases (not guaranteed it works perfectly), as LDS is lowered before the module is split for parallel codegen.

add lto pipeline test

a7f1081

Pierre-vh force-pushed the lto-lowerlds branch from 17ea305 to a7f1081 Compare March 14, 2024 12:15

arsenm approved these changes Mar 15, 2024

View reviewed changes

Pierre-vh merged commit 9b98692 into llvm:main Mar 18, 2024

Pierre-vh deleted the lto-lowerlds branch March 18, 2024 08:09

Pierre-vh added a commit that referenced this pull request Mar 18, 2024

Revert "[AMDGPU] Run LowerLDS at the end of the fullLTO pipeline (#75333

3493438

)" This reverts commit 9b98692.

Pierre-vh mentioned this pull request Mar 18, 2024

[AMDGPU] Run LowerLDS at the end of the fullLTO pipeline #85626

Merged

Pierre-vh added a commit that referenced this pull request Mar 21, 2024

(Reland) [AMDGPU] Run LowerLDS at the end of the fullLTO pipeline (#8…

95a834a

…5626) Reland of #75333

chencha3 pushed a commit to chencha3/llvm-project that referenced this pull request Mar 23, 2024

(Reland) [AMDGPU] Run LowerLDS at the end of the fullLTO pipeline (ll…

cc3657f

…vm#85626) Reland of llvm#75333

JonChesterfield mentioned this pull request Jan 14, 2025

Running lowermodulelds multiple times needs to be thinlto only, breaking fortran at present #122891

Open

Search code, repositories, users, issues, pull requests...

[AMDGPU] Run LowerLDS at the end of the fullLTO pipeline #75333

[AMDGPU] Run LowerLDS at the end of the fullLTO pipeline #75333

Uh oh!

Conversation

Pierre-vh commented Dec 13, 2023

Uh oh!

llvmbot commented Dec 13, 2023

Uh oh!

Pierre-vh commented Dec 15, 2023

Uh oh!

JonChesterfield commented Dec 15, 2023 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

arsenm commented Dec 18, 2023

Uh oh!

Pierre-vh commented Jan 4, 2024

Uh oh!

Pierre-vh commented Jan 11, 2024

Uh oh!

arsenm left a comment

Choose a reason for hiding this comment

Uh oh!

arsenm Jan 12, 2024

Choose a reason for hiding this comment

Uh oh!

Pierre-vh Jan 12, 2024

Choose a reason for hiding this comment

Uh oh!

Pierre-vh commented Jan 12, 2024

Uh oh!

Pierre-vh commented Jan 22, 2024

Uh oh!

Pierre-vh commented Jan 29, 2024

Uh oh!

Pierre-vh commented Feb 7, 2024

Uh oh!

JonChesterfield commented Feb 12, 2024 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Pierre-vh commented Mar 11, 2024

Uh oh!

arsenm commented Mar 11, 2024

Uh oh!

Pierre-vh commented Mar 11, 2024

Uh oh!

arsenm commented Mar 14, 2024

Uh oh!

Pierre-vh commented Mar 18, 2024

Uh oh!

JonChesterfield commented Jan 14, 2025

Uh oh!

Uh oh!

JonChesterfield commented Dec 15, 2023 •

edited

Loading

JonChesterfield commented Feb 12, 2024 •

edited

Loading