[AMDGPU] Eliminate unnecessary packing in wider f16 vectors for sdwa/opsel-able instruction #137137

vg0204 · Apr 24, 2025

As the compiler has no fp16 packed instruction, so isel scalarizes each fp16 operation in wide fp16 vectors and generates separate individual fp16 results, which are later packed. Now, in post-isel pass in SIPeepholeSDWA pass, opportunistically any instructions is
eventually converted into its SDWA/OPSEL-able version.

This patch gets rids of unnecessary packing in wider fp16 vectors operation for SDWA/OPSEL-able instruction, by overwriting the partial fp16 result into same input register partially, while maintaining the sanctity of rest of bits in input register, using OPSEL dst_unused operand set as UNUSED_PRESERVED. Owing to the context of generating SDWA instructions, it is invoked at the end of the SIPeepholeSDWA pass.

#137150 provides the pre-commit testcase!

github-actions · Apr 24, 2025

✅ With the latest revision this PR passed the C/C++ code formatter.

This adds llc LIT test for vector fp16 operations like log, exp, etc. Its act as the pre-commit test for github PR#137137.

llvmbot · Apr 24, 2025

@llvm/pr-subscribers-llvm-globalisel

@llvm/pr-subscribers-backend-amdgpu

Author: Vikash Gupta (vg0204)

Changes

As the compiler has no fp16 packed instruction, so isel scalarizes each fp16 operation in wide fp16 vectors and generates separate individual fp16 results, which are later packed. Now, in post-isel pass in SIPeepholeSDWA pass, opportunistically any instructions is
eventually converted into its SDWA/OPSEL-able version.

This patch gets rids of unnecessary packing in wider fp16 vectors operation for SDWA/OPSEL-able instruction, by overwriting the partial fp16 result into same input register partially, while maintaining the sanctity of rest of bits in input register, using OPSEL dst_unused operand set as UNUSED_PRESERVED. Owing to the context of generating SDWA instructions, it is invoked at the end of the SIPeepholeSDWA pass.

#137150 provides the pre-commit testcase!

Patch is 150.79 KiB, truncated to 20.00 KiB below, full version: https://github.com/llvm/llvm-project/pull/137137.diff

20 Files Affected:

(modified) llvm/lib/Target/AMDGPU/SIPeepholeSDWA.cpp (+514-2)
(modified) llvm/test/CodeGen/AMDGPU/GlobalISel/combine-fma-sub-mul.ll (+40-48)
(modified) llvm/test/CodeGen/AMDGPU/GlobalISel/combine-fma-sub-neg-mul.ll (+16-24)
(modified) llvm/test/CodeGen/AMDGPU/GlobalISel/fdiv.f16.ll (+48-52)
(modified) llvm/test/CodeGen/AMDGPU/fdiv.f16.ll (+8-12)
(modified) llvm/test/CodeGen/AMDGPU/fold-int-pow2-with-fmul-or-fdiv.ll (+21-30)
(modified) llvm/test/CodeGen/AMDGPU/fract-match.ll (+5-6)
(modified) llvm/test/CodeGen/AMDGPU/llvm.cos.f16.ll (+8-10)
(modified) llvm/test/CodeGen/AMDGPU/llvm.exp.ll (+6-8)
(modified) llvm/test/CodeGen/AMDGPU/llvm.exp10.ll (+8-10)
(modified) llvm/test/CodeGen/AMDGPU/llvm.exp2.ll (+14-21)
(modified) llvm/test/CodeGen/AMDGPU/llvm.frexp.ll (+11-12)
(modified) llvm/test/CodeGen/AMDGPU/llvm.ldexp.ll (+18-26)
(modified) llvm/test/CodeGen/AMDGPU/llvm.log.ll (+150-110)
(modified) llvm/test/CodeGen/AMDGPU/llvm.log10.ll (+150-110)
(modified) llvm/test/CodeGen/AMDGPU/llvm.log2.ll (+44-66)
(modified) llvm/test/CodeGen/AMDGPU/llvm.rint.f16.ll (+2-3)
(modified) llvm/test/CodeGen/AMDGPU/llvm.sin.f16.ll (+8-10)
(modified) llvm/test/CodeGen/AMDGPU/repeated-divisor.ll (+5-7)
(modified) llvm/test/CodeGen/AMDGPU/roundeven.ll (+32-48)

diff --git a/llvm/lib/Target/AMDGPU/SIPeepholeSDWA.cpp b/llvm/lib/Target/AMDGPU/SIPeepholeSDWA.cpp
index 22f23e4c94e2d..b8bb27430707a 100644
--- a/llvm/lib/Target/AMDGPU/SIPeepholeSDWA.cpp
+++ b/llvm/lib/Target/AMDGPU/SIPeepholeSDWA.cpp
@@ -27,6 +27,7 @@
 #include "llvm/ADT/Statistic.h"
 #include "llvm/CodeGen/MachineFunctionPass.h"
 #include <optional>
+#include <queue>
 
 using namespace llvm;
 
@@ -35,6 +36,11 @@ using namespace llvm;
 STATISTIC(NumSDWAPatternsFound, "Number of SDWA patterns found.");
 STATISTIC(NumSDWAInstructionsPeepholed,
           "Number of instruction converted to SDWA.");
+STATISTIC(Num16BitPackedInstructionsEliminated,
+          "Number of packed instruction eliminated.");
+STATISTIC(NumSDWAInstructionsToEliminateFP16Pack,
+          "Number of instruction converted/modified into SDWA to eliminate "
+          "FP16 packing.");
 
 namespace {
 
@@ -66,6 +72,14 @@ class SIPeepholeSDWA {
   bool convertToSDWA(MachineInstr &MI, const SDWAOperandsVector &SDWAOperands);
   void legalizeScalarOperands(MachineInstr &MI, const GCNSubtarget &ST) const;
 
+  void eliminateFP16Packing(MachineBasicBlock &MBB, const GCNSubtarget &ST);
+  unsigned
+  computeMIChainsForPackedOps(MachineInstr *ParentMI,
+                              std::queue<MachineOperand *> &DefSrcQueue,
+                              const GCNSubtarget &ST);
+  void convertMIToSDWAWithOpsel(MachineInstr &MI, MachineOperand &SrcMO,
+                                AMDGPU::SDWA::SdwaSel OpSel);
+
 public:
   bool run(MachineFunction &MF);
 };
@@ -266,13 +280,17 @@ void SDWADstPreserveOperand::print(raw_ostream& OS) const {
 
 #endif
 
-static void copyRegOperand(MachineOperand &To, const MachineOperand &From) {
+static void copyRegOperand(MachineOperand &To, const MachineOperand &From,
+                           bool isKill = false) {
   assert(To.isReg() && From.isReg());
   To.setReg(From.getReg());
   To.setSubReg(From.getSubReg());
   To.setIsUndef(From.isUndef());
   if (To.isUse()) {
-    To.setIsKill(From.isKill());
+    if (isKill)
+      To.setIsKill(true);
+    else
+      To.setIsKill(From.isKill());
   } else {
     To.setIsDead(From.isDead());
   }
@@ -1361,6 +1379,494 @@ bool SIPeepholeSDWALegacy::runOnMachineFunction(MachineFunction &MF) {
   return SIPeepholeSDWA().run(MF);
 }
 
+static bool isSrcDestFP16Bits(MachineInstr *MI, const SIInstrInfo *TII) {
+  unsigned Opcode = MI->getOpcode();
+  if (TII->isSDWA(Opcode))
+    Opcode = AMDGPU::getBasicFromSDWAOp(Opcode);
+
+  switch (Opcode) {
+  case AMDGPU::V_CVT_F16_U16_e32:
+  case AMDGPU::V_CVT_F16_U16_e64:
+  case AMDGPU::V_CVT_F16_I16_e32:
+  case AMDGPU::V_CVT_F16_I16_e64:
+  case AMDGPU::V_RCP_F16_e64:
+  case AMDGPU::V_RCP_F16_e32:
+  case AMDGPU::V_RSQ_F16_e64:
+  case AMDGPU::V_RSQ_F16_e32:
+  case AMDGPU::V_SQRT_F16_e64:
+  case AMDGPU::V_SQRT_F16_e32:
+  case AMDGPU::V_LOG_F16_e64:
+  case AMDGPU::V_LOG_F16_e32:
+  case AMDGPU::V_EXP_F16_e64:
+  case AMDGPU::V_EXP_F16_e32:
+  case AMDGPU::V_SIN_F16_e64:
+  case AMDGPU::V_SIN_F16_e32:
+  case AMDGPU::V_COS_F16_e64:
+  case AMDGPU::V_COS_F16_e32:
+  case AMDGPU::V_FLOOR_F16_e64:
+  case AMDGPU::V_FLOOR_F16_e32:
+  case AMDGPU::V_CEIL_F16_e64:
+  case AMDGPU::V_CEIL_F16_e32:
+  case AMDGPU::V_TRUNC_F16_e64:
+  case AMDGPU::V_TRUNC_F16_e32:
+  case AMDGPU::V_RNDNE_F16_e64:
+  case AMDGPU::V_RNDNE_F16_e32:
+  case AMDGPU::V_FRACT_F16_e64:
+  case AMDGPU::V_FRACT_F16_e32:
+  case AMDGPU::V_FREXP_MANT_F16_e64:
+  case AMDGPU::V_FREXP_MANT_F16_e32:
+  case AMDGPU::V_FREXP_EXP_I16_F16_e64:
+  case AMDGPU::V_FREXP_EXP_I16_F16_e32:
+  case AMDGPU::V_LDEXP_F16_e64:
+  case AMDGPU::V_LDEXP_F16_e32:
+  case AMDGPU::V_ADD_F16_e64:
+  case AMDGPU::V_ADD_F16_e32:
+  case AMDGPU::V_SUB_F16_e64:
+  case AMDGPU::V_SUB_F16_e32:
+  case AMDGPU::V_SUBREV_F16_e64:
+  case AMDGPU::V_SUBREV_F16_e32:
+  case AMDGPU::V_MUL_F16_e64:
+  case AMDGPU::V_MUL_F16_e32:
+  case AMDGPU::V_MAX_F16_e64:
+  case AMDGPU::V_MAX_F16_e32:
+  case AMDGPU::V_MIN_F16_e64:
+  case AMDGPU::V_MIN_F16_e32:
+  case AMDGPU::V_MAD_F16_e64:
+  case AMDGPU::V_FMA_F16_e64:
+  case AMDGPU::V_DIV_FIXUP_F16_e64:
+    return true;
+  case AMDGPU::V_MADAK_F16:
+  case AMDGPU::V_MADMK_F16:
+  case AMDGPU::V_FMAMK_F16:
+  case AMDGPU::V_FMAAK_F16:
+    // NOTE : SKEPTICAL ABOUT IT
+    return false;
+  case AMDGPU::V_FMAC_F16_e32:
+  case AMDGPU::V_FMAC_F16_e64:
+  case AMDGPU::V_MAC_F16_e32:
+  case AMDGPU::V_MAC_F16_e64:
+    // As their sdwa version allow dst_sel to be equal only set to DWORD
+  default:
+    return false;
+  }
+}
+
+static bool checkForRightSrcRootAccess(MachineInstr *Def0MI,
+                                       MachineInstr *Def1MI,
+                                       Register SrcRootReg,
+                                       const SIInstrInfo *TII) {
+  // As if could, the Def1MI would have been sdwa-ed
+  if (!TII->isSDWA(Def1MI->getOpcode()))
+    return false;
+
+  MachineOperand *Def1Src0 =
+      TII->getNamedOperand(*Def1MI, AMDGPU::OpName::src0);
+  MachineOperand *Def1Src1 =
+      TII->getNamedOperand(*Def1MI, AMDGPU::OpName::src1);
+  MachineOperand *Def0Src0 =
+      TII->getNamedOperand(*Def0MI, AMDGPU::OpName::src0);
+  MachineOperand *Def0Src1 =
+      TII->getNamedOperand(*Def0MI, AMDGPU::OpName::src1);
+
+  if (Def1Src0 && Def1Src0->isReg() && (Def1Src0->getReg() == SrcRootReg)) {
+    MachineOperand *Def1Src0Sel =
+        TII->getNamedOperand(*Def1MI, AMDGPU::OpName::src0_sel);
+    if (!Def1Src0Sel ||
+        (Def1Src0Sel->getImm() != AMDGPU::SDWA::SdwaSel::WORD_1))
+      return false;
+
+    if (Def0Src0 && Def0Src0->isReg() && (Def0Src0->getReg() == SrcRootReg)) {
+      MachineOperand *Def0Src0Sel =
+          TII->getNamedOperand(*Def0MI, AMDGPU::OpName::src0_sel);
+      if (!Def0Src0Sel)
+        return true;
+      if (Def0Src0Sel && Def0Src0Sel->getImm() == AMDGPU::SDWA::SdwaSel::WORD_0)
+        return true;
+    }
+
+    if (Def0Src1 && Def0Src1->isReg() && (Def0Src1->getReg() == SrcRootReg)) {
+      MachineOperand *Def0Src1Sel =
+          TII->getNamedOperand(*Def0MI, AMDGPU::OpName::src1_sel);
+      if (!Def0Src1Sel)
+        return true;
+      if (Def0Src1Sel && Def0Src1Sel->getImm() == AMDGPU::SDWA::SdwaSel::WORD_0)
+        return true;
+    }
+  }
+
+  if (Def1Src1 && Def1Src1->isReg() && (Def1Src1->getReg() == SrcRootReg)) {
+    MachineOperand *Def1Src1Sel =
+        TII->getNamedOperand(*Def1MI, AMDGPU::OpName::src1_sel);
+    if (!Def1Src1Sel ||
+        (Def1Src1Sel->getImm() != AMDGPU::SDWA::SdwaSel::WORD_1))
+      return false;
+
+    if (Def0Src0 && Def0Src0->isReg() && (Def0Src0->getReg() == SrcRootReg)) {
+      MachineOperand *Def0Src0Sel =
+          TII->getNamedOperand(*Def0MI, AMDGPU::OpName::src0_sel);
+      if (!Def0Src0Sel)
+        return true;
+      if (Def0Src0Sel && Def0Src0Sel->getImm() == AMDGPU::SDWA::SdwaSel::WORD_0)
+        return true;
+    }
+
+    if (Def0Src1 && Def0Src1->isReg() && (Def0Src1->getReg() == SrcRootReg)) {
+      MachineOperand *Def0Src1Sel =
+          TII->getNamedOperand(*Def0MI, AMDGPU::OpName::src1_sel);
+      if (!Def0Src1Sel)
+        return true;
+      if (Def0Src1Sel && Def0Src1Sel->getImm() == AMDGPU::SDWA::SdwaSel::WORD_0)
+        return true;
+    }
+  }
+
+  return false;
+}
+
+/// Given A and B are in the same MBB, returns true if A comes before B.
+static bool dominates(MachineBasicBlock::const_iterator A,
+                      MachineBasicBlock::const_iterator B) {
+  assert(A->getParent() == B->getParent());
+  const MachineBasicBlock *MBB = A->getParent();
+  auto MBBEnd = MBB->end();
+  if (B == MBBEnd)
+    return true;
+
+  MachineBasicBlock::const_iterator I = MBB->begin();
+  for (; &*I != A && &*I != B; ++I)
+    ;
+
+  return &*I == A;
+}
+
+// Convert MI into its SDWA version with its Dst_Sel & SrcMO_Sel set with OpSel
+// and preserving the rest of Dst's bits.
+void SIPeepholeSDWA::convertMIToSDWAWithOpsel(MachineInstr &MI,
+                                              MachineOperand &SrcMO,
+                                              AMDGPU::SDWA::SdwaSel OpSel) {
+  LLVM_DEBUG(dbgs() << "Convert instruction:" << MI);
+
+  MachineInstr *SDWAInst;
+  if (TII->isSDWA(MI.getOpcode())) {
+    SDWAInst = &MI;
+  } else {
+    SDWAInst = createSDWAVersion(MI);
+    MI.eraseFromParent();
+  }
+
+  ConvertedInstructions.push_back(SDWAInst);
+  unsigned SDWAOpcode = SDWAInst->getOpcode();
+  ++NumSDWAInstructionsToEliminateFP16Pack;
+
+  MachineOperand *Dst = TII->getNamedOperand(*SDWAInst, AMDGPU::OpName::vdst);
+  assert(Dst && AMDGPU::hasNamedOperand(SDWAOpcode, AMDGPU::OpName::vdst));
+
+  MachineOperand *DstSel =
+      TII->getNamedOperand(*SDWAInst, AMDGPU::OpName::dst_sel);
+  assert(DstSel &&
+         AMDGPU::hasNamedOperand(SDWAOpcode, AMDGPU::OpName::dst_sel));
+  DstSel->setImm(OpSel);
+
+  MachineOperand *DstUnused =
+      TII->getNamedOperand(*SDWAInst, AMDGPU::OpName::dst_unused);
+  assert(DstUnused &&
+         AMDGPU::hasNamedOperand(SDWAOpcode, AMDGPU::OpName::dst_unused));
+  assert(!(DstUnused->getImm() == AMDGPU::SDWA::DstUnused::UNUSED_PRESERVE) &&
+         "Dst_unused should not be UNUSED_PRESERVE already");
+  DstUnused->setImm(AMDGPU::SDWA::DstUnused::UNUSED_PRESERVE);
+
+  auto PreserveDstIdx =
+      AMDGPU::getNamedOperandIdx(SDWAOpcode, AMDGPU::OpName::vdst);
+  assert(PreserveDstIdx != -1);
+  auto NewSrcImplitMO = MachineOperand::CreateReg(SrcMO.getReg(), false, true);
+  copyRegOperand(NewSrcImplitMO, SrcMO);
+  SDWAInst->addOperand(NewSrcImplitMO);
+  SDWAInst->tieOperands(PreserveDstIdx, SDWAInst->getNumOperands() - 1);
+
+  MachineOperand *Src0 = TII->getNamedOperand(*SDWAInst, AMDGPU::OpName::src0);
+  assert(Src0 && AMDGPU::hasNamedOperand(SDWAOpcode, AMDGPU::OpName::src0));
+  if (Src0->isReg() && (Src0->getReg() == SrcMO.getReg())) {
+    MachineOperand *Src0Sel =
+        TII->getNamedOperand(*SDWAInst, AMDGPU::OpName::src0_sel);
+    assert(Src0Sel &&
+           AMDGPU::hasNamedOperand(SDWAOpcode, AMDGPU::OpName::src0_sel));
+    Src0Sel->setImm(OpSel);
+
+    LLVM_DEBUG(dbgs() << "\nInto:" << *SDWAInst << '\n');
+    return;
+  }
+
+  MachineOperand *Src1 = TII->getNamedOperand(*SDWAInst, AMDGPU::OpName::src1);
+  assert(Src1 && AMDGPU::hasNamedOperand(SDWAOpcode, AMDGPU::OpName::src1));
+  if (Src1->isReg() && (Src1->getReg() == SrcMO.getReg())) {
+    MachineOperand *Src1Sel =
+        TII->getNamedOperand(*SDWAInst, AMDGPU::OpName::src1_sel);
+    assert(Src1Sel &&
+           AMDGPU::hasNamedOperand(SDWAOpcode, AMDGPU::OpName::src1_sel));
+    Src1Sel->setImm(OpSel);
+
+    LLVM_DEBUG(dbgs() << "\nInto:" << *SDWAInst << '\n');
+    return;
+  }
+}
+
+// BackTracks the given Parent MI to look for any of its use operand that has
+// been defined by FP16 (sdwa-able) in recursive fashion.
+unsigned SIPeepholeSDWA::computeMIChainsForPackedOps(
+    MachineInstr *ParentMI, std::queue<MachineOperand *> &DefSrcQueue,
+    const GCNSubtarget &ST) {
+  unsigned NumOfFP16Def;
+  do {
+    MachineInstr *NextMIInChain = nullptr;
+    NumOfFP16Def = 0;
+    for (MachineOperand &currentMO : ParentMI->uses()) {
+      if (!currentMO.isReg() || currentMO.getReg().isPhysical() ||
+          !MRI->hasOneUse(currentMO.getReg()))
+        continue;
+
+      MachineOperand *DefCurrMO = findSingleRegDef(&currentMO, MRI);
+      if (!DefCurrMO)
+        continue;
+
+      MachineInstr *DefCurrMI = DefCurrMO->getParent();
+      if (!isSrcDestFP16Bits(DefCurrMI, TII) ||
+          !isConvertibleToSDWA(*DefCurrMI, ST, TII))
+        continue;
+
+      NextMIInChain = DefCurrMI;
+      DefSrcQueue.push(DefCurrMO);
+      NumOfFP16Def++;
+    }
+
+    if (NumOfFP16Def > 1)
+      break;
+
+    ParentMI = NextMIInChain;
+  } while (ParentMI);
+
+  return NumOfFP16Def;
+}
+
+void SIPeepholeSDWA::eliminateFP16Packing(MachineBasicBlock &MBB,
+                                          const GCNSubtarget &ST) {
+  if (!ST.has16BitInsts())
+    return;
+
+  for (MachineInstr &MI : make_early_inc_range(MBB)) {
+    if (MI.getOpcode() == AMDGPU::V_PACK_B32_F16_e64) {
+      LLVM_DEBUG(dbgs() << "\nCandidate FP16 Packed MI : " << MI << '\n');
+      std::queue<MachineOperand *> DefSrc0Queue;
+      std::queue<MachineOperand *> DefSrc1Queue;
+      MachineOperand *Src0 = TII->getNamedOperand(MI, AMDGPU::OpName::src0);
+      MachineOperand *Src1 = TII->getNamedOperand(MI, AMDGPU::OpName::src1);
+
+      if (!Src0->isReg() || Src0->getReg().isPhysical() ||
+          !MRI->hasOneUse(Src0->getReg()) || !Src1->isReg() ||
+          Src1->getReg().isPhysical() || !MRI->hasOneUse(Src1->getReg()))
+        continue;
+
+      MachineOperand *Op0 = findSingleRegDef(Src0, MRI);
+      MachineOperand *Op1 = findSingleRegDef(Src1, MRI);
+
+      if (!Op0 || !Op1)
+        continue;
+
+      MachineInstr *ParentMIOp0 = Op0->getParent();
+      MachineInstr *ParentMIOp1 = Op1->getParent();
+
+      if (!isSrcDestFP16Bits(ParentMIOp0, TII) ||
+          !isSrcDestFP16Bits(ParentMIOp1, TII) ||
+          !isConvertibleToSDWA(*ParentMIOp0, ST, TII) ||
+          !isConvertibleToSDWA(*ParentMIOp1, ST, TII))
+        continue;
+
+      DefSrc0Queue.push(Op0);
+      DefSrc1Queue.push(Op1);
+
+      // This checks for the given MI, that it only has exact one register MO
+      // use , that is defined by pure FP16 instruction (that is SDWA-able too)
+      unsigned NumOfFP16Def;
+
+      NumOfFP16Def = computeMIChainsForPackedOps(ParentMIOp0, DefSrc0Queue, ST);
+      if (NumOfFP16Def > 1)
+        continue;
+
+      NumOfFP16Def = computeMIChainsForPackedOps(ParentMIOp1, DefSrc1Queue, ST);
+      if (NumOfFP16Def > 1)
+        continue;
+
+      MachineInstr *Def0RootMI = (DefSrc0Queue.back())->getParent();
+      MachineInstr *Def1RootMI = (DefSrc1Queue.back())->getParent();
+      Register SrcRootMOReg = AMDGPU::NoRegister;
+
+      // Now, check if the last operation for each in of the DefSrcQueue
+      // has the common MO, that would be the source root MO for element-wise
+      // fp16 chain operations
+      for (MachineOperand &Current0MO : Def0RootMI->uses()) {
+        if (!Current0MO.isReg() || Current0MO.getReg().isPhysical())
+          continue;
+
+        for (MachineOperand &Current1MO : Def1RootMI->uses()) {
+          if (!Current1MO.isReg() || Current1MO.getReg().isPhysical())
+            continue;
+
+          if (Current0MO.getReg() == Current1MO.getReg() &&
+              Current0MO.getSubReg() == Current1MO.getSubReg()) {
+            SrcRootMOReg = Current0MO.getReg();
+            break;
+          }
+        }
+        // Found it, no more check needed, so break;
+        if (SrcRootMOReg != AMDGPU::NoRegister)
+          break;
+      }
+
+      if (SrcRootMOReg == AMDGPU::NoRegister)
+        continue;
+
+      // Also we need to ensure that each of the DefXRootMI should access the
+      // lower and upper half word of SrcRootMOReg respectively.
+      if (!checkForRightSrcRootAccess(Def0RootMI, Def1RootMI, SrcRootMOReg,
+                                      TII))
+        continue;
+
+      // The graph below represents the connection :
+      //       Op0Intial  -->  Op0x  --> ...  -->  Op0Final
+      //      /                                       \'
+      // SrcRootMO                                    v_Pack_b32_f16
+      //      \                                       /
+      //       Op1Intial  -->  Op1x  --> ...  -->  Op1Final
+      // The nomenclature is based upon above flow-graph
+      //
+      // Also for each of DefSrcXQueue :
+      // OpXIntial is at back & OpXFinal is at front
+      auto Op0FinalMI = (DefSrc0Queue.front())->getParent();
+      auto Op1FinalMI = (DefSrc1Queue.front())->getParent();
+      auto Op0IntialMI = (DefSrc0Queue.back())->getParent();
+      auto Op1IntialMI = (DefSrc1Queue.back())->getParent();
+
+      MachineOperand *FinalOutMO = nullptr;
+      std::queue<MachineOperand *> ChainedDefOps;
+      AMDGPU::SDWA::SdwaSel OpSel = AMDGPU::SDWA::SdwaSel::DWORD;
+      int NumOfElemInSecondOpChain = 0;
+
+      // Now, we will change the flow as per the dominace of MI as follows, if
+      // possible and store it in ChainedDefOps, so later can be used to convert
+      // into its SDWA version:
+      //
+      // If (dominates(Op0FinalMI, Op1IntialMI)) == TRUE
+      // SrcRootMO -> Op0Intial -> Op0x -> ... -> Op0Final
+      // -> Op1Intial -> Op1x -> ... -> Op1Final (FinalOutMO)
+      //
+      // If (dominates(Op1FinalMI, Op0IntialMI)) == TRUE
+      // SrcRootMO -> Op1Intial -> Op1x -> ... -> Op1Final
+      // -> Op0Intial -> Op0x -> ... -> Op0Final (FinalOutMO)
+      //
+      // TODO : Else, not handled!
+      // One such case is observed when multiple fp16 instruction are chained
+      // on a fp16 vector input. For Example :
+      //
+      // %1 = call <2 x half> @llvm.log.v2f16 (<2 x half> %0)
+      // %res = call <2 x half> @llvm.sin.v2f16 (<2 x half> %1)
+      // return <2 x half> %res
+      if (dominates(Op0FinalMI, Op1IntialMI)) {
+        int OpIdx = Op1IntialMI->findRegisterUseOperandIdx(SrcRootMOReg, TRI);
+        auto &MOTo = Op1IntialMI->getOperand(OpIdx);
+        auto MOFrom = DefSrc0Queue.front();
+        copyRegOperand(MOTo, *MOFrom, true);
+        FinalOutMO = DefSrc1Queue.front();
+
+        LLVM_DEBUG(dbgs() << "Updated Connecting MI : " << *Op1IntialMI
+                          << '\n');
+        OpIdx = Op0IntialMI->findRegisterUseOperandIdx(SrcRootMOReg, TRI);
+        auto &IntialInMO = Op0IntialMI->getOperand(OpIdx);
+
+        while (!DefSrc1Queue.empty()) {
+          ChainedDefOps.push(DefSrc1Queue.front());
+          DefSrc1Queue.pop();
+          NumOfElemInSecondOpChain++;
+        }
+        while (!DefSrc0Queue.empty()) {
+          ChainedDefOps.push(DefSrc0Queue.front());
+          DefSrc0Queue.pop();
+        }
+
+        ChainedDefOps.push(&IntialInMO);
+        OpSel = AMDGPU::SDWA::SdwaSel::WORD_1;
+      } else if (dominates(Op1FinalMI, Op0IntialMI)) {
+        int OpIdx = Op0IntialMI->findRegisterUseOperandIdx(SrcRootMOReg, TRI);
+        auto &MOTo = Op0IntialMI->getOperand(OpIdx);
+        auto MOFrom = DefSrc1Queue.front();
+        copyRegOperand(MOTo, *MOFrom, true);
+        FinalOutMO = DefSrc0Queue.front();
+
+        LLVM_DEBUG(dbgs() << "Updated Connecting MI : " << *Op0IntialMI
+                          << '\n');
+        OpIdx = Op1IntialMI->findRegisterUseOperandIdx(SrcRootMOReg, TRI);
+        auto &IntialInMO = Op1IntialMI->getOperand(OpIdx);
+
+        while (!DefSrc0Queue.empty()) {
+          ChainedDefOps.push(DefSrc0Queue.front());
+          DefSrc0Queue.pop();
+          NumOfElemInSecondOpChain++;
+        }
+        while (!DefSrc1Queue.empty()) {
+          ChainedDefOps.push(DefSrc1Queue.front());
+          DefSrc1Queue.pop();
+        }
+
+        ChainedDefOps.push(&IntialInMO);
+        OpSel = AMDGPU::SDWA::SdwaSel::WORD_0;
+      } else {
+        LLVM_DEBUG(dbgs() << "No Connecting MI exists" << '\n');
+        continue;
+      }
+
+      // Replace all use places of MI(v_pack) defMO with FinalOutMO.
+      MachineOperand &DefMO = MI.getOperand(0);
+      for (MachineOperand &MO :
+           make_early_inc_range(MRI->use_nodbg_operands(DefMO.getReg()))) {
+        if (!MO.isReg())
+          continue;
+
+        MO.setReg(FinalOutMO->getReg());
+        MO.setSubReg(FinalOutMO->getSubReg());
+      }
+      LLVM_DEBUG(dbgs() << "Replace uses of " << DefMO << " in " << MI
+                        << "With " << *FinalOutMO << '\n');
+
+      // Delete v_pack machine instruction
+      LLVM_DEBUG(dbgs() << "\nInstruction to be deleted : " << MI << "\n\n");
+      MI.eraseFromParent();
+      ++Num16BitPackedInstructionsEliminated;
+
+      // Convert machine instruction into SDWA-version
+      while (ChainedDefOps.size() != 1) {
+        if (NumOfElemInSecondOpChain == 0) {
+          if (OpSel == AMDGPU::SDWA::SdwaSel::WORD_0)
+            OpSel = AMDGPU::SDWA::SdwaSel::WORD_1;
+          else
+            OpSel = AMDGPU::SDWA::SdwaSel::WORD_0;
+        }
+
+        MachineInstr *DefMI = ChainedDefOps.front()->getParent();
+        ChainedDefOps.pop();
+        MachineOperand *SrcMO = ChainedDefOps.front();
+
+        // Take SrcMO (which are def) as its usage in DefMI
+        if (SrcMO->isDef()) {
+          assert(MRI->hasOneUse(SrcMO->getReg()));
+          SrcMO = findSingleRegUse(SrcMO, MRI);
+          assert(DefMI == S...
[truncated]

This adds llc LIT test for vector fp16 operations like log, exp, etc. Its act as the pre-commit test for github PR#137137.

krzysz00

Just one question, I'll let the compiler folks do the reviewing here

krzysz00 · Apr 24, 2025

llvm/lib/Target/AMDGPU/SIPeepholeSDWA.cpp

@@ -1361,6 +1379,499 @@ bool SIPeepholeSDWALegacy::runOnMachineFunction(MachineFunction &MF) {
  return SIPeepholeSDWA().run(MF);
 }

+static bool isSrcDestFP16Bits(MachineInstr *MI, const SIInstrInfo *TII) {


Is there a way to tablegen this switch-case?

(Also, are there relevant things that use the OPSEL scheme and not the SDWA one?)

This switch overlaps a lot with zeroesHigh16BitsOfDest, but I also suspect you don't actually need this information

This is required to identify those instructions for which both src and dst implied type are fp16, then only this optimization can be hold true for that case!

@vg0204 Please use replies instead of editing the original comment. I got very confused while reading this.

arsenm

This feels like about 50x more code than should be required for this. Can you write some MIR tests for the specific pattern you are trying to handle?

llvm/lib/Target/AMDGPU/SIPeepholeSDWA.cpp

arsenm · Apr 24, 2025

llvm/lib/Target/AMDGPU/SIPeepholeSDWA.cpp

+/// Given A and B are in the same MBB, returns true if A comes before B.
+static bool dominates(MachineBasicBlock::const_iterator A,
+                      MachineBasicBlock::const_iterator B) {


You shouldn't have to do a scan for dominance, this should be implied by your visitation order?

Here, dominance check is needed, as the both inputs of v_pack can be traced back independently to their access to input register independently.

Even if it's needed, this function could be written a lot more succinctly

Check if both iterators have the same parent

If they do, take A then iterate until the end of the basic block. If you see B while iterating, then A comes before B, else B comes before A if you reach the end first.

For this, I took inspiration from MachineDominatorTree::dominate(), in case of both MI exists in same MBB!

You shouldn't have to do a scan for dominance, this should be implied by your visitation order?

I am checking dominance among 2 independent instruction, that could access the each half og v32( f16 values) and do ops on it at any point before the packing!

arsenm · Apr 24, 2025

llvm/lib/Target/AMDGPU/SIPeepholeSDWA.cpp

@@ -1361,6 +1379,499 @@ bool SIPeepholeSDWALegacy::runOnMachineFunction(MachineFunction &MF) {
  return SIPeepholeSDWA().run(MF);
 }

+static bool isSrcDestFP16Bits(MachineInstr *MI, const SIInstrInfo *TII) {


This switch overlaps a lot with zeroesHigh16BitsOfDest, but I also suspect you don't actually need this information

This is required to identify those instructions for which both src and dst implied type are fp16, then only this optimization can be hold true for that case!

llvm/lib/Target/AMDGPU/SIPeepholeSDWA.cpp

arsenm · Apr 24, 2025

llvm/lib/Target/AMDGPU/SIPeepholeSDWA.cpp

+  if (!ST.has16BitInsts())
+    return;
+
+  for (MachineInstr &MI : make_early_inc_range(MBB)) {


This should not require its own loop through the block

My current implementation need to take effect after the all the peepholesSDWA are done, so either we might need a new pass or operate on MBB here itself.

Later, seems better as lot of utility needed already exists in sipeepholesdwa pass

I am saying this is a flawed approach. This should be integrated with the main pass processing loop

I believe my pattern search as well as the transformation might not be that straight forward to get easily integrated with existing pass. I studied the existing structure present in order to accomodate my idea but found it difficult & so eventually came up with this first attempt of patch!

MIR test added may give better idea!

@vg0204 Do you perhaps have a single, concrete example that shows why you cannot accomplish your transformation using a similar approach as for the current V_OR handling using SDWADstPreserveOperand. The transformation might take several matching & conversion steps.

vg0204 · Apr 25, 2025

This feels like about 50x more code than should be required for this. Can you write some MIR tests for the specific pattern you are trying to handle?

@arsenm , this PR corresponds to https://ontrack-internal.amd.com/browse/SWDEV-523024 ticket, containing the info & sample test for target pattern. Also, the attached images in them, also describes the same in generic way, alongwith a solution image based around this patch

Pierre-vh

Precommit test is merged into this patch so it kind of defeats the purpose of precommiting the tests
I'd also like to see MIR tests that test the specific patterns you're looking for, along with some edge cases that can break the code to test for infinite loops, crashes and so on.

Pierre-vh · Apr 25, 2025

llvm/lib/Target/AMDGPU/SIPeepholeSDWA.cpp

@@ -1361,6 +1379,499 @@ bool SIPeepholeSDWALegacy::runOnMachineFunction(MachineFunction &MF) {
  return SIPeepholeSDWA().run(MF);
 }

+static bool isSrcDestFP16Bits(MachineInstr *MI, const SIInstrInfo *TII) {


@vg0204 Please use replies instead of editing the original comment. I got very confused while reading this.

llvm/lib/Target/AMDGPU/SIPeepholeSDWA.cpp

Pierre-vh · Apr 25, 2025

llvm/lib/Target/AMDGPU/SIPeepholeSDWA.cpp

+/// Given A and B are in the same MBB, returns true if A comes before B.
+static bool dominates(MachineBasicBlock::const_iterator A,
+                      MachineBasicBlock::const_iterator B) {


Even if it's needed, this function could be written a lot more succinctly

Check if both iterators have the same parent

If they do, take A then iterate until the end of the basic block. If you see B while iterating, then A comes before B, else B comes before A if you reach the end first.

llvm/lib/Target/AMDGPU/SIPeepholeSDWA.cpp

arsenm · Apr 25, 2025

Also, the attached images in them, also describes the same in generic way, alongwith a solution image based around this patch

This isn't a substitute for MIR tests for this patch

vg0204 · Apr 28, 2025

This feels like about 50x more code than should be required for this. Can you write some MIR tests for the specific pattern you are trying to handle?

Added the basic MIR test which covers the targeted specifc pattern!

vg0204 · Apr 29, 2025

Precommit test is merged into this patch so it kind of defeats the purpose of precommiting the tests I'd also like to see MIR tests that test the specific patterns you're looking for, along with some edge cases that can break the code to test for infinite loops, crashes and so on.

Added basic MIR test specifying the targeted pattern (generated via ISEL just before si-peephole-sdwa pass for intrinsics). Yet to add non-conservative and edge test!

llvm/lib/Target/AMDGPU/SIPeepholeSDWA.cpp

vg0204 · May 5, 2025

@Pierre-vh do you have anything to comment on added MIR test showcasing exact MIR pattern handled via this optimization?

vg0204 · May 7, 2025

Gentle remider ping for review!

vg0204 · May 12, 2025

@jayfoad , @arsenm , @Pierre-vh , it would be appreciated if you could put further comments on this, thanks!

llvm/lib/Target/AMDGPU/SIPeepholeSDWA.cpp

arsenm

I think you need to step back and wholly reevaluate what you are trying to do here. You have a post-processing of the function in the SDWA pass that doesn't really have anything to do with SDWA or the rest of the pass. The problem you are solving appears to be avoiding the interference of v_pack_b32_f16 instructions, which only apply in a narrow range of cases.

It would be easier to solve this by avoiding introducing that problem in the first place. Whether that's by checking if the uses are SDWA candidates, or just not using v_pack_b32_f16 in the first place. It would be simpler to form v_pack_b32_f16 later than trying to look through it while also mixing in conversion to SDWA

llvm/test/CodeGen/AMDGPU/vector-fp16.ll

llvm/test/CodeGen/AMDGPU/packed-vec-fp16.mir

arsenm · May 13, 2025

llvm/test/CodeGen/AMDGPU/packed-vec-fp16.mir

+    ret <4 x half> %res
+  }
+
+  define <4 x half> @log2_v4f16(<4 x half> %a) #0 {


The point of a MIR test is to stress all of the structural edge cases and/or have minimal MIR instruction context. We're not getting that by just testing structurally identical opcodes in different functions

vg0204 · May 13, 2025

You have a post-processing of the function in the SDWA pass that doesn't really have anything to do with SDWA or the rest of the pass. The problem you are solving appears to be avoiding the interference of v_pack_b32_f16 instructions, which only apply in a narrow range of cases.

The point of doing it at this point is the readily available utilities needed such as SDWACandidateCheck , SDWAConvertOfMI & legalizeScalarOperands, agreeing to the point that this patch doesn't have anything to do with SDWA peephole or the rest of the pass.

vg0204 · May 13, 2025

It would be easier to solve this by avoiding introducing that problem in the first place. Whether that's by checking if the uses are SDWA candidates, or just not using v_pack_b32_f16 in the first place. It would be simpler to form v_pack_b32_f16 later than trying to look through it while also mixing in conversion to SDWA

This problem arises right away from isel phase, as fot unary ops like log or exp, we don't have dedicated packed instruction, so isel always scalarizes & generate separate element-wise instruction, followed by using different strategy to pack them. In our case for v_pack_b32 _f16 (GFX8- to GFX9+). So, checking for SDWA candidates at such early stage OR preventing it to happen at isel phase seems lot of work.

@jayfoad, you could add to it if anything I missed, or unaware about!

vg0204 · May 20, 2025

@Pierre-vh , @arsenm , @jayfoad , @frederik-h kind ping for review!

frederik-h

I would have expected that this can be handled by creating a SDWADstPreserveOperand for the V_PACK_B32_F16_e64 instruction in SIPeepholeSDWA::matchSDWAOperand and adding special handling for it in SDWADstPreserveOperand::convertToSDWA. See the handling for V_OR_B32_e64. Unfortunately, there are no explanations in the source code, but the original patch review for this feature is very informative: https://reviews.llvm.org/D37817.

Have you looked into this approach?

vg0204 · May 27, 2025

I would have expected that this can be handled by creating a SDWADstPreserveOperand for the V_PACK_B32_F16_e64 instruction in SIPeepholeSDWA::matchSDWAOperand and adding special handling for it in SDWADstPreserveOperand::convertToSDWA. See the handling for V_OR_B32_e64. Unfortunately, there are no explanations in the source code, but the original patch review for this feature is very informative: https://reviews.llvm.org/D37817.

Have you looked into this approach?

I looked into it now! And as you said it can be handled using SDWADstPreserveOperand, but if you look into https://ontrack-internal.amd.com/browse/SWDEV-523024 (specifically the attached images), you will understand I tried to handle something which is one step-ahead (maybe computationally expensive, need feedback on that). Also, do look into updated mir tests for example along the same line!

operations for SDWA/OPSEL-able instruction As the compiler has no fp16 packed instruction, so isel scalarizes each fp16 operation in wide fp16 vectors and generates separate individual fp16 results, which are later packed. Now, in post- isel pass in SIPeepholeSDWA pass, opportunistically any instructions is eventually converted into its SDWA/OPSEL-able version. This patch gets rids of unnecessary packing in wider fp16 vectors operation for SDWA/OPSEL-able instruction, by overwriting the partial fp16 result into same input register partially, while maintaining the sanctity of rest of bits in input register, using OPSEL dst_unused operand set as UNUSED_PRESERVED. Owing to the context of generating SDWA instructions, it is invoked at the end of the SIPeepholeSDWA pass.

This adds llc LIT test for vector fp16 operations like log, exp, etc. Its act as the pre-commit test for github PR#137137.

…function.

…iminate packing for fp16.

…File.

vg0204 · Jun 4, 2025

After a detailed discussion with Frederik, we came to the conclusion that what I really want to acheive is quite tricky/not feasible to achieve even via breaking it into multiple patterns and integrating it into existing si-peephole-sdwa infrastructure, as its not really a peephole optimization, as it looks for chain of pattern span across various instructions rather than localised pattern with very few instruction.

So, considering the optimization it does in terms of elimination of v_pack(generated naturally by ISEL for fp16 vectors), and reducing register usage by extensive of UNUSED_PRESERVE, we can write this as a new pass (invoked immediately after si-peephole-sdwa pass) and need some SDWAutility from this existing pass that can be extracted out (in discussion with frederik on how to do accurately considering some needed utility function for my logic, has pre-assumptions based on si-peephole pass, that need generalization)

What do you guys think @arsenm , @jayfoad , @krzysz00 ?
CC : @frederik-h

PS : Still in discussion whether it could be achieved via the ways existing currently in si-peephole-sdwa pass!

krzysz00 · Jun 20, 2025

I do think solving the original problem - that is, less register-efficient lowerings of SWDA/OPSEL-able operations that're being run on a vector <4 x [i//f]16> or the like - should be done

vg0204 · Jun 27, 2025

un on a vector <4 x [i//f]16> or the like - should be done

You mean at the machine instruction selection phase for the given DAG of vector <4 x [i//f]16> or the like!

vg0204 requested review from jayfoad, arsenm, krzysz00 and Pierre-vh April 24, 2025 08:31

vg0204 self-assigned this Apr 24, 2025

vg0204 force-pushed the vg0204/optimize-unnecessary-packing-wider-f16-vectors-sdwa branch from a9b7e60 to 98bdc1d Compare April 24, 2025 08:45

vg0204 requested a review from cdevadas April 24, 2025 08:47

vg0204 added a commit to vg0204/llvm-project that referenced this pull request Apr 24, 2025

[AMDGPU][NFC] Added Pre-commit tests for llvm#137137

54581a0

This adds llc LIT test for vector fp16 operations like log, exp, etc. Its act as the pre-commit test for github PR#137137.

vg0204 mentioned this pull request Apr 24, 2025

[AMDGPU][NFC] Added Pre-commit tests for PR#137137 #137150

Closed

vg0204 marked this pull request as ready for review April 24, 2025 10:23

llvmbot added backend:AMDGPU llvm:globalisel labels Apr 24, 2025

vg0204 added a commit to vg0204/llvm-project that referenced this pull request Apr 24, 2025

[AMDGPU][NFC] Added Pre-commit tests for llvm#137137

6e3bca2

This adds llc LIT test for vector fp16 operations like log, exp, etc. Its act as the pre-commit test for github PR#137137.

vg0204 force-pushed the vg0204/optimize-unnecessary-packing-wider-f16-vectors-sdwa branch from c88fb17 to 6f2c15c Compare April 24, 2025 14:45

krzysz00 reviewed Apr 24, 2025

View reviewed changes

arsenm changed the title ~~[AMDGPU] [PeepholeOpt] Eliminate unnecessary packing in wider f16 vectors for sdwa/opsel-able instruction~~ [AMDGPU] Eliminate unnecessary packing in wider f16 vectors for sdwa/opsel-able instruction Apr 24, 2025

arsenm requested a review from frederik-h April 24, 2025 16:05

arsenm reviewed Apr 24, 2025

View reviewed changes

Pierre-vh requested changes Apr 25, 2025

View reviewed changes

vg0204 requested review from Pierre-vh and arsenm April 30, 2025 06:03

Pierre-vh reviewed May 5, 2025

View reviewed changes

llvm/lib/Target/AMDGPU/SIPeepholeSDWA.cpp Outdated Show resolved Hide resolved

llvm/lib/Target/AMDGPU/SIPeepholeSDWA.cpp Outdated Show resolved Hide resolved

vg0204 requested a review from krzysz00 May 13, 2025 06:51

Pierre-vh reviewed May 13, 2025

View reviewed changes

llvm/lib/Target/AMDGPU/SIPeepholeSDWA.cpp Outdated Show resolved Hide resolved

llvm/lib/Target/AMDGPU/SIPeepholeSDWA.cpp Outdated Show resolved Hide resolved

llvm/lib/Target/AMDGPU/SIPeepholeSDWA.cpp Outdated Show resolved Hide resolved

arsenm requested changes May 13, 2025

View reviewed changes

frederik-h reviewed May 26, 2025

View reviewed changes

vg0204 closed this May 27, 2025

vg0204 reopened this May 27, 2025

vg0204 requested a review from frederik-h May 27, 2025 07:16

vg0204 added 8 commits June 3, 2025 11:43

Update the LIT tests to accomodate the patch effects.

3aae24a

[AMDGPU][NFC] Added Pre-commit tests for llvm#137137

8b1230d

This adds llc LIT test for vector fp16 operations like log, exp, etc. Its act as the pre-commit test for github PR#137137.

Refactored the code in order to encapsulate redundant code as lambda …

9a1a08a

…function.

Added reviewed changes addressing redundant code & complex logic.

9d75623

Added MIR test to demonstrate the specific MIR pattern handling to el…

6867911

…iminate packing for fp16.

Reduced duplicate code length & added a new MIR test in existing test…

ce99d8c

…File.

Updated MIR test case, and addressed code reviews.

d0d7bce

vg0204 force-pushed the vg0204/optimize-unnecessary-packing-wider-f16-vectors-sdwa branch from e9d41c0 to d0d7bce Compare June 3, 2025 11:44

Search code, repositories, users, issues, pull requests...

[AMDGPU] Eliminate unnecessary packing in wider f16 vectors for sdwa/opsel-able instruction #137137

Are you sure you want to change the base?

[AMDGPU] Eliminate unnecessary packing in wider f16 vectors for sdwa/opsel-able instruction #137137

Uh oh!

Conversation

vg0204 commented Apr 24, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

github-actions bot commented Apr 24, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

llvmbot commented Apr 24, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

krzysz00 left a comment

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

arsenm Apr 24, 2025 • edited by vg0204 Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

arsenm left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

arsenm Apr 24, 2025 • edited by vg0204 Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

vg0204 commented Apr 25, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Pierre-vh left a comment

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

arsenm commented Apr 25, 2025

Uh oh!

vg0204 commented Apr 28, 2025

Uh oh!

vg0204 commented Apr 29, 2025

Uh oh!

vg0204 commented Apr 24, 2025 •

edited

Loading

github-actions bot commented Apr 24, 2025 •

edited

Loading

llvmbot commented Apr 24, 2025 •

edited

Loading

arsenm Apr 24, 2025 •

edited by vg0204

Loading

arsenm Apr 24, 2025 •

edited by vg0204

Loading

vg0204 commented Apr 25, 2025 •

edited

Loading

arsenm left a comment •

edited

Loading

vg0204 commented Jun 4, 2025 •

edited

Loading