Skip to content

Navigation Menu

Search code, repositories, users, issues, pull requests...

Provide feedback

We read every piece of feedback, and take your input very seriously.

Saved searches

Use saved searches to filter your results more quickly

Appearance settings

[Intrinsics][AArch64] Add intrinsic to mask off aliasing vector lanes #117007

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Open
wants to merge 20 commits into
base: main
Choose a base branch
Loading
from

Conversation

SamTebbs33
Copy link
Collaborator

It can be unsafe to load a vector from an address and write a vector to an address if those two addresses have overlapping lanes within a vectorised loop iteration.

This PR adds an intrinsic designed to create a mask with lanes disabled if they overlap between the two pointer arguments, so that only safe lanes are loaded, operated on and stored.

Along with the two pointer parameters, the intrinsic also takes an immediate that represents the size in bytes of the vector element types, as well as an immediate i1 that is true if there is a write after-read-hazard or false if there is a read-after-write hazard.

This will be used by #100579 and replaces the existing lowering for whilewr since that isn't needed now we have the intrinsic.

@llvmbot
Copy link
Member

llvmbot commented Nov 20, 2024

@llvm/pr-subscribers-llvm-ir

@llvm/pr-subscribers-llvm-selectiondag

Author: Sam Tebbs (SamTebbs33)

Changes

It can be unsafe to load a vector from an address and write a vector to an address if those two addresses have overlapping lanes within a vectorised loop iteration.

This PR adds an intrinsic designed to create a mask with lanes disabled if they overlap between the two pointer arguments, so that only safe lanes are loaded, operated on and stored.

Along with the two pointer parameters, the intrinsic also takes an immediate that represents the size in bytes of the vector element types, as well as an immediate i1 that is true if there is a write after-read-hazard or false if there is a read-after-write hazard.

This will be used by #100579 and replaces the existing lowering for whilewr since that isn't needed now we have the intrinsic.


Patch is 93.77 KiB, truncated to 20.00 KiB below, full version: https://github.com/llvm/llvm-project/pull/117007.diff

11 Files Affected:

  • (modified) llvm/docs/LangRef.rst (+80)
  • (modified) llvm/include/llvm/CodeGen/TargetLowering.h (+5)
  • (modified) llvm/include/llvm/IR/Intrinsics.td (+5)
  • (modified) llvm/lib/CodeGen/SelectionDAG/SelectionDAGBuilder.cpp (+44)
  • (modified) llvm/lib/Target/AArch64/AArch64ISelLowering.cpp (+59-123)
  • (modified) llvm/lib/Target/AArch64/AArch64ISelLowering.h (+6)
  • (modified) llvm/lib/Target/AArch64/AArch64SVEInstrInfo.td (+7-2)
  • (modified) llvm/lib/Target/AArch64/SVEInstrFormats.td (+5-5)
  • (added) llvm/test/CodeGen/AArch64/alias_mask.ll (+421)
  • (added) llvm/test/CodeGen/AArch64/alias_mask_scalable.ll (+82)
  • (removed) llvm/test/CodeGen/AArch64/whilewr.ll (-1086)
diff --git a/llvm/docs/LangRef.rst b/llvm/docs/LangRef.rst
index 9f4c90ba82a419..c9589d5af8ebbe 100644
--- a/llvm/docs/LangRef.rst
+++ b/llvm/docs/LangRef.rst
@@ -23475,6 +23475,86 @@ Examples:
       %active.lane.mask = call <4 x i1> @llvm.get.active.lane.mask.v4i1.i64(i64 %elem0, i64 429)
       %wide.masked.load = call <4 x i32> @llvm.masked.load.v4i32.p0v4i32(<4 x i32>* %3, i32 4, <4 x i1> %active.lane.mask, <4 x i32> poison)
 
+.. _int_experimental_get_alias_lane_mask:
+
+'``llvm.get.alias.lane.mask.*``' Intrinsics
+^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
+
+Syntax:
+"""""""
+This is an overloaded intrinsic.
+
+::
+
+      declare <4 x i1> @llvm.experimental.get.alias.lane.mask.v4i1.i64(i64 %ptrA, i64 %ptrB, i32 immarg %elementSize, i1 immarg %writeAfterRead)
+      declare <8 x i1> @llvm.experimental.get.alias.lane.mask.v8i1.i64(i64 %ptrA, i64 %ptrB, i32 immarg %elementSize, i1 immarg %writeAfterRead)
+      declare <16 x i1> @llvm.experimental.get.alias.lane.mask.v16i1.i64(i64 %ptrA, i64 %ptrB, i32 immarg %elementSize, i1 immarg %writeAfterRead)
+      declare <vscale x 16 x i1> @llvm.experimental.get.alias.lane.mask.nxv16i1.i64(i64 %ptrA, i64 %ptrB, i32 immarg %elementSize, i1 immarg %writeAfterRead)
+
+
+Overview:
+"""""""""
+
+Create a mask representing lanes that do or not overlap between two pointers across one vector loop iteration.
+
+
+Arguments:
+""""""""""
+
+The first two arguments have the same scalar integer type.
+The final two are immediates and the result is a vector with the i1 element type.
+
+Semantics:
+""""""""""
+
+In the case that ``%writeAfterRead`` is true, the '``llvm.experimental.get.alias.lane.mask.*``' intrinsics are semantically equivalent
+to:
+
+::
+
+      %diff = (%ptrB - %ptrA) / %elementSize
+      %m[i] = (icmp ult i, %diff) || (%diff <= 0)
+
+Otherwise they are semantically equivalent to:
+
+::
+
+      %diff = abs(%ptrB - %ptrA) / %elementSize
+      %m[i] = (icmp ult i, %diff) || (%diff == 0)
+
+where ``%m`` is a vector (mask) of active/inactive lanes with its elements
+indexed by ``i``,  and ``%ptrA``, ``%ptrB`` are the two i64 arguments to
+``llvm.experimental.get.alias.lane.mask.*``, ``%elementSize`` is the i32 argument, ``%abs`` is the absolute difference operation, ``%icmp`` is an integer compare and ``ult``
+the unsigned less-than comparison operator. The subtraction between ``%ptrA`` and ``%ptrB`` could be negative. The ``%writeAfterRead`` argument is expected to be true if the ``%ptrB`` is stored to after ``%ptrA`` is read from.
+The above is equivalent to:
+
+::
+
+      %m = @llvm.experimental.get.alias.lane.mask(%ptrA, %ptrB, %elementSize, %writeAfterRead)
+
+This can, for example, be emitted by the loop vectorizer in which case
+``%ptrA`` is a pointer that is read from within the loop, and ``%ptrB`` is a pointer that is stored to within the loop.
+If the difference between these pointers is less than the vector factor, then they overlap (alias) within a loop iteration.
+An example is if ``%ptrA`` is 20 and ``%ptrB`` is 23 with a vector factor of 8, then lanes 3, 4, 5, 6 and 7 of the vector loaded from ``%ptrA``
+share addresses with lanes 0, 1, 2, 3, 4 and 5 from the vector stored to at ``%ptrB``.
+An alias mask of these two pointers should be <1, 1, 1, 0, 0, 0, 0, 0> so that only the non-overlapping lanes are loaded and stored.
+This operation allows many loops to be vectorised when it would otherwise be unsafe to do so.
+
+To account for the fact that only a subset of lanes have been operated on in an iteration,
+the loop's induction variable should be incremented by the popcount of the mask rather than the vector factor.
+
+This mask ``%m`` can e.g. be used in masked load/store instructions.
+
+
+Examples:
+"""""""""
+
+.. code-block:: llvm
+
+      %alias.lane.mask = call <4 x i1> @llvm.experimental.get.alias.lane.mask.v4i1.i64(i64 %ptrA, i64 %ptrB, i32 4, i1 1)
+      %vecA = call <4 x i32> @llvm.masked.load.v4i32.p0v4i32(<4 x i32>* %ptrA, i32 4, <4 x i1> %alias.lane.mask, <4 x i32> poison)
+      [...]
+      call @llvm.masked.store.v4i32.p0v4i32(<4 x i32> %vecA, <4 x i32>* %ptrB, i32 4, <4 x i1> %alias.lane.mask)
 
 .. _int_experimental_vp_splice:
 
diff --git a/llvm/include/llvm/CodeGen/TargetLowering.h b/llvm/include/llvm/CodeGen/TargetLowering.h
index 6a41094ff933b0..0338310fd936df 100644
--- a/llvm/include/llvm/CodeGen/TargetLowering.h
+++ b/llvm/include/llvm/CodeGen/TargetLowering.h
@@ -468,6 +468,11 @@ class TargetLoweringBase {
     return true;
   }
 
+  /// Return true if the @llvm.experimental.get.alias.lane.mask intrinsic should be expanded using generic code in SelectionDAGBuilder.
+  virtual bool shouldExpandGetAliasLaneMask(EVT VT, EVT PtrVT, unsigned EltSize) const {
+    return true;
+  }
+
   virtual bool shouldExpandGetVectorLength(EVT CountVT, unsigned VF,
                                            bool IsScalable) const {
     return true;
diff --git a/llvm/include/llvm/IR/Intrinsics.td b/llvm/include/llvm/IR/Intrinsics.td
index 1ca8c2565ab0b6..5f7073a531283e 100644
--- a/llvm/include/llvm/IR/Intrinsics.td
+++ b/llvm/include/llvm/IR/Intrinsics.td
@@ -2363,6 +2363,11 @@ let IntrProperties = [IntrNoMem, IntrNoSync, IntrWillReturn, ImmArg<ArgIndex<1>>
                                     llvm_i32_ty]>;
 }
 
+def int_experimental_get_alias_lane_mask:
+  DefaultAttrsIntrinsic<[llvm_anyvector_ty],
+            [llvm_anyint_ty, LLVMMatchType<1>, llvm_anyint_ty, llvm_i1_ty],
+            [IntrNoMem, IntrNoSync, IntrWillReturn, ImmArg<ArgIndex<2>>, ImmArg<ArgIndex<3>>]>;
+
 def int_get_active_lane_mask:
   DefaultAttrsIntrinsic<[llvm_anyvector_ty],
             [llvm_anyint_ty, LLVMMatchType<1>],
diff --git a/llvm/lib/CodeGen/SelectionDAG/SelectionDAGBuilder.cpp b/llvm/lib/CodeGen/SelectionDAG/SelectionDAGBuilder.cpp
index 9d729d448502d8..39e84e06a8de60 100644
--- a/llvm/lib/CodeGen/SelectionDAG/SelectionDAGBuilder.cpp
+++ b/llvm/lib/CodeGen/SelectionDAG/SelectionDAGBuilder.cpp
@@ -8284,6 +8284,50 @@ void SelectionDAGBuilder::visitIntrinsicCall(const CallInst &I,
     visitVectorExtractLastActive(I, Intrinsic);
     return;
   }
+  case Intrinsic::experimental_get_alias_lane_mask: {
+    SDValue SourceValue = getValue(I.getOperand(0));
+    SDValue SinkValue = getValue(I.getOperand(1));
+    SDValue EltSize = getValue(I.getOperand(2));
+    bool IsWriteAfterRead = cast<ConstantSDNode>(getValue(I.getOperand(3)))->getZExtValue() != 0;
+    auto IntrinsicVT = EVT::getEVT(I.getType());
+    auto PtrVT = SourceValue->getValueType(0);
+
+    if (!TLI.shouldExpandGetAliasLaneMask(IntrinsicVT, PtrVT, cast<ConstantSDNode>(EltSize)->getSExtValue())) {
+      visitTargetIntrinsic(I, Intrinsic);
+      return;
+    }
+
+    SDValue Diff = DAG.getNode(ISD::SUB, sdl,
+                              PtrVT, SinkValue, SourceValue);
+    if (!IsWriteAfterRead)
+      Diff = DAG.getNode(ISD::ABS, sdl, PtrVT, Diff);
+
+    Diff = DAG.getNode(ISD::SDIV, sdl, PtrVT, Diff, EltSize);
+    SDValue Zero = DAG.getTargetConstant(0, sdl, PtrVT);
+
+    // If the difference is positive then some elements may alias
+    auto CmpVT = TLI.getSetCCResultType(DAG.getDataLayout(), *DAG.getContext(),
+                                   PtrVT);
+    SDValue Cmp = DAG.getSetCC(sdl, CmpVT, Diff, Zero, IsWriteAfterRead ? ISD::SETLE : ISD::SETEQ);
+
+    // Splat the compare result then OR it with a lane mask
+    SDValue Splat = DAG.getSplat(IntrinsicVT, sdl, Cmp);
+
+    SDValue DiffMask;
+    // Don't emit an active lane mask if the target doesn't support it
+    if (TLI.shouldExpandGetActiveLaneMask(IntrinsicVT, PtrVT)) {
+        EVT VecTy = EVT::getVectorVT(*DAG.getContext(), PtrVT,
+                                     IntrinsicVT.getVectorElementCount());
+        SDValue DiffSplat = DAG.getSplat(VecTy, sdl, Diff);
+        SDValue VectorStep = DAG.getStepVector(sdl, VecTy);
+        DiffMask = DAG.getSetCC(sdl, IntrinsicVT, VectorStep,
+                                     DiffSplat, ISD::CondCode::SETULT);
+    } else {
+      DiffMask = DAG.getNode(ISD::INTRINSIC_WO_CHAIN, sdl, IntrinsicVT, DAG.getTargetConstant(Intrinsic::get_active_lane_mask, sdl, MVT::i64), Zero, Diff);
+    }
+    SDValue Or = DAG.getNode(ISD::OR, sdl, IntrinsicVT, DiffMask, Splat);
+    setValue(&I, Or);
+  }
   }
 }
 
diff --git a/llvm/lib/Target/AArch64/AArch64ISelLowering.cpp b/llvm/lib/Target/AArch64/AArch64ISelLowering.cpp
index 7ab3fc06715ec8..66eaec0d5ae6c9 100644
--- a/llvm/lib/Target/AArch64/AArch64ISelLowering.cpp
+++ b/llvm/lib/Target/AArch64/AArch64ISelLowering.cpp
@@ -2033,6 +2033,24 @@ bool AArch64TargetLowering::shouldExpandGetActiveLaneMask(EVT ResVT,
   return false;
 }
 
+bool AArch64TargetLowering::shouldExpandGetAliasLaneMask(EVT VT, EVT PtrVT, unsigned EltSize) const {
+  if (!Subtarget->hasSVE2())
+    return true;
+
+  if (PtrVT != MVT::i64)
+    return true;
+
+  if (VT == MVT::v2i1 || VT == MVT::nxv2i1)
+    return EltSize != 8;
+  if( VT == MVT::v4i1 || VT == MVT::nxv4i1)
+    return EltSize != 4;
+  if (VT == MVT::v8i1 || VT == MVT::nxv8i1)
+    return EltSize != 2;
+  if (VT == MVT::v16i1 || VT == MVT::nxv16i1)
+    return EltSize != 1;
+  return true;
+}
+
 bool AArch64TargetLowering::shouldExpandPartialReductionIntrinsic(
     const IntrinsicInst *I) const {
   if (I->getIntrinsicID() != Intrinsic::experimental_vector_partial_reduce_add)
@@ -2796,6 +2814,8 @@ const char *AArch64TargetLowering::getTargetNodeName(unsigned Opcode) const {
     MAKE_CASE(AArch64ISD::LS64_BUILD)
     MAKE_CASE(AArch64ISD::LS64_EXTRACT)
     MAKE_CASE(AArch64ISD::TBL)
+    MAKE_CASE(AArch64ISD::WHILEWR)
+    MAKE_CASE(AArch64ISD::WHILERW)
     MAKE_CASE(AArch64ISD::FADD_PRED)
     MAKE_CASE(AArch64ISD::FADDA_PRED)
     MAKE_CASE(AArch64ISD::FADDV_PRED)
@@ -5881,6 +5901,16 @@ SDValue AArch64TargetLowering::LowerINTRINSIC_WO_CHAIN(SDValue Op,
     EVT PtrVT = getPointerTy(DAG.getDataLayout());
     return DAG.getNode(AArch64ISD::THREAD_POINTER, dl, PtrVT);
   }
+  case Intrinsic::aarch64_sve_whilewr_b:
+  case Intrinsic::aarch64_sve_whilewr_h:
+  case Intrinsic::aarch64_sve_whilewr_s:
+  case Intrinsic::aarch64_sve_whilewr_d:
+    return DAG.getNode(AArch64ISD::WHILEWR, dl, Op.getValueType(), Op.getOperand(1), Op.getOperand(2));
+  case Intrinsic::aarch64_sve_whilerw_b:
+  case Intrinsic::aarch64_sve_whilerw_h:
+  case Intrinsic::aarch64_sve_whilerw_s:
+  case Intrinsic::aarch64_sve_whilerw_d:
+    return DAG.getNode(AArch64ISD::WHILERW, dl, Op.getValueType(), Op.getOperand(1), Op.getOperand(2));
   case Intrinsic::aarch64_neon_abs: {
     EVT Ty = Op.getValueType();
     if (Ty == MVT::i64) {
@@ -6340,16 +6370,39 @@ SDValue AArch64TargetLowering::LowerINTRINSIC_WO_CHAIN(SDValue Op,
     return DAG.getNode(AArch64ISD::USDOT, dl, Op.getValueType(),
                        Op.getOperand(1), Op.getOperand(2), Op.getOperand(3));
   }
+  case Intrinsic::experimental_get_alias_lane_mask:
   case Intrinsic::get_active_lane_mask: {
+    unsigned IntrinsicID = Intrinsic::aarch64_sve_whilelo;
+    if (IntNo == Intrinsic::experimental_get_alias_lane_mask) {
+      uint64_t EltSize = Op.getOperand(3)->getAsZExtVal();
+      bool IsWriteAfterRead = Op.getOperand(4)->getAsZExtVal() == 1;
+      switch (EltSize) {
+        case 1:
+          IntrinsicID = IsWriteAfterRead ? Intrinsic::aarch64_sve_whilewr_b : Intrinsic::aarch64_sve_whilerw_b;
+          break;
+        case 2:
+          IntrinsicID = IsWriteAfterRead ? Intrinsic::aarch64_sve_whilewr_h : Intrinsic::aarch64_sve_whilerw_h;
+          break;
+        case 4:
+          IntrinsicID = IsWriteAfterRead ? Intrinsic::aarch64_sve_whilewr_s : Intrinsic::aarch64_sve_whilerw_s;
+          break;
+        case 8:
+          IntrinsicID = IsWriteAfterRead ? Intrinsic::aarch64_sve_whilewr_d : Intrinsic::aarch64_sve_whilerw_d;
+          break;
+        default:
+          llvm_unreachable("Unexpected element size for get.alias.lane.mask");
+          break;
+      }
+    }
     SDValue ID =
-        DAG.getTargetConstant(Intrinsic::aarch64_sve_whilelo, dl, MVT::i64);
+        DAG.getTargetConstant(IntrinsicID, dl, MVT::i64);
 
     EVT VT = Op.getValueType();
     if (VT.isScalableVector())
       return DAG.getNode(ISD::INTRINSIC_WO_CHAIN, dl, VT, ID, Op.getOperand(1),
                          Op.getOperand(2));
 
-    // We can use the SVE whilelo instruction to lower this intrinsic by
+    // We can use the SVE whilelo/whilewr/whilerw instruction to lower this intrinsic by
     // creating the appropriate sequence of scalable vector operations and
     // then extracting a fixed-width subvector from the scalable vector.
 
@@ -14241,128 +14294,8 @@ static SDValue tryLowerToSLI(SDNode *N, SelectionDAG &DAG) {
   return ResultSLI;
 }
 
-/// Try to lower the construction of a pointer alias mask to a WHILEWR.
-/// The mask's enabled lanes represent the elements that will not overlap across
-/// one loop iteration. This tries to match:
-/// or (splat (setcc_lt (sub ptrA, ptrB), -(element_size - 1))),
-///    (get_active_lane_mask 0, (div (sub ptrA, ptrB), element_size))
-SDValue tryWhileWRFromOR(SDValue Op, SelectionDAG &DAG,
-                         const AArch64Subtarget &Subtarget) {
-  if (!Subtarget.hasSVE2())
-    return SDValue();
-  SDValue LaneMask = Op.getOperand(0);
-  SDValue Splat = Op.getOperand(1);
-
-  if (Splat.getOpcode() != ISD::SPLAT_VECTOR)
-    std::swap(LaneMask, Splat);
-
-  if (LaneMask.getOpcode() != ISD::INTRINSIC_WO_CHAIN ||
-      LaneMask.getConstantOperandVal(0) != Intrinsic::get_active_lane_mask ||
-      Splat.getOpcode() != ISD::SPLAT_VECTOR)
-    return SDValue();
-
-  SDValue Cmp = Splat.getOperand(0);
-  if (Cmp.getOpcode() != ISD::SETCC)
-    return SDValue();
-
-  CondCodeSDNode *Cond = cast<CondCodeSDNode>(Cmp.getOperand(2));
-
-  auto ComparatorConst = dyn_cast<ConstantSDNode>(Cmp.getOperand(1));
-  if (!ComparatorConst || ComparatorConst->getSExtValue() > 0 ||
-      Cond->get() != ISD::CondCode::SETLT)
-    return SDValue();
-  unsigned CompValue = std::abs(ComparatorConst->getSExtValue());
-  unsigned EltSize = CompValue + 1;
-  if (!isPowerOf2_64(EltSize) || EltSize > 8)
-    return SDValue();
-
-  SDValue Diff = Cmp.getOperand(0);
-  if (Diff.getOpcode() != ISD::SUB || Diff.getValueType() != MVT::i64)
-    return SDValue();
-
-  if (!isNullConstant(LaneMask.getOperand(1)) ||
-      (EltSize != 1 && LaneMask.getOperand(2).getOpcode() != ISD::SRA))
-    return SDValue();
-
-  // The number of elements that alias is calculated by dividing the positive
-  // difference between the pointers by the element size. An alias mask for i8
-  // elements omits the division because it would just divide by 1
-  if (EltSize > 1) {
-    SDValue DiffDiv = LaneMask.getOperand(2);
-    auto DiffDivConst = dyn_cast<ConstantSDNode>(DiffDiv.getOperand(1));
-    if (!DiffDivConst || DiffDivConst->getZExtValue() != Log2_64(EltSize))
-      return SDValue();
-    if (EltSize > 2) {
-      // When masking i32 or i64 elements, the positive value of the
-      // possibly-negative difference comes from a select of the difference if
-      // it's positive, otherwise the difference plus the element size if it's
-      // negative: pos_diff = diff < 0 ? (diff + 7) : diff
-      SDValue Select = DiffDiv.getOperand(0);
-      // Make sure the difference is being compared by the select
-      if (Select.getOpcode() != ISD::SELECT_CC || Select.getOperand(3) != Diff)
-        return SDValue();
-      // Make sure it's checking if the difference is less than 0
-      if (!isNullConstant(Select.getOperand(1)) ||
-          cast<CondCodeSDNode>(Select.getOperand(4))->get() !=
-              ISD::CondCode::SETLT)
-        return SDValue();
-      // An add creates a positive value from the negative difference
-      SDValue Add = Select.getOperand(2);
-      if (Add.getOpcode() != ISD::ADD || Add.getOperand(0) != Diff)
-        return SDValue();
-      if (auto *AddConst = dyn_cast<ConstantSDNode>(Add.getOperand(1));
-          !AddConst || AddConst->getZExtValue() != EltSize - 1)
-        return SDValue();
-    } else {
-      // When masking i16 elements, this positive value comes from adding the
-      // difference's sign bit to the difference itself. This is equivalent to
-      // the 32 bit and 64 bit case: pos_diff = diff + sign_bit (diff)
-      SDValue Add = DiffDiv.getOperand(0);
-      if (Add.getOpcode() != ISD::ADD || Add.getOperand(0) != Diff)
-        return SDValue();
-      // A logical right shift by 63 extracts the sign bit from the difference
-      SDValue Shift = Add.getOperand(1);
-      if (Shift.getOpcode() != ISD::SRL || Shift.getOperand(0) != Diff)
-        return SDValue();
-      if (auto *ShiftConst = dyn_cast<ConstantSDNode>(Shift.getOperand(1));
-          !ShiftConst || ShiftConst->getZExtValue() != 63)
-        return SDValue();
-    }
-  } else if (LaneMask.getOperand(2) != Diff)
-    return SDValue();
-
-  SDValue StorePtr = Diff.getOperand(0);
-  SDValue ReadPtr = Diff.getOperand(1);
-
-  unsigned IntrinsicID = 0;
-  switch (EltSize) {
-  case 1:
-    IntrinsicID = Intrinsic::aarch64_sve_whilewr_b;
-    break;
-  case 2:
-    IntrinsicID = Intrinsic::aarch64_sve_whilewr_h;
-    break;
-  case 4:
-    IntrinsicID = Intrinsic::aarch64_sve_whilewr_s;
-    break;
-  case 8:
-    IntrinsicID = Intrinsic::aarch64_sve_whilewr_d;
-    break;
-  default:
-    return SDValue();
-  }
-  SDLoc DL(Op);
-  SDValue ID = DAG.getConstant(IntrinsicID, DL, MVT::i32);
-  return DAG.getNode(ISD::INTRINSIC_WO_CHAIN, DL, Op.getValueType(), ID,
-                     StorePtr, ReadPtr);
-}
-
 SDValue AArch64TargetLowering::LowerVectorOR(SDValue Op,
                                              SelectionDAG &DAG) const {
-  if (SDValue SV =
-          tryWhileWRFromOR(Op, DAG, DAG.getSubtarget<AArch64Subtarget>()))
-    return SV;
-
   if (useSVEForFixedLengthVectorVT(Op.getValueType(),
                                    !Subtarget->isNeonAvailable()))
     return LowerToScalableOp(Op, DAG);
@@ -19609,7 +19542,9 @@ static bool isPredicateCCSettingOp(SDValue N) {
         N.getConstantOperandVal(0) == Intrinsic::aarch64_sve_whilels ||
         N.getConstantOperandVal(0) == Intrinsic::aarch64_sve_whilelt ||
         // get_active_lane_mask is lowered to a whilelo instruction.
-        N.getConstantOperandVal(0) == Intrinsic::get_active_lane_mask)))
+        N.getConstantOperandVal(0) == Intrinsic::get_active_lane_mask ||
+        // get_alias_lane_mask is lowered to a whilewr/rw instruction.
+        N.getConstantOperandVal(0) == Intrinsic::experimental_get_alias_lane_mask)))
     return true;
 
   return false;
@@ -27175,6 +27110,7 @@ void AArch64TargetLowering::ReplaceNodeResults(
       return;
     }
     case Intrinsic::experimental_vector_match:
+    case Intrinsic::experimental_get_alias_lane_mask:
     case Intrinsic::get_active_lane_mask: {
       if (!VT.isFixedLengthVector() || VT.getVectorElementType() != MVT::i1)
         return;
diff --git a/llvm/lib/Target/AArch64/AArch64ISelLowering.h b/llvm/lib/Target/AArch64/AArch64ISelLowering.h
index cb0b9e965277aa..b2f766b22911ff 100644
--- a/llvm/lib/Target/AArch64/AArch64ISelLowering.h
+++ b/llvm/lib/Target/AArch64/AArch64ISelLowering.h
@@ -297,6 +297,10 @@ enum NodeType : unsigned {
   SMAXV,
   UMAXV,
 
+  // Alias lane masks
+  WHILEWR,
+  WHILERW,
+
   SADDV_PRED,
   UADDV_PRED,
   SMAXV_PRED,
@@ -980,6 +984,8 @@ class AArch64TargetLowering : public TargetLowering {
 
   bool shouldExpandGetActiveLaneMask(EVT VT, EVT OpVT) const override;
 
+  bool shouldExpandGetAliasLaneMask(EVT VT, EVT PtrVT, unsigned EltSize) const override;
+
   bool
   shouldExpandPartialReductionIntrinsic(const IntrinsicInst *I) const override;
 
diff --git a/llvm/lib/Target/AArch64/AArch64SVEInstrInfo.td b/llvm/lib/Target/AArch64/AArch64SVEInstrInfo.td
index 564fb33758ad57..99b1e0618ab34b 100644
--- a/llvm/lib/Target/AArch64/AArch64SVEInstrInfo.td
+++ b/llvm/lib/Target/AArch64/AArch64SVEInstrInfo.td
@@ -140,6 +140,11 @@ def AArch64st1q_scatter : SDNode<"AArch64ISD::SST1Q_PRED", SDT_AArch64_SCATTER_V
 // AArch64 SVE/SVE2 - the remaining node definitions
 //
 
+// Alias masks
+def SDT_AArch64Mask : SDTypeProfile<1, 2, [SDTCisVec<0>, SDTCisInt<1>, SDTCisSameAs<2, 1>, SDTCVecEltisVT<0,i1>]>;
+def AArch64whilewr : SDNode<"AArch64ISD::WHILEWR", SDT_AArch64Mask>;
+def AArch64whilerw : SDNode<"AArch64ISD::WHILERW", SDT_AArch64Mask>;
+
 // SVE CNT/INC/RDVL
 def sve_rdvl_imm : Co...
[truncated]

@llvmbot
Copy link
Member

llvmbot commented Nov 20, 2024

@llvm/pr-subscribers-backend-aarch64

Author: Sam Tebbs (SamTebbs33)

Changes

It can be unsafe to load a vector from an address and write a vector to an address if those two addresses have overlapping lanes within a vectorised loop iteration.

This PR adds an intrinsic designed to create a mask with lanes disabled if they overlap between the two pointer arguments, so that only safe lanes are loaded, operated on and stored.

Along with the two pointer parameters, the intrinsic also takes an immediate that represents the size in bytes of the vector element types, as well as an immediate i1 that is true if there is a write after-read-hazard or false if there is a read-after-write hazard.

This will be used by #100579 and replaces the existing lowering for whilewr since that isn't needed now we have the intrinsic.


Patch is 93.77 KiB, truncated to 20.00 KiB below, full version: https://github.com/llvm/llvm-project/pull/117007.diff

11 Files Affected:

  • (modified) llvm/docs/LangRef.rst (+80)
  • (modified) llvm/include/llvm/CodeGen/TargetLowering.h (+5)
  • (modified) llvm/include/llvm/IR/Intrinsics.td (+5)
  • (modified) llvm/lib/CodeGen/SelectionDAG/SelectionDAGBuilder.cpp (+44)
  • (modified) llvm/lib/Target/AArch64/AArch64ISelLowering.cpp (+59-123)
  • (modified) llvm/lib/Target/AArch64/AArch64ISelLowering.h (+6)
  • (modified) llvm/lib/Target/AArch64/AArch64SVEInstrInfo.td (+7-2)
  • (modified) llvm/lib/Target/AArch64/SVEInstrFormats.td (+5-5)
  • (added) llvm/test/CodeGen/AArch64/alias_mask.ll (+421)
  • (added) llvm/test/CodeGen/AArch64/alias_mask_scalable.ll (+82)
  • (removed) llvm/test/CodeGen/AArch64/whilewr.ll (-1086)
diff --git a/llvm/docs/LangRef.rst b/llvm/docs/LangRef.rst
index 9f4c90ba82a419..c9589d5af8ebbe 100644
--- a/llvm/docs/LangRef.rst
+++ b/llvm/docs/LangRef.rst
@@ -23475,6 +23475,86 @@ Examples:
       %active.lane.mask = call <4 x i1> @llvm.get.active.lane.mask.v4i1.i64(i64 %elem0, i64 429)
       %wide.masked.load = call <4 x i32> @llvm.masked.load.v4i32.p0v4i32(<4 x i32>* %3, i32 4, <4 x i1> %active.lane.mask, <4 x i32> poison)
 
+.. _int_experimental_get_alias_lane_mask:
+
+'``llvm.get.alias.lane.mask.*``' Intrinsics
+^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
+
+Syntax:
+"""""""
+This is an overloaded intrinsic.
+
+::
+
+      declare <4 x i1> @llvm.experimental.get.alias.lane.mask.v4i1.i64(i64 %ptrA, i64 %ptrB, i32 immarg %elementSize, i1 immarg %writeAfterRead)
+      declare <8 x i1> @llvm.experimental.get.alias.lane.mask.v8i1.i64(i64 %ptrA, i64 %ptrB, i32 immarg %elementSize, i1 immarg %writeAfterRead)
+      declare <16 x i1> @llvm.experimental.get.alias.lane.mask.v16i1.i64(i64 %ptrA, i64 %ptrB, i32 immarg %elementSize, i1 immarg %writeAfterRead)
+      declare <vscale x 16 x i1> @llvm.experimental.get.alias.lane.mask.nxv16i1.i64(i64 %ptrA, i64 %ptrB, i32 immarg %elementSize, i1 immarg %writeAfterRead)
+
+
+Overview:
+"""""""""
+
+Create a mask representing lanes that do or not overlap between two pointers across one vector loop iteration.
+
+
+Arguments:
+""""""""""
+
+The first two arguments have the same scalar integer type.
+The final two are immediates and the result is a vector with the i1 element type.
+
+Semantics:
+""""""""""
+
+In the case that ``%writeAfterRead`` is true, the '``llvm.experimental.get.alias.lane.mask.*``' intrinsics are semantically equivalent
+to:
+
+::
+
+      %diff = (%ptrB - %ptrA) / %elementSize
+      %m[i] = (icmp ult i, %diff) || (%diff <= 0)
+
+Otherwise they are semantically equivalent to:
+
+::
+
+      %diff = abs(%ptrB - %ptrA) / %elementSize
+      %m[i] = (icmp ult i, %diff) || (%diff == 0)
+
+where ``%m`` is a vector (mask) of active/inactive lanes with its elements
+indexed by ``i``,  and ``%ptrA``, ``%ptrB`` are the two i64 arguments to
+``llvm.experimental.get.alias.lane.mask.*``, ``%elementSize`` is the i32 argument, ``%abs`` is the absolute difference operation, ``%icmp`` is an integer compare and ``ult``
+the unsigned less-than comparison operator. The subtraction between ``%ptrA`` and ``%ptrB`` could be negative. The ``%writeAfterRead`` argument is expected to be true if the ``%ptrB`` is stored to after ``%ptrA`` is read from.
+The above is equivalent to:
+
+::
+
+      %m = @llvm.experimental.get.alias.lane.mask(%ptrA, %ptrB, %elementSize, %writeAfterRead)
+
+This can, for example, be emitted by the loop vectorizer in which case
+``%ptrA`` is a pointer that is read from within the loop, and ``%ptrB`` is a pointer that is stored to within the loop.
+If the difference between these pointers is less than the vector factor, then they overlap (alias) within a loop iteration.
+An example is if ``%ptrA`` is 20 and ``%ptrB`` is 23 with a vector factor of 8, then lanes 3, 4, 5, 6 and 7 of the vector loaded from ``%ptrA``
+share addresses with lanes 0, 1, 2, 3, 4 and 5 from the vector stored to at ``%ptrB``.
+An alias mask of these two pointers should be <1, 1, 1, 0, 0, 0, 0, 0> so that only the non-overlapping lanes are loaded and stored.
+This operation allows many loops to be vectorised when it would otherwise be unsafe to do so.
+
+To account for the fact that only a subset of lanes have been operated on in an iteration,
+the loop's induction variable should be incremented by the popcount of the mask rather than the vector factor.
+
+This mask ``%m`` can e.g. be used in masked load/store instructions.
+
+
+Examples:
+"""""""""
+
+.. code-block:: llvm
+
+      %alias.lane.mask = call <4 x i1> @llvm.experimental.get.alias.lane.mask.v4i1.i64(i64 %ptrA, i64 %ptrB, i32 4, i1 1)
+      %vecA = call <4 x i32> @llvm.masked.load.v4i32.p0v4i32(<4 x i32>* %ptrA, i32 4, <4 x i1> %alias.lane.mask, <4 x i32> poison)
+      [...]
+      call @llvm.masked.store.v4i32.p0v4i32(<4 x i32> %vecA, <4 x i32>* %ptrB, i32 4, <4 x i1> %alias.lane.mask)
 
 .. _int_experimental_vp_splice:
 
diff --git a/llvm/include/llvm/CodeGen/TargetLowering.h b/llvm/include/llvm/CodeGen/TargetLowering.h
index 6a41094ff933b0..0338310fd936df 100644
--- a/llvm/include/llvm/CodeGen/TargetLowering.h
+++ b/llvm/include/llvm/CodeGen/TargetLowering.h
@@ -468,6 +468,11 @@ class TargetLoweringBase {
     return true;
   }
 
+  /// Return true if the @llvm.experimental.get.alias.lane.mask intrinsic should be expanded using generic code in SelectionDAGBuilder.
+  virtual bool shouldExpandGetAliasLaneMask(EVT VT, EVT PtrVT, unsigned EltSize) const {
+    return true;
+  }
+
   virtual bool shouldExpandGetVectorLength(EVT CountVT, unsigned VF,
                                            bool IsScalable) const {
     return true;
diff --git a/llvm/include/llvm/IR/Intrinsics.td b/llvm/include/llvm/IR/Intrinsics.td
index 1ca8c2565ab0b6..5f7073a531283e 100644
--- a/llvm/include/llvm/IR/Intrinsics.td
+++ b/llvm/include/llvm/IR/Intrinsics.td
@@ -2363,6 +2363,11 @@ let IntrProperties = [IntrNoMem, IntrNoSync, IntrWillReturn, ImmArg<ArgIndex<1>>
                                     llvm_i32_ty]>;
 }
 
+def int_experimental_get_alias_lane_mask:
+  DefaultAttrsIntrinsic<[llvm_anyvector_ty],
+            [llvm_anyint_ty, LLVMMatchType<1>, llvm_anyint_ty, llvm_i1_ty],
+            [IntrNoMem, IntrNoSync, IntrWillReturn, ImmArg<ArgIndex<2>>, ImmArg<ArgIndex<3>>]>;
+
 def int_get_active_lane_mask:
   DefaultAttrsIntrinsic<[llvm_anyvector_ty],
             [llvm_anyint_ty, LLVMMatchType<1>],
diff --git a/llvm/lib/CodeGen/SelectionDAG/SelectionDAGBuilder.cpp b/llvm/lib/CodeGen/SelectionDAG/SelectionDAGBuilder.cpp
index 9d729d448502d8..39e84e06a8de60 100644
--- a/llvm/lib/CodeGen/SelectionDAG/SelectionDAGBuilder.cpp
+++ b/llvm/lib/CodeGen/SelectionDAG/SelectionDAGBuilder.cpp
@@ -8284,6 +8284,50 @@ void SelectionDAGBuilder::visitIntrinsicCall(const CallInst &I,
     visitVectorExtractLastActive(I, Intrinsic);
     return;
   }
+  case Intrinsic::experimental_get_alias_lane_mask: {
+    SDValue SourceValue = getValue(I.getOperand(0));
+    SDValue SinkValue = getValue(I.getOperand(1));
+    SDValue EltSize = getValue(I.getOperand(2));
+    bool IsWriteAfterRead = cast<ConstantSDNode>(getValue(I.getOperand(3)))->getZExtValue() != 0;
+    auto IntrinsicVT = EVT::getEVT(I.getType());
+    auto PtrVT = SourceValue->getValueType(0);
+
+    if (!TLI.shouldExpandGetAliasLaneMask(IntrinsicVT, PtrVT, cast<ConstantSDNode>(EltSize)->getSExtValue())) {
+      visitTargetIntrinsic(I, Intrinsic);
+      return;
+    }
+
+    SDValue Diff = DAG.getNode(ISD::SUB, sdl,
+                              PtrVT, SinkValue, SourceValue);
+    if (!IsWriteAfterRead)
+      Diff = DAG.getNode(ISD::ABS, sdl, PtrVT, Diff);
+
+    Diff = DAG.getNode(ISD::SDIV, sdl, PtrVT, Diff, EltSize);
+    SDValue Zero = DAG.getTargetConstant(0, sdl, PtrVT);
+
+    // If the difference is positive then some elements may alias
+    auto CmpVT = TLI.getSetCCResultType(DAG.getDataLayout(), *DAG.getContext(),
+                                   PtrVT);
+    SDValue Cmp = DAG.getSetCC(sdl, CmpVT, Diff, Zero, IsWriteAfterRead ? ISD::SETLE : ISD::SETEQ);
+
+    // Splat the compare result then OR it with a lane mask
+    SDValue Splat = DAG.getSplat(IntrinsicVT, sdl, Cmp);
+
+    SDValue DiffMask;
+    // Don't emit an active lane mask if the target doesn't support it
+    if (TLI.shouldExpandGetActiveLaneMask(IntrinsicVT, PtrVT)) {
+        EVT VecTy = EVT::getVectorVT(*DAG.getContext(), PtrVT,
+                                     IntrinsicVT.getVectorElementCount());
+        SDValue DiffSplat = DAG.getSplat(VecTy, sdl, Diff);
+        SDValue VectorStep = DAG.getStepVector(sdl, VecTy);
+        DiffMask = DAG.getSetCC(sdl, IntrinsicVT, VectorStep,
+                                     DiffSplat, ISD::CondCode::SETULT);
+    } else {
+      DiffMask = DAG.getNode(ISD::INTRINSIC_WO_CHAIN, sdl, IntrinsicVT, DAG.getTargetConstant(Intrinsic::get_active_lane_mask, sdl, MVT::i64), Zero, Diff);
+    }
+    SDValue Or = DAG.getNode(ISD::OR, sdl, IntrinsicVT, DiffMask, Splat);
+    setValue(&I, Or);
+  }
   }
 }
 
diff --git a/llvm/lib/Target/AArch64/AArch64ISelLowering.cpp b/llvm/lib/Target/AArch64/AArch64ISelLowering.cpp
index 7ab3fc06715ec8..66eaec0d5ae6c9 100644
--- a/llvm/lib/Target/AArch64/AArch64ISelLowering.cpp
+++ b/llvm/lib/Target/AArch64/AArch64ISelLowering.cpp
@@ -2033,6 +2033,24 @@ bool AArch64TargetLowering::shouldExpandGetActiveLaneMask(EVT ResVT,
   return false;
 }
 
+bool AArch64TargetLowering::shouldExpandGetAliasLaneMask(EVT VT, EVT PtrVT, unsigned EltSize) const {
+  if (!Subtarget->hasSVE2())
+    return true;
+
+  if (PtrVT != MVT::i64)
+    return true;
+
+  if (VT == MVT::v2i1 || VT == MVT::nxv2i1)
+    return EltSize != 8;
+  if( VT == MVT::v4i1 || VT == MVT::nxv4i1)
+    return EltSize != 4;
+  if (VT == MVT::v8i1 || VT == MVT::nxv8i1)
+    return EltSize != 2;
+  if (VT == MVT::v16i1 || VT == MVT::nxv16i1)
+    return EltSize != 1;
+  return true;
+}
+
 bool AArch64TargetLowering::shouldExpandPartialReductionIntrinsic(
     const IntrinsicInst *I) const {
   if (I->getIntrinsicID() != Intrinsic::experimental_vector_partial_reduce_add)
@@ -2796,6 +2814,8 @@ const char *AArch64TargetLowering::getTargetNodeName(unsigned Opcode) const {
     MAKE_CASE(AArch64ISD::LS64_BUILD)
     MAKE_CASE(AArch64ISD::LS64_EXTRACT)
     MAKE_CASE(AArch64ISD::TBL)
+    MAKE_CASE(AArch64ISD::WHILEWR)
+    MAKE_CASE(AArch64ISD::WHILERW)
     MAKE_CASE(AArch64ISD::FADD_PRED)
     MAKE_CASE(AArch64ISD::FADDA_PRED)
     MAKE_CASE(AArch64ISD::FADDV_PRED)
@@ -5881,6 +5901,16 @@ SDValue AArch64TargetLowering::LowerINTRINSIC_WO_CHAIN(SDValue Op,
     EVT PtrVT = getPointerTy(DAG.getDataLayout());
     return DAG.getNode(AArch64ISD::THREAD_POINTER, dl, PtrVT);
   }
+  case Intrinsic::aarch64_sve_whilewr_b:
+  case Intrinsic::aarch64_sve_whilewr_h:
+  case Intrinsic::aarch64_sve_whilewr_s:
+  case Intrinsic::aarch64_sve_whilewr_d:
+    return DAG.getNode(AArch64ISD::WHILEWR, dl, Op.getValueType(), Op.getOperand(1), Op.getOperand(2));
+  case Intrinsic::aarch64_sve_whilerw_b:
+  case Intrinsic::aarch64_sve_whilerw_h:
+  case Intrinsic::aarch64_sve_whilerw_s:
+  case Intrinsic::aarch64_sve_whilerw_d:
+    return DAG.getNode(AArch64ISD::WHILERW, dl, Op.getValueType(), Op.getOperand(1), Op.getOperand(2));
   case Intrinsic::aarch64_neon_abs: {
     EVT Ty = Op.getValueType();
     if (Ty == MVT::i64) {
@@ -6340,16 +6370,39 @@ SDValue AArch64TargetLowering::LowerINTRINSIC_WO_CHAIN(SDValue Op,
     return DAG.getNode(AArch64ISD::USDOT, dl, Op.getValueType(),
                        Op.getOperand(1), Op.getOperand(2), Op.getOperand(3));
   }
+  case Intrinsic::experimental_get_alias_lane_mask:
   case Intrinsic::get_active_lane_mask: {
+    unsigned IntrinsicID = Intrinsic::aarch64_sve_whilelo;
+    if (IntNo == Intrinsic::experimental_get_alias_lane_mask) {
+      uint64_t EltSize = Op.getOperand(3)->getAsZExtVal();
+      bool IsWriteAfterRead = Op.getOperand(4)->getAsZExtVal() == 1;
+      switch (EltSize) {
+        case 1:
+          IntrinsicID = IsWriteAfterRead ? Intrinsic::aarch64_sve_whilewr_b : Intrinsic::aarch64_sve_whilerw_b;
+          break;
+        case 2:
+          IntrinsicID = IsWriteAfterRead ? Intrinsic::aarch64_sve_whilewr_h : Intrinsic::aarch64_sve_whilerw_h;
+          break;
+        case 4:
+          IntrinsicID = IsWriteAfterRead ? Intrinsic::aarch64_sve_whilewr_s : Intrinsic::aarch64_sve_whilerw_s;
+          break;
+        case 8:
+          IntrinsicID = IsWriteAfterRead ? Intrinsic::aarch64_sve_whilewr_d : Intrinsic::aarch64_sve_whilerw_d;
+          break;
+        default:
+          llvm_unreachable("Unexpected element size for get.alias.lane.mask");
+          break;
+      }
+    }
     SDValue ID =
-        DAG.getTargetConstant(Intrinsic::aarch64_sve_whilelo, dl, MVT::i64);
+        DAG.getTargetConstant(IntrinsicID, dl, MVT::i64);
 
     EVT VT = Op.getValueType();
     if (VT.isScalableVector())
       return DAG.getNode(ISD::INTRINSIC_WO_CHAIN, dl, VT, ID, Op.getOperand(1),
                          Op.getOperand(2));
 
-    // We can use the SVE whilelo instruction to lower this intrinsic by
+    // We can use the SVE whilelo/whilewr/whilerw instruction to lower this intrinsic by
     // creating the appropriate sequence of scalable vector operations and
     // then extracting a fixed-width subvector from the scalable vector.
 
@@ -14241,128 +14294,8 @@ static SDValue tryLowerToSLI(SDNode *N, SelectionDAG &DAG) {
   return ResultSLI;
 }
 
-/// Try to lower the construction of a pointer alias mask to a WHILEWR.
-/// The mask's enabled lanes represent the elements that will not overlap across
-/// one loop iteration. This tries to match:
-/// or (splat (setcc_lt (sub ptrA, ptrB), -(element_size - 1))),
-///    (get_active_lane_mask 0, (div (sub ptrA, ptrB), element_size))
-SDValue tryWhileWRFromOR(SDValue Op, SelectionDAG &DAG,
-                         const AArch64Subtarget &Subtarget) {
-  if (!Subtarget.hasSVE2())
-    return SDValue();
-  SDValue LaneMask = Op.getOperand(0);
-  SDValue Splat = Op.getOperand(1);
-
-  if (Splat.getOpcode() != ISD::SPLAT_VECTOR)
-    std::swap(LaneMask, Splat);
-
-  if (LaneMask.getOpcode() != ISD::INTRINSIC_WO_CHAIN ||
-      LaneMask.getConstantOperandVal(0) != Intrinsic::get_active_lane_mask ||
-      Splat.getOpcode() != ISD::SPLAT_VECTOR)
-    return SDValue();
-
-  SDValue Cmp = Splat.getOperand(0);
-  if (Cmp.getOpcode() != ISD::SETCC)
-    return SDValue();
-
-  CondCodeSDNode *Cond = cast<CondCodeSDNode>(Cmp.getOperand(2));
-
-  auto ComparatorConst = dyn_cast<ConstantSDNode>(Cmp.getOperand(1));
-  if (!ComparatorConst || ComparatorConst->getSExtValue() > 0 ||
-      Cond->get() != ISD::CondCode::SETLT)
-    return SDValue();
-  unsigned CompValue = std::abs(ComparatorConst->getSExtValue());
-  unsigned EltSize = CompValue + 1;
-  if (!isPowerOf2_64(EltSize) || EltSize > 8)
-    return SDValue();
-
-  SDValue Diff = Cmp.getOperand(0);
-  if (Diff.getOpcode() != ISD::SUB || Diff.getValueType() != MVT::i64)
-    return SDValue();
-
-  if (!isNullConstant(LaneMask.getOperand(1)) ||
-      (EltSize != 1 && LaneMask.getOperand(2).getOpcode() != ISD::SRA))
-    return SDValue();
-
-  // The number of elements that alias is calculated by dividing the positive
-  // difference between the pointers by the element size. An alias mask for i8
-  // elements omits the division because it would just divide by 1
-  if (EltSize > 1) {
-    SDValue DiffDiv = LaneMask.getOperand(2);
-    auto DiffDivConst = dyn_cast<ConstantSDNode>(DiffDiv.getOperand(1));
-    if (!DiffDivConst || DiffDivConst->getZExtValue() != Log2_64(EltSize))
-      return SDValue();
-    if (EltSize > 2) {
-      // When masking i32 or i64 elements, the positive value of the
-      // possibly-negative difference comes from a select of the difference if
-      // it's positive, otherwise the difference plus the element size if it's
-      // negative: pos_diff = diff < 0 ? (diff + 7) : diff
-      SDValue Select = DiffDiv.getOperand(0);
-      // Make sure the difference is being compared by the select
-      if (Select.getOpcode() != ISD::SELECT_CC || Select.getOperand(3) != Diff)
-        return SDValue();
-      // Make sure it's checking if the difference is less than 0
-      if (!isNullConstant(Select.getOperand(1)) ||
-          cast<CondCodeSDNode>(Select.getOperand(4))->get() !=
-              ISD::CondCode::SETLT)
-        return SDValue();
-      // An add creates a positive value from the negative difference
-      SDValue Add = Select.getOperand(2);
-      if (Add.getOpcode() != ISD::ADD || Add.getOperand(0) != Diff)
-        return SDValue();
-      if (auto *AddConst = dyn_cast<ConstantSDNode>(Add.getOperand(1));
-          !AddConst || AddConst->getZExtValue() != EltSize - 1)
-        return SDValue();
-    } else {
-      // When masking i16 elements, this positive value comes from adding the
-      // difference's sign bit to the difference itself. This is equivalent to
-      // the 32 bit and 64 bit case: pos_diff = diff + sign_bit (diff)
-      SDValue Add = DiffDiv.getOperand(0);
-      if (Add.getOpcode() != ISD::ADD || Add.getOperand(0) != Diff)
-        return SDValue();
-      // A logical right shift by 63 extracts the sign bit from the difference
-      SDValue Shift = Add.getOperand(1);
-      if (Shift.getOpcode() != ISD::SRL || Shift.getOperand(0) != Diff)
-        return SDValue();
-      if (auto *ShiftConst = dyn_cast<ConstantSDNode>(Shift.getOperand(1));
-          !ShiftConst || ShiftConst->getZExtValue() != 63)
-        return SDValue();
-    }
-  } else if (LaneMask.getOperand(2) != Diff)
-    return SDValue();
-
-  SDValue StorePtr = Diff.getOperand(0);
-  SDValue ReadPtr = Diff.getOperand(1);
-
-  unsigned IntrinsicID = 0;
-  switch (EltSize) {
-  case 1:
-    IntrinsicID = Intrinsic::aarch64_sve_whilewr_b;
-    break;
-  case 2:
-    IntrinsicID = Intrinsic::aarch64_sve_whilewr_h;
-    break;
-  case 4:
-    IntrinsicID = Intrinsic::aarch64_sve_whilewr_s;
-    break;
-  case 8:
-    IntrinsicID = Intrinsic::aarch64_sve_whilewr_d;
-    break;
-  default:
-    return SDValue();
-  }
-  SDLoc DL(Op);
-  SDValue ID = DAG.getConstant(IntrinsicID, DL, MVT::i32);
-  return DAG.getNode(ISD::INTRINSIC_WO_CHAIN, DL, Op.getValueType(), ID,
-                     StorePtr, ReadPtr);
-}
-
 SDValue AArch64TargetLowering::LowerVectorOR(SDValue Op,
                                              SelectionDAG &DAG) const {
-  if (SDValue SV =
-          tryWhileWRFromOR(Op, DAG, DAG.getSubtarget<AArch64Subtarget>()))
-    return SV;
-
   if (useSVEForFixedLengthVectorVT(Op.getValueType(),
                                    !Subtarget->isNeonAvailable()))
     return LowerToScalableOp(Op, DAG);
@@ -19609,7 +19542,9 @@ static bool isPredicateCCSettingOp(SDValue N) {
         N.getConstantOperandVal(0) == Intrinsic::aarch64_sve_whilels ||
         N.getConstantOperandVal(0) == Intrinsic::aarch64_sve_whilelt ||
         // get_active_lane_mask is lowered to a whilelo instruction.
-        N.getConstantOperandVal(0) == Intrinsic::get_active_lane_mask)))
+        N.getConstantOperandVal(0) == Intrinsic::get_active_lane_mask ||
+        // get_alias_lane_mask is lowered to a whilewr/rw instruction.
+        N.getConstantOperandVal(0) == Intrinsic::experimental_get_alias_lane_mask)))
     return true;
 
   return false;
@@ -27175,6 +27110,7 @@ void AArch64TargetLowering::ReplaceNodeResults(
       return;
     }
     case Intrinsic::experimental_vector_match:
+    case Intrinsic::experimental_get_alias_lane_mask:
     case Intrinsic::get_active_lane_mask: {
       if (!VT.isFixedLengthVector() || VT.getVectorElementType() != MVT::i1)
         return;
diff --git a/llvm/lib/Target/AArch64/AArch64ISelLowering.h b/llvm/lib/Target/AArch64/AArch64ISelLowering.h
index cb0b9e965277aa..b2f766b22911ff 100644
--- a/llvm/lib/Target/AArch64/AArch64ISelLowering.h
+++ b/llvm/lib/Target/AArch64/AArch64ISelLowering.h
@@ -297,6 +297,10 @@ enum NodeType : unsigned {
   SMAXV,
   UMAXV,
 
+  // Alias lane masks
+  WHILEWR,
+  WHILERW,
+
   SADDV_PRED,
   UADDV_PRED,
   SMAXV_PRED,
@@ -980,6 +984,8 @@ class AArch64TargetLowering : public TargetLowering {
 
   bool shouldExpandGetActiveLaneMask(EVT VT, EVT OpVT) const override;
 
+  bool shouldExpandGetAliasLaneMask(EVT VT, EVT PtrVT, unsigned EltSize) const override;
+
   bool
   shouldExpandPartialReductionIntrinsic(const IntrinsicInst *I) const override;
 
diff --git a/llvm/lib/Target/AArch64/AArch64SVEInstrInfo.td b/llvm/lib/Target/AArch64/AArch64SVEInstrInfo.td
index 564fb33758ad57..99b1e0618ab34b 100644
--- a/llvm/lib/Target/AArch64/AArch64SVEInstrInfo.td
+++ b/llvm/lib/Target/AArch64/AArch64SVEInstrInfo.td
@@ -140,6 +140,11 @@ def AArch64st1q_scatter : SDNode<"AArch64ISD::SST1Q_PRED", SDT_AArch64_SCATTER_V
 // AArch64 SVE/SVE2 - the remaining node definitions
 //
 
+// Alias masks
+def SDT_AArch64Mask : SDTypeProfile<1, 2, [SDTCisVec<0>, SDTCisInt<1>, SDTCisSameAs<2, 1>, SDTCVecEltisVT<0,i1>]>;
+def AArch64whilewr : SDNode<"AArch64ISD::WHILEWR", SDT_AArch64Mask>;
+def AArch64whilerw : SDNode<"AArch64ISD::WHILERW", SDT_AArch64Mask>;
+
 // SVE CNT/INC/RDVL
 def sve_rdvl_imm : Co...
[truncated]

llvm/docs/LangRef.rst Outdated Show resolved Hide resolved
Copy link

github-actions bot commented Nov 20, 2024

✅ With the latest revision this PR passed the C/C++ code formatter.

if (!IsWriteAfterRead)
Diff = DAG.getNode(ISD::ABS, sdl, PtrVT, Diff);

Diff = DAG.getNode(ISD::SDIV, sdl, PtrVT, Diff, EltSize);
Copy link
Contributor

@JamesChesterman JamesChesterman Nov 20, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sorry if I'm wrong but wouldn't this line always be executed, even if the !IsWriteAfterRead condition is met. So if the if statement above is entered, Diff is set to the ABS node and is then overwritten and set to the SDIV node?
Maybe you forgot a return in the if statement?

Copy link
Contributor

@NickGuy-Arm NickGuy-Arm Nov 20, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

From my understanding: !IsWriteAfterRead would imply that Diff is likely to be negative, so this inserts the ISD::ABS immediately before using Diff as an operand to the SDIV node, ensuring that it is positive.

Arguably the if isn't needed there, as ABS-ing a positive number just returns the same number, but then we're adding nodes that we know don't do anything.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Oh yes sorry I didn't notice the Diff as an argument in the second assignment.

llvm/docs/LangRef.rst Outdated Show resolved Hide resolved
@@ -0,0 +1,82 @@
; NOTE: Assertions have been autogenerated by utils/update_llc_test_checks.py UTC_ARGS: --version 5
; RUN: llc -mtriple=aarch64 -mattr=+sve2 %s -o - | FileCheck %s
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is it worth having a +sve and a +sve2 run line.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done.

llvm/docs/LangRef.rst Outdated Show resolved Hide resolved
Comment on lines 23530 to 23531
immediate argument, ``%abs`` is the absolute difference operation, ``%icmp`` is
an integer compare and ``ult`` the unsigned less-than comparison operator. The
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Remove the explanation of abs, icmp and ult.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done.

``llvm.experimental.get.alias.lane.mask.*``, ``%elementSize`` is the first
immediate argument, ``%abs`` is the absolute difference operation, ``%icmp`` is
an integer compare and ``ult`` the unsigned less-than comparison operator. The
subtraction between ``%ptrA`` and ``%ptrB`` could be negative. The
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I would remove this line about the result being negative too.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done.

Comment on lines 23551 to 23552
The intrinsic will return poison if ``%ptrA`` and ``%ptrB`` are within
VF * ``%elementSize`` of each other and ``%ptrA`` + VF * ``%elementSize`` wraps.
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Move this up above the other explanation to make it more prominent, and explain how the (%ptrB - %ptrA) / %elementSize doesn't always apply.

Copy link
Collaborator Author

@SamTebbs33 SamTebbs33 Dec 2, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done. Let me know if it needs changing more.

Copy link
Collaborator

@davemgreen davemgreen left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks, I'm not sure about these shouldExpand functions but I can see that is used elsewhere, and in general this LGTM. It would be good to use these to generate runtime alias checks using the last lane.

SamTebbs33 added a commit that referenced this pull request Dec 20, 2024
Reverts #100769

A bug in the lowering (the subtraction should be reversed) was found
after merging and it will all be replaced by #117007 anyway.
@SamTebbs33
Copy link
Collaborator Author

SamTebbs33 commented Jan 13, 2025

I've made some changes to relocate the default lowering for the intrinsic so that SelectionDAGBuilder.cpp doesn't call any TTI hooks.

The AArch64 WHILEWR/RW instructions accept scalable vector output types so I mimicked the methods used for active lane mask lowering when the output is fixed instead. This originally produced an extra xtn instruction but making the whilewr/rw intrinsics generic (i.e. without separate b, h, s and d variants) removed the need for this extra instruction.

Copy link
Collaborator

@davemgreen davemgreen left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If you want to upgrade the whilewr intrinsics (which I think sounds OK to me), then it will need auto-update code something like in https://github.com/llvm/llvm-project/pull/120363/files#diff-0c0305d510a076cef711c006c1d9fd78c95cade1f597d21ee46fd753e6982316.
It might be good to separate that out into a separate patch too, to keep things managable.

@@ -567,6 +567,9 @@ std::string SDNode::getOperationName(const SelectionDAG *G) const {
case ISD::EXPERIMENTAL_VECTOR_HISTOGRAM:
return "histogram";

case ISD::EXPERIMENTAL_ALIAS_LANE_MASK:
return "alias_mask";
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

alias_lane_mask

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done.

@@ -2033,6 +2041,25 @@ bool AArch64TargetLowering::shouldExpandGetActiveLaneMask(EVT ResVT,
return false;
}

bool AArch64TargetLowering::shouldExpandGetAliasLaneMask(
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can this be removed now?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It certainly can. Done.

@SamTebbs33
Copy link
Collaborator Author

If you want to upgrade the whilewr intrinsics (which I think sounds OK to me), then it will need auto-update code something like in https://github.com/llvm/llvm-project/pull/120363/files#diff-0c0305d510a076cef711c006c1d9fd78c95cade1f597d21ee46fd753e6982316. It might be good to separate that out into a separate patch too, to keep things managable.

Thanks for that. I've removed them and am no longer seeing the extra xtn instruction that I was before so it looks like those changes are actually unnecessary anyway!

@SamTebbs33 SamTebbs33 force-pushed the alias-intrinsic branch 2 times, most recently from d093339 to 7124e2c Compare January 22, 2025 15:52
} // End HasSVE2_or_SME
defm WHILEWR_PXX : sve2_int_while_rr<0b0, "whilewr", AArch64whilewr>;
defm WHILERW_PXX : sve2_int_while_rr<0b1, "whilerw", AArch64whilerw>;
} // End HasSVE2orSME
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Undo this. (It looks like a merge-conflict went the wrong way).

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for spotting that, fixed.

@@ -19861,7 +19946,8 @@ static SDValue getPTest(SelectionDAG &DAG, EVT VT, SDValue Pg, SDValue Op,
AArch64CC::CondCode Cond);

static bool isPredicateCCSettingOp(SDValue N) {
if ((N.getOpcode() == ISD::SETCC) ||
if ((N.getOpcode() == ISD::SETCC ||
N.getOpcode() == ISD::EXPERIMENTAL_ALIAS_LANE_MASK) ||
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Does adding this mean we need to always lower this to a while?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Perhaps it does. Do you think it's a problem that not all vector types are marked as legal?

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Does it do anything at the moment, and do you have a test for it? I think this is only used in performFirstTrueTestVectorCombine, and that has a test for !isBeforeLegalize so protects against the wrong types. Maybe it is good to separate that part out into a new commit and make sure it has plenty of tests if it is not needed already.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I was just following what had been implemented for get.active.lane.mask so I can remove this and revisit it if needed later.

llvm/lib/CodeGen/SelectionDAG/LegalizeIntegerTypes.cpp Outdated Show resolved Hide resolved
llvm/lib/Target/AArch64/AArch64ISelLowering.cpp Outdated Show resolved Hide resolved
llvm/lib/CodeGen/SelectionDAG/LegalizeIntegerTypes.cpp Outdated Show resolved Hide resolved
Copy link
Collaborator

@davemgreen davemgreen left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks. LGTM

llvm/docs/LangRef.rst Outdated Show resolved Hide resolved
llvm/docs/LangRef.rst Outdated Show resolved Hide resolved
""""""""""

The first two arguments have the same scalar integer type.
The final two are immediates and the result is a vector with the i1 element type.
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is elementSize the element size in bits or in bytes?
What is the meaning of %writeAfterRead = 0? Does that mean ReadAfterWrite ?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'll clarify those 👍

Comment on lines 23708 to 23710
%vecA = call <4 x i32> @llvm.masked.load.v4i32.p0v4i32(<4 x i32>* %ptrA, i32 4, <4 x i1> %alias.lane.mask, <4 x i32> poison)
[...]
call @llvm.masked.store.v4i32.p0v4i32(<4 x i32> %vecA, <4 x i32>* %ptrB, i32 4, <4 x i1> %alias.lane.mask)
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit:

Suggested change
%vecA = call <4 x i32> @llvm.masked.load.v4i32.p0v4i32(<4 x i32>* %ptrA, i32 4, <4 x i1> %alias.lane.mask, <4 x i32> poison)
[...]
call @llvm.masked.store.v4i32.p0v4i32(<4 x i32> %vecA, <4 x i32>* %ptrB, i32 4, <4 x i1> %alias.lane.mask)
%vecA = call <4 x i32> @llvm.masked.load.v4i32.p0v4i32(ptr %ptrA, i32 4, <4 x i1> %alias.lane.mask, <4 x i32> poison)
[...]
call @llvm.masked.store.v4i32.p0v4i32(<4 x i32> %vecA, ptr %ptrB, i32 4, <4 x i1> %alias.lane.mask)

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done.

llvm/docs/LangRef.rst Outdated Show resolved Hide resolved
EVT ContainerVT = getContainerForFixedLengthVector(DAG, VT);
EVT WhileVT = ContainerVT.changeElementType(MVT::i1);

SDValue Mask = DAG.getNode(ISD::INTRINSIC_WO_CHAIN, DL, WhileVT, ID,
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What's the reason for the indirection through an intrinsic call, rather than creating a AArch64ISD::WHILERW/WHILEWR node directly?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think it's because I used to share code with the get.active.lane.mask lowering which uses intrinsics. Changed to emit the ISD node now.

@@ -6475,18 +6556,20 @@ SDValue AArch64TargetLowering::LowerINTRINSIC_WO_CHAIN(SDValue Op,
return DAG.getNode(AArch64ISD::USDOT, dl, Op.getValueType(),
Op.getOperand(1), Op.getOperand(2), Op.getOperand(3));
}
case Intrinsic::experimental_get_alias_lane_mask:
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is no longer relevant, because the intrinsic is always lowered to a EXPERIMENTAL_ALIAS_LANE_MASK node by SelectionDAGBuilder?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

That's correct, removed now.


'``llvm.experimental.get.alias.lane.mask.*``' Intrinsics
'``llvm.experimental.get.nonalias.lane.mask.*``' Intrinsics
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

(Optional): What about noalias (currently there's 0 hits for "nonalias" in the langref, but "noalias" is a thing).

Suggested change
'``llvm.experimental.get.nonalias.lane.mask.*``' Intrinsics
'``llvm.experimental.get.noalias.lane.mask.*``' Intrinsics

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I like that, changed.

::

%diff = abs(%ptrB - %ptrA) / %elementSize
%m[i] = (icmp ult i, %diff) || (%diff == 0)
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

|| %diff == 0
^^ Ironically this says that if the accesses fully alias, the noalias intrinsic returns all true.


::

declare <4 x i1> @llvm.experimental.get.noalias.lane.mask.v4i1(ptr %ptrA, ptr %ptrB, i64 immarg %elementSize, i1 immarg %writeAfterRead)
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I had to dig a bit into the semantics of this intrinsic and the corresponding whilerw/whilewr instructions for AArch64 to get a better understanding. It seems there are 6 different cases to distinguish:

  1. read-after-write, positive distance
  2. write-after-read, positive distance
  3. read-after-write, negative distance
  4. write-after-read, negative distance
  5. write-after-read, zero distance
  6. read-after-write, zero distance

(1) and (2) would change the original order of read/write operations when vectorising with a sufficiently large VF, hence the reason for wanting a mask to know how many lanes are safe to vectorise.
(3) would not change the order of read/write operations, but may result in a performance penalty, because the write must have completed fully before being able to read.
(4), (5) and (6) are no problem for vectorization, because none of those would result in read/write operations being reordered when being vectorized.

I can see that %writeAfterRead was added to handle case (3), but to me the semantics of it are not very intuitive, particularly because the parameter only makes sense depending on the context in which it is used (i.e. whether it is used in the context of a read-write, or write-read sequence) rather than an intuitive operation on a vector of values that returns a result vector.

Rather than encoding that with an extra immediate parameter, I think it's worth considering a slightly different intrinsic that returns an 'alias dependence vector', rather than returning a boolean mask. The dependence can either be a negative value (for negative distance), a positive value (for positive distance) or 0 (which would either mean a dependence distance of 0 or no dependence). It's then possible to use an icmp instruction to generate a no-alias mask for the case you're interested in, i.e. icmp le ... zeroinitializer for cases (1) and (2), or icmp eq ... zeroinitializer for case (3).

We could then also remove the %elementSize parameter and instead encode this information in the return type. For example:

%intra.vector.deps = <4 x i32> llvm.experimental.alias.vector.v4i32(ptr %a, ptr %b)
%mask = icmp eq <4 x i32> %intra.vector.deps, zeroinitializer

This would generate a no-alias mask for an element size of 4 (i32), and for AArch64 would map to a whilerw instruction.

I would suggest not prescribing what exactly the 'positive' and 'negative' values must be, but only that they must be values that describe the dependence direction and fit the element type (which must be bigger than 1 bit, in order to describe both a positive, negative and zero-value). If for example we say that it must describe the dependence distance (signed value), that might put unnecessary restrictions on the minimum size of the element for targets that have no architectural maximum VF.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can you explain why this is better than the simpler approach of having the intrinsic represent the operation that we want?

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

These whilerw/wr instructions always do my head in, specifically the part that the operation is not symmetrical (i.e. positive RAW/WAR = cannot vectorize, negative RAW = may not want to vectorize, negative WAR = fine).

Two things I don't particularly like about the current intrinsic:

  • A noalias mask of all-ones can mean "it doesn't alias" or "it fully aliases" (although the latter is no issue for vectorization).
  • The semantics of the 'readAfterWrite' operand isn't intuitive with regards to 'noalias'. The description explains what this means for the bits in the result vector, but it doesn't really describe the rationale for it.

My reason for suggesting an alternative approach was to simplify the semantics and signature of the intrinsic. But that only make sense if this has no down-sides; in particular I hoped that the direction vector could be a side-product of generating the diff and that folding this would come practically for free. But the former doesn't seem to be the case (truncating the diff value even taking into account the architectural maximum, would give issues for i8 elements because for SVE we couldn't represent a -256 to 255 value in an i8). So just ignore this suggestion!

To address my concerns, maybe we can rename intrinsic to something else then '[no]alias', and split it up into separate intrinsics rather than specifying two distinct operations via an immediate.

I see the following combinations:

  1. llvm.vector.no.loop.carried.backward.raw.dep.mask(ptr, ptr, i32)
  2. llvm.vector.no.loop.carried.forward.raw.dep.mask(ptr, ptr, i32)
  3. llvm.vector.no.loop.carried.backward.war.dep.mask(ptr, ptr, i32)
  4. llvm.vector.no.loop.carried.forward.war.dep.mask(ptr, ptr, i32)

The SVE instruction whilerw implements the AND'ed result of (1) and (2), whereas Instruction whilewr implements (3). It is probably not worth adding an intrinsic for (4).

We could choose to keep (1) and (2) separate and match a pattern to match the combined case, or we could merge them into into a single llvm.vector.no.loop.carried.raw.dep.mask(ptr, ptr, i32) to be equivalent to the current intrinsic with readAfterWrite = true. In this case the whole thing would be a rename.

Any thoughts?

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

One of the places whilewr might be most useful would be in checking that there is no aliasing in the standard way that we check for aliasing in vector loops. i.e what gcc uses them for in https://godbolt.org/z/b5o3KEaod, it replaces a sub+cmp so long as the whilewr is quick enough. If we could just pattern recognise those, that would be ideal. Unfortunately it doesn't mathematically work for arbitrary i64 values because of the overflow issue. (The instruction uses an i65 internal subtract).

I'm not sure about the general idea of using these whiles to update a mask that is used in the loop, if we try to do that for any loop that we don't know aliases. If it adds extra instructions to the loop body and the actual chance of the aliasing overlapping is very low - it might not be worth it compared to just having alias checks outside the loop. I may be mistaken about how useful it is, but perhaps it is worth investigating it further before we decide anything here.

For the intrinsic used for the simpler alias checks, whether that is using these or something else, it is generally useful to have fewer instructions to make cost-modelling simpler, and it would go good if after they are introduced (during LTO for example) we can constant-fold "noalias" pointers. i.e if we later find that the two pointers are noalias, that should imply that the mask is all-true or all-false.

Comment on lines 23736 to 23739
.. _int_experimental_get_noalias_lane_mask:

'``llvm.experimental.get.noalias.lane.mask.*``' Intrinsics
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks like you need to update the lang ref with the new names

@@ -23760,27 +23762,25 @@ across one vector loop iteration.
Arguments:
""""""""""

The first two arguments have the same scalar integer type.
The final two are immediates and the result is a vector with the i1 element type.
The first two arguments have the same pointer type.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: There's only one pointer type.

Suggested change
The first two arguments have the same pointer type.
The first two arguments are pointers.

Comment on lines 23765 to 23766
The first two arguments have the same pointer type.
The final one is an immediate and the result is a vector with the i1 element type.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: I think it reads a little better mention all the arguments in the first sentence, then the result separately.

Suggested change
The first two arguments have the same pointer type.
The final one is an immediate and the result is a vector with the i1 element type.
The first two arguments are pointers and the last argument is an immediate.
The result is a vector with the i1 element type.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

That's much better, thank you.

SamTebbs33 added 20 commits May 6, 2025 11:21
It can be unsafe to load a vector from an address and write a vector to
an address if those two addresses have overlapping lanes within a
vectorised loop iteration.

This PR adds an intrinsic designed to create a mask with lanes disabled
if they overlap between the two pointer arguments, so that only safe
lanes are loaded, operated on and stored.

Along with the two pointer parameters, the intrinsic also takes an
immediate that represents the size in bytes of the vector element
types, as well as an immediate i1 that is true if there is a write
after-read-hazard or false if there is a read-after-write hazard.

This will be used by llvm#100579 and replaces the existing lowering for
whilewr since that isn't needed now we have the intrinsic.
Copy link
Collaborator

@sdesmalen-arm sdesmalen-arm left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sorry for the delay reviewing this. I mostly left some comments on the wording in the LangRef, but it looks mostly fine to me.

Overview:
"""""""""

Create a mask enabling lanes that do not overlap between two pointers
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Perhaps it's useful to give a brief summary of read-after-write/write-after-read hazards, e.g.

A read-after-write hazard occurs when a read-after-write sequence for a given lane in a vector ends up being executed as a write-after-read sequence due to the aliasing of pointers. Conversely, a write-after-read hazard occurs when a write-after-read sequence for a given lane in a vector ends up being executed as a read-after-write sequence due to the aliasing of pointers. These intrinsics return a mask that can be used to safely vectorize without any hazards.

I also think it would help to be a bit more specific in the description, because this is about more than 'no overlap', e.g.

For write-after-read:

Given a scalar load from %ptrA, followed by a scalar store to %ptrB, this instruction generates a mask where an active lane indicates that there is no write-after-read hazard for this lane.

For read-after-write:

Given a scalar store to %ptrA, followed by a scalar load from %ptrB, this instruction generates a mask where an active lane indicates that there is no read-after-write hazard for this lane and that this lane does not introduce any new store-to-load forwarding hazard.

Comment on lines +23949 to +23950
The intrinsic will return poison if ``%ptrA`` and ``%ptrB`` are within
VF * ``%elementSize`` of each other and ``%ptrA`` + VF * ``%elementSize`` wraps.
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Should this include wrapping of ptrB + VF * elementSize as well? i.e.

Suggested change
The intrinsic will return poison if ``%ptrA`` and ``%ptrB`` are within
VF * ``%elementSize`` of each other and ``%ptrA`` + VF * ``%elementSize`` wraps.
The intrinsic returns ``poison`` if the distance between ``prtA`` and ``ptrB`` is smaller than ``VF * elementsize`` and either ``ptrA + VF * elementSize`` or ``ptrB + VF * elementSize`` wrap.

?

.. _int_experimental_loop_dependence_war_mask:
.. _int_experimental_loop_dependence_raw_mask:

'``llvm.experimental.loop.dependence.raw.mask.*``' and '``llvm.experimental.loop.dependence.war.mask.*``' Intrinsics
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can you separate these out into two separate sections?

Comment on lines +23956 to +23957
%diff = (%ptrB - %ptrA) / %elementSize
%m[i] = (icmp ult i, %diff) || (%diff <= 0)
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: I find the pseudo code a bit difficult to follow. What about describing this as:

[write-after-read]
The element of the result mask is active when no write-after-read hazard occurs, meaning that:

  • (ptrB - ptrA) <= 0 (guarantees that all lanes are loaded before any stores are committed)
    or
  • (ptrB - ptrA) >= elementSize * lane (guarantees that this lane is loaded before the store to the same address is committed)

[read-after-write]
The element of the result mask is active when no read-after-write hazard occurs, meaning that:

  • abs(ptrB - ptrA) >= elementSize * lane (guarantees that the store of this lane is committed before loading from this address)

Note that the case where (ptrB - ptrA) < 0 does not result in any read-after-write hazards, but may introduce new store-to-load-forwarding stalls where both the store and load partially access the same addresses.

?

For me that would be sufficient and no longer require the description below (lines 23967 to 23981).

auto PtrVT = SourceValue->getValueType(0);

SDValue Diff = DAG.getNode(ISD::SUB, DL, PtrVT, SinkValue, SourceValue);
if (!IsWriteAfterRead)
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: change IsWriteAfterRead -> IsReadAfterWrite to avoid having to negate this expression.

DAG.getSetCC(DL, VT, VectorStep, DiffSplat, ISD::CondCode::SETULT);

// Splat the compare result then OR it with the lane mask
auto VTElementTy = VT.getVectorElementType();
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit:

Suggested change
auto VTElementTy = VT.getVectorElementType();
EVT EltVT = VT.getVectorElementType();

Comment on lines +8240 to +8251
case Intrinsic::experimental_loop_dependence_war_mask:
case Intrinsic::experimental_loop_dependence_raw_mask: {
auto IntrinsicVT = EVT::getEVT(I.getType());
SmallVector<SDValue, 4> Ops;
for (auto &Op : I.operands())
Ops.push_back(getValue(Op));
unsigned ID = Intrinsic == Intrinsic::experimental_loop_dependence_war_mask
? ISD::EXPERIMENTAL_LOOP_DEPENDENCE_WAR_MASK
: ISD::EXPERIMENTAL_LOOP_DEPENDENCE_RAW_MASK;
SDValue Mask = DAG.getNode(ID, sdl, IntrinsicVT, Ops);
setValue(&I, Mask);
}
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: The number of operands is known, so there's no need for Ops.

Suggested change
case Intrinsic::experimental_loop_dependence_war_mask:
case Intrinsic::experimental_loop_dependence_raw_mask: {
auto IntrinsicVT = EVT::getEVT(I.getType());
SmallVector<SDValue, 4> Ops;
for (auto &Op : I.operands())
Ops.push_back(getValue(Op));
unsigned ID = Intrinsic == Intrinsic::experimental_loop_dependence_war_mask
? ISD::EXPERIMENTAL_LOOP_DEPENDENCE_WAR_MASK
: ISD::EXPERIMENTAL_LOOP_DEPENDENCE_RAW_MASK;
SDValue Mask = DAG.getNode(ID, sdl, IntrinsicVT, Ops);
setValue(&I, Mask);
}
case Intrinsic::experimental_loop_dependence_raw_mask:
case Intrinsic::experimental_loop_dependence_war_mask: {
unsigned ID = Intrinsic == Intrinsic::experimental_loop_dependence_war_mask
? ISD::EXPERIMENTAL_LOOP_DEPENDENCE_WAR_MASK
: ISD::EXPERIMENTAL_LOOP_DEPENDENCE_RAW_MASK;
setValue(&I, DAG.getNode(ID, sdl, EVT::getEVT(I.getType()),
getValue(I.getArgOperand(0)),
getValue(I.getArgOperand(1)),
getValue(I.getArgOperand(2))));
break;
}

@paulwalker-arm
Copy link
Collaborator

FYI: I recall there was a recent discussion where the conclusion was to do away with the experimental naming for new intrinsics because it doesn't really mean anything and just causes more work down the line.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

Successfully merging this pull request may close these issues.

8 participants
Morty Proxy This is a proxified and sanitized view of the page, visit original site.