-
Notifications
You must be signed in to change notification settings - Fork 13.6k
[LV] Mask off possibly aliasing vector lanes #100579
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: users/SamTebbs33/alias-intrinsic
Are you sure you want to change the base?
[LV] Mask off possibly aliasing vector lanes #100579
Conversation
@llvm/pr-subscribers-llvm-ir @llvm/pr-subscribers-llvm-analysis Author: Sam Tebbs (SamTebbs33) ChangesWhen vectorising a loop that uses loads and stores, those pointers with unknown bounds could overlap if their difference is less than the vector factor. For example, if address 20 is being stored to and address 23 is being loaded from, they overlap when the vector factor is 4 or higher. Currently LoopVectorize branches to a scalar loop in these cases with a runtime check. However if we construct a mask that disables the overlapping (aliasing) lanes then the vectorised loop can be safely entered, as long as the loads and stores are masked off. This PR modifies the LoopVectorizer and VPlan to create such a mask and always branch to the vector loop. Currently this is only done if we're tail-predicating, but more work will come in the future to do this in other cases as well. Patch is 103.32 KiB, truncated to 20.00 KiB below, full version: https://github.com/llvm/llvm-project/pull/100579.diff 11 Files Affected:
diff --git a/clang/test/CodeGen/loop-alias-mask.c b/clang/test/CodeGen/loop-alias-mask.c
new file mode 100644
index 0000000000000..76c3b5deddfa0
--- /dev/null
+++ b/clang/test/CodeGen/loop-alias-mask.c
@@ -0,0 +1,404 @@
+// NOTE: Assertions have been autogenerated by utils/update_cc_test_checks.py UTC_ARGS: --version 4
+// RUN: %clang --target=aarch64-linux-gnu -march=armv9+sme2 -emit-llvm -S -g0 -O3 -mllvm -prefer-predicate-over-epilogue=predicate-dont-vectorize %s -o - | FileCheck %s
+#include <stdint.h>
+
+// CHECK-LABEL: define dso_local void @alias_mask_8(
+// CHECK-SAME: ptr noalias nocapture noundef readonly [[A:%.*]], ptr nocapture noundef readonly [[B:%.*]], ptr nocapture noundef writeonly [[C:%.*]], i32 noundef [[N:%.*]]) local_unnamed_addr #[[ATTR0:[0-9]+]] {
+// CHECK-NEXT: entry:
+// CHECK-NEXT: [[CMP11:%.*]] = icmp sgt i32 [[N]], 0
+// CHECK-NEXT: br i1 [[CMP11]], label [[FOR_BODY_PREHEADER:%.*]], label [[FOR_COND_CLEANUP:%.*]]
+// CHECK: for.body.preheader:
+// CHECK-NEXT: [[C14:%.*]] = ptrtoint ptr [[C]] to i64
+// CHECK-NEXT: [[B15:%.*]] = ptrtoint ptr [[B]] to i64
+// CHECK-NEXT: [[WIDE_TRIP_COUNT:%.*]] = zext nneg i32 [[N]] to i64
+// CHECK-NEXT: [[SUB_DIFF:%.*]] = sub i64 [[B15]], [[C14]]
+// CHECK-NEXT: [[NEG_COMPARE:%.*]] = icmp slt i64 [[SUB_DIFF]], 0
+// CHECK-NEXT: [[DOTSPLATINSERT:%.*]] = insertelement <vscale x 16 x i1> poison, i1 [[NEG_COMPARE]], i64 0
+// CHECK-NEXT: [[DOTSPLAT:%.*]] = shufflevector <vscale x 16 x i1> [[DOTSPLATINSERT]], <vscale x 16 x i1> poison, <vscale x 16 x i32> zeroinitializer
+// CHECK-NEXT: [[PTR_DIFF_LANE_MASK:%.*]] = tail call <vscale x 16 x i1> @llvm.get.active.lane.mask.nxv16i1.i64(i64 0, i64 [[SUB_DIFF]])
+// CHECK-NEXT: [[ACTIVE_LANE_MASK_ALIAS:%.*]] = or <vscale x 16 x i1> [[PTR_DIFF_LANE_MASK]], [[DOTSPLAT]]
+// CHECK-NEXT: [[ACTIVE_LANE_MASK_ENTRY:%.*]] = tail call <vscale x 16 x i1> @llvm.get.active.lane.mask.nxv16i1.i64(i64 0, i64 [[WIDE_TRIP_COUNT]])
+// CHECK-NEXT: [[TMP0:%.*]] = zext <vscale x 16 x i1> [[ACTIVE_LANE_MASK_ALIAS]] to <vscale x 16 x i8>
+// CHECK-NEXT: [[TMP1:%.*]] = tail call i8 @llvm.vector.reduce.add.nxv16i8(<vscale x 16 x i8> [[TMP0]])
+// CHECK-NEXT: [[TMP2:%.*]] = zext i8 [[TMP1]] to i64
+// CHECK-NEXT: br label [[VECTOR_BODY:%.*]]
+// CHECK: vector.body:
+// CHECK-NEXT: [[INDEX:%.*]] = phi i64 [ 0, [[FOR_BODY_PREHEADER]] ], [ [[INDEX_NEXT:%.*]], [[VECTOR_BODY]] ]
+// CHECK-NEXT: [[ACTIVE_LANE_MASK:%.*]] = phi <vscale x 16 x i1> [ [[ACTIVE_LANE_MASK_ENTRY]], [[FOR_BODY_PREHEADER]] ], [ [[ACTIVE_LANE_MASK_NEXT:%.*]], [[VECTOR_BODY]] ]
+// CHECK-NEXT: [[TMP3:%.*]] = and <vscale x 16 x i1> [[ACTIVE_LANE_MASK]], [[ACTIVE_LANE_MASK_ALIAS]]
+// CHECK-NEXT: [[TMP4:%.*]] = getelementptr inbounds i8, ptr [[A]], i64 [[INDEX]]
+// CHECK-NEXT: [[WIDE_MASKED_LOAD:%.*]] = tail call <vscale x 16 x i8> @llvm.masked.load.nxv16i8.p0(ptr [[TMP4]], i32 1, <vscale x 16 x i1> [[TMP3]], <vscale x 16 x i8> poison), !tbaa [[TBAA6:![0-9]+]]
+// CHECK-NEXT: [[TMP5:%.*]] = getelementptr inbounds i8, ptr [[B]], i64 [[INDEX]]
+// CHECK-NEXT: [[WIDE_MASKED_LOAD16:%.*]] = tail call <vscale x 16 x i8> @llvm.masked.load.nxv16i8.p0(ptr [[TMP5]], i32 1, <vscale x 16 x i1> [[TMP3]], <vscale x 16 x i8> poison), !tbaa [[TBAA6]]
+// CHECK-NEXT: [[TMP6:%.*]] = add <vscale x 16 x i8> [[WIDE_MASKED_LOAD16]], [[WIDE_MASKED_LOAD]]
+// CHECK-NEXT: [[TMP7:%.*]] = getelementptr inbounds i8, ptr [[C]], i64 [[INDEX]]
+// CHECK-NEXT: tail call void @llvm.masked.store.nxv16i8.p0(<vscale x 16 x i8> [[TMP6]], ptr [[TMP7]], i32 1, <vscale x 16 x i1> [[TMP3]]), !tbaa [[TBAA6]]
+// CHECK-NEXT: [[INDEX_NEXT]] = add i64 [[INDEX]], [[TMP2]]
+// CHECK-NEXT: [[ACTIVE_LANE_MASK_NEXT]] = tail call <vscale x 16 x i1> @llvm.get.active.lane.mask.nxv16i1.i64(i64 [[INDEX_NEXT]], i64 [[WIDE_TRIP_COUNT]])
+// CHECK-NEXT: [[TMP8:%.*]] = extractelement <vscale x 16 x i1> [[ACTIVE_LANE_MASK_NEXT]], i64 0
+// CHECK-NEXT: br i1 [[TMP8]], label [[VECTOR_BODY]], label [[FOR_COND_CLEANUP]], !llvm.loop [[LOOP9:![0-9]+]]
+// CHECK: for.cond.cleanup:
+// CHECK-NEXT: ret void
+//
+void alias_mask_8(uint8_t *restrict a, uint8_t * b, uint8_t * c, int n) {
+ #pragma clang loop vectorize(enable)
+ for (int i = 0; i < n; i++) {
+ c[i] = a[i] + b[i];
+ }
+}
+
+// CHECK-LABEL: define dso_local void @alias_mask_16(
+// CHECK-SAME: ptr noalias nocapture noundef readonly [[A:%.*]], ptr nocapture noundef readonly [[B:%.*]], ptr nocapture noundef writeonly [[C:%.*]], i32 noundef [[N:%.*]]) local_unnamed_addr #[[ATTR0]] {
+// CHECK-NEXT: entry:
+// CHECK-NEXT: [[CMP11:%.*]] = icmp sgt i32 [[N]], 0
+// CHECK-NEXT: br i1 [[CMP11]], label [[FOR_BODY_PREHEADER:%.*]], label [[FOR_COND_CLEANUP:%.*]]
+// CHECK: for.body.preheader:
+// CHECK-NEXT: [[C14:%.*]] = ptrtoint ptr [[C]] to i64
+// CHECK-NEXT: [[B15:%.*]] = ptrtoint ptr [[B]] to i64
+// CHECK-NEXT: [[WIDE_TRIP_COUNT:%.*]] = zext nneg i32 [[N]] to i64
+// CHECK-NEXT: [[SUB_DIFF:%.*]] = sub i64 [[B15]], [[C14]]
+// CHECK-NEXT: [[DIFF:%.*]] = sdiv i64 [[SUB_DIFF]], 2
+// CHECK-NEXT: [[NEG_COMPARE:%.*]] = icmp slt i64 [[SUB_DIFF]], -1
+// CHECK-NEXT: [[DOTSPLATINSERT:%.*]] = insertelement <vscale x 8 x i1> poison, i1 [[NEG_COMPARE]], i64 0
+// CHECK-NEXT: [[DOTSPLAT:%.*]] = shufflevector <vscale x 8 x i1> [[DOTSPLATINSERT]], <vscale x 8 x i1> poison, <vscale x 8 x i32> zeroinitializer
+// CHECK-NEXT: [[PTR_DIFF_LANE_MASK:%.*]] = tail call <vscale x 8 x i1> @llvm.get.active.lane.mask.nxv8i1.i64(i64 0, i64 [[DIFF]])
+// CHECK-NEXT: [[ACTIVE_LANE_MASK_ALIAS:%.*]] = or <vscale x 8 x i1> [[PTR_DIFF_LANE_MASK]], [[DOTSPLAT]]
+// CHECK-NEXT: [[ACTIVE_LANE_MASK_ENTRY:%.*]] = tail call <vscale x 8 x i1> @llvm.get.active.lane.mask.nxv8i1.i64(i64 0, i64 [[WIDE_TRIP_COUNT]])
+// CHECK-NEXT: [[TMP0:%.*]] = zext <vscale x 8 x i1> [[ACTIVE_LANE_MASK_ALIAS]] to <vscale x 8 x i8>
+// CHECK-NEXT: [[TMP1:%.*]] = tail call i8 @llvm.vector.reduce.add.nxv8i8(<vscale x 8 x i8> [[TMP0]])
+// CHECK-NEXT: [[TMP2:%.*]] = zext i8 [[TMP1]] to i64
+// CHECK-NEXT: br label [[VECTOR_BODY:%.*]]
+// CHECK: vector.body:
+// CHECK-NEXT: [[INDEX:%.*]] = phi i64 [ 0, [[FOR_BODY_PREHEADER]] ], [ [[INDEX_NEXT:%.*]], [[VECTOR_BODY]] ]
+// CHECK-NEXT: [[ACTIVE_LANE_MASK:%.*]] = phi <vscale x 8 x i1> [ [[ACTIVE_LANE_MASK_ENTRY]], [[FOR_BODY_PREHEADER]] ], [ [[ACTIVE_LANE_MASK_NEXT:%.*]], [[VECTOR_BODY]] ]
+// CHECK-NEXT: [[TMP3:%.*]] = and <vscale x 8 x i1> [[ACTIVE_LANE_MASK]], [[ACTIVE_LANE_MASK_ALIAS]]
+// CHECK-NEXT: [[TMP4:%.*]] = getelementptr inbounds i16, ptr [[A]], i64 [[INDEX]]
+// CHECK-NEXT: [[WIDE_MASKED_LOAD:%.*]] = tail call <vscale x 8 x i16> @llvm.masked.load.nxv8i16.p0(ptr [[TMP4]], i32 2, <vscale x 8 x i1> [[TMP3]], <vscale x 8 x i16> poison), !tbaa [[TBAA13:![0-9]+]]
+// CHECK-NEXT: [[TMP5:%.*]] = getelementptr inbounds i16, ptr [[B]], i64 [[INDEX]]
+// CHECK-NEXT: [[WIDE_MASKED_LOAD16:%.*]] = tail call <vscale x 8 x i16> @llvm.masked.load.nxv8i16.p0(ptr [[TMP5]], i32 2, <vscale x 8 x i1> [[TMP3]], <vscale x 8 x i16> poison), !tbaa [[TBAA13]]
+// CHECK-NEXT: [[TMP6:%.*]] = add <vscale x 8 x i16> [[WIDE_MASKED_LOAD16]], [[WIDE_MASKED_LOAD]]
+// CHECK-NEXT: [[TMP7:%.*]] = getelementptr inbounds i16, ptr [[C]], i64 [[INDEX]]
+// CHECK-NEXT: tail call void @llvm.masked.store.nxv8i16.p0(<vscale x 8 x i16> [[TMP6]], ptr [[TMP7]], i32 2, <vscale x 8 x i1> [[TMP3]]), !tbaa [[TBAA13]]
+// CHECK-NEXT: [[INDEX_NEXT]] = add i64 [[INDEX]], [[TMP2]]
+// CHECK-NEXT: [[ACTIVE_LANE_MASK_NEXT]] = tail call <vscale x 8 x i1> @llvm.get.active.lane.mask.nxv8i1.i64(i64 [[INDEX_NEXT]], i64 [[WIDE_TRIP_COUNT]])
+// CHECK-NEXT: [[TMP8:%.*]] = extractelement <vscale x 8 x i1> [[ACTIVE_LANE_MASK_NEXT]], i64 0
+// CHECK-NEXT: br i1 [[TMP8]], label [[VECTOR_BODY]], label [[FOR_COND_CLEANUP]], !llvm.loop [[LOOP15:![0-9]+]]
+// CHECK: for.cond.cleanup:
+// CHECK-NEXT: ret void
+//
+void alias_mask_16(uint16_t *restrict a, uint16_t * b, uint16_t * c, int n) {
+ #pragma clang loop vectorize(enable)
+ for (int i = 0; i < n; i++) {
+ c[i] = a[i] + b[i];
+ }
+}
+
+// CHECK-LABEL: define dso_local void @alias_mask_32(
+// CHECK-SAME: ptr noalias nocapture noundef readonly [[A:%.*]], ptr nocapture noundef readonly [[B:%.*]], ptr nocapture noundef writeonly [[C:%.*]], i32 noundef [[N:%.*]]) local_unnamed_addr #[[ATTR0]] {
+// CHECK-NEXT: entry:
+// CHECK-NEXT: [[CMP9:%.*]] = icmp sgt i32 [[N]], 0
+// CHECK-NEXT: br i1 [[CMP9]], label [[FOR_BODY_PREHEADER:%.*]], label [[FOR_COND_CLEANUP:%.*]]
+// CHECK: for.body.preheader:
+// CHECK-NEXT: [[C12:%.*]] = ptrtoint ptr [[C]] to i64
+// CHECK-NEXT: [[B13:%.*]] = ptrtoint ptr [[B]] to i64
+// CHECK-NEXT: [[WIDE_TRIP_COUNT:%.*]] = zext nneg i32 [[N]] to i64
+// CHECK-NEXT: [[SUB_DIFF:%.*]] = sub i64 [[B13]], [[C12]]
+// CHECK-NEXT: [[DIFF:%.*]] = sdiv i64 [[SUB_DIFF]], 4
+// CHECK-NEXT: [[NEG_COMPARE:%.*]] = icmp slt i64 [[SUB_DIFF]], -3
+// CHECK-NEXT: [[DOTSPLATINSERT:%.*]] = insertelement <vscale x 4 x i1> poison, i1 [[NEG_COMPARE]], i64 0
+// CHECK-NEXT: [[DOTSPLAT:%.*]] = shufflevector <vscale x 4 x i1> [[DOTSPLATINSERT]], <vscale x 4 x i1> poison, <vscale x 4 x i32> zeroinitializer
+// CHECK-NEXT: [[PTR_DIFF_LANE_MASK:%.*]] = tail call <vscale x 4 x i1> @llvm.get.active.lane.mask.nxv4i1.i64(i64 0, i64 [[DIFF]])
+// CHECK-NEXT: [[ACTIVE_LANE_MASK_ALIAS:%.*]] = or <vscale x 4 x i1> [[PTR_DIFF_LANE_MASK]], [[DOTSPLAT]]
+// CHECK-NEXT: [[ACTIVE_LANE_MASK_ENTRY:%.*]] = tail call <vscale x 4 x i1> @llvm.get.active.lane.mask.nxv4i1.i64(i64 0, i64 [[WIDE_TRIP_COUNT]])
+// CHECK-NEXT: [[TMP0:%.*]] = zext <vscale x 4 x i1> [[ACTIVE_LANE_MASK_ALIAS]] to <vscale x 4 x i8>
+// CHECK-NEXT: [[TMP1:%.*]] = tail call i8 @llvm.vector.reduce.add.nxv4i8(<vscale x 4 x i8> [[TMP0]])
+// CHECK-NEXT: [[TMP2:%.*]] = zext i8 [[TMP1]] to i64
+// CHECK-NEXT: br label [[VECTOR_BODY:%.*]]
+// CHECK: vector.body:
+// CHECK-NEXT: [[INDEX:%.*]] = phi i64 [ 0, [[FOR_BODY_PREHEADER]] ], [ [[INDEX_NEXT:%.*]], [[VECTOR_BODY]] ]
+// CHECK-NEXT: [[ACTIVE_LANE_MASK:%.*]] = phi <vscale x 4 x i1> [ [[ACTIVE_LANE_MASK_ENTRY]], [[FOR_BODY_PREHEADER]] ], [ [[ACTIVE_LANE_MASK_NEXT:%.*]], [[VECTOR_BODY]] ]
+// CHECK-NEXT: [[TMP3:%.*]] = and <vscale x 4 x i1> [[ACTIVE_LANE_MASK]], [[ACTIVE_LANE_MASK_ALIAS]]
+// CHECK-NEXT: [[TMP4:%.*]] = getelementptr inbounds i32, ptr [[A]], i64 [[INDEX]]
+// CHECK-NEXT: [[WIDE_MASKED_LOAD:%.*]] = tail call <vscale x 4 x i32> @llvm.masked.load.nxv4i32.p0(ptr [[TMP4]], i32 4, <vscale x 4 x i1> [[TMP3]], <vscale x 4 x i32> poison), !tbaa [[TBAA16:![0-9]+]]
+// CHECK-NEXT: [[TMP5:%.*]] = getelementptr inbounds i32, ptr [[B]], i64 [[INDEX]]
+// CHECK-NEXT: [[WIDE_MASKED_LOAD14:%.*]] = tail call <vscale x 4 x i32> @llvm.masked.load.nxv4i32.p0(ptr [[TMP5]], i32 4, <vscale x 4 x i1> [[TMP3]], <vscale x 4 x i32> poison), !tbaa [[TBAA16]]
+// CHECK-NEXT: [[TMP6:%.*]] = add <vscale x 4 x i32> [[WIDE_MASKED_LOAD14]], [[WIDE_MASKED_LOAD]]
+// CHECK-NEXT: [[TMP7:%.*]] = getelementptr inbounds i32, ptr [[C]], i64 [[INDEX]]
+// CHECK-NEXT: tail call void @llvm.masked.store.nxv4i32.p0(<vscale x 4 x i32> [[TMP6]], ptr [[TMP7]], i32 4, <vscale x 4 x i1> [[TMP3]]), !tbaa [[TBAA16]]
+// CHECK-NEXT: [[INDEX_NEXT]] = add i64 [[INDEX]], [[TMP2]]
+// CHECK-NEXT: [[ACTIVE_LANE_MASK_NEXT]] = tail call <vscale x 4 x i1> @llvm.get.active.lane.mask.nxv4i1.i64(i64 [[INDEX_NEXT]], i64 [[WIDE_TRIP_COUNT]])
+// CHECK-NEXT: [[TMP8:%.*]] = extractelement <vscale x 4 x i1> [[ACTIVE_LANE_MASK_NEXT]], i64 0
+// CHECK-NEXT: br i1 [[TMP8]], label [[VECTOR_BODY]], label [[FOR_COND_CLEANUP]], !llvm.loop [[LOOP18:![0-9]+]]
+// CHECK: for.cond.cleanup:
+// CHECK-NEXT: ret void
+//
+void alias_mask_32(uint32_t *restrict a, uint32_t * b, uint32_t * c, int n) {
+ #pragma clang loop vectorize(enable)
+ for (int i = 0; i < n; i++) {
+ c[i] = a[i] + b[i];
+ }
+}
+
+// CHECK-LABEL: define dso_local void @alias_mask_64(
+// CHECK-SAME: ptr noalias nocapture noundef readonly [[A:%.*]], ptr nocapture noundef readonly [[B:%.*]], ptr nocapture noundef writeonly [[C:%.*]], i32 noundef [[N:%.*]]) local_unnamed_addr #[[ATTR0]] {
+// CHECK-NEXT: entry:
+// CHECK-NEXT: [[CMP9:%.*]] = icmp sgt i32 [[N]], 0
+// CHECK-NEXT: br i1 [[CMP9]], label [[FOR_BODY_PREHEADER:%.*]], label [[FOR_COND_CLEANUP:%.*]]
+// CHECK: for.body.preheader:
+// CHECK-NEXT: [[C12:%.*]] = ptrtoint ptr [[C]] to i64
+// CHECK-NEXT: [[B13:%.*]] = ptrtoint ptr [[B]] to i64
+// CHECK-NEXT: [[WIDE_TRIP_COUNT:%.*]] = zext nneg i32 [[N]] to i64
+// CHECK-NEXT: [[SUB_DIFF:%.*]] = sub i64 [[B13]], [[C12]]
+// CHECK-NEXT: [[DIFF:%.*]] = sdiv i64 [[SUB_DIFF]], 8
+// CHECK-NEXT: [[NEG_COMPARE:%.*]] = icmp slt i64 [[SUB_DIFF]], -7
+// CHECK-NEXT: [[DOTSPLATINSERT:%.*]] = insertelement <vscale x 2 x i1> poison, i1 [[NEG_COMPARE]], i64 0
+// CHECK-NEXT: [[DOTSPLAT:%.*]] = shufflevector <vscale x 2 x i1> [[DOTSPLATINSERT]], <vscale x 2 x i1> poison, <vscale x 2 x i32> zeroinitializer
+// CHECK-NEXT: [[PTR_DIFF_LANE_MASK:%.*]] = tail call <vscale x 2 x i1> @llvm.get.active.lane.mask.nxv2i1.i64(i64 0, i64 [[DIFF]])
+// CHECK-NEXT: [[ACTIVE_LANE_MASK_ALIAS:%.*]] = or <vscale x 2 x i1> [[PTR_DIFF_LANE_MASK]], [[DOTSPLAT]]
+// CHECK-NEXT: [[ACTIVE_LANE_MASK_ENTRY:%.*]] = tail call <vscale x 2 x i1> @llvm.get.active.lane.mask.nxv2i1.i64(i64 0, i64 [[WIDE_TRIP_COUNT]])
+// CHECK-NEXT: [[TMP0:%.*]] = zext <vscale x 2 x i1> [[ACTIVE_LANE_MASK_ALIAS]] to <vscale x 2 x i8>
+// CHECK-NEXT: [[TMP1:%.*]] = tail call i8 @llvm.vector.reduce.add.nxv2i8(<vscale x 2 x i8> [[TMP0]])
+// CHECK-NEXT: [[TMP2:%.*]] = zext i8 [[TMP1]] to i64
+// CHECK-NEXT: br label [[VECTOR_BODY:%.*]]
+// CHECK: vector.body:
+// CHECK-NEXT: [[INDEX:%.*]] = phi i64 [ 0, [[FOR_BODY_PREHEADER]] ], [ [[INDEX_NEXT:%.*]], [[VECTOR_BODY]] ]
+// CHECK-NEXT: [[ACTIVE_LANE_MASK:%.*]] = phi <vscale x 2 x i1> [ [[ACTIVE_LANE_MASK_ENTRY]], [[FOR_BODY_PREHEADER]] ], [ [[ACTIVE_LANE_MASK_NEXT:%.*]], [[VECTOR_BODY]] ]
+// CHECK-NEXT: [[TMP3:%.*]] = and <vscale x 2 x i1> [[ACTIVE_LANE_MASK]], [[ACTIVE_LANE_MASK_ALIAS]]
+// CHECK-NEXT: [[TMP4:%.*]] = getelementptr inbounds i64, ptr [[A]], i64 [[INDEX]]
+// CHECK-NEXT: [[WIDE_MASKED_LOAD:%.*]] = tail call <vscale x 2 x i64> @llvm.masked.load.nxv2i64.p0(ptr [[TMP4]], i32 8, <vscale x 2 x i1> [[TMP3]], <vscale x 2 x i64> poison), !tbaa [[TBAA19:![0-9]+]]
+// CHECK-NEXT: [[TMP5:%.*]] = getelementptr inbounds i64, ptr [[B]], i64 [[INDEX]]
+// CHECK-NEXT: [[WIDE_MASKED_LOAD14:%.*]] = tail call <vscale x 2 x i64> @llvm.masked.load.nxv2i64.p0(ptr [[TMP5]], i32 8, <vscale x 2 x i1> [[TMP3]], <vscale x 2 x i64> poison), !tbaa [[TBAA19]]
+// CHECK-NEXT: [[TMP6:%.*]] = add <vscale x 2 x i64> [[WIDE_MASKED_LOAD14]], [[WIDE_MASKED_LOAD]]
+// CHECK-NEXT: [[TMP7:%.*]] = getelementptr inbounds i64, ptr [[C]], i64 [[INDEX]]
+// CHECK-NEXT: tail call void @llvm.masked.store.nxv2i64.p0(<vscale x 2 x i64> [[TMP6]], ptr [[TMP7]], i32 8, <vscale x 2 x i1> [[TMP3]]), !tbaa [[TBAA19]]
+// CHECK-NEXT: [[INDEX_NEXT]] = add i64 [[INDEX]], [[TMP2]]
+// CHECK-NEXT: [[ACTIVE_LANE_MASK_NEXT]] = tail call <vscale x 2 x i1> @llvm.get.active.lane.mask.nxv2i1.i64(i64 [[INDEX_NEXT]], i64 [[WIDE_TRIP_COUNT]])
+// CHECK-NEXT: [[TMP8:%.*]] = extractelement <vscale x 2 x i1> [[ACTIVE_LANE_MASK_NEXT]], i64 0
+// CHECK-NEXT: br i1 [[TMP8]], label [[VECTOR_BODY]], label [[FOR_COND_CLEANUP]], !llvm.loop [[LOOP21:![0-9]+]]
+// CHECK: for.cond.cleanup:
+// CHECK-NEXT: ret void
+//
+void alias_mask_64(uint64_t *restrict a, uint64_t * b, uint64_t * c, int n) {
+ #pragma clang loop vectorize(enable)
+ for (int i = 0; i < n; i++) {
+ c[i] = a[i] + b[i];
+ }
+}
+
+// CHECK-LABEL: define dso_local void @alias_mask_multiple_8(
+// CHECK-SAME: ptr nocapture noundef readonly [[A:%.*]], ptr nocapture noundef readonly [[B:%.*]], ptr nocapture noundef writeonly [[C:%.*]], i32 noundef [[N:%.*]]) local_unnamed_addr #[[ATTR0]] {
+// CHECK-NEXT: entry:
+// CHECK-NEXT: [[CMP11:%.*]] = icmp sgt i32 [[N]], 0
+// CHECK-NEXT: br i1 [[CMP11]], label [[FOR_BODY_PREHEADER:%.*]], label [[FOR_COND_CLEANUP:%.*]]
+// CHECK: for.body.preheader:
+// CHECK-NEXT: [[C14:%.*]] = ptrtoint ptr [[C]] to i64
+// CHECK-NEXT: [[A15:%.*]] = ptrtoint ptr [[A]] to i64
+// CHECK-NEXT: [[B16:%.*]] = ptrtoint ptr [[B]] to i64
+// CHECK-NEXT: [[WIDE_TRIP_COUNT:%.*]] = zext nneg i32 [[N]] to i64
+// CHECK-NEXT: [[SUB_DIFF:%.*]] = sub i64 [[A15]], [[C14]]
+// CHECK-NEXT: [[NEG_COMPARE:%.*]] = icmp slt i64 [[SUB_DIFF]], 0
+// CHECK-NEXT: [[DOTSPLATINSERT:%.*]] = insertelement <vscale x 16 x i1> poison, i1 [[NEG_COMPARE]], i64 0
+// CHECK-NEXT: [[DOTSPLAT:%.*]] = shufflevector <vscale x 16 x i1> [[DOTSPLATINSERT]], <vscale x 16 x i1> poison, <vscale x 16 x i32> zeroinitializer
+// CHECK-NEXT: [[PTR_DIFF_LANE_MASK:%.*]] = tail call <vscale x 16 x i1> @llvm.get.active.lane.mask.nxv16i1.i64(i64 0, i64 [[SUB_DIFF]])
+// CHECK-NEXT: [[ACTIVE_LANE_MASK_ALIAS:%.*]] = or <vscale x 16 x i1> [[PTR_DIFF_LANE_MASK]], [[DOTSPLAT]]
+// CHECK-NEXT: [[SUB_DIFF18:%.*]] = sub i64 [[B16]], [[C14]]
+// CHECK-NEXT: [[NEG_COMPARE20:%.*]] = icmp slt i64 [[SUB_DIFF18]], 0
+// CHECK-NEXT: [[DOTSPLATINSERT21:%.*]] = insertelement <vscale x 16 x i1> poison, i1 [[NEG_COMPARE20]], i64 0
+// CHECK-NEXT: [[DOTSPLAT22:%.*]] = shufflevector <vscale x 16 x i1> [[DOTSPLATINSERT21]], <vscale x 16 x i1> poison, <vscale x 16 x i32> zeroinitializer
+// CHECK-NEXT: [[PTR_DIFF_LANE_MASK23:%.*]] = tail call <vscale x 16 x i1> @llvm.get.active.lane.mask.nxv16i1.i64(i64 0, i64 [[SUB_DIFF18]])
+// CHECK-NEXT: [[ACTIVE_LANE_MASK_ALIAS24:%.*]] = or <vscale x 16 x i1> [[PTR_DIFF_LANE_MASK23]], [[DOTSPLAT22]]
+// CHECK-NEXT: [[TMP0:%.*]] = and <vscale x 16 x i1> [[ACTIVE_LANE_MASK_ALIAS]], [[ACTIVE_LANE_MASK_ALIAS24]]
+// CHECK-NEXT: [[ACTIVE_LANE_MASK_ENTRY:%.*]] = tail call <vscale x 16 x i1> @llvm.get.active.lane.mask.nxv16i1.i64(i64 0, i64 [[WIDE_TRIP_COUNT]])
+// CHECK-NEXT: [[TMP1:%.*]] = zext <vscale x 16 x i1> [[TMP0]] to <vscale x 16 x i8>
+// CHECK-NEXT: [[TMP2:%.*]] = tail call i8 @llvm.vector.reduce.add.nxv16i8(<vscale x 16 x i8> [[TMP1]])
+// CHECK-NEXT: [[TMP3:%.*]] = zext i8 [[TMP2]] to i64
+// CHECK-NEXT: br label [[VECTOR_BODY:%.*]]
+// CHECK: vector.body:
+// CHECK-NEXT: [[INDEX:%.*]] = phi i64 [ 0, [[FOR_BODY_PREHEADER]] ], [ [[INDEX_NEXT:%.*]], [[VECTOR_BODY]] ]
+// CHECK-NEXT: [[ACTIVE_LANE_MASK:%.*]] = phi <vscale x 16 x i1> [ [[ACTIVE_LANE_MASK_ENTRY]], [[FOR_BODY_PREHEADER]] ], [ [[ACTIVE_LANE_MASK_NEXT:%.*]], [[VECTOR_BODY]] ]
+// CHECK-NEXT: [[TMP4:%.*]] = and <vscale x 16 x i1> [[ACTIVE_LANE_MASK]], [[TMP0]]
+// CHECK-NEXT: [[TMP5:%.*]] = getelementptr inbounds i8, ptr [[A]], i64 [[INDEX]]
+// CHECK-NEXT: [[WIDE_MASKED_LOAD:%.*]] = tail call <vscale x 16 x i8> @llvm.masked.load.nxv16i8.p0(ptr [[TMP5]], i32 1, <vscale x 16 x i1> [[TMP4]], <vscale x 16 x i8> poison), !tbaa [[TBAA6]]
+// CHECK-NEXT: [[TMP6:%.*]] = getelementptr inbounds i8, ptr [[B]], i64 [[INDEX]]
+// CHECK-NEXT: [[WIDE_MASKED_LOAD25:%.*]] = tail call <vscale x 16 x i8> @llvm.masked.load.nxv16i8.p0(ptr [[TMP6]], i32 1, <vscale x 16 x i1> [[TMP4]], <vscale x 16 x i8> poison), !tbaa [[TBAA6]]
+// CHECK-NEXT: [[TMP7:%.*]] = add <vscale x 16 x i8> [[WIDE_MASKED_LOAD25]], [[WIDE_MASKED_LOAD]]
+// CHECK-NEXT: [[TMP8:%.*]] = getelementptr inbounds i8, ptr [[C]], i64 [[INDEX]]
+// CHECK-NEXT: tail call void @llvm.masked.store.nxv16i8.p0(<vscale x 16 x i8> [[TMP7]], ptr [[TMP8]], i32 1, <vscale x 16 x i1> [[TMP4]]), !tbaa [[TBAA6]]
+// CHECK-NEXT: [[INDEX_NEXT]] = add i64 [[INDEX]], [[TMP3]]
+// CHECK-NEXT: [[ACTIVE_LANE_MASK_NEXT]] = tail call <vscale x 16 x i1> @llvm.get.active.lane.mask.nxv16i1.i64(i64 [[INDEX_NEXT]], i64 [[WIDE_TRIP_COUNT]])
+// CHECK-NEXT: [[TMP9:%.*]] = extractelement <vscale x 16 x i1> [[ACTIVE_LANE_MASK_NEXT]], i64 0
+// CHECK-NEXT: br i1 [[TMP9]], label [[VECTOR_BODY]], label [[FOR_COND_CLEANUP]], !llvm.loop [[LOOP22:![0-9]+]]
+// CHECK: for.cond.cleanup:
+// CHECK-NEXT: ret void
+//
+void alias_mask_multiple_8(ui...
[truncated]
|
llvm#100579 emits IR that creates a mask disabling lanes that could alias within a loop iteration, based on a pair of pointers. This PR lowers that IR to a WHILEWR instruction for AArch64.
clang/test/CodeGen/loop-alias-mask.c
Outdated
@@ -0,0 +1,404 @@ | ||
// NOTE: Assertions have been autogenerated by utils/update_cc_test_checks.py UTC_ARGS: --version 4 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We generally don't add source level tests for llvm optimizations; aiui clang's CodeGen test directory is mainly to test lowering the AST to LLVM IR, not end-to-end.
We might want a different way of testing the feature from source that we could add to a buildbot.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
That's true, I've removed this file and have started work on adding a test to the llvm-test-suite to replace it. A statistic variable will help make sure the alias masking is happening.
@@ -448,6 +448,20 @@ struct PointerDiffInfo { | ||
NeedsFreeze(NeedsFreeze) {} | ||
}; | ||
|
||
/// A pair of pointers that could overlap across a loop iteration. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If the analysis has been removed from LAA, then I suggest the struct can also be moved elsewhere. Maybe VPlan.h ?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
That sounds good to me. I originally put it here so it was next to the related PointerDiffInfo
#100579 emits IR that creates a mask disabling lanes that could alias within a loop iteration, based on a pair of pointers. This PR lowers that IR to the WHILEWR instruction for AArch64.
#include "llvm/IR/DiagnosticInfo.h" | ||
#include "llvm/Transforms/Utils/ScalarEvolutionExpander.h" |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Unrelated changes?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yep this is left over from previous changes, thanks. Removed now.
@@ -8520,7 +8561,29 @@ LoopVectorizationPlanner::tryToBuildVPlanWithVPRecipes(VFRange &Range) { | ||
// When not folding the tail, we know that the induction increment will not | ||
// overflow. | ||
bool HasNUW = Style == TailFoldingStyle::None; | ||
addCanonicalIVRecipes(*Plan, Legal->getWidestInductionType(), HasNUW, DL); | ||
|
||
VPValue *AliasMask = nullptr; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Does this need to be done at initial construction? IIUC this could be framed as optimization of the original VPlan, done as a separate VPlan-to-VPlan transform?
At the moment this requires threading a bunch of information to a lot of places, would
Be good limit to a single dedicated transform
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Currently the mask is only generated when tail predication is enabled, in which case it perhaps could be done with a transform since all the vector loads and stores will be masked anyway. However the goal is to generate the mask even when tail predication is disabled, to capture more cases, and so the transform would have to handle swapping the non-masked loads and stores to masked variants, which I don't think is best done with a transform since the original VPlan does a good job of that anyway.
I'd also argue that it's a good idea to get the masking as we want it to be from the beginning rather than setting up an active lane mask and building on top of it later on.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If possible to do so in a separate transform, that would preferable IMO as now the general VPlan construction has a bunch of added complexity for something that is quite specific.
A similar approach was taken for adding various options to control tail-folding (addActiveLaneMask
, tryAddExplicitVectorLength
). This also has the advantage of keeping the code responsible to generate the new construct at a single place, rather than spreading it across multiple places?
Having it as a separate transform also allows to generate VPlans with different strategies (e.g. with and without this new mask) and pick the best option based on cost.
GeneratedRTChecks Checks(*PSE.getSE(), DT, LI, TTI, F->getDataLayout(), | ||
AddBranchWeights); | ||
|
||
// VPlan needs the aliasing pointers as Values and not SCEVs, so expand them |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If expanded SCEVs are needed in VPlan they should be expanded during VPlan execution using VPExpandSCEVRecioe if possible
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Oh perfect, I didn't know that existed, thanks.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Done.
EltVT->getArrayNumElements(); | ||
else | ||
ElementSize = | ||
GEP->getSourceElementType()->getScalarSizeInBits() / 8; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Not sure if using the GEP source element type here is correct, should this use the accessed type from load/store which could be different?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I've managed to figure it out now and am using the AccessSize
provided by PointerDIffInfo
. Thanks for pointing out the flakiness here!
@@ -9,11 +9,11 @@ define void @same_step_and_size(ptr %a, ptr %b, i64 %n) { | ||
; CHECK-NEXT: [[A2:%.*]] = ptrtoint ptr [[A:%.*]] to i64 | ||
; CHECK-NEXT: [[B1:%.*]] = ptrtoint ptr [[B:%.*]] to i64 | ||
; CHECK-NEXT: [[MIN_ITERS_CHECK:%.*]] = icmp ult i64 [[N:%.*]], 4 | ||
; CHECK-NEXT: br i1 [[MIN_ITERS_CHECK]], label %scalar.ph, label %vector.memcheck | ||
; CHECK-NEXT: br i1 [[MIN_ITERS_CHECK]], label [[SCALAR_PH:%.*]], label [[VECTOR_MEMCHECK:%.*]] |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Unrelated changes?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Indeed, the update script did this. I've removed it now.
%cmp11 = icmp sgt i32 %n, 0 | ||
br i1 %cmp11, label %for.body.preheader, label %for.cond.cleanup | ||
|
||
for.body.preheader: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I might be missing something but the input is already vectorized?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
You're right, thanks for pointing that out. I've improved the test so that it shows the loop vectoriser's work.
✅ With the latest revision this PR passed the C/C++ code formatter. |
Ping. |
} | ||
} | ||
} | ||
assert(ElementSize > 0 && "Couldn't get element size from pointer"); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This assert fires when compiling MicroBenchmarks/ImageProcessing/Dilate/CMakeFiles/Dilate.dir/dilateKernel.c
from the test suite, so that needs fixing at least.
But I think we shouldn't be trying to figure out the element size at recipe execution time; it should be determined when trying to build the recipe and the value stored as part of the class. If we can't find the size then we stop trying to build the recipe and fall back to normal RT checks. Does that sound sensible to you?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks Graham, I've had a deeper look and PointerDiffInfo
provides an AccessSize
which corresponds to what we want, so I'm using that now.
7a87bf4
to
96b8b65
Compare
I've moved the read-after-write lowering to #114028 |
I've now fixed the comparison that's emitted since it should have been comparing to 0. See the Arm WHILERW instruction. |
I've rebased this on top of my PR that adds an intrinsic since that's less fragile to match in the backend. So this should now be ready to have a look at. |
/// RTChecks is a list of pointer pairs that should be checked for aliasing, | ||
/// setting HasAliasMask to true in the case that an alias mask is generated |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Outdated comment? Is this DiffChecks
now?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Done, thanks.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
A bunch of little comments (mostly just nitpicks from a pass over the PR) 🙂
/// Branch to scalar loop if checks fails at runtime. | ||
ScalarFallback, | ||
/// Form a mask based on elements which won't be a WAR or RAW hazard |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
ultra nit: One of these comments ends with a full-stop and the other does not.
/// Branch to scalar loop if checks fails at runtime. | |
ScalarFallback, | |
/// Form a mask based on elements which won't be a WAR or RAW hazard | |
/// Branch to scalar loop if checks fails at runtime. | |
ScalarFallback, | |
/// Form a mask based on elements which won't be a WAR or RAW hazard. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Done.
// Count the number of bits set in each lane and reduce the result to a scalar | ||
case VPInstruction::PopCount: { | ||
Value *Op = State.get(getOperand(0)); | ||
auto *VT = Op->getType(); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
nit: Spell out type if it's not present on the RHS.
auto *VT = Op->getType(); | |
Type *VT = Op->getType(); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Done.
Value *SinkValue = State.get(getSinkValue(), true); | ||
Value *SourceValue = State.get(getSourceValue(), true); | ||
|
||
auto *Type = SinkValue->getType(); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
nit:
auto *Type = SinkValue->getType(); | |
Type *PtrType = SinkValue->getType(); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Not needed thanks to rebase.
VPSlotTracker &SlotTracker) const { | ||
O << Indent << "EMIT "; | ||
getVPSingleValue()->printAsOperand(O, SlotTracker); | ||
O << " = alias lane mask "; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
nit: These seem more commonly printed in all caps with hyphens.
O << " = alias lane mask "; | |
O << " = ALIAS-LANE-MASK "; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Done.
@@ -952,7 +952,6 @@ void VPlan::prepareToExecute(Value *TripCountV, Value *VectorTripCountV, | ||
|
||
IRBuilder<> Builder(State.CFG.PrevBB->getTerminator()); | ||
// FIXME: Model VF * UF computation completely in VPlan. | ||
assert(VFxUF.getNumUsers() && "VFxUF expected to always have users"); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
How does removing this assert relate to these changes?
@@ -1416,14 +1466,14 @@ void VPlanTransforms::addActiveLaneMask( | ||
auto *FoundWidenCanonicalIVUser = | ||
find_if(Plan.getCanonicalIV()->users(), | ||
[](VPUser *U) { return isa<VPWidenCanonicalIVRecipe>(U); }); | ||
assert(FoundWidenCanonicalIVUser && | ||
assert(FoundWidenCanonicalIVUser && *FoundWidenCanonicalIVUser && |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This looks a little odd. Doesn't find_if
return an iterator?
assert(FoundWidenCanonicalIVUser && *FoundWidenCanonicalIVUser && | |
auto IVUsers = Plan.getCanonicalIV()->users(); | |
/// ... | |
assert(FoundWidenCanonicalIVUser != IVUsers.end() && "Must have widened canonical IV when tail folding!"); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Done, thanks.
/// RTChecks refers to the pointer pairs that need aliasing elements to be | ||
/// masked off each loop iteration. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
No docs for PSE?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Added, let me know if anything about it should change.
/// Print the recipe. | ||
void print(raw_ostream &O, const Twine &Indent, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Would it be possible to add a test that prints the VPlan recipe with the alias mask?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Done.
LaneMask = Builder.createNaryOp(Instruction::BinaryOps::And, | ||
{LaneMaskPhi, AliasMask}, DL); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Do we know this AND won't be all-false?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We don't, and there's actually a case in the test suite that hangs because the mask is all-false. I'll start looking into a solution for that.
// is X and B is X + 2 with VF being 4, only the final two elements of the | ||
// loaded vector can be stored since they don't overlap with the stored | ||
// vector. %b.vec = load %b ; = [s, t, u, v] | ||
// [...] | ||
// store %a, %b.vec ; only u and v can be stored as their addresses don't | ||
// overlap with %a + (VF - 1) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is specifically RAW? Of something like:
store A[x]
load A[x + 2]
Perhaps I'm muddled on what "final two elements" means, but isn't the first two elements store that is valid (so it won't overwrite the elements for the load)?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes you're right, this should say that the first two are valid. Thanks for spotting that. I've re-worded the comment to make it more clear.
It can be unsafe to load a vector from an address and write a vector to an address if those two addresses have overlapping lanes within a vectorised loop iteration. This PR adds an intrinsic designed to create a mask with lanes disabled if they overlap between the two pointer arguments, so that only safe lanes are loaded, operated on and stored. Along with the two pointer parameters, the intrinsic also takes an immediate that represents the size in bytes of the vector element types, as well as an immediate i1 that is true if there is a write after-read-hazard or false if there is a read-after-write hazard. This will be used by llvm#100579 and replaces the existing lowering for whilewr since that isn't needed now we have the intrinsic.
It can be unsafe to load a vector from an address and write a vector to an address if those two addresses have overlapping lanes within a vectorised loop iteration. This PR adds an intrinsic designed to create a mask with lanes disabled if they overlap between the two pointer arguments, so that only safe lanes are loaded, operated on and stored. Along with the two pointer parameters, the intrinsic also takes an immediate that represents the size in bytes of the vector element types, as well as an immediate i1 that is true if there is a write after-read-hazard or false if there is a read-after-write hazard. This will be used by llvm#100579 and replaces the existing lowering for whilewr since that isn't needed now we have the intrinsic.
It can be unsafe to load a vector from an address and write a vector to an address if those two addresses have overlapping lanes within a vectorised loop iteration. This PR adds an intrinsic designed to create a mask with lanes disabled if they overlap between the two pointer arguments, so that only safe lanes are loaded, operated on and stored. Along with the two pointer parameters, the intrinsic also takes an immediate that represents the size in bytes of the vector element types, as well as an immediate i1 that is true if there is a write after-read-hazard or false if there is a read-after-write hazard. This will be used by llvm#100579 and replaces the existing lowering for whilewr since that isn't needed now we have the intrinsic.
When vectorising a loop that uses loads and stores, those pointers could overlap if their difference is less than the vector factor. For example, if address 20 is being stored to and address 23 is being loaded from, they overlap when the vector factor is 4 or higher. Currently LoopVectorize branches to a scalar loop in these cases with a runtime check. Howver if we construct a mask that disables the overlapping (aliasing) lanes then the vectorised loop can be safely entered, as long as the loads and stores are masked off.
4c4ab8a
to
8822b8f
Compare
When vectorising a loop that uses loads and stores, those pointers with unknown bounds could overlap if their difference is less than the vector factor. For example, if address 20 is being stored to and address 23 is being loaded from, they overlap when the vector factor is 4 or higher. Currently LoopVectorize branches to a scalar loop in these cases with a runtime check. However if we construct a mask that disables the overlapping (aliasing) lanes then the vectorised loop can be safely entered, as long as the loads and stores are masked off.
This PR modifies the LoopVectorizer and VPlan to create such a mask and always branch to the vector loop. Currently this is only done if we're tail-predicating, but more work will come in the future to do this in other cases as well.