Skip to content

Navigation Menu

Sign in
Appearance settings

Search code, repositories, users, issues, pull requests...

Provide feedback

We read every piece of feedback, and take your input very seriously.

Saved searches

Use saved searches to filter your results more quickly

Appearance settings

[InstCombine] Canonicalize xor with disjoint ops to or disjoint #133139

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Closed
wants to merge 1 commit into from

Conversation

jrbyrnes
Copy link
Contributor

or disjoints work better with other optimizations (e.g. SeparateConstOffsetFromGEP). This feels like the right place for this canonicalization, but perhaps not.

Change-Id: Ib278c8c6a2893b0f9f75a0a6fb0430437b9cff14
@llvmbot
Copy link
Member

llvmbot commented Mar 26, 2025

@llvm/pr-subscribers-llvm-transforms

Author: Jeffrey Byrnes (jrbyrnes)

Changes

or disjoints work better with other optimizations (e.g. SeparateConstOffsetFromGEP). This feels like the right place for this canonicalization, but perhaps not.


Full diff: https://github.com/llvm/llvm-project/pull/133139.diff

2 Files Affected:

  • (modified) llvm/lib/Transforms/InstCombine/InstCombineAndOrXor.cpp (+5)
  • (modified) llvm/test/Transforms/InstCombine/xor.ll (+49)
diff --git a/llvm/lib/Transforms/InstCombine/InstCombineAndOrXor.cpp b/llvm/lib/Transforms/InstCombine/InstCombineAndOrXor.cpp
index 6cc241781d112..c8d2e5d960ec8 100644
--- a/llvm/lib/Transforms/InstCombine/InstCombineAndOrXor.cpp
+++ b/llvm/lib/Transforms/InstCombine/InstCombineAndOrXor.cpp
@@ -4993,6 +4993,11 @@ Instruction *InstCombinerImpl::visitXor(BinaryOperator &I) {
   if (Instruction *Abs = canonicalizeAbs(I, Builder))
     return Abs;
 
+  if (KnownBits::haveNoCommonBitsSet(
+          computeKnownBits(I.getOperand(0), /*Depth=*/0, &I),
+          computeKnownBits(I.getOperand(1), /*Depth=*/0, &I)))
+    return BinaryOperator::CreateDisjointOr(I.getOperand(0), I.getOperand(1));
+
   // Otherwise, if all else failed, try to hoist the xor-by-constant:
   //   (X ^ C) ^ Y --> (X ^ Y) ^ C
   // Just like we do in other places, we completely avoid the fold
diff --git a/llvm/test/Transforms/InstCombine/xor.ll b/llvm/test/Transforms/InstCombine/xor.ll
index 3abaf74285cc0..0cfbb2c47e21c 100644
--- a/llvm/test/Transforms/InstCombine/xor.ll
+++ b/llvm/test/Transforms/InstCombine/xor.ll
@@ -1664,3 +1664,52 @@ entry:
   %or = or <2 x i32> %add, %c
   ret <2 x i32> %or
 }
+
+declare i32 @callee()
+
+define i32 @xor_disjoint() {
+; CHECK-LABEL: @xor_disjoint(
+; CHECK-NEXT:    [[CALL1:%.*]] = call i32 @callee(), !range [[RNG0:![0-9]+]]
+; CHECK-NEXT:    [[XOR:%.*]] = or disjoint i32 [[CALL1]], 4096
+; CHECK-NEXT:    ret i32 [[XOR]]
+;
+  %call1 = call i32 @callee(), !range !0
+  %xor = xor i32 %call1, 4096
+  ret i32 %xor
+}
+
+define i32 @xor_disjoint2() {
+; CHECK-LABEL: @xor_disjoint2(
+; CHECK-NEXT:    [[CALL1:%.*]] = call i32 @callee(), !range [[RNG1:![0-9]+]]
+; CHECK-NEXT:    [[XOR:%.*]] = or disjoint i32 [[CALL1]], 512
+; CHECK-NEXT:    ret i32 [[XOR]]
+;
+  %call1 = call i32 @callee(), !range !1
+  %xor = xor i32 %call1, 512
+  ret i32 %xor
+}
+
+define i32 @xor_non_disjoint() {
+; CHECK-LABEL: @xor_non_disjoint(
+; CHECK-NEXT:    [[CALL1:%.*]] = call i32 @callee(), !range [[RNG0]]
+; CHECK-NEXT:    [[XOR:%.*]] = xor i32 [[CALL1]], 1024
+; CHECK-NEXT:    ret i32 [[XOR]]
+;
+  %call1 = call i32 @callee(), !range !0
+  %xor = xor i32 %call1, 1024
+  ret i32 %xor
+}
+
+define i32 @xor_non_disjoint2() {
+; CHECK-LABEL: @xor_non_disjoint2(
+; CHECK-NEXT:    [[CALL1:%.*]] = call i32 @callee(), !range [[RNG1]]
+; CHECK-NEXT:    [[XOR:%.*]] = and i32 [[CALL1]], 511
+; CHECK-NEXT:    ret i32 [[XOR]]
+;
+  %call1 = call i32 @callee(), !range !1
+  %xor = xor i32 %call1, 1024
+  ret i32 %xor
+}
+
+!0 = !{ i32 0, i32 2048 }
+!1 = !{ i32 1024, i32 1536 }

@jrbyrnes jrbyrnes requested a review from arsenm March 27, 2025 03:01
%xor = xor i32 %call1, 1024
ret i32 %xor
}

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Test a vector version. Also could use alive2 proof link

@@ -4993,6 +4993,11 @@ Instruction *InstCombinerImpl::visitXor(BinaryOperator &I) {
if (Instruction *Abs = canonicalizeAbs(I, Builder))
return Abs;

if (KnownBits::haveNoCommonBitsSet(
computeKnownBits(I.getOperand(0), /*Depth=*/0, &I),
computeKnownBits(I.getOperand(1), /*Depth=*/0, &I)))
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can try to short circuit the first operand query if the second is unknown

@jrbyrnes
Copy link
Contributor Author

jrbyrnes commented Mar 27, 2025

alive2 proof link

xor_disjoint
https://alive2.llvm.org/ce/z/fQm9-T

xor_disjoint2
https://alive2.llvm.org/ce/z/AMPAib

xor_non_disjoint
https://alive2.llvm.org/ce/z/Vdw5W9

xor_non_disjoint2
https://alive2.llvm.org/ce/z/R-_RWG

@jrbyrnes
Copy link
Contributor Author

Going to have to rethink this -- it's not that we're missing this canonicalization, but my test case does not canonicalize because of the recursion limit.

@arsenm
Copy link
Contributor

arsenm commented Mar 28, 2025

Disjoint is set on or in InstCombineSimplifyDemanded. Move this there with the opcode switch?

@jrbyrnes
Copy link
Contributor Author

jrbyrnes commented Mar 28, 2025

Disjoint is set on or in InstCombineSimplifyDemanded. Move this there with the opcode switch?

Yeah that handling actually already covers the easy xor -> or disjoint cases

https://godbolt.org/z/5b95qvhqo

So this PR as it is is redundant.

However, we also care about the non easy cases which aren't covered due to the recursion limit

https://godbolt.org/z/Edr4brT1j

%17 = xor i32 %16, 4096 should be canonicalized to or disjoint

My case was accidentally covered since I did CKB on the operands instead of the instruction. I need to investigate how we can canonicalize these complex cases, and will update the PR with that implementation.

@jrbyrnes
Copy link
Contributor Author

jrbyrnes commented Apr 1, 2025

@nikic -- do you have any insight on how to resolve issues of this type?

We are missing xor -> or disjoint canonicalization because we hit the recursion depth limit in SimplifyDemandedBits ; the result of this is bad addressing schemes.

@arsenm
Copy link
Contributor

arsenm commented Apr 2, 2025

What was the original sequence that exceeds the limit?

@jrbyrnes
Copy link
Contributor Author

jrbyrnes commented Apr 2, 2025

What was the original sequence that exceeds the limit?

See for a representation of the original sequence

https://godbolt.org/z/9rE9dneT9

There is an opportunity for another instcombine in this sequence -- we can transform

%i48 = and i32 %i28, 1
%i.not2 = icmp eq i32 %i48, 0
%i144 = select i1 %i.not2, i32 0, i32 72

into

%i48 = and %i28, 1
%i144 = mul i32 %i48, 72

There are a couple similar patterns combined together with an or disjoint -- these can basically be transformed into a single and->mul . There are a couple other more complex opportunities, but I haven't been able to find a way to reduce the depth enough to transform the final xor -> or disjoint and extract the constant. Nor am I sure that if I were to find the right set of combines for this specific case that the same set would reduce the depth enough in similar cases.

I'm leaning towards providing an API to computeKnownBits to override the depth limit, for cases where getting it right is worth a bit more compile time (e.g. separateConstOffsetFromGEP).

@dtcxzyw
Copy link
Member

dtcxzyw commented Apr 3, 2025

What was the original sequence that exceeds the limit?

See for a representation of the original sequence

https://godbolt.org/z/9rE9dneT9

There is an opportunity for another instcombine in this sequence -- we can transform

%i48 = and i32 %i28, 1
%i.not2 = icmp eq i32 %i48, 0
%i144 = select i1 %i.not2, i32 0, i32 72

into

%i48 = and %i28, 1
%i144 = mul i32 %i48, 72

I don't think this transformation is profitable. Generally cmov is cheaper than mul.
InstCombiner performs similar transformations in foldSelectICmpAndBinOp for add/xor/shl/lshr/ashr/..., but not for mul.

@arsenm
Copy link
Contributor

arsenm commented Apr 3, 2025

I don't think this transformation is profitable. Generally cmov is cheaper than mul. InstCombiner performs similar transformations in foldSelectICmpAndBinOp for add/xor/shl/lshr/ashr/..., but not for mul.

It's fewer IR instructions, so it's better canonical form. Undoing for profitability is a backend issue. Though I guess mul is harder to analyze .

But you don't need mul, you just need 0-or-non0. Can you use `shl 72, %i48" equivalently

@dtcxzyw
Copy link
Member

dtcxzyw commented Apr 3, 2025

It's fewer IR instructions, so it's better canonical form.

I think the canonical form should be:

%i.not2.not = trunc i32 %i28 to i1
%i144 = select i1 %i.not2.not, i32 72, i32 0

cc @andjo403

@jrbyrnes
Copy link
Contributor Author

jrbyrnes commented Apr 3, 2025

For my purposes, what is most beneficial is:

%i48 = and i32 %i28, 1
%i.not2 = icmp eq i32 %i48, 0
%i49 = and i32 %i28, 2
%i50 = icmp eq i32 %i49, 0
%i144 = select i1 %i.not2, i32 0, i32 72
%i145 = select i1 %i50, i32 0, i32 144
%i146 = or disjoint i32 %i144, %i145

->

%temp = and i32 %i32, 3
%146 = mul %temp, 72

Which I imagine produces better code even for hardwares which prefer icmp->select over mul.

@dtcxzyw
Copy link
Member

dtcxzyw commented Apr 9, 2025

For my purposes, what is most beneficial is:

%i48 = and i32 %i28, 1
%i.not2 = icmp eq i32 %i48, 0
%i49 = and i32 %i28, 2
%i50 = icmp eq i32 %i49, 0
%i144 = select i1 %i.not2, i32 0, i32 72
%i145 = select i1 %i50, i32 0, i32 144
%i146 = or disjoint i32 %i144, %i145

->

%temp = and i32 %i32, 3
%146 = mul %temp, 72

Which I imagine produces better code even for hardwares which prefer icmp->select over mul.

https://alive2.llvm.org/ce/z/1kmtuR

@jrbyrnes
Copy link
Contributor Author

For my purposes, what is most beneficial is:

%i48 = and i32 %i28, 1
%i.not2 = icmp eq i32 %i48, 0
%i49 = and i32 %i28, 2
%i50 = icmp eq i32 %i49, 0
%i144 = select i1 %i.not2, i32 0, i32 72
%i145 = select i1 %i50, i32 0, i32 144
%i146 = or disjoint i32 %i144, %i145

->

%temp = and i32 %i32, 3
%146 = mul %temp, 72

Which I imagine produces better code even for hardwares which prefer icmp->select over mul.

Added: #135274

@jrbyrnes
Copy link
Contributor Author

This canonicalization already exists

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants
Morty Proxy This is a proxified and sanitized view of the page, visit original site.