-
Notifications
You must be signed in to change notification settings - Fork 2.7k
[stdlib][docs] Clarify the Cost of Raising #5349
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: main
Are you sure you want to change the base?
Conversation
Given that we've seen a lot of confusion around the cost of raising in Mojo, add clarification to the Mojo manual that `raises` is not the same as C++ exceptions and include some reasoning as to why it's better than Rust's `Result`. Signed-off-by: Owen Hilyard <hilyard.owen@gmail.com>
@arthurevans fyi |
you raise and returning a boolean, meaning that it has similar overhead to errors as | ||
values (C and Rust), but still enables "placement new" and other optimizations that |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It may be worth adding a short code snippet here on how the compiler would transform a raises
function into this. Visual/code example are often a bit easier to understand than walls of text.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Right now my local Mojo install and compiler explorer are having a fairly strong disagreement about this.
Here's what compiler explorer says Mojo emits, which, while not great (see the Rust version), is a reasonable lowering given the constraints of raises
.
But when I do it locally using this:
from compile import compile_info
from utils import StaticTuple
@export(ABI="C")
fn bounds_check(read l: List[UInt], var index: UInt) raises:
"""
Checks that an index is in-bounds for a list, raising if it is not.
"""
if index >= len(l):
raise "Index too large"
def main():
print(compile_info[bounds_check, emission_kind="asm"]())
and run it I get the following:
bounds_check:
.Lbounds_check$local:
.type .Lbounds_check$local,@function
.cfi_startproc
pushq %rbp
.cfi_def_cfa_offset 16
pushq %r15
.cfi_def_cfa_offset 24
pushq %r14
.cfi_def_cfa_offset 32
pushq %r13
.cfi_def_cfa_offset 40
pushq %r12
.cfi_def_cfa_offset 48
pushq %rbx
.cfi_def_cfa_offset 56
subq $24, %rsp
.cfi_def_cfa_offset 80
.cfi_offset %rbx, -56
.cfi_offset %r12, -48
.cfi_offset %r13, -40
.cfi_offset %r14, -32
.cfi_offset %r15, -24
.cfi_offset %rbp, -16
movq 8(%rdi), %r14
movq %r8, %rbx
movq %rdx, %r15
movq %rsi, %r12
cmpq %r14, %rsi
jae .LBB0_2
jmp .LBB0_19
.LBB0_2:
movq %rcx, 16(%rsp)
movq $0, (%rsp)
movq %rsp, %rdi
xorl %esi, %esi
callq KGEN_CompilerRT_GetStackTrace@PLT
testq %rax, %rax
je .LBB0_3
movl $1, %edi
movq %rax, %rsi
movq %rax, %rbp
movq %rbx, 8(%rsp)
callq KGEN_CompilerRT_AlignedAlloc@PLT
movl $8, %edi
movl $16, %esi
movq %rax, %rbx
callq KGEN_CompilerRT_AlignedAlloc@PLT
movq (%rsp), %rdi
movq $1, (%rax)
movq %rax, %r13
movq %rbx, 8(%rax)
cmpq $4, %rbp
jg .LBB0_7
movzbl (%rdi), %eax
movb %al, (%rbx)
movzbl -1(%rdi,%rbp), %eax
movb %al, -1(%rbx,%rbp)
cmpq $3, %rbp
jl .LBB0_17
movzbl 1(%rdi), %eax
movb %al, 1(%rbx)
movzbl -2(%rdi,%rbp), %eax
movb %al, -2(%rbx,%rbp)
jmp .LBB0_17
.LBB0_3:
movl $8, %edi
movl $16, %esi
callq KGEN_CompilerRT_AlignedAlloc@PLT
movq $1, (%rax)
movq %rax, %r13
movq $0, 8(%rax)
jmp .LBB0_18
.LBB0_7:
cmpq $16, %rbp
jg .LBB0_11
cmpq $8, %rbp
jl .LBB0_10
movq (%rdi), %rax
movq %rax, (%rbx)
movq -8(%rdi,%rbp), %rax
movq %rax, -8(%rbx,%rbp)
jmp .LBB0_17
.LBB0_11:
movabsq $9223372036854775776, %rax
andq %rbp, %rax
je .LBB0_14
xorl %ecx, %ecx
.p2align 4
.LBB0_13:
vmovups (%rdi,%rcx), %ymm0
vmovups %ymm0, (%rbx,%rcx)
addq $32, %rcx
cmpq %rax, %rcx
jb .LBB0_13
.LBB0_14:
testb $31, %bpl
je .LBB0_17
subq %rax, %rbp
.p2align 4
.LBB0_16:
movzbl (%rdi,%rax), %ecx
decq %rbp
movb %cl, (%rbx,%rax)
incq %rax
testq %rbp, %rbp
jg .LBB0_16
jmp .LBB0_17
.LBB0_10:
movl (%rdi), %eax
movl %eax, (%rbx)
movl -4(%rdi,%rbp), %eax
movl %eax, -4(%rbx,%rbp)
.LBB0_17:
vzeroupper
callq free@PLT
movq 8(%rsp), %rbx
.LBB0_18:
movq 16(%rsp), %rcx
movl $15, %eax
leaq static_string_d0af4fb5fae76f45(%rip), %rdi
.LBB0_19:
vmovdqa64 .LCPI0_0(%rip), %zmm1
vpbroadcastq %r13, %zmm0
vpbroadcastq %rax, %zmm2
cmpq %r14, %r12
sbbl %edx, %edx
xorl %esi, %esi
cmpq %r14, %r12
kmovd %edx, %k0
cmovaeq %rdi, %rsi
knotw %k0, %k1
movq %rsi, (%r15)
vpsrlvq %zmm1, %zmm0, %zmm0
vpsrlvq %zmm1, %zmm2, %zmm1
vpmovqb %zmm1, %xmm1
vpmovqb %zmm0, %xmm0
vpunpcklqdq %xmm0, %xmm1, %xmm0
vmovdqu8 %xmm0, %xmm0 {%k1} {z}
vmovdqu %xmm0, (%rcx)
setb (%rbx)
addq $24, %rsp
.cfi_def_cfa_offset 56
popq %rbx
.cfi_def_cfa_offset 48
popq %r12
.cfi_def_cfa_offset 40
popq %r13
.cfi_def_cfa_offset 32
popq %r14
.cfi_def_cfa_offset 24
popq %r15
.cfi_def_cfa_offset 16
popq %rbp
.cfi_def_cfa_offset 8
vzeroupper
retq
Is there some kind of work happening around raises
that would cause a massive performance regression like this since the last time compiler explorer updated their mojo nightly? I don't have the ability to look at how raises
is actually lowered so I was planning to work upwards from assembly and/or llvm IR.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm not sure if having code here is a great idea given this issue. As-is, what I've said in the docs is apparently untrue and raises
is actually quite expensive. Compiler explorer seems to have an out of date version of Mojo where this was cheaper.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I agree about not having code like this here. I don't think it would be helpful in the context of a basic section explaining how to raise and handle errors.
If we do end up including a note like this, it might be more helpful to include pseudocode showing the rough equivalent (in Mojo syntax).
For now, it seems like there's an open question about the performance of raising in Mojo and whether there's been a recent regression, which we should follow up on before updating the docs. @owenhilyard did you open a ticket about this?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
it might be more helpful to include pseudocode showing the rough equivalent (in Mojo syntax).
Yep, this is what I was referring to. Not actual assembly or llvm codegen. But mojo pseudo code
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@arthurevans Chris seemed to be well aware of the issue so I had assumed it was already tracked internally. I can open an issue if you aren't aware of one which already exists.
I was planning to have some roughly equivalent Mojo code, but wanted to verify it had a vaguely equivalent cost, which is why I was looking at assembly.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is there some kind of work happening around raises that would cause a massive performance regression like this since the last time compiler explorer updated their mojo nightly?
@owenhilyard do you have a benchmark to share so that I can measure performance regression you see?
Details: The change in asm comes from ~2m old commit 6d87e52 that added ability to get a stack trace of the raised Error
. What I don't understand is why nightly in compiler explorer doesn't show this.
That functionality, however, is disabled by default due to massive perf drop it can cause: KGEN_CompilerRT_GetStackTrace internally checks if MOJO_ENABLE_STACK_TRACE_ON_ERROR
env var is set. If it's not set, nothing is collected and no extra logic later to do memcpy is done.
The benchmark and data I used:
from time import perf_counter_ns
@no_inline
def throw_exception():
raise Error("dummy")
def main():
alias iters = 1000
var total_time: Int = 0
var start_time = perf_counter_ns()
for _ in range(iters):
try:
throw_exception()
except e:
_ = e
pass
total_time += perf_counter_ns() - start_time
print(total_time)
mojo build test.mojo -o test.out
Time reported by hyperfine
Before 6d87e52 | 6d87e52 + no MOJO_ENABLE_STACK_TRACE_ON_ERROR |
6d87e52 + MOJO_ENABLE_STACK_TRACE_ON_ERROR |
---|---|---|
20.6 ms +- 1.0 ms | 20.4 ms + 0.8 ms | 47.02 s +- 0.1 ms |
That is, I don't see perf change before and after 6d87e52 commit with no MOJO_ENABLE_STACK_TRACE_ON_ERROR
env var set. However, if that variable IS set, perf drops by more than 2000x times.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The regression I was talking about was in the generated code. That being stack trace handling code makes sense, but that's a lot of code to have inlined and gets rid of the "it's syntactic sugar for returning a boolean and the actual return value" argument I was trying to make since it now does quite a bit more. It looks like compiler explorer has since updated and I forgot to permalink, so the outputs are gone, but previously it was a about a dozen instructions. Even in the new generated code, there's still a bunch of vector stuff in what should be a fully scalar happy path (notice the single retq
in the function being at the end with no labels to jump past the vector code. Vector ops on zen 4 aren't as bad as skylake X, but AVX512 still isn't cheap. This means that even assuming the branch predictor works, you still have a fairly substantial body of vector ops which makes using this in kernel code not a great idea since you can't raise for "free" while OoO takes care of other vector ops.
What would probably work as a quick fix is to have a way to disable stack trace gathering statically. Once we have parametric raises, that can be opt-in behavior based on your error type.
Given that |
I'm sorry I am just getting caught up on this now. Indeed, the introduction of stack trace stuff was a major regression, unnecessarily so. This is deeply unfortunately, I'd love help improving this. I filed some notes here: #5423 |
Let's put this PR on hold for a moment. The raising code indeed has some low hanging fruit improvements, we should do those first, and then take another loop at the generated code and decide how to explain the raising in the doc. |
I agree @dgurchenkov. I want it to actually be a zero-overhead solution before we claim it is one. If that means waiting for parametric raises so be it. |
Given that we've seen a lot of confusion around the cost of raising in Mojo, add clarification to the Mojo manual that
raises
is not the same as C++ exceptions and include some reasoning as to why it's better than Rust'sResult
.