Skip to content

Navigation Menu

Sign in
Appearance settings

Search code, repositories, users, issues, pull requests...

Provide feedback

We read every piece of feedback, and take your input very seriously.

Saved searches

Use saved searches to filter your results more quickly

Appearance settings

Hot-cold splitting for JIT stencils #143158

Copy link
Copy link
@Fidget-Spinner

Description

@Fidget-Spinner
Issue body actions

Feature or enhancement

Proposal:

We have a textual assembly parser for the stencils. It already knows what blocks are cold and what blocks are hot. With that, it's now not too hard to teach it to section-up blocks.

Currently this is _BINARY_OP_ADD_INT:

    // _BINARY_OP_ADD_INT_r23.o:      file format elf64-x86-64
    // 
    // Disassembly of section .text:
    // 
    // 0000000000000000 <_JIT_ENTRY>:
    // 0: 55                            pushq   %rbp
    // 1: 48 83 ec 10                   subq    $0x10, %rsp
    // 5: 48 89 74 24 08                movq    %rsi, 0x8(%rsp)
    // a: 48 89 fb                      movq    %rdi, %rbx
    // d: 4c 89 fd                      movq    %r15, %rbp
    // 10: 4c 89 ff                      movq    %r15, %rdi
    // 13: 48 83 e7 fe                   andq    $-0x2, %rdi
    // 17: 48 89 de                      movq    %rbx, %rsi
    // 1a: 48 83 e6 fe                   andq    $-0x2, %rsi
    // 1e: ff 15 00 00 00 00             callq   *(%rip)                 # 0x24 <_JIT_ENTRY+0x24>
    // 0000000000000020:  R_X86_64_GOTPCRELX   _PyCompactLong_Add-0x4
    // 24: 48 83 f8 01                   cmpq    $0x1, %rax
    // 28: 75 15                         jne     0x3f <_JIT_ENTRY+0x3f>
    // 2a: 49 89 ef                      movq    %rbp, %r15
    // 2d: 48 89 df                      movq    %rbx, %rdi
    // 30: 48 8b 74 24 08                movq    0x8(%rsp), %rsi
    // 35: 48 83 c4 10                   addq    $0x10, %rsp
    // 39: 5d                            popq    %rbp
    // 3a: e9 00 00 00 00                jmp     0x3f <_JIT_ENTRY+0x3f>
    // 000000000000003b:  R_X86_64_PLT32       _JIT_JUMP_TARGET-0x4
    // 3f: 49 89 c7                      movq    %rax, %r15
    // 42: 48 89 ef                      movq    %rbp, %rdi
    // 45: 48 89 de                      movq    %rbx, %rsi
    // 48: 48 83 c4 10                   addq    $0x10, %rsp
    // 4c: 5d                            popq    %rbp

With hot-cold splitting, it will be split into:

_BINARY_OP_ADD_INT_r23.HOT:
    // 0000000000000000 <_JIT_ENTRY>:
    // 0: 55                            pushq   %rbp
    // 1: 48 83 ec 10                   subq    $0x10, %rsp
    // 5: 48 89 74 24 08                movq    %rsi, 0x8(%rsp)
    // a: 48 89 fb                      movq    %rdi, %rbx
    // d: 4c 89 fd                      movq    %r15, %rbp
    // 10: 4c 89 ff                      movq    %r15, %rdi
    // 13: 48 83 e7 fe                   andq    $-0x2, %rdi
    // 17: 48 89 de                      movq    %rbx, %rsi
    // 1a: 48 83 e6 fe                   andq    $-0x2, %rsi
    // 1e: ff 15 00 00 00 00             callq   *(%rip)                 # 0x24 <_JIT_ENTRY+0x24>
    // 0000000000000020:  R_X86_64_GOTPCRELX   _PyCompactLong_Add-0x4
    // 24: 48 83 f8 01                   cmpq    $0x1, %rax
    // 28: 75 15                         jne     0x3f <_JIT_ENTRY+0x3f>
    // 3f: 49 89 c7                      movq    %rax, %r15
    // 42: 48 89 ef                      movq    %rbp, %rdi
    // 45: 48 89 de                      movq    %rbx, %rsi
    // 48: 48 83 c4 10                   addq    $0x10, %rsp
    // 4c: 5d                            popq    %rbp

_BINARY_OP_ADD_INT_r23.COLD:
    // 2a: 49 89 ef                      movq    %rbp, %r15
    // 2d: 48 89 df                      movq    %rbx, %rdi
    // 30: 48 8b 74 24 08                movq    0x8(%rsp), %rsi
    // 35: 48 83 c4 10                   addq    $0x10, %rsp
    // 39: 5d                            popq    %rbp
    // 3a: e9 00 00 00 00                jmp     0x3f <_JIT_ENTRY+0x3f>
    // 000000000000003b:  R_X86_64_PLT32       _JIT_JUMP_TARGET-0x4

Running the current jump inversion and zero length jump removal then gives us:

_BINARY_OP_ADD_INT_r23.HOT:
    // 0000000000000000 <_JIT_ENTRY>:
    // 0: 55                            pushq   %rbp
    // 1: 48 83 ec 10                   subq    $0x10, %rsp
    // 5: 48 89 74 24 08                movq    %rsi, 0x8(%rsp)
    // a: 48 89 fb                      movq    %rdi, %rbx
    // d: 4c 89 fd                      movq    %r15, %rbp
    // 10: 4c 89 ff                      movq    %r15, %rdi
    // 13: 48 83 e7 fe                   andq    $-0x2, %rdi
    // 17: 48 89 de                      movq    %rbx, %rsi
    // 1a: 48 83 e6 fe                   andq    $-0x2, %rsi
    // 1e: ff 15 00 00 00 00             callq   *(%rip)                 # 0x24 <_JIT_ENTRY+0x24>
    // 0000000000000020:  R_X86_64_GOTPCRELX   _PyCompactLong_Add-0x4
    // 24: 48 83 f8 01                   cmpq    $0x1, %rax
    // 28: 75 15                         je    _BINARY_OP_ADD_INT_r23.COLD
    // 3f: 49 89 c7                      movq    %rax, %r15
    // 42: 48 89 ef                      movq    %rbp, %rdi
    // 45: 48 89 de                      movq    %rbx, %rsi
    // 48: 48 83 c4 10                   addq    $0x10, %rsp
    // 4c: 5d                            popq    %rbp

_BINARY_OP_ADD_INT_r23.COLD:
    // 2a: 49 89 ef                      movq    %rbp, %r15
    // 2d: 48 89 df                      movq    %rbx, %rdi
    // 30: 48 8b 74 24 08                movq    0x8(%rsp), %rsi
    // 35: 48 83 c4 10                   addq    $0x10, %rsp
    // 39: 5d                            popq    %rbp
    // 3a: e9 00 00 00 00                jmp     0x3f <_JIT_ENTRY+0x3f>
    // 000000000000003b:  R_X86_64_PLT32       _JIT_JUMP_TARGET-0x4

We then lay out the traces using only the HOT sections and leave the COLD sections at the end. I think this is as good as it gets for machine code flow/layout unless we start writing things by hand.

This builds on #142228.

In the future, to reduce the jitted memory even further, we can de-duplicate common cold stencil fragments. E.g. if we see multiple _BINARY_OP_ADD_INT_r23 in a trace, we can all jump to the common _BINARY_OP_ADD_INT_r23.COLD instead of having one copy for each stencil. That should be a separate PR from this however.

I will work on this.

Has this already been discussed elsewhere?

No response given

Links to previous discussion of this feature:

No response

Linked PRs

Reactions are currently unavailable

Metadata

Metadata

Assignees

No one assigned

    Labels

    interpreter-core(Objects, Python, Grammar, and Parser dirs)(Objects, Python, Grammar, and Parser dirs)performancePerformance or resource usagePerformance or resource usagetopic-JITtype-featureA feature request or enhancementA feature request or enhancement
    No fields configured for issues without a type.

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions

      Morty Proxy This is a proxified and sanitized view of the page, visit original site.