Skip to content

Navigation Menu

Sign in
Appearance settings

Search code, repositories, users, issues, pull requests...

Provide feedback

We read every piece of feedback, and take your input very seriously.

Saved searches

Use saved searches to filter your results more quickly

Appearance settings

Add huge pages support for pymalloc #144319

Copy link
Copy link
@pablogsal

Description

@pablogsal
Issue body actions

pymalloc allocates small objects from contiguous regions called arenas. On 64-bit platforms each arena is 1 MiB, obtained via mmap(MAP_PRIVATE|MAP_ANONYMOUS) and backed by 256 standard 4 KiB pages. Each page needs its own TLB entry, and the x86_64 dTLB only holds 64-128 entries for 4K pages, so a single arena already overflows TLB capacity, and any non-trivial Python program touches many arenas.

Most modern operating systems support "huge pages": memory pages much larger than the default 4 KiB. On x86_64 Linux the standard huge page size is 2 MiB. A single 2 MiB huge page is covered by one TLB entry instead of 512 entries for the equivalent range of 4 KiB pages. This dramatically reduces TLB pressure for workloads that touch large contiguous allocations. On Linux, explicit huge pages are allocated via mmap with the MAP_HUGETLB flag (available since kernel 2.6.32) from a pre-reserved pool configured through /proc/sys/vm/nr_hugepages. On Windows, the equivalent is VirtualAlloc with MEM_LARGE_PAGES.

I'd like to propose adding a ./configure --with-pymalloc-hugepages option that increases ARENA_BITS from 20 to 21 (1 MiB -> 2 MiB) and makes _PyMem_ArenaAlloc() try mmap(MAP_HUGETLB) first, falling back to regular mmap if the huge page pool is exhausted. On Windows the equivalent would be VirtualAlloc(MEM_LARGE_PAGES) with fallback. _PyMem_ArenaFree() needs no changes since munmap handles huge pages identically. All derived constants (ARENA_SIZE, MAX_POOLS_IN_ARENA, radix tree bit widths, nfp2lasta sizing) adjust automatically from ARENA_BITS.

The flag is opt-in and off by default. MAP_HUGETLB requires the kernel to have huge pages pre-allocated; without them the fallback path produces identical behavior to a non-hugepages build. On Linux, huge pages are managed through /proc/sys/vm/nr_hugepages. To allocate 128 huge pages (256 MiB on x86_64 where the default huge page size is 2 MiB):

# Allocate (requires root)
echo 128 | sudo tee /proc/sys/vm/nr_hugepages

# Verify
grep HugePages /proc/meminfo
# HugePages_Total:     128
# HugePages_Free:      128

# Make persistent across reboots by adding to /etc/sysctl.conf:
# vm.nr_hugepages = 128

Each arena consumes one huge page. If the pool runs out, obmalloc falls back to regular 4K pages transparently.

I benchmarked on an i9-14900KS, Linux 6.18.3, GCC 15.2.1 on main with nr_hugepages=128. Measured with perf stat -r 100 using cpu_core counters. GC disabled during benchmarks.

Wall-clock results:

Benchmark Default Hugepages Change
list_of_tuples (1M 3-tuples) 0.172s 0.121s -29.5%
fragmentation (500K alloc/free/realloc) 0.162s 0.119s -26.5%
mixed_sizes (500K, 12 size classes) 0.141s 0.106s -25.1%
bulk_small_alloc (1M bytearrays) 0.205s 0.160s -22.1%
class_instances (500K __slots__) 0.120s 0.096s -20.0%
arena_pressure (10x200K objects) 0.509s 0.448s -12.1%
random_walk (1M, shuffled access) 0.822s 0.759s -7.6%

dTLB miss reductions:

Benchmark dTLB Load Miss dTLB Store Miss Page Faults
fragmentation -95.9% -94.7% -94.5%
random_walk -93.1% -98.9% -91.6%
bulk_small_alloc -91.4% -94.5% -93.5%
list_of_tuples -88.0% -93.7% -94.1%
class_instances -84.3% -91.8% -92.1%
mixed_sizes -80.8% -76.5% -78.2%

The perf command used per benchmark:

EVENTS="dTLB-loads,dTLB-load-misses,dTLB-stores,dTLB-store-misses,iTLB-load-misses,cache-misses,cache-references,instructions,cycles,page-faults"
perf stat -r 10 -e "$EVENTS" ./python bench_obmalloc.py fragmentation
bench_obmalloc.py
import sys, gc

def bench_small_object_churn():
    objs = []
    for _ in range(200_000): objs.append(bytearray(64))
    for _ in range(200_000): objs.append(bytearray(64)); objs.pop(0)

def bench_bulk_small_alloc():
    objs = [bytearray(48) for _ in range(1_000_000)]
    for o in objs: o[0] = 1

def bench_dict_churn():
    for _ in range(500_000): d = {"a": 1, "b": 2, "c": 3, "d": 4}; del d

def bench_mixed_sizes():
    sizes = [8, 16, 24, 32, 48, 64, 96, 128, 192, 256, 384, 512]
    objs = [bytearray(sizes[i % 12]) for i in range(500_000)]

def bench_fragmentation():
    objs = [bytearray(128) for _ in range(500_000)]
    for i in range(0, len(objs), 2): objs[i] = None
    for i in range(0, len(objs), 2): objs[i] = bytearray(128)

def bench_list_of_tuples():
    objs = [(i, i+1, i+2) for i in range(1_000_000)]

def bench_class_instances():
    class Pt:
        __slots__ = ('x', 'y', 'z')
        def __init__(s, x, y, z): s.x = x; s.y = y; s.z = z
    objs = [Pt(i, i+1, i+2) for i in range(500_000)]

def bench_arena_pressure():
    layers = [[bytearray(256) for _ in range(200_000)] for _ in range(10)]

def bench_random_walk():
    import random; random.seed(42)
    objs = [bytearray(64) for _ in range(1_000_000)]
    idx = list(range(len(objs))); random.shuffle(idx)
    for i in idx: objs[i][0] = i & 0xff

BENCHMARKS = dict(small_object_churn=bench_small_object_churn,
    bulk_small_alloc=bench_bulk_small_alloc, dict_churn=bench_dict_churn,
    mixed_sizes=bench_mixed_sizes, fragmentation=bench_fragmentation,
    list_of_tuples=bench_list_of_tuples, class_instances=bench_class_instances,
    arena_pressure=bench_arena_pressure, random_walk=bench_random_walk)

if __name__ == "__main__":
    gc.collect(); gc.disable(); BENCHMARKS[sys.argv[1]](); gc.enable()

Full reproduction:

./configure && make -j$(nproc) && cp python python_default
./configure --with-pymalloc-hugepages && make -j$(nproc) && cp python python_hugepages
echo 128 | sudo tee /proc/sys/vm/nr_hugepages

EVENTS="dTLB-loads,dTLB-load-misses,dTLB-stores,dTLB-store-misses,iTLB-load-misses,cache-misses,cache-references,instructions,cycles,page-faults"
for b in bulk_small_alloc mixed_sizes fragmentation list_of_tuples class_instances arena_pressure random_walk; do
    echo "=== $b ==="
    perf stat -r 10 -e "$EVENTS" ./python_default bench_obmalloc.py "$b"
    perf stat -r 10 -e "$EVENTS" ./python_hugepages bench_obmalloc.py "$b"
done

Linked PRs

Reactions are currently unavailable

Metadata

Metadata

Assignees

Labels

interpreter-core(Objects, Python, Grammar, and Parser dirs)(Objects, Python, Grammar, and Parser dirs)type-featureA feature request or enhancementA feature request or enhancement
No fields configured for issues without a type.

Projects

No projects

Milestone

No milestone

Relationships

None yet

Development

No branches or pull requests

Issue actions

    Morty Proxy This is a proxified and sanitized view of the page, visit original site.