Add huge pages support for pymalloc

pymalloc allocates small objects from contiguous regions called arenas. On 64-bit platforms each arena is 1 MiB, obtained via mmap(MAP_PRIVATE|MAP_ANONYMOUS) and backed by 256 standard 4 KiB pages. Each page needs its own TLB entry, and the x86_64 dTLB only holds 64-128 entries for 4K pages, so a single arena already overflows TLB capacity, and any non-trivial Python program touches many arenas.

Most modern operating systems support "huge pages": memory pages much larger than the default 4 KiB. On x86_64 Linux the standard huge page size is 2 MiB. A single 2 MiB huge page is covered by one TLB entry instead of 512 entries for the equivalent range of 4 KiB pages. This dramatically reduces TLB pressure for workloads that touch large contiguous allocations. On Linux, explicit huge pages are allocated via mmap with the MAP_HUGETLB flag (available since kernel 2.6.32) from a pre-reserved pool configured through /proc/sys/vm/nr_hugepages. On Windows, the equivalent is VirtualAlloc with MEM_LARGE_PAGES.

I'd like to propose adding a ./configure --with-pymalloc-hugepages option that increases ARENA_BITS from 20 to 21 (1 MiB -> 2 MiB) and makes _PyMem_ArenaAlloc() try mmap(MAP_HUGETLB) first, falling back to regular mmap if the huge page pool is exhausted. On Windows the equivalent would be VirtualAlloc(MEM_LARGE_PAGES) with fallback. _PyMem_ArenaFree() needs no changes since munmap handles huge pages identically. All derived constants (ARENA_SIZE, MAX_POOLS_IN_ARENA, radix tree bit widths, nfp2lasta sizing) adjust automatically from ARENA_BITS.

The flag is opt-in and off by default. MAP_HUGETLB requires the kernel to have huge pages pre-allocated; without them the fallback path produces identical behavior to a non-hugepages build. On Linux, huge pages are managed through /proc/sys/vm/nr_hugepages. To allocate 128 huge pages (256 MiB on x86_64 where the default huge page size is 2 MiB):

# Allocate (requires root)
echo 128 | sudo tee /proc/sys/vm/nr_hugepages

# Verify
grep HugePages /proc/meminfo
# HugePages_Total:     128
# HugePages_Free:      128

# Make persistent across reboots by adding to /etc/sysctl.conf:
# vm.nr_hugepages = 128

Each arena consumes one huge page. If the pool runs out, obmalloc falls back to regular 4K pages transparently.

I benchmarked on an i9-14900KS, Linux 6.18.3, GCC 15.2.1 on main with nr_hugepages=128. Measured with perf stat -r 100 using cpu_core counters. GC disabled during benchmarks.

Wall-clock results:

Benchmark	Default	Hugepages	Change
list_of_tuples (1M 3-tuples)	0.172s	0.121s	-29.5%
fragmentation (500K alloc/free/realloc)	0.162s	0.119s	-26.5%
mixed_sizes (500K, 12 size classes)	0.141s	0.106s	-25.1%
bulk_small_alloc (1M bytearrays)	0.205s	0.160s	-22.1%
class_instances (500K __slots__)	0.120s	0.096s	-20.0%
arena_pressure (10x200K objects)	0.509s	0.448s	-12.1%
random_walk (1M, shuffled access)	0.822s	0.759s	-7.6%

dTLB miss reductions:

Benchmark	dTLB Load Miss	dTLB Store Miss	Page Faults
fragmentation	-95.9%	-94.7%	-94.5%
random_walk	-93.1%	-98.9%	-91.6%
bulk_small_alloc	-91.4%	-94.5%	-93.5%
list_of_tuples	-88.0%	-93.7%	-94.1%
class_instances	-84.3%	-91.8%	-92.1%
mixed_sizes	-80.8%	-76.5%	-78.2%

The perf command used per benchmark:

EVENTS="dTLB-loads,dTLB-load-misses,dTLB-stores,dTLB-store-misses,iTLB-load-misses,cache-misses,cache-references,instructions,cycles,page-faults"
perf stat -r 10 -e "$EVENTS" ./python bench_obmalloc.py fragmentation

bench_obmalloc.py

import sys, gc

def bench_small_object_churn():
    objs = []
    for _ in range(200_000): objs.append(bytearray(64))
    for _ in range(200_000): objs.append(bytearray(64)); objs.pop(0)

def bench_bulk_small_alloc():
    objs = [bytearray(48) for _ in range(1_000_000)]
    for o in objs: o[0] = 1

def bench_dict_churn():
    for _ in range(500_000): d = {"a": 1, "b": 2, "c": 3, "d": 4}; del d

def bench_mixed_sizes():
    sizes = [8, 16, 24, 32, 48, 64, 96, 128, 192, 256, 384, 512]
    objs = [bytearray(sizes[i % 12]) for i in range(500_000)]

def bench_fragmentation():
    objs = [bytearray(128) for _ in range(500_000)]
    for i in range(0, len(objs), 2): objs[i] = None
    for i in range(0, len(objs), 2): objs[i] = bytearray(128)

def bench_list_of_tuples():
    objs = [(i, i+1, i+2) for i in range(1_000_000)]

def bench_class_instances():
    class Pt:
        __slots__ = ('x', 'y', 'z')
        def __init__(s, x, y, z): s.x = x; s.y = y; s.z = z
    objs = [Pt(i, i+1, i+2) for i in range(500_000)]

def bench_arena_pressure():
    layers = [[bytearray(256) for _ in range(200_000)] for _ in range(10)]

def bench_random_walk():
    import random; random.seed(42)
    objs = [bytearray(64) for _ in range(1_000_000)]
    idx = list(range(len(objs))); random.shuffle(idx)
    for i in idx: objs[i][0] = i & 0xff

BENCHMARKS = dict(small_object_churn=bench_small_object_churn,
    bulk_small_alloc=bench_bulk_small_alloc, dict_churn=bench_dict_churn,
    mixed_sizes=bench_mixed_sizes, fragmentation=bench_fragmentation,
    list_of_tuples=bench_list_of_tuples, class_instances=bench_class_instances,
    arena_pressure=bench_arena_pressure, random_walk=bench_random_walk)

if __name__ == "__main__":
    gc.collect(); gc.disable(); BENCHMARKS[sys.argv[1]](); gc.enable()

Full reproduction:

./configure && make -j$(nproc) && cp python python_default
./configure --with-pymalloc-hugepages && make -j$(nproc) && cp python python_hugepages
echo 128 | sudo tee /proc/sys/vm/nr_hugepages

EVENTS="dTLB-loads,dTLB-load-misses,dTLB-stores,dTLB-store-misses,iTLB-load-misses,cache-misses,cache-references,instructions,cycles,page-faults"
for b in bulk_small_alloc mixed_sizes fragmentation list_of_tuples class_instances arena_pressure random_walk; do
    echo "=== $b ==="
    perf stat -r 10 -e "$EVENTS" ./python_default bench_obmalloc.py "$b"
    perf stat -r 10 -e "$EVENTS" ./python_hugepages bench_obmalloc.py "$b"
done

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Add huge pages support for pymalloc #144319

Linked PRs

Metadata

Assignees

Labels

Fields

Projects

Milestone

Relationships

Development

Search code, repositories, users, issues, pull requests...

Uh oh!

Add huge pages support for pymalloc #144319

Description

Linked PRs

Metadata

Metadata

Assignees

Labels

Fields

Projects

Milestone

Relationships

Development

Issue actions