Skip to content

Navigation Menu

Sign in
Appearance settings

Search code, repositories, users, issues, pull requests...

Provide feedback

We read every piece of feedback, and take your input very seriously.

Saved searches

Use saved searches to filter your results more quickly

Appearance settings

Reduce copies when reading files in pyio, match behavior of _io #129005

Copy link
Copy link
Open
@cmaloney

Description

@cmaloney
Issue body actions

Feature or enhancement

Proposal:

Currently _pyio uses ~2x as much memory to read all data from a file compared to _io. This is because it makes more than one copy of the data.

Details from test_fileio run

$ ./python -m test -M8g -uall test_largefile -m test_large_read -vvv
== CPython 3.14.0a4+ (heads/main-dirty:3829104ab41, Jan 17 2025, 21:40:47) [Clang 19.1.6 ]
== Linux-6.12.9-arch1-1-x86_64-with-glibc2.40 little-endian
== Python build: debug
== cwd: <$HOME>/python/build/build/test_python_worker_32392æ
== CPU count: 32
== encodings: locale=UTF-8 FS=utf-8
== resources: all

Using random seed: 1740056613
0:00:00 load avg: 0.53 Run 1 test sequentially in a single process
0:00:00 load avg: 0.53 [1/1] test_largefile
test_large_read (test.test_largefile.CLargeFileTest.test_large_read) ... 
 ... expected peak memory use: 4.7G
 ... process data size: 2.3G
ok
test_large_read (test.test_largefile.PyLargeFileTest.test_large_read) ... 
 ... expected peak memory use: 4.7G
 ... process data size: 2.3G
 ... process data size: 4.3G
 ... process data size: 4.7G
ok

----------------------------------------------------------------------
Ran 2 tests in 3.711s

OK

== Tests result: SUCCESS ==

1 test OK.

Total duration: 3.7 sec
Total tests: run=2 (filtered)
Total test files: run=1/1 (filtered)
Result: SUCCESS

Plan:

  1. Switch to os.readv() os.readinto() to do readinto like C _Py_read used by _io does. os.read() can't take a buffer to use. This aligns behavior between _io.FileIO.readall and _pyio.FileIO.readall. os.readv works well today and takes a caller allocated buffer rather than needing to add a new os API. readv(2) mirrors the behavior and errors of read(2), so this should keep the same end behavior.
  2. Update _pyio.BufferedIO to not force a copy of the buffer for readall when its internal buffer is empty. Currently it always slices its internal buffer then adds the result of _pyio.FileIO.readall to it.

For iterating, I'm using a small tracemalloc script to find where copies are:

from _pyio import open

import tracemalloc

with open("README.rst", 'rb') as file:
    tracemalloc.start()
    data = file.read()
    snap = tracemalloc.take_snapshot()


stats = snap.statistics('lineno')
for stat in stats:
    print(stat)

Loose Ends

  • os.readv seems to be well supported but is currently guarded by a configure check. I'd like to just make pyio require readv, but can do conditional code if needed. If making readv non-optional generally is feasible, happy to work on that.
    • os.readv is not supported on WASI, so need to add conditional code.

Has this already been discussed elsewhere?

This is a minor feature, which does not need previous discussion elsewhere

Links to previous discussion of this feature:

No response

Linked PRs

Metadata

Metadata

Assignees

No one assigned

    Labels

    performancePerformance or resource usagePerformance or resource usagestdlibPython modules in the Lib dirPython modules in the Lib dirtype-featureA feature request or enhancementA feature request or enhancement

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions

      Morty Proxy This is a proxified and sanitized view of the page, visit original site.