Reduce copies when reading files in pyio, match behavior of _io

Feature or enhancement

Proposal:

Currently _pyio uses ~2x as much memory to read all data from a file compared to _io. This is because it makes more than one copy of the data.

Details from test_fileio run

$ ./python -m test -M8g -uall test_largefile -m test_large_read -vvv
== CPython 3.14.0a4+ (heads/main-dirty:3829104ab41, Jan 17 2025, 21:40:47) [Clang 19.1.6 ]
== Linux-6.12.9-arch1-1-x86_64-with-glibc2.40 little-endian
== Python build: debug
== cwd: <$HOME>/python/build/build/test_python_worker_32392æ
== CPU count: 32
== encodings: locale=UTF-8 FS=utf-8
== resources: all

Using random seed: 1740056613
0:00:00 load avg: 0.53 Run 1 test sequentially in a single process
0:00:00 load avg: 0.53 [1/1] test_largefile
test_large_read (test.test_largefile.CLargeFileTest.test_large_read) ... 
 ... expected peak memory use: 4.7G
 ... process data size: 2.3G
ok
test_large_read (test.test_largefile.PyLargeFileTest.test_large_read) ... 
 ... expected peak memory use: 4.7G
 ... process data size: 2.3G
 ... process data size: 4.3G
 ... process data size: 4.7G
ok

----------------------------------------------------------------------
Ran 2 tests in 3.711s

OK

== Tests result: SUCCESS ==

1 test OK.

Total duration: 3.7 sec
Total tests: run=2 (filtered)
Total test files: run=1/1 (filtered)
Result: SUCCESS

Plan:

Switch to ~~os.readv()~~ os.readinto() to do readinto like C _Py_read used by _io does. os.read() can't take a buffer to use. This aligns behavior between _io.FileIO.readall and _pyio.FileIO.readall. os.readv works well today and takes a caller allocated buffer rather than needing to add a new os API. readv(2) mirrors the behavior and errors of read(2), so this should keep the same end behavior.
Update _pyio.BufferedIO to not force a copy of the buffer for readall when its internal buffer is empty. Currently it always slices its internal buffer then adds the result of _pyio.FileIO.readall to it.

For iterating, I'm using a small tracemalloc script to find where copies are:

from _pyio import open

import tracemalloc

with open("README.rst", 'rb') as file:
    tracemalloc.start()
    data = file.read()
    snap = tracemalloc.take_snapshot()


stats = snap.statistics('lineno')
for stat in stats:
    print(stat)

Loose Ends

os.readv seems to be well supported but is currently guarded by a configure check. I'd like to just make pyio require readv, but can do conditional code if needed. If making readv non-optional generally is feasible, happy to work on that.
- os.readv is not supported on WASI, so need to add conditional code.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Uh oh!

Reduce copies when reading files in pyio, match behavior of _io #129005

Feature or enhancement

Proposal:

Plan:

Loose Ends

Has this already been discussed elsewhere?

Links to previous discussion of this feature:

Linked PRs

Metadata

Assignees

Labels

Projects

Milestone

Relationships

Development

Search code, repositories, users, issues, pull requests...

Uh oh!

Reduce copies when reading files in pyio, match behavior of _io #129005

Description

Feature or enhancement

Proposal:

Plan:

Loose Ends

Has this already been discussed elsewhere?

Links to previous discussion of this feature:

Linked PRs

Metadata

Metadata

Assignees

Labels

Projects

Milestone

Relationships

Development

Issue actions