Open
Description
Feature or enhancement
Proposal:
Currently _pyio
uses ~2x as much memory to read all data from a file compared to _io. This is because it makes more than one copy of the data.
Details from test_fileio run
$ ./python -m test -M8g -uall test_largefile -m test_large_read -vvv
== CPython 3.14.0a4+ (heads/main-dirty:3829104ab41, Jan 17 2025, 21:40:47) [Clang 19.1.6 ]
== Linux-6.12.9-arch1-1-x86_64-with-glibc2.40 little-endian
== Python build: debug
== cwd: <$HOME>/python/build/build/test_python_worker_32392æ
== CPU count: 32
== encodings: locale=UTF-8 FS=utf-8
== resources: all
Using random seed: 1740056613
0:00:00 load avg: 0.53 Run 1 test sequentially in a single process
0:00:00 load avg: 0.53 [1/1] test_largefile
test_large_read (test.test_largefile.CLargeFileTest.test_large_read) ...
... expected peak memory use: 4.7G
... process data size: 2.3G
ok
test_large_read (test.test_largefile.PyLargeFileTest.test_large_read) ...
... expected peak memory use: 4.7G
... process data size: 2.3G
... process data size: 4.3G
... process data size: 4.7G
ok
----------------------------------------------------------------------
Ran 2 tests in 3.711s
OK
== Tests result: SUCCESS ==
1 test OK.
Total duration: 3.7 sec
Total tests: run=2 (filtered)
Total test files: run=1/1 (filtered)
Result: SUCCESS
Plan:
- Switch to
os.readv()
os.readinto()
to do readinto like C_Py_read
used by_io
does.os.read()
can't take a buffer to use. This aligns behavior between_io.FileIO.readall
and_pyio.FileIO.readall
.os.readv
works well today and takes a caller allocated buffer rather than needing to add a newos
API.readv(2)
mirrors the behavior and errors ofread(2)
, so this should keep the same end behavior. - Update
_pyio.BufferedIO
to not force a copy of the buffer for readall when its internal buffer is empty. Currently it always slices its internal buffer then adds the result of_pyio.FileIO.readall
to it.
For iterating, I'm using a small tracemalloc script to find where copies are:
from _pyio import open
import tracemalloc
with open("README.rst", 'rb') as file:
tracemalloc.start()
data = file.read()
snap = tracemalloc.take_snapshot()
stats = snap.statistics('lineno')
for stat in stats:
print(stat)
Loose Ends
os.readv
seems to be well supported but is currently guarded by a configure check. I'd like to just make pyio requirereadv
, but can do conditional code if needed. If makingreadv
non-optional generally is feasible, happy to work on that.os.readv
is not supported on WASI, so need to add conditional code.
Has this already been discussed elsewhere?
This is a minor feature, which does not need previous discussion elsewhere
Links to previous discussion of this feature:
No response
Linked PRs
- gh-129005: Avoid copy in _pyio.FileIO.readinto #129006
- gh-129005: Avoid copy in _pyio.FileIO.readinto #129324
- gh-129005: Align FileIO.readall allocation #129424
- gh-129005: _pyio.BufferedIO remove copy on readall #129454
- gh-129005: Align FileIO.readall allocation #129458
- gh-129005: Remove copy in
_pyio.FileIO.readall()
#129496 - Revert "gh-129005: _pyio.BufferedIO remove copy on readall (#129454)" #129500
- gh-129005: Fix buffer expansion in _pyio.FileIO.readall #129541
- Revert "gh-129005: Align FileIO.readall() allocation (#129458)" #129572
- gh-129005: Update _pyio.BytesIO to use bytearray.resize on write #129702
- gh-129005: Align FileIO.readall between _pyio and _io #129705
- gh-129005: Move bytearray to use bytes as a buffer #130563
Metadata
Metadata
Assignees
Labels
Performance or resource usagePerformance or resource usagePython modules in the Lib dirPython modules in the Lib dirA feature request or enhancementA feature request or enhancement