Skip to content

Navigation Menu

Sign in
Appearance settings

Search code, repositories, users, issues, pull requests...

Provide feedback

We read every piece of feedback, and take your input very seriously.

Saved searches

Use saved searches to filter your results more quickly

Appearance settings

BUG: Allow empty memmaps in most situations #27723

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merged
merged 10 commits into from
Nov 18, 2024
4 changes: 4 additions & 0 deletions 4 doc/release/upcoming_changes/27723.improvement.rst
Original file line number Diff line number Diff line change
@@ -0,0 +1,4 @@
* Improved support for empty `memmap`. Previously an empty `memmap` would fail
unless a non-zero ``offset`` was set. Now a zero-size `memmap` is supported
even if ``offset=0``. To achieve this, if a `memmap` is mapped to an empty
file that file is padded with a single byte.
17 changes: 13 additions & 4 deletions 17 numpy/_core/memmap.py
Original file line number Diff line number Diff line change
Expand Up @@ -262,10 +262,14 @@ def __new__(subtype, filename, dtype=uint8, mode='r+', offset=0,

bytes = int(offset + size*_dbytes)

if mode in ('w+', 'r+') and flen < bytes:
fid.seek(bytes - 1, 0)
fid.write(b'\0')
fid.flush()
if mode in ('w+', 'r+'):
# gh-27723
# if bytes == 0, we write out 1 byte to allow empty memmap.
bytes = max(bytes, 1)
if flen < bytes:
fid.seek(bytes - 1, 0)
fid.write(b'\0')
fid.flush()

if mode == 'c':
acc = mmap.ACCESS_COPY
Expand All @@ -276,6 +280,11 @@ def __new__(subtype, filename, dtype=uint8, mode='r+', offset=0,

start = offset - offset % mmap.ALLOCATIONGRANULARITY
bytes -= start
# bytes == 0 is problematic as in mmap length=0 maps the full file.
# See PR gh-27723 for a more detailed explanation.
if bytes == 0 and start > 0:
bytes += mmap.ALLOCATIONGRANULARITY
start -= mmap.ALLOCATIONGRANULARITY
Copy link
Member

@seberg seberg Nov 9, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This needs more explanation, I think. To me it also looks like this will simply fail if you pass offset=0. Even if I am missing something, that means it needs testing.
I.e. it looks a bit like it fixes your specific issue, but may not improve things generally.

It seems like a 0 size memmap is quite fraught to begin with, so I am not sure what is right. Setting the bytes to be at least 1 may make sense, but on linux may also fail if the size is actually 0 it seems (so at the very least it would also lead to a regressions in some places).
Of course sure, if an offset is used, that offset can be adjusted so that we don't need to "read" a byte beyond the end.

EDIT: Ack, sorry I guess the start > 0 prevents things, but it still would be nice to be clear that this only works with offsets. And it would be good to think about what to do about the other case.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Agreed, I could have been more clear.

So just to clarify, on Linux (haven't tested on Windows), in the current numpy implementation:

  • Works:
    • 0 size memmap with offset > 0 and offset % mmap.ALLOCATIONGRANULARITY != 0
  • Fails:
    • 0 size memmap with offset = 0
    • 0 size memmap with offset > 0 and offset % mmap.ALLOCATIONGRANULARITY == 0

With the suggested fix:

  • Works:
    • 0 size memmap with offset > 0
  • Fails:
    • 0 size memmap with offset = 0

I agree that a 0 size memmap is quite fraught, but it would be nice if the behaviour of a 0 size memmap was independent of whether the offset was a multiple of the allocation granularity or not.

An alternative solution would be to not allow 0 size memmaps at all.

I do see some logic as to why it makes sense to allow a 0 size memmap with offset > 0 as this offset typically implies that the file contains some sort of metadata besides for the 0 size array. For a 0 size memmap with offset = 0 the file would be completely empty and thus there is nothing to map at all (mmap.mmap with length=0 even fails on Windows for an empty file (see here)).

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeay, thanks. I have two asks:

  1. Please add a test for arr.offset being correct, even though you are messing around with the start. (I suspect it already is.)
  2. Can you add (ideally not too long) code comment about bytes=0 being problematic but we can avoid it easily if offset!=0. It can reference to this PR gh-27723.

But, maybe we can rethink the bytes=0 part as well? It seems to me that it is impossible to memory map an empty file on all systems. In other words, mapping 0 bytes works on linux probably, but only because it maps the full file and then reads nothing.

I think that means that we can:

  • Use bytes=1 if offset=0
  • Use your trick here if offset > 0 in order to not grow the file unnecessarily. Although, I don't think it would matter too much, that seems pretty safe to me.

Or am I missing some path where if bytes == 0 and offset == 0: bytes = 1 might backfire?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I've added a test for arr.offset == offset, please confirm that this is what you had in mind as the test seems a bit trivial to me (but it could very well be that I misunderstood what you were asking).

I've also added a code comment as asked.

I also had a think about the 0 size memmap with offset=0 and your suggestion to use bytes=1 makes sense. I've implemented it in such a way that it requires mode in ('w+', 'r+') since we will be writing a byte to the file. There was an existing test for an empty memmap which I have adjusted accordingly.

Have a look at the changes and let me know what you think.

array_offset = offset - start
mm = mmap.mmap(fid.fileno(), bytes, access=acc, offset=start)

Expand Down
15 changes: 12 additions & 3 deletions 15 numpy/_core/tests/test_memmap.py
Original file line number Diff line number Diff line change
Expand Up @@ -199,6 +199,13 @@ def test_mmap_offset_greater_than_allocation_granularity(self):
fp = memmap(self.tmpfp, shape=size, mode='w+', offset=offset)
assert_(fp.offset == offset)

def test_empty_array_with_offset_multiple_of_allocation_granularity(self):
self.tmpfp.write(b'a'*mmap.ALLOCATIONGRANULARITY)
size = 0
offset = mmap.ALLOCATIONGRANULARITY
fp = memmap(self.tmpfp, shape=size, mode='w+', offset=offset)
assert_equal(fp.offset, offset)

def test_no_shape(self):
self.tmpfp.write(b'a'*16)
mm = memmap(self.tmpfp, dtype='float64')
Expand All @@ -207,12 +214,14 @@ def test_no_shape(self):
def test_empty_array(self):
# gh-12653
with pytest.raises(ValueError, match='empty file'):
memmap(self.tmpfp, shape=(0,4), mode='w+')
memmap(self.tmpfp, shape=(0, 4), mode='r')

self.tmpfp.write(b'\0')
# gh-27723
# empty memmap works with mode in ('w+','r+')
memmap(self.tmpfp, shape=(0, 4), mode='w+')

# ok now the file is not empty
memmap(self.tmpfp, shape=(0,4), mode='w+')
memmap(self.tmpfp, shape=(0, 4), mode='w+')

def test_shape_type(self):
memmap(self.tmpfp, shape=3, mode='w+')
Expand Down
Loading
Morty Proxy This is a proxified and sanitized view of the page, visit original site.