Description
The following test case will fail on main
:
import numpy as np
# strings long enough to live in the arena
string_list = ["abc", "def", "ghi" * 10, "A¢☃€ 😊" * 100, "Abc" * 1000, "DEF"]
arr = np.array(string_list, dtype=np.dtypes.StringDType())
arr_rev = np.array(string_list[::-1], dtype=np.dtypes.StringDType())
arr_view = np.array(arr, copy=None, dtype=arr_rev.dtype)
np.testing.assert_array_equal(arr, arr_view)
This is happening because of the code path here:
numpy/numpy/_core/src/multiarray/multiarraymodule.c
Lines 1612 to 1636 in 6d4a0f7
In the test case, oldtype != dtype
, so we enter the path that calls NewFromDescrAndBase
with the user-provided dtype. The problem is the user-provided dtype doesn't know anything about the arena that lives on the original dtype, so blindly accepting the user-provided dtype and passing it to NewFromDescrAndBase
corrupts the arena.
I think a way to fix this would be to call finalize_descr
on the user-provided DType to get a fresh descriptor if its already owned by the array, and then force the user-provided dtype to share allocators with the original dtype. We could also copy the arena, but that goes against the user desire to not have a copy.
Another way to handle this is to make stringdtype always copy in this situation, and error if someone passes copy=False
.
Originally posted by @ngoldbaum in #26082 (comment)