Description
Context
Using cnp.ndarray
is not encouragged and using typed memoryviews must be preferred. More precisely const
-qualified memoryview must be used when the buffer of the NumPy array is readonly (some Cython interfaces previously have had support for readonly data using cnp.ndarray
as a workaround).
On a side-note, better uniform canonical support of readonly buffers has notable value: some frameworks like Ray or libraries like joblib make use of memory mapping which change the writability of buffers. Yet it might crash in some context if scikit-learn's implementations have no clear support for readonly data and this is fatal to users' workflows.
In overall, we have a poor overview for such support but we must. Efforts in this meta-issue will improve the overview for this support.
As an example, the following pattern:
cdef cnp.ndarray[T, ndim=1, mode="c"] readonly_data
must be changed to:
cdef const T[::1] readonly_data
where T
is a concrete type or a fused type.
See occurences of cnp.ndarray
in the codebase.
💡 Do note, as mentioned by @thomasjpfan in #25484 (comment), that not all occurences of cnp.ndarray
must be removed.
💡 People interested starting working with Cython might be interested in reading Cython Best Practices, Conventions and Knowledge section of scikit-learn documentation.
Proposed solution: treat one file per PR
- Only perform changes necessary change to use memoryviews
- If the file is big, you can have several PR to treat it
- Create PRs for each warning with the following title:
where:
MAINT Use memoryviews in `source_file.ext`
source_file.ext
is the file being treated
The following command allows to see which files needs to be treated:
ag "cnp.ndarray" -c
As of b69abf5, the following files need to be treated:
- sklearn/metrics/cluster/_expected_mutual_info_fast.pyx:4
- sklearn/metrics/_dist_metrics.pyx.tp:6
- sklearn/neighbors/_quad_tree.pyx:1
- sklearn/preprocessing/_csr_polynomial_expansion.pyx:9
- sklearn/utils/_seq_dataset.pyx.tp:14
- sklearn/utils/_fast_dict.pyx:4
- sklearn/utils/_seq_dataset.pxd.tp:10
- sklearn/utils/arrayfuncs.pyx:2
- sklearn/utils/sparsefuncs_fast.pyx:50
- sklearn/linear_model/_cd_fast.pyx:15
- sklearn/tree/_tree.pxd:10
- sklearn/tree/_tree.pyx:38
- sklearn/neighbors/_binary_tree.pxi:1
- sklearn/tree/_utils.pyx:1
- sklearn/tree/_utils.pxd:1
- sklearn/tree/_criterion.pyx:3
- sklearn/utils/_random.pyx:4
- sklearn/_isotonic.pyx:6
Miscellaneous
#24875 is also linked to this PR regarding using the newest NumPy C API.