Skip to content

Navigation Menu

Sign in
Appearance settings

Search code, repositories, users, issues, pull requests...

Provide feedback

We read every piece of feedback, and take your input very seriously.

Saved searches

Use saved searches to filter your results more quickly

Appearance settings

MAINT Improve scikit-learn reliability by replacing cnp.ndarrays with typed memoryviews #25484

Copy link
Copy link
Closed
@jjerphan

Description

@jjerphan
Issue body actions

Context

Using cnp.ndarray is not encouragged and using typed memoryviews must be preferred. More precisely const-qualified memoryview must be used when the buffer of the NumPy array is readonly (some Cython interfaces previously have had support for readonly data using cnp.ndarray as a workaround).

On a side-note, better uniform canonical support of readonly buffers has notable value: some frameworks like Ray or libraries like joblib make use of memory mapping which change the writability of buffers. Yet it might crash in some context if scikit-learn's implementations have no clear support for readonly data and this is fatal to users' workflows.

In overall, we have a poor overview for such support but we must. Efforts in this meta-issue will improve the overview for this support.


As an example, the following pattern:

cdef cnp.ndarray[T, ndim=1, mode="c"] readonly_data 

must be changed to:

cdef const T[::1] readonly_data

where T is a concrete type or a fused type.

See occurences of cnp.ndarray in the codebase.

💡 Do note, as mentioned by @thomasjpfan in #25484 (comment), that not all occurences of cnp.ndarray must be removed.

💡 People interested starting working with Cython might be interested in reading Cython Best Practices, Conventions and Knowledge section of scikit-learn documentation.

Proposed solution: treat one file per PR

  • Only perform changes necessary change to use memoryviews
  • If the file is big, you can have several PR to treat it
  • Create PRs for each warning with the following title:
    MAINT Use memoryviews in `source_file.ext`
    
    where:
    • source_file.ext is the file being treated

The following command allows to see which files needs to be treated:

ag "cnp.ndarray" -c

As of b69abf5, the following files need to be treated:

  • sklearn/metrics/cluster/_expected_mutual_info_fast.pyx:4
  • sklearn/metrics/_dist_metrics.pyx.tp:6
  • sklearn/neighbors/_quad_tree.pyx:1
  • sklearn/preprocessing/_csr_polynomial_expansion.pyx:9
  • sklearn/utils/_seq_dataset.pyx.tp:14
  • sklearn/utils/_fast_dict.pyx:4
  • sklearn/utils/_seq_dataset.pxd.tp:10
  • sklearn/utils/arrayfuncs.pyx:2
  • sklearn/utils/sparsefuncs_fast.pyx:50
  • sklearn/linear_model/_cd_fast.pyx:15
  • sklearn/tree/_tree.pxd:10
  • sklearn/tree/_tree.pyx:38
  • sklearn/neighbors/_binary_tree.pxi:1
  • sklearn/tree/_utils.pyx:1
  • sklearn/tree/_utils.pxd:1
  • sklearn/tree/_criterion.pyx:3
  • sklearn/utils/_random.pyx:4
  • sklearn/_isotonic.pyx:6

Miscellaneous

#24875 is also linked to this PR regarding using the newest NumPy C API.

Metadata

Metadata

Assignees

No one assigned

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions

      Morty Proxy This is a proxified and sanitized view of the page, visit original site.