Skip to content

Navigation Menu

Search code, repositories, users, issues, pull requests...

Provide feedback

We read every piece of feedback, and take your input very seriously.

Saved searches

Use saved searches to filter your results more quickly

Appearance settings

Make more of the "tools" of scikit-learn Array API compatible #26024

Copy link
Copy link
Open
@betatim

Description

@betatim
Issue body actions

🚨 🚧 This issue requires a bit of patience and experience to contribute to 🚧 🚨

Please mention this issue when you create a PR, but please don't write "closes #26024" or "fixes #26024".

scikit-learn contains lots of useful tools, in addition to the many estimators it has. For example metrics, pipelines, pre-processing and mode selection. These are useful to and used by people who do not necessarily use an estimator from scikit-learn. This is great.

The fact that many users install scikit-learn "just" to use train_test_split is a testament to how useful it is to provide easy to use tools that do the right(!) thing. Instead of everyone implementing them from scratch because it is "easy" and making mistakes along the way.

In this issue I'd like to collect and track work related to making it easier to use all these "tools" from scikit-learn even if you are not using Numpy arrays for your data. In particular thanks to the Array API standard it should be "not too much work" to make things usable with data that is in an array that conforms to the Array API standard.

There is work in #25956 and #22554 which adds the basic infrastructure needed to use "array API arrays".

The goal of this issue is to make code like the following work:

>>> from sklearn.preprocessing import MinMaxScaler
>>> from sklearn import config_context
>>> from sklearn.datasets import make_classification
>>> import torch
>>> X_np, y_np = make_classification(random_state=0)
>>> X_torch = torch.asarray(X_np, device="cuda", dtype=torch.float32)
>>> y_torch = torch.asarray(y_np, device="cuda", dtype=torch.float32)

>>> with config_context(array_api_dispatch=True):
...     # For example using MinMaxScaler on PyTorch tensors
...     scale = MinMaxScaler()
...     X_trans = scale.fit_transform(X_torch, y_torch)
...     assert type(X_trans) == type(X_torch)
...     assert X_trans.device == X_torch.device

The first step (or maybe part of the first) is to check which of them already "just work". After that is done we can start the work (one PR per class/function) making changes.

Guidelines for testing

General comment: most of the time when we add array API support to a function in scikit-learn, we do not touch the existing (numpy-only) tests to make sure that the PR does not change the default behavior of scikit-learn on traditional inputs when array API is not enabled.

In the case of an estimator, it can be enough to add the array_api_support=True estimator tag in a method named __sklearn_tags__. For metric functions, just register it in the array_api_metric_checkers in sklearn/metrics/tests/test_common.py to include it in the common test.

For other kinds of functions not covered by existing common tests, or when the array API support depends heavily on non-default values, it might be required to add one or more new test functions to the related module-level test file. The general testing scheme is the following:

  • generate some random test data with numpy or sklearn.datasets.make_*;
  • call the function once on the numpy inputs without enabling array API dispatch;
  • convert the inputs to a namespace / device combo passed as parameter to the test;
  • call the function with array API dispatching enabled (under a with sklearn.config_context(array_api_dispatch=True) block
  • check that the results are on the same namespace and device as the input
  • convert back the output to a numpy array using _convert_to_numpy
  • compare the original / reference numpy results and the xp computation results converted back to numpy using assert_allclose or similar.

Those tests should have array_api somewhere in their name to makes sure that we can run all the array API compliance tests with a keyword search in the pytest command line, e.g.:

pytest -k array_api sklearn/some/subpackage

In particular, for cost reasons, our CUDA GPU CI only runs pytest -k array_api sklearn. So it's very important to respect this naming conventions, otherwise we will not tests all what we are supposed to test on CUDA.

More generally, look at merged array API pull requests to see how testing is typically handled.

Metadata

Metadata

Assignees

No one assigned

    Labels

    APIArray APIMeta-issueGeneral issue associated to an identified list of tasksGeneral issue associated to an identified list of tasks

    Type

    No type

    Projects

    Status

    Todo
    Show more project fields

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions

      Morty Proxy This is a proxified and sanitized view of the page, visit original site.