Make more of the "tools" of scikit-learn Array API compatible

🚨 🚧 This issue requires a bit of patience and experience to contribute to 🚧 🚨

Original issue introducing array API in scikit-learn: Path for Adopting the Array API spec #22352
array API official doc/spec: https://data-apis.org/array-api/
scikit-learn doc: https://scikit-learn.org/dev/modules/array_api.html

Please mention this issue when you create a PR, but please don't write "closes #26024" or "fixes #26024".

scikit-learn contains lots of useful tools, in addition to the many estimators it has. For example metrics, pipelines, pre-processing and mode selection. These are useful to and used by people who do not necessarily use an estimator from scikit-learn. This is great.

The fact that many users install scikit-learn "just" to use train_test_split is a testament to how useful it is to provide easy to use tools that do the right(!) thing. Instead of everyone implementing them from scratch because it is "easy" and making mistakes along the way.

In this issue I'd like to collect and track work related to making it easier to use all these "tools" from scikit-learn even if you are not using Numpy arrays for your data. In particular thanks to the Array API standard it should be "not too much work" to make things usable with data that is in an array that conforms to the Array API standard.

There is work in #25956 and #22554 which adds the basic infrastructure needed to use "array API arrays".

The goal of this issue is to make code like the following work:

>>> from sklearn.preprocessing import MinMaxScaler
>>> from sklearn import config_context
>>> from sklearn.datasets import make_classification
>>> import torch
>>> X_np, y_np = make_classification(random_state=0)
>>> X_torch = torch.asarray(X_np, device="cuda", dtype=torch.float32)
>>> y_torch = torch.asarray(y_np, device="cuda", dtype=torch.float32)

>>> with config_context(array_api_dispatch=True):
...     # For example using MinMaxScaler on PyTorch tensors
...     scale = MinMaxScaler()
...     X_trans = scale.fit_transform(X_torch, y_torch)
...     assert type(X_trans) == type(X_torch)
...     assert X_trans.device == X_torch.device

The first step (or maybe part of the first) is to check which of them already "just work". After that is done we can start the work (one PR per class/function) making changes.

Guidelines for testing

General comment: most of the time when we add array API support to a function in scikit-learn, we do not touch the existing (numpy-only) tests to make sure that the PR does not change the default behavior of scikit-learn on traditional inputs when array API is not enabled.

In the case of an estimator, it can be enough to add the array_api_support=True estimator tag in a method named __sklearn_tags__. For metric functions, just register it in the array_api_metric_checkers in sklearn/metrics/tests/test_common.py to include it in the common test.

For other kinds of functions not covered by existing common tests, or when the array API support depends heavily on non-default values, it might be required to add one or more new test functions to the related module-level test file. The general testing scheme is the following:

generate some random test data with numpy or sklearn.datasets.make_*;
call the function once on the numpy inputs without enabling array API dispatch;
convert the inputs to a namespace / device combo passed as parameter to the test;
call the function with array API dispatching enabled (under a with sklearn.config_context(array_api_dispatch=True) block
check that the results are on the same namespace and device as the input
convert back the output to a numpy array using _convert_to_numpy
compare the original / reference numpy results and the xp computation results converted back to numpy using assert_allclose or similar.

Those tests should have array_api somewhere in their name to makes sure that we can run all the array API compliance tests with a keyword search in the pytest command line, e.g.:

pytest -k array_api sklearn/some/subpackage

In particular, for cost reasons, our CUDA GPU CI only runs pytest -k array_api sklearn. So it's very important to respect this naming conventions, otherwise we will not tests all what we are supposed to test on CUDA.

More generally, look at merged array API pull requests to see how testing is typically handled.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Uh oh!

Make more of the "tools" of scikit-learn Array API compatible #26024

Guidelines for testing

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Search code, repositories, users, issues, pull requests...

Uh oh!

Make more of the "tools" of scikit-learn Array API compatible #26024

Description

Guidelines for testing

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions