Description
🚨 🚧 This issue requires a bit of patience and experience to contribute to 🚧 🚨
- Original issue introducing array API in scikit-learn: Path for Adopting the Array API spec #22352
- array API official doc/spec: https://data-apis.org/array-api/
- scikit-learn doc: https://scikit-learn.org/dev/modules/array_api.html
Please mention this issue when you create a PR, but please don't write "closes #26024" or "fixes #26024".
scikit-learn contains lots of useful tools, in addition to the many estimators it has. For example metrics, pipelines, pre-processing and mode selection. These are useful to and used by people who do not necessarily use an estimator from scikit-learn. This is great.
The fact that many users install scikit-learn "just" to use train_test_split
is a testament to how useful it is to provide easy to use tools that do the right(!) thing. Instead of everyone implementing them from scratch because it is "easy" and making mistakes along the way.
In this issue I'd like to collect and track work related to making it easier to use all these "tools" from scikit-learn even if you are not using Numpy arrays for your data. In particular thanks to the Array API standard it should be "not too much work" to make things usable with data that is in an array that conforms to the Array API standard.
There is work in #25956 and #22554 which adds the basic infrastructure needed to use "array API arrays".
The goal of this issue is to make code like the following work:
>>> from sklearn.preprocessing import MinMaxScaler
>>> from sklearn import config_context
>>> from sklearn.datasets import make_classification
>>> import torch
>>> X_np, y_np = make_classification(random_state=0)
>>> X_torch = torch.asarray(X_np, device="cuda", dtype=torch.float32)
>>> y_torch = torch.asarray(y_np, device="cuda", dtype=torch.float32)
>>> with config_context(array_api_dispatch=True):
... # For example using MinMaxScaler on PyTorch tensors
... scale = MinMaxScaler()
... X_trans = scale.fit_transform(X_torch, y_torch)
... assert type(X_trans) == type(X_torch)
... assert X_trans.device == X_torch.device
The first step (or maybe part of the first) is to check which of them already "just work". After that is done we can start the work (one PR per class/function) making changes.
Guidelines for testing
General comment: most of the time when we add array API support to a function in scikit-learn, we do not touch the existing (numpy-only) tests to make sure that the PR does not change the default behavior of scikit-learn on traditional inputs when array API is not enabled.
In the case of an estimator, it can be enough to add the array_api_support=True
estimator tag in a method named __sklearn_tags__
. For metric functions, just register it in the array_api_metric_checkers
in sklearn/metrics/tests/test_common.py
to include it in the common test.
For other kinds of functions not covered by existing common tests, or when the array API support depends heavily on non-default values, it might be required to add one or more new test functions to the related module-level test file. The general testing scheme is the following:
- generate some random test data with numpy or
sklearn.datasets.make_*
; - call the function once on the numpy inputs without enabling array API dispatch;
- convert the inputs to a namespace / device combo passed as parameter to the test;
- call the function with array API dispatching enabled (under a
with sklearn.config_context(array_api_dispatch=True)
block - check that the results are on the same namespace and device as the input
- convert back the output to a numpy array using
_convert_to_numpy
- compare the original / reference numpy results and the
xp
computation results converted back to numpy usingassert_allclose
or similar.
Those tests should have array_api
somewhere in their name to makes sure that we can run all the array API compliance tests with a keyword search in the pytest command line, e.g.:
pytest -k array_api sklearn/some/subpackage
In particular, for cost reasons, our CUDA GPU CI only runs pytest -k array_api sklearn
. So it's very important to respect this naming conventions, otherwise we will not tests all what we are supposed to test on CUDA.
More generally, look at merged array API pull requests to see how testing is typically handled.
Metadata
Metadata
Assignees
Labels
Type
Projects
Status