Skip to content

Navigation Menu

Search code, repositories, users, issues, pull requests...

Provide feedback

We read every piece of feedback, and take your input very seriously.

Saved searches

Use saved searches to filter your results more quickly

Appearance settings

FEA Add array API support for GaussianMixture #30777

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Open
wants to merge 84 commits into
base: main
Choose a base branch
Loading
from

Conversation

lesteve
Copy link
Member

@lesteve lesteve commented Feb 6, 2025

Working on it with @StefanieSenger.

Link to TODO

@lesteve lesteve marked this pull request as draft February 6, 2025 14:26
Copy link

github-actions bot commented Feb 6, 2025

✔️ Linting Passed

All linting checks passed. Your pull request is in excellent shape! ☀️

Generated for commit: de1e575. Link to the linter CI: here

@StefanieSenger StefanieSenger self-requested a review February 14, 2025 09:28
@lesteve
Copy link
Member Author

lesteve commented May 9, 2025

Quick benchmark on VM with NVIDIA GeForce RTX 3070

import torch
import numpy as np
from time import perf_counter
from sklearn import set_config
from sklearn.mixture import GaussianMixture
from sklearn.datasets import make_blobs

set_config(array_api_dispatch=True)

n_samples, n_features = int(5e4), int(1e3)
n_components = 10
print(f"Generating data with shape {(n_samples, n_features)}...")
X_np, _ = make_blobs(
    n_samples=n_samples, n_features=n_features, centers=n_components, random_state=0
)
print(f"Data size: {X_np.nbytes / 1e6:.1f} MB")

gmm = GaussianMixture(
    n_components=n_components,
    covariance_type="diag",
    init_params="random",
    random_state=0,
)

X_torch_cpu = torch.asarray(X_np)
print("PyTorch CPU GMM")
%timeit gmm.fit(X_torch_cpu)

print("PyTorch GPU GMM")
X_torch_cuda = torch.asarray(X_np, device="cuda")
# .means_[0, 0].item() is to make sure to block to measure CUDA
# computation faithfully, following guideline from
# https://github.com/scikit-learn/scikit-learn/pull/27961#issuecomment-2506259528
%timeit gmm.fit(X_torch_cuda).means_[0, 0].item()

print("NumPy GMM")
%timeit gmm.fit(X_np)

Output:

Generating data with shape (50000, 1000)...
Data size: 400.0 MB
PyTorch CPU GMM
1.28 s ± 3.48 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)
PyTorch GPU GMM
221 ms ± 1.3 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)
NumPy GMM
1.6 s ± 5.42 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)

Note that in this case PyTorch GPU vs numpy is 7x, I have seen other cases where it is more 3-4x (e.g. n_samples, n_features = int(5e4), int(5e3), n_components = 10).

@lesteve
Copy link
Member Author

lesteve commented May 9, 2025

I think the confusing error about torch not being defined is an array-api-compat bug with torch 2.7 I opened data-apis/array-api-compat#320.

From build log the error was:

x1 = tensor([[4.8761, 0.0000],
        [2.7456, 5.9371]], dtype=torch.float64)
x2 = tensor([[1., 0.],
        [0., 1.]], dtype=torch.float64), kwargs = {}
        x1, x2 = _fix_promotion(x1, x2, only_scalar=False)
        # Torch tries to emulate NumPy 1 solve behavior by using batched 1-D solve
        # whenever
        # 1. x1.ndim - 1 == x2.ndim
        # 2. x1.shape[:-1] == x2.shape
        #
        # See linalg_solve_is_vector_rhs in
        # aten/src/ATen/native/LinearAlgebraUtils.h and
        # TORCH_META_FUNC(_linalg_solve_ex) in
        # aten/src/ATen/native/BatchLinearAlgebra.cpp in the PyTorch source code.
        #
        # The easiest way to work around this is to prepend a size 1 dimension to
        # x2, since x2 is already one dimension less than x1.
        #
        # See https://github.com/pytorch/pytorch/issues/52915
        if x2.ndim != 1 and x1.ndim - 1 == x2.ndim and x1.shape[:-1] == x2.shape:
            x2 = x2[None]
>       return torch.linalg.solve(x1, x2, **kwargs)
E       NameError: name 'torch' is not defined

kwargs     = {}
x1         = tensor([[4.8761, 0.0000],
        [2.7456, 5.9371]], dtype=torch.float64)
x2         = tensor([[1., 0.],
        [0., 1.]], dtype=torch.float64)

@lesteve lesteve marked this pull request as ready for review May 16, 2025 13:44
@lesteve lesteve changed the title Investigate GaussianMixture array API support ENH Add array API support for GaussianMixture May 16, 2025
@lesteve lesteve changed the title ENH Add array API support for GaussianMixture FEA Add array API support for GaussianMixture May 16, 2025
@lesteve
Copy link
Member Author

lesteve commented May 16, 2025

GaussianMixture is ready for a first round of review 🎉 !

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
Status: In Progress
Development

Successfully merging this pull request may close these issues.

3 participants
Morty Proxy This is a proxified and sanitized view of the page, visit original site.