Skip to content

Navigation Menu

Sign in
Appearance settings

Search code, repositories, users, issues, pull requests...

Provide feedback

We read every piece of feedback, and take your input very seriously.

Saved searches

Use saved searches to filter your results more quickly

Appearance settings

Performance Regression in scikit-learn 1.5.0: Execution Time for ColumnTransformer Scales Quadratically with the Number of Transformers when n_jobs > 1 #29229

Copy link
Copy link
Closed
@0xbe7a

Description

@0xbe7a
Issue body actions

Describe the bug

After upgrading to scikit-learn 1.5.0, we observed a significant performance regression in the ColumnTransformer when using n_jobs > 1. The issue seems related to the IO overhead, which escalates quadratically with the number of transformers, particularly noticeable when processing Series holding Python objects like lists or strings.

Below are benchmarks for running a pipeline with varying numbers of columns (n_col) with n_jobs = {1, 2} across scikit-learn versions 1.4.2 and 1.5.0:

sklearn version: 1.4.2 and n_jobs = 1
5: Per col: 0.019380s / total 0.10 s
10: Per col: 0.018936s / total 0.19 s
15: Per col: 0.019192s / total 0.29 s
20: Per col: 0.019223s / total 0.38 s
25: Per col: 0.019718s / total 0.49 s
30: Per col: 0.019141s / total 0.57 s
35: Per col: 0.019265s / total 0.67 s
40: Per col: 0.019065s / total 0.76 s
45: Per col: 0.019170s / total 0.86 s

sklearn version 1.5.0 and n_jobs = 1
5: Per col: 0.025390s / total 0.13 s
10: Per col: 0.020016s / total 0.20 s
15: Per col: 0.021841s / total 0.33 s
20: Per col: 0.020817s / total 0.42 s
25: Per col: 0.021067s / total 0.53 s
30: Per col: 0.021997s / total 0.66 s
35: Per col: 0.021080s / total 0.74 s
40: Per col: 0.020629s / total 0.83 s
45: Per col: 0.020796s / total 0.94 s

sklearn version: 1.4.2 and n_jobs = 2
5: Per col: 0.243821s / total 1.22 s
10: Per col: 0.028045s / total 0.28 s
15: Per col: 0.026836s / total 0.40 s
20: Per col: 0.028144s / total 0.56 s
25: Per col: 0.026041s / total 0.65 s
30: Per col: 0.025631s / total 0.77 s
35: Per col: 0.025608s / total 0.90 s
40: Per col: 0.025547s / total 1.02 s
45: Per col: 0.025084s / total 1.13 s

sklearn version: 1.5.0 and n_jobs = 2
5: Per col: 0.119883s / total 0.60 s
10: Per col: 0.226338s / total 2.26 s
15: Per col: 0.399880s / total 6.00 s
20: Per col: 0.513848s / total 10.28 s
25: Per col: 0.673867s / total 16.85 s
30: Per col: 0.923152s / total 27.69 s
35: Per col: 1.080279s / total 37.81 s
40: Per col: 1.280597s / total 51.22 s
45: Per col: 1.468622s / total 66.09 s

From the data, the per-column / per-transformer processing time increases with the total number of transformers, contrary to expectations of a static processing time per transformer. I bisected this issue to PR #28822, which seems to cause the entire DataFrame to be sent to each worker rather than just the columns selected by the transformer.

Steps/Code to Reproduce

import pandas as pd
import random
import time
import joblib
from sklearn.compose import ColumnTransformer
from sklearn.pipeline import FunctionTransformer, Pipeline

def list_sum(col):
    return col.map(lambda x: sum(x))

def profile(n_col: int):
    df = pd.DataFrame({
        f"{i}": [
            [random.random() for _ in range(random.randint(1, 5))]
            for _ in range(100_000)
        ] for i in range(n_col)
    })

    pipeline = Pipeline([
        ("transformer", ColumnTransformer([
            (f"{i}", FunctionTransformer(list_sum), [f"{i}"])
            for i in range(n_col)
        ], n_jobs=2))
    ])

    start = time.time()
    with joblib.parallel_backend(backend="loky", mmap_mode="r+"):
        pipeline.fit_transform(df)
    return time.time() - start

from sklearn import __version__ as sklearn_version
print(f"sklearn version: {sklearn_version}")

for n in range(5, 50, 5):
    run_time = profile(n)
    print(f"{n}: Per col: {(run_time / n):.4f}s / total {run_time:.2f} s")

Expected Results

The execution time scales linear with the number of transformers

Actual Results

The execution time scales quadratically with the number of transformers

Versions

System:
    python: 3.12.3 | packaged by conda-forge | (main, Apr 15 2024, 18:35:20) [Clang 16.0.6 ]
executable: /Users/belastoyan/micromamba/envs/sk-issue/bin/python
   machine: macOS-14.4-arm64-arm-64bit

Python dependencies:
      sklearn: 1.5.0
          pip: 24.0
   setuptools: 70.0.0
        numpy: 1.26.4
        scipy: 1.13.1
       Cython: None
       pandas: 2.2.2
   matplotlib: None
       joblib: 1.4.2
threadpoolctl: 3.5.0

Built with OpenMP: True

threadpoolctl info:
       user_api: blas
   internal_api: openblas
    num_threads: 12
         prefix: libopenblas
       filepath: /Users/belastoyan/micromamba/envs/sk-issue/lib/libopenblas.0.dylib
        version: 0.3.27
threading_layer: openmp
   architecture: VORTEX

       user_api: openmp
   internal_api: openmp
    num_threads: 12
         prefix: libomp
       filepath: /Users/belastoyan/micromamba/envs/sk-issue/lib/libomp.dylib
        version: None

Metadata

Metadata

Assignees

No one assigned

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions

      Morty Proxy This is a proxified and sanitized view of the page, visit original site.