Performance Regression in scikit-learn 1.5.0: Execution Time for ColumnTransformer Scales Quadratically with the Number of Transformers when n_jobs > 1

Describe the bug

After upgrading to scikit-learn 1.5.0, we observed a significant performance regression in the ColumnTransformer when using n_jobs > 1. The issue seems related to the IO overhead, which escalates quadratically with the number of transformers, particularly noticeable when processing Series holding Python objects like lists or strings.

Below are benchmarks for running a pipeline with varying numbers of columns (n_col) with n_jobs = {1, 2} across scikit-learn versions 1.4.2 and 1.5.0:

sklearn version: 1.4.2 and n_jobs = 1
5: Per col: 0.019380s / total 0.10 s
10: Per col: 0.018936s / total 0.19 s
15: Per col: 0.019192s / total 0.29 s
20: Per col: 0.019223s / total 0.38 s
25: Per col: 0.019718s / total 0.49 s
30: Per col: 0.019141s / total 0.57 s
35: Per col: 0.019265s / total 0.67 s
40: Per col: 0.019065s / total 0.76 s
45: Per col: 0.019170s / total 0.86 s

sklearn version 1.5.0 and n_jobs = 1
5: Per col: 0.025390s / total 0.13 s
10: Per col: 0.020016s / total 0.20 s
15: Per col: 0.021841s / total 0.33 s
20: Per col: 0.020817s / total 0.42 s
25: Per col: 0.021067s / total 0.53 s
30: Per col: 0.021997s / total 0.66 s
35: Per col: 0.021080s / total 0.74 s
40: Per col: 0.020629s / total 0.83 s
45: Per col: 0.020796s / total 0.94 s

sklearn version: 1.4.2 and n_jobs = 2
5: Per col: 0.243821s / total 1.22 s
10: Per col: 0.028045s / total 0.28 s
15: Per col: 0.026836s / total 0.40 s
20: Per col: 0.028144s / total 0.56 s
25: Per col: 0.026041s / total 0.65 s
30: Per col: 0.025631s / total 0.77 s
35: Per col: 0.025608s / total 0.90 s
40: Per col: 0.025547s / total 1.02 s
45: Per col: 0.025084s / total 1.13 s

sklearn version: 1.5.0 and n_jobs = 2
5: Per col: 0.119883s / total 0.60 s
10: Per col: 0.226338s / total 2.26 s
15: Per col: 0.399880s / total 6.00 s
20: Per col: 0.513848s / total 10.28 s
25: Per col: 0.673867s / total 16.85 s
30: Per col: 0.923152s / total 27.69 s
35: Per col: 1.080279s / total 37.81 s
40: Per col: 1.280597s / total 51.22 s
45: Per col: 1.468622s / total 66.09 s

From the data, the per-column / per-transformer processing time increases with the total number of transformers, contrary to expectations of a static processing time per transformer. I bisected this issue to PR #28822, which seems to cause the entire DataFrame to be sent to each worker rather than just the columns selected by the transformer.

Steps/Code to Reproduce

import pandas as pd
import random
import time
import joblib
from sklearn.compose import ColumnTransformer
from sklearn.pipeline import FunctionTransformer, Pipeline

def list_sum(col):
    return col.map(lambda x: sum(x))

def profile(n_col: int):
    df = pd.DataFrame({
        f"{i}": [
            [random.random() for _ in range(random.randint(1, 5))]
            for _ in range(100_000)
        ] for i in range(n_col)
    })

    pipeline = Pipeline([
        ("transformer", ColumnTransformer([
            (f"{i}", FunctionTransformer(list_sum), [f"{i}"])
            for i in range(n_col)
        ], n_jobs=2))
    ])

    start = time.time()
    with joblib.parallel_backend(backend="loky", mmap_mode="r+"):
        pipeline.fit_transform(df)
    return time.time() - start

from sklearn import __version__ as sklearn_version
print(f"sklearn version: {sklearn_version}")

for n in range(5, 50, 5):
    run_time = profile(n)
    print(f"{n}: Per col: {(run_time / n):.4f}s / total {run_time:.2f} s")

Expected Results

The execution time scales linear with the number of transformers

Actual Results

The execution time scales quadratically with the number of transformers

Versions

System:
    python: 3.12.3 | packaged by conda-forge | (main, Apr 15 2024, 18:35:20) [Clang 16.0.6 ]
executable: /Users/belastoyan/micromamba/envs/sk-issue/bin/python
   machine: macOS-14.4-arm64-arm-64bit

Python dependencies:
      sklearn: 1.5.0
          pip: 24.0
   setuptools: 70.0.0
        numpy: 1.26.4
        scipy: 1.13.1
       Cython: None
       pandas: 2.2.2
   matplotlib: None
       joblib: 1.4.2
threadpoolctl: 3.5.0

Built with OpenMP: True

threadpoolctl info:
       user_api: blas
   internal_api: openblas
    num_threads: 12
         prefix: libopenblas
       filepath: /Users/belastoyan/micromamba/envs/sk-issue/lib/libopenblas.0.dylib
        version: 0.3.27
threading_layer: openmp
   architecture: VORTEX

       user_api: openmp
   internal_api: openmp
    num_threads: 12
         prefix: libomp
       filepath: /Users/belastoyan/micromamba/envs/sk-issue/lib/libomp.dylib
        version: None

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Uh oh!

Performance Regression in scikit-learn 1.5.0: Execution Time for ColumnTransformer Scales Quadratically with the Number of Transformers when n_jobs > 1 #29229

Describe the bug

Steps/Code to Reproduce

Expected Results

Actual Results

Versions

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Search code, repositories, users, issues, pull requests...

Uh oh!

Performance Regression in scikit-learn 1.5.0: Execution Time for ColumnTransformer Scales Quadratically with the Number of Transformers when n_jobs > 1 #29229

Description

Describe the bug

Steps/Code to Reproduce

Expected Results

Actual Results

Versions

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions