Description
Describe the bug
After upgrading to scikit-learn 1.5.0, we observed a significant performance regression in the ColumnTransformer when using n_jobs > 1
. The issue seems related to the IO overhead, which escalates quadratically with the number of transformers, particularly noticeable when processing Series holding Python objects like lists or strings.
Below are benchmarks for running a pipeline with varying numbers of columns (n_col
) with n_jobs = {1, 2}
across scikit-learn versions 1.4.2 and 1.5.0:
sklearn version: 1.4.2 and n_jobs = 1
5: Per col: 0.019380s / total 0.10 s
10: Per col: 0.018936s / total 0.19 s
15: Per col: 0.019192s / total 0.29 s
20: Per col: 0.019223s / total 0.38 s
25: Per col: 0.019718s / total 0.49 s
30: Per col: 0.019141s / total 0.57 s
35: Per col: 0.019265s / total 0.67 s
40: Per col: 0.019065s / total 0.76 s
45: Per col: 0.019170s / total 0.86 s
sklearn version 1.5.0 and n_jobs = 1
5: Per col: 0.025390s / total 0.13 s
10: Per col: 0.020016s / total 0.20 s
15: Per col: 0.021841s / total 0.33 s
20: Per col: 0.020817s / total 0.42 s
25: Per col: 0.021067s / total 0.53 s
30: Per col: 0.021997s / total 0.66 s
35: Per col: 0.021080s / total 0.74 s
40: Per col: 0.020629s / total 0.83 s
45: Per col: 0.020796s / total 0.94 s
sklearn version: 1.4.2 and n_jobs = 2
5: Per col: 0.243821s / total 1.22 s
10: Per col: 0.028045s / total 0.28 s
15: Per col: 0.026836s / total 0.40 s
20: Per col: 0.028144s / total 0.56 s
25: Per col: 0.026041s / total 0.65 s
30: Per col: 0.025631s / total 0.77 s
35: Per col: 0.025608s / total 0.90 s
40: Per col: 0.025547s / total 1.02 s
45: Per col: 0.025084s / total 1.13 s
sklearn version: 1.5.0 and n_jobs = 2
5: Per col: 0.119883s / total 0.60 s
10: Per col: 0.226338s / total 2.26 s
15: Per col: 0.399880s / total 6.00 s
20: Per col: 0.513848s / total 10.28 s
25: Per col: 0.673867s / total 16.85 s
30: Per col: 0.923152s / total 27.69 s
35: Per col: 1.080279s / total 37.81 s
40: Per col: 1.280597s / total 51.22 s
45: Per col: 1.468622s / total 66.09 s
From the data, the per-column / per-transformer processing time increases with the total number of transformers, contrary to expectations of a static processing time per transformer. I bisected this issue to PR #28822, which seems to cause the entire DataFrame to be sent to each worker rather than just the columns selected by the transformer.
Steps/Code to Reproduce
import pandas as pd
import random
import time
import joblib
from sklearn.compose import ColumnTransformer
from sklearn.pipeline import FunctionTransformer, Pipeline
def list_sum(col):
return col.map(lambda x: sum(x))
def profile(n_col: int):
df = pd.DataFrame({
f"{i}": [
[random.random() for _ in range(random.randint(1, 5))]
for _ in range(100_000)
] for i in range(n_col)
})
pipeline = Pipeline([
("transformer", ColumnTransformer([
(f"{i}", FunctionTransformer(list_sum), [f"{i}"])
for i in range(n_col)
], n_jobs=2))
])
start = time.time()
with joblib.parallel_backend(backend="loky", mmap_mode="r+"):
pipeline.fit_transform(df)
return time.time() - start
from sklearn import __version__ as sklearn_version
print(f"sklearn version: {sklearn_version}")
for n in range(5, 50, 5):
run_time = profile(n)
print(f"{n}: Per col: {(run_time / n):.4f}s / total {run_time:.2f} s")
Expected Results
The execution time scales linear with the number of transformers
Actual Results
The execution time scales quadratically with the number of transformers
Versions
System:
python: 3.12.3 | packaged by conda-forge | (main, Apr 15 2024, 18:35:20) [Clang 16.0.6 ]
executable: /Users/belastoyan/micromamba/envs/sk-issue/bin/python
machine: macOS-14.4-arm64-arm-64bit
Python dependencies:
sklearn: 1.5.0
pip: 24.0
setuptools: 70.0.0
numpy: 1.26.4
scipy: 1.13.1
Cython: None
pandas: 2.2.2
matplotlib: None
joblib: 1.4.2
threadpoolctl: 3.5.0
Built with OpenMP: True
threadpoolctl info:
user_api: blas
internal_api: openblas
num_threads: 12
prefix: libopenblas
filepath: /Users/belastoyan/micromamba/envs/sk-issue/lib/libopenblas.0.dylib
version: 0.3.27
threading_layer: openmp
architecture: VORTEX
user_api: openmp
internal_api: openmp
num_threads: 12
prefix: libomp
filepath: /Users/belastoyan/micromamba/envs/sk-issue/lib/libomp.dylib
version: None