Closed
Description
Describe the bug
Original issue: kedro-org/kedro#3674
Relates to #28781
We use multiprocessing managers to work with shared memory for pipeline parallelisation. After this validation step was added we are experiencing ValueError: cannot set WRITEABLE flag to True of this array
error when objects are retrieved from shared memory and passed to scikit-learn
functions, for example fit,
including this validation step.
The only solution that works for us so far is making a deep copy of objects before passing them to those methods which is not the desired solution.
Steps/Code to Reproduce
Some findings:
- The result depends on
n_samples
. Whenn_samles
is relatively small ~100 the error is not happening. So can be related to ColumnTransformer throws error with n_jobs > 1 input dataframes and joblib auto-memmapping (regression in 1.4.1.post1) #28781 (comment) - Replacing
pd.Series
withpd.DataFrame
solves the issue but we don't have an idea why
from concurrent.futures import ProcessPoolExecutor
from multiprocessing.managers import BaseManager
import traceback
import numpy as np
import pandas as pd
from sklearn.linear_model import LinearRegression
class MemoryDataset:
def __init__(self):
self._ds = None
def save(self, ds):
self._ds = ds
def load(self):
return self._ds
def train_model(dataset: MemoryDataset) -> LinearRegression:
regressor = LinearRegression()
X_train, y_train = dataset.load()
try:
regressor.fit(X_train, y_train)
except Exception as _:
print(traceback.format_exc())
return regressor
class MyManager(BaseManager):
pass
MyManager.register("MemoryDataset", MemoryDataset, exposed=("save", "load"))
def main():
rng = np.random.default_rng()
n_samples = 1000
X_train = pd.DataFrame(rng.random((n_samples, 4)), columns=list('ABCD'))
y_train = pd.Series(rng.random(n_samples))
# Replacing pd.Series with pd.DataFrame solves the issue
# y_train = pd.DataFrame(rng.random((n_samples, 1)), columns=list('E'))
futures = set()
manager = MyManager()
manager.start()
dataset = manager.MemoryDataset()
dataset.save((X_train, y_train))
with ProcessPoolExecutor(max_workers=1) as pool:
futures.add(pool.submit(train_model, dataset))
Expected Results
No error is thrown.
Actual Results
Traceback (most recent call last):
File "/pr-scikit-learn/main.py", line 48, in train_model
regressor.fit(X_train, y_train)
File "/lib/python3.11/site-packages/sklearn/base.py", line 1473, in wrapper
return fit_method(estimator, *args, **kwargs)
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/lib/python3.11/site-packages/sklearn/linear_model/_base.py", line 609, in fit
X, y = self._validate_data(
^^^^^^^^^^^^^^^^^^^^
File "/lib/python3.11/site-packages/sklearn/base.py", line 650, in _validate_data
X, y = check_X_y(X, y, **check_params)
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/lib/python3.11/site-packages/sklearn/utils/validation.py", line 1282, in check_X_y
y = _check_y(y, multi_output=multi_output, y_numeric=y_numeric, estimator=estimator)
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/lib/python3.11/site-packages/sklearn/utils/validation.py", line 1292, in _check_y
y = check_array(
^^^^^^^^^^^^
File "/lib/python3.11/site-packages/sklearn/utils/validation.py", line 1100, in check_array
array.flags.writeable = True
^^^^^^^^^^^^^^^^^^^^^
ValueError: cannot set WRITEABLE flag to True of this array
Versions
System:
python: 3.11.9 (main, Apr 19 2024, 11:44:45) [Clang 14.0.6 ]
executable: /opt/miniconda3/envs/paraller-runner-scikit-learn-env/bin/python
machine: macOS-10.16-x86_64-i386-64bit
Python dependencies:
sklearn: 1.5.dev0
pip: 23.3.1
setuptools: 68.2.2
numpy: 1.26.4
scipy: 1.13.0
Cython: None
pandas: 2.2.2
matplotlib: None
joblib: 1.4.0
threadpoolctl: 3.4.0
Built with OpenMP: False
threadpoolctl info:
user_api: blas
internal_api: openblas
num_threads: 10
prefix: libopenblas
filepath: /opt/miniconda3/envs/paraller-runner-scikit-learn-env/lib/python3.11/site-packages/numpy/.dylibs/libopenblas64_.0.dylib
version: 0.3.23.dev
threading_layer: pthreads
architecture: Nehalem
user_api: blas
internal_api: openblas
num_threads: 10
prefix: libopenblas
filepath: /opt/miniconda3/envs/paraller-runner-scikit-learn-env/lib/python3.11/site-packages/scipy/.dylibs/libopenblas.0.dylib
version: 0.3.26.dev
threading_layer: pthreads
architecture: Nehalem