Skip to content

Navigation Menu

Sign in
Appearance settings

Search code, repositories, users, issues, pull requests...

Provide feedback

We read every piece of feedback, and take your input very seriously.

Saved searches

Use saved searches to filter your results more quickly

Appearance settings

Validation step fails when using shared memory with multiprocessing.managers.BaseManager #28899

Copy link
Copy link
Closed
@ElenaKhaustova

Description

@ElenaKhaustova
Issue body actions

Describe the bug

Original issue: kedro-org/kedro#3674

Relates to #28781

We use multiprocessing managers to work with shared memory for pipeline parallelisation. After this validation step was added we are experiencing ValueError: cannot set WRITEABLE flag to True of this array error when objects are retrieved from shared memory and passed to scikit-learn functions, for example fit, including this validation step.

The only solution that works for us so far is making a deep copy of objects before passing them to those methods which is not the desired solution.

Steps/Code to Reproduce

Some findings:

from concurrent.futures import ProcessPoolExecutor
from multiprocessing.managers import BaseManager
import traceback

import numpy as np
import pandas as pd
from sklearn.linear_model import LinearRegression


class MemoryDataset:
    def __init__(self):
        self._ds = None

    def save(self, ds):
        self._ds = ds

    def load(self):
        return self._ds


def train_model(dataset: MemoryDataset) -> LinearRegression:
    regressor = LinearRegression()
    X_train, y_train = dataset.load()
    try:
        regressor.fit(X_train, y_train)
    except Exception as _:
        print(traceback.format_exc())
    return regressor


class MyManager(BaseManager):
    pass


MyManager.register("MemoryDataset", MemoryDataset, exposed=("save", "load"))


def main():
    rng = np.random.default_rng()
    n_samples = 1000
    X_train = pd.DataFrame(rng.random((n_samples, 4)), columns=list('ABCD'))
    y_train = pd.Series(rng.random(n_samples))
    # Replacing pd.Series with pd.DataFrame solves the issue
    # y_train = pd.DataFrame(rng.random((n_samples, 1)), columns=list('E'))

    futures = set()

    manager = MyManager()
    manager.start()
    dataset = manager.MemoryDataset()
    dataset.save((X_train, y_train))

    with ProcessPoolExecutor(max_workers=1) as pool:
        futures.add(pool.submit(train_model, dataset))

Expected Results

No error is thrown.

Actual Results

Traceback (most recent call last):
  File "/pr-scikit-learn/main.py", line 48, in train_model
    regressor.fit(X_train, y_train)
  File "/lib/python3.11/site-packages/sklearn/base.py", line 1473, in wrapper
    return fit_method(estimator, *args, **kwargs)
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/lib/python3.11/site-packages/sklearn/linear_model/_base.py", line 609, in fit
    X, y = self._validate_data(
           ^^^^^^^^^^^^^^^^^^^^
  File "/lib/python3.11/site-packages/sklearn/base.py", line 650, in _validate_data
    X, y = check_X_y(X, y, **check_params)
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/lib/python3.11/site-packages/sklearn/utils/validation.py", line 1282, in check_X_y
    y = _check_y(y, multi_output=multi_output, y_numeric=y_numeric, estimator=estimator)
        ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/lib/python3.11/site-packages/sklearn/utils/validation.py", line 1292, in _check_y
    y = check_array(
        ^^^^^^^^^^^^
  File "/lib/python3.11/site-packages/sklearn/utils/validation.py", line 1100, in check_array
    array.flags.writeable = True
    ^^^^^^^^^^^^^^^^^^^^^
ValueError: cannot set WRITEABLE flag to True of this array

Versions

System:
    python: 3.11.9 (main, Apr 19 2024, 11:44:45) [Clang 14.0.6 ]
executable: /opt/miniconda3/envs/paraller-runner-scikit-learn-env/bin/python
   machine: macOS-10.16-x86_64-i386-64bit

Python dependencies:
      sklearn: 1.5.dev0
          pip: 23.3.1
   setuptools: 68.2.2
        numpy: 1.26.4
        scipy: 1.13.0
       Cython: None
       pandas: 2.2.2
   matplotlib: None
       joblib: 1.4.0
threadpoolctl: 3.4.0

Built with OpenMP: False

threadpoolctl info:
       user_api: blas
   internal_api: openblas
    num_threads: 10
         prefix: libopenblas
       filepath: /opt/miniconda3/envs/paraller-runner-scikit-learn-env/lib/python3.11/site-packages/numpy/.dylibs/libopenblas64_.0.dylib
        version: 0.3.23.dev
threading_layer: pthreads
   architecture: Nehalem

       user_api: blas
   internal_api: openblas
    num_threads: 10
         prefix: libopenblas
       filepath: /opt/miniconda3/envs/paraller-runner-scikit-learn-env/lib/python3.11/site-packages/scipy/.dylibs/libopenblas.0.dylib
        version: 0.3.26.dev
threading_layer: pthreads
   architecture: Nehalem

Metadata

Metadata

Assignees

No one assigned

    Labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions

      Morty Proxy This is a proxified and sanitized view of the page, visit original site.