Description
Description
When passing in a Criterion object to RandomForest or ExtraTrees as opposed to a Criterion string, I've observed segfaults when fitting when n_jobs is > 1. In my case, I've written a custom Criterion, but can reproduce the problem with one of the sklearn built in criterions if you pass in the Criterion object instead of the string.
I believe the problem is that when creating the list of estimators for the ensemble, the parameters aren't copied so that the same Criterion object is used for all the trees. When n_jobs=1, this is ok because the criterion is re-initialized at each split. However, when n_jobs>1, the same criterion is modified by multiple threads resulting in cases where pointers are freed and then accessed.
Steps/Code to Reproduce
The following code reproduces the segfault:
from sklearn.ensemble import ExtraTreesRegressor
from sklearn.tree.tree import CRITERIA_REG
import numpy as np
X = np.random.random((1000, 3))
y = np.random.random((1000, 1))
n_samples, n_outputs = y.shape
mse_criterion = CRITERIA_REG['mse'](n_outputs, n_samples)
rf = ExtraTreesRegressor(n_estimators=400, n_jobs=-1, criterion=mse_criterion)
rf.fit(X,y)
Versions
System
python: 3.5.6 |Anaconda, Inc.| (default, Aug 26 2018, 21:41:56) [GCC 7.3.0]
Python deps
sklearn: 0.20.0
setuptools: 40.2.0
pip: 10.0.1
Cython: 0.28.5
numpy: 1.13.3
pandas: 0.23.4
scipy: 1.1.0
Discussion
I've tried adding a call to copy.deepcopy() around the getattr call for all the parameters accessed when making the estimators to fit which seems to fix the problem. Would that be an acceptable fix or are you interested in a deeper fix?