Open
Description
Describe the bug
When fitting DecisionTreeClassifier on a duplicated sample set (i.e. each sample repeated by two), the result is not the same as when fitting on the original sample set. This only happens for 'min_weight_fraction_leaf' specified as <0.5. This also effects ExtraTreesClassifier and ExtraTreeClassifier.
Steps/Code to Reproduce
from sklearn.tree import DecisionTreeClassifier
from scipy.stats import kstest
import numpy as np
rng = np.random.RandomState(0)
n_samples = 20
X = rng.rand(n_samples, n_samples * 2)
y = rng.randint(0, 3, size=n_samples)
X_repeated = np.repeat(X,2,axis=0)
y_repeated = np.repeat(y,2)
predictions = []
predictions_dup = []
## Fit estimator
for seed in range(100):
est = DecisionTreeClassifier(random_state=seed, max_features=0.5, min_weight_fraction_leaf=0.5).fit(X,y)
est_dup = DecisionTreeClassifier(random_state=seed, max_features=0.5, min_weight_fraction_leaf=0.5).fit(X_repeated,y_repeated)
##Get predictions
predictions.append(est.predict_proba(X)[:,:-1])
predictions_dup.append(est_dup.predict_proba(X)[:,:-1])
predictions = np.vstack(predictions)
predictions_dup = np.vstack(predictions_dup)
for pred, pred_dup in (predictions.T,predictions_dup.T):
print(kstest(pred,pred_dup).pvalue)
Expected Results
p-values are more than ˜0.05
Actual Results
p-values = 2.0064970441275627e-69
Versions
System:
python: 3.12.4 | packaged by conda-forge | (main, Jun 17 2024, 10:13:44) [Clang 16.0.6 ]
executable: /Users/shrutinath/micromamba/envs/scikit-learn/bin/python
machine: macOS-14.3-arm64-arm-64bit
Python dependencies:
sklearn: 1.7.dev0
pip: 24.0
setuptools: 75.8.0
numpy: 2.0.0
scipy: 1.14.0
Cython: 3.0.10
pandas: 2.2.2
matplotlib: 3.9.0
joblib: 1.4.2
threadpoolctl: 3.5.0
Built with OpenMP: True
threadpoolctl info:
user_api: blas
internal_api: openblas
num_threads: 8
prefix: libopenblas
...
num_threads: 8
prefix: libomp
filepath: /Users/shrutinath/micromamba/envs/scikit-learn/lib/libomp.dylib
version: None
Output is truncated. View as a scrollable element or open in a text editor. Adjust cell output settings...