Description
I'm using scikit-learn
version 0.24.1.
I suspect some form of memoization or extreme low sensitivity happening in the SVC predict_proba(...)
implementation whereby once a SVC model is built from a Pipeline with inputs scaling preprocessing and calibration and when setting the random_state
argument; I get the exact same predicted probability for wildly different inputs.
I also checked the decision_function(..)
result and it returns the exact same value for two wildly different x_test
inputs.
This issue only happens when setting the argument random_state
e.g. to zero. It is hard to create a MRE here also because of NDA and confidentiality of the dataset and work but I propose to export the pipeline like this:
from joblib import dump, load
dump(model, '/some/where/model.joblib')
and provide the two wildly different inputs leading to the same exact predict_proba(...)
and decision_function(...)
results. Would that be a viable solution to reproduce and fix the possible bug?
A simplified relevant example of my pipeline is the following:
pipeline = Pipeline(steps=[('preprocess', MaxAbsScaler()),
('svm', SVC(kernel='rbf', probability=True,
random_state=0, decision_function_shape='ovr',
break_ties=False, ...))]
params = [{...}]
model = GridSearchCV(pipeline, params, ...).fit(x_train, y_train).best_estimator_
# and now I get
prob1 = model.predict_proba(x_test1)
prob2 = model.predict_proba(x_test2_wildly_diff_from_test1)
assert np.abs(prob1 - prob2) < 1e-10
For example, using random_seed=0
:
>>> import numpy as np
>>> from scipy.spatial import distance
# how far are the two x_test vectors from each other?
>>> distance.cosine(x_test1, x_test2) # EDIT: very far apart angle-wise e.g. 90°
1.0280449512858494
>>> distance.euclidean(x_test1, x_test2) # very far in euclidean distance
30675.221284568033
>>> model.predict_proba(x_test1)
array([[0.86879653, 0.13120347]])
>>> model.predict_proba(x_test2)
array([[0.86879653, 0.13120347]])
>>> model.decision_function(x_test1)
array([-0.03474242])
>>> model.decision_function(x_test2)
array([-0.03474242])
UPDATE: what bothers me is not that they are close, what spooks me is that the two completely different vectors end up getting the exact same distance (to the 1e-10 decimal place) from the decision boundary decision_function(...)
and therefore the exact probability too predict_proba(...)
. I'm still thinking how to further validate and scrutinize this case ...