SVC's predict_proba(...) predicts the exact same probability for wildly different inputs

I'm using scikit-learn version 0.24.1.

I suspect some form of memoization or extreme low sensitivity happening in the SVC predict_proba(...) implementation whereby once a SVC model is built from a Pipeline with inputs scaling preprocessing and calibration and when setting the random_state argument; I get the exact same predicted probability for wildly different inputs.

I also checked the decision_function(..) result and it returns the exact same value for two wildly different x_test inputs.

This issue only happens when setting the argument random_state e.g. to zero. It is hard to create a MRE here also because of NDA and confidentiality of the dataset and work but I propose to export the pipeline like this:

from joblib import dump, load
dump(model, '/some/where/model.joblib')

and provide the two wildly different inputs leading to the same exact predict_proba(...) and decision_function(...) results. Would that be a viable solution to reproduce and fix the possible bug?

A simplified relevant example of my pipeline is the following:

pipeline = Pipeline(steps=[('preprocess', MaxAbsScaler()),
                           ('svm', SVC(kernel='rbf', probability=True, 
                                       random_state=0, decision_function_shape='ovr', 
                                       break_ties=False, ...))]
params = [{...}]
model = GridSearchCV(pipeline, params, ...).fit(x_train, y_train).best_estimator_

# and now I get
prob1 = model.predict_proba(x_test1)
prob2 = model.predict_proba(x_test2_wildly_diff_from_test1)
assert np.abs(prob1 - prob2) < 1e-10

For example, using random_seed=0:

>>> import numpy as np
>>> from scipy.spatial import distance
# how far are the two x_test vectors from each other?
>>> distance.cosine(x_test1, x_test2)  # EDIT: very far apart angle-wise e.g. 90°
1.0280449512858494
>>> distance.euclidean(x_test1, x_test2) # very far in euclidean distance
30675.221284568033
>>> model.predict_proba(x_test1)
array([[0.86879653, 0.13120347]])
>>> model.predict_proba(x_test2)
array([[0.86879653, 0.13120347]])
>>> model.decision_function(x_test1)
array([-0.03474242])
>>> model.decision_function(x_test2)
array([-0.03474242])

UPDATE: what bothers me is not that they are close, what spooks me is that the two completely different vectors end up getting the exact same distance (to the 1e-10 decimal place) from the decision boundary decision_function(...) and therefore the exact probability too predict_proba(...). I'm still thinking how to further validate and scrutinize this case ...

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Uh oh!

SVC's predict_proba(...) predicts the exact same probability for wildly different inputs #19447

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Search code, repositories, users, issues, pull requests...

Uh oh!

SVC's predict_proba(...) predicts the exact same probability for wildly different inputs #19447

Description

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions