Skip to content

Navigation Menu

Sign in
Appearance settings

Search code, repositories, users, issues, pull requests...

Provide feedback

We read every piece of feedback, and take your input very seriously.

Saved searches

Use saved searches to filter your results more quickly

Appearance settings

SVC's predict_proba(...) predicts the exact same probability for wildly different inputs #19447

Copy link
Copy link
Open
@bravegag

Description

@bravegag
Issue body actions

I'm using scikit-learn version 0.24.1.

I suspect some form of memoization or extreme low sensitivity happening in the SVC predict_proba(...) implementation whereby once a SVC model is built from a Pipeline with inputs scaling preprocessing and calibration and when setting the random_state argument; I get the exact same predicted probability for wildly different inputs.

I also checked the decision_function(..) result and it returns the exact same value for two wildly different x_test inputs.

This issue only happens when setting the argument random_state e.g. to zero. It is hard to create a MRE here also because of NDA and confidentiality of the dataset and work but I propose to export the pipeline like this:

from joblib import dump, load
dump(model, '/some/where/model.joblib')

and provide the two wildly different inputs leading to the same exact predict_proba(...) and decision_function(...) results. Would that be a viable solution to reproduce and fix the possible bug?

A simplified relevant example of my pipeline is the following:

pipeline = Pipeline(steps=[('preprocess', MaxAbsScaler()),
                           ('svm', SVC(kernel='rbf', probability=True, 
                                       random_state=0, decision_function_shape='ovr', 
                                       break_ties=False, ...))]
params = [{...}]
model = GridSearchCV(pipeline, params, ...).fit(x_train, y_train).best_estimator_

# and now I get
prob1 = model.predict_proba(x_test1)
prob2 = model.predict_proba(x_test2_wildly_diff_from_test1)
assert np.abs(prob1 - prob2) < 1e-10

For example, using random_seed=0:

>>> import numpy as np
>>> from scipy.spatial import distance
# how far are the two x_test vectors from each other?
>>> distance.cosine(x_test1, x_test2)  # EDIT: very far apart angle-wise e.g. 90°
1.0280449512858494
>>> distance.euclidean(x_test1, x_test2) # very far in euclidean distance
30675.221284568033
>>> model.predict_proba(x_test1)
array([[0.86879653, 0.13120347]])
>>> model.predict_proba(x_test2)
array([[0.86879653, 0.13120347]])
>>> model.decision_function(x_test1)
array([-0.03474242])
>>> model.decision_function(x_test2)
array([-0.03474242])

UPDATE: what bothers me is not that they are close, what spooks me is that the two completely different vectors end up getting the exact same distance (to the 1e-10 decimal place) from the decision boundary decision_function(...) and therefore the exact probability too predict_proba(...). I'm still thinking how to further validate and scrutinize this case ...

Metadata

Metadata

Assignees

No one assigned

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions

      Morty Proxy This is a proxified and sanitized view of the page, visit original site.