Don't refit in FixedThresholdClassifier when original model is already trained.

Describe the workflow you want to enable

I wrote some code for a demo that looks like this:

from sklearn.datasets import make_classification
from sklearn.linear_model import LogisticRegression
from sklearn.metrics import accuracy_score, f1_score, precision_score, recall_score
from sklearn.model_selection import FixedThresholdClassifier, train_test_split
from tqdm import trange

X, y = make_classification(
    n_samples=10_000, weights=[0.9, 0.1], class_sep=0.8, random_state=42
)

X_train, X_test, y_train, y_test = train_test_split(
    X, y, stratify=y, random_state=42
)

classifier = LogisticRegression(random_state=0).fit(X_train, y_train)

n_steps = 200
metrics = []
for i in trange(1, n_steps):
    classifier_other_threshold = FixedThresholdClassifier(
        classifier, threshold=i/n_steps, response_method="predict_proba"
    ).fit(X_train, y_train)
    
    y_pred = classifier_other_threshold.predict(X_train)
    metrics.append({
        'threshold': i/n_steps,
        'f1': f1_score(y_train, y_pred),
        'precision': precision_score(y_train, y_pred),
        'recall': recall_score(y_train, y_pred),
        'accuracy': accuracy_score(y_train, y_pred)
    })

The goal here is to log some statistics but I was suprised to see that this took over 2 minutes to run. Granted, I am not doing anything in parallel, but it's only 10000 datapoints that need to be predicted/thredholded. So it felt like something was up.

I figured I'd rewrite the code a bit and was able to confirm that, probably, the FixedThresholdClassifier is refitting the internal classifier internally.

n_steps = 200
metrics = []
for i in trange(1, n_steps):
    # classifier_other_threshold = FixedThresholdClassifier(
    #     classifier, threshold=i/n_steps, response_method="predict_proba"
    # ).fit(X_train, y_train)
    
    y_pred = classifier.predict_proba(X_train)[:, 1] > (i / n_steps)
    metrics.append({
        'threshold': i/n_steps,
        'f1': f1_score(y_train, y_pred),
        'precision': precision_score(y_train, y_pred),
        'recall': recall_score(y_train, y_pred),
        'accuracy': accuracy_score(y_train, y_pred)
    })

This was a whole lot faster, only took 16s on my machine. I also had a brief look at the current implementation and it indeed seems to always refit right now.

Describe your proposed solution

I have had a similar observation on the scikit-lego side of things (link a, link b) with the Thresholder meta estimator we have there and we addressed this by adding a refit parameter. If that is set to false this estimator won't refit the underlying estimator.

I understand that this use-case is only relevant outside of the pipeline, but it may be nice for folks want to use this component manually.

Describe alternatives you've considered, if relevant

I think the new TunedThresholdClassifierCV could also allow for the use-case that I have in mind but that will only work if it allows for extra metrics. There is currently an open discussion on that here.

Additional context

No response

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Uh oh!

Don't refit in FixedThresholdClassifier when original model is already trained. #29062

Describe the workflow you want to enable

Describe your proposed solution

Describe alternatives you've considered, if relevant

Additional context

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Search code, repositories, users, issues, pull requests...

Uh oh!

Don't refit in FixedThresholdClassifier when original model is already trained. #29062

Description

Describe the workflow you want to enable

Describe your proposed solution

Describe alternatives you've considered, if relevant

Additional context

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions