Skip to content

Navigation Menu

Sign in
Appearance settings

Search code, repositories, users, issues, pull requests...

Provide feedback

We read every piece of feedback, and take your input very seriously.

Saved searches

Use saved searches to filter your results more quickly

Appearance settings

Don't refit in FixedThresholdClassifier when original model is already trained.  #29062

Copy link
Copy link
Closed
@koaning

Description

@koaning
Issue body actions

Describe the workflow you want to enable

I wrote some code for a demo that looks like this:

from sklearn.datasets import make_classification
from sklearn.linear_model import LogisticRegression
from sklearn.metrics import accuracy_score, f1_score, precision_score, recall_score
from sklearn.model_selection import FixedThresholdClassifier, train_test_split
from tqdm import trange

X, y = make_classification(
    n_samples=10_000, weights=[0.9, 0.1], class_sep=0.8, random_state=42
)

X_train, X_test, y_train, y_test = train_test_split(
    X, y, stratify=y, random_state=42
)

classifier = LogisticRegression(random_state=0).fit(X_train, y_train)

n_steps = 200
metrics = []
for i in trange(1, n_steps):
    classifier_other_threshold = FixedThresholdClassifier(
        classifier, threshold=i/n_steps, response_method="predict_proba"
    ).fit(X_train, y_train)
    
    y_pred = classifier_other_threshold.predict(X_train)
    metrics.append({
        'threshold': i/n_steps,
        'f1': f1_score(y_train, y_pred),
        'precision': precision_score(y_train, y_pred),
        'recall': recall_score(y_train, y_pred),
        'accuracy': accuracy_score(y_train, y_pred)
    })

The goal here is to log some statistics but I was suprised to see that this took over 2 minutes to run. Granted, I am not doing anything in parallel, but it's only 10000 datapoints that need to be predicted/thredholded. So it felt like something was up.

I figured I'd rewrite the code a bit and was able to confirm that, probably, the FixedThresholdClassifier is refitting the internal classifier internally.

n_steps = 200
metrics = []
for i in trange(1, n_steps):
    # classifier_other_threshold = FixedThresholdClassifier(
    #     classifier, threshold=i/n_steps, response_method="predict_proba"
    # ).fit(X_train, y_train)
    
    y_pred = classifier.predict_proba(X_train)[:, 1] > (i / n_steps)
    metrics.append({
        'threshold': i/n_steps,
        'f1': f1_score(y_train, y_pred),
        'precision': precision_score(y_train, y_pred),
        'recall': recall_score(y_train, y_pred),
        'accuracy': accuracy_score(y_train, y_pred)
    })

This was a whole lot faster, only took 16s on my machine. I also had a brief look at the current implementation and it indeed seems to always refit right now.

Describe your proposed solution

I have had a similar observation on the scikit-lego side of things (link a, link b) with the Thresholder meta estimator we have there and we addressed this by adding a refit parameter. If that is set to false this estimator won't refit the underlying estimator.

I understand that this use-case is only relevant outside of the pipeline, but it may be nice for folks want to use this component manually.

Describe alternatives you've considered, if relevant

I think the new TunedThresholdClassifierCV could also allow for the use-case that I have in mind but that will only work if it allows for extra metrics. There is currently an open discussion on that here.

Additional context

No response

Metadata

Metadata

Assignees

Type

No type

Projects

Status

Done
Show more project fields

Milestone

Relationships

None yet

Development

No branches or pull requests

Issue actions

    Morty Proxy This is a proxified and sanitized view of the page, visit original site.