Description
Describe the workflow you want to enable
I wrote some code for a demo that looks like this:
from sklearn.datasets import make_classification
from sklearn.linear_model import LogisticRegression
from sklearn.metrics import accuracy_score, f1_score, precision_score, recall_score
from sklearn.model_selection import FixedThresholdClassifier, train_test_split
from tqdm import trange
X, y = make_classification(
n_samples=10_000, weights=[0.9, 0.1], class_sep=0.8, random_state=42
)
X_train, X_test, y_train, y_test = train_test_split(
X, y, stratify=y, random_state=42
)
classifier = LogisticRegression(random_state=0).fit(X_train, y_train)
n_steps = 200
metrics = []
for i in trange(1, n_steps):
classifier_other_threshold = FixedThresholdClassifier(
classifier, threshold=i/n_steps, response_method="predict_proba"
).fit(X_train, y_train)
y_pred = classifier_other_threshold.predict(X_train)
metrics.append({
'threshold': i/n_steps,
'f1': f1_score(y_train, y_pred),
'precision': precision_score(y_train, y_pred),
'recall': recall_score(y_train, y_pred),
'accuracy': accuracy_score(y_train, y_pred)
})
The goal here is to log some statistics but I was suprised to see that this took over 2 minutes to run. Granted, I am not doing anything in parallel, but it's only 10000 datapoints that need to be predicted/thredholded. So it felt like something was up.
I figured I'd rewrite the code a bit and was able to confirm that, probably, the FixedThresholdClassifier
is refitting the internal classifier internally.
n_steps = 200
metrics = []
for i in trange(1, n_steps):
# classifier_other_threshold = FixedThresholdClassifier(
# classifier, threshold=i/n_steps, response_method="predict_proba"
# ).fit(X_train, y_train)
y_pred = classifier.predict_proba(X_train)[:, 1] > (i / n_steps)
metrics.append({
'threshold': i/n_steps,
'f1': f1_score(y_train, y_pred),
'precision': precision_score(y_train, y_pred),
'recall': recall_score(y_train, y_pred),
'accuracy': accuracy_score(y_train, y_pred)
})
This was a whole lot faster, only took 16s on my machine. I also had a brief look at the current implementation and it indeed seems to always refit right now.
Describe your proposed solution
I have had a similar observation on the scikit-lego side of things (link a, link b) with the Thresholder
meta estimator we have there and we addressed this by adding a refit
parameter. If that is set to false this estimator won't refit the underlying estimator.
I understand that this use-case is only relevant outside of the pipeline, but it may be nice for folks want to use this component manually.
Describe alternatives you've considered, if relevant
I think the new TunedThresholdClassifierCV
could also allow for the use-case that I have in mind but that will only work if it allows for extra metrics. There is currently an open discussion on that here.
Additional context
No response
Metadata
Metadata
Assignees
Type
Projects
Status