-
-
Notifications
You must be signed in to change notification settings - Fork 25.8k
DOC attempt to fix lorenz_curve in plot tweedie regression example #30198
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
ping @ogrisel @antoinebaker @snath-xoc for reviews here. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for the PR. I updated the snippet to compare the 2 strategies on the synthetic repeated/reweighted data and make one plot for each: # %%
import matplotlib.pyplot as plt
import numpy as np
rng = np.random.default_rng(0)
n = 30
y_true = rng.uniform(low=0, high=10, size=n)
y_pred = y_true * rng.uniform(low=0.9, high=1.1, size=n)
exposure = rng.integers(low=0, high=10, size=n)
def lorenz_curve_linspace(frequency, exposure, label=None, use_cumulated_exposure=True):
ranking = np.argsort(frequency)
ranked_frequencies = frequency[ranking]
ranked_exposure = exposure[ranking]
cumulated_claims = np.cumsum(ranked_frequencies * ranked_exposure)
cumulated_claims = cumulated_claims / cumulated_claims[-1]
if use_cumulated_exposure:
cumulated_exposure = np.cumsum(ranked_exposure).astype(np.float64)
cumulated_exposure /= cumulated_exposure[-1]
else:
cumulated_exposure = np.linspace(0, 1, len(frequency))
plt.scatter(
cumulated_exposure,
cumulated_claims,
marker=".",
alpha=0.5,
label=label,
)
return cumulated_exposure, cumulated_claims
y_true_repeated = y_true.repeat(exposure)
y_pred_repeated = y_pred.repeat(exposure)
sample_weight_repeated = np.ones_like(y_pred_repeated)
res_repeated = lorenz_curve_linspace(
y_pred_repeated, sample_weight_repeated, label="repeated"
)
y_true_weighted = y_true
y_pred_weighted = y_pred
sample_weight_weighted = exposure
res_weighted = lorenz_curve_linspace(
y_pred_weighted, sample_weight_weighted, label="weighted"
)
plt.legend()
plt.title("Lorenz curve using cumulated exposure as x-axis");
# %%
res_repeated = lorenz_curve_linspace(
y_pred_repeated,
sample_weight_repeated,
label="repeated",
use_cumulated_exposure=False,
)
res_weighted = lorenz_curve_linspace(
y_pred_weighted,
sample_weight_weighted,
label="weighted",
use_cumulated_exposure=False,
)
plt.legend()
plt.title("Lorenz curve using linear exposure as x-axis"); Here are the resulting plots: So this confirms that using the cumulated exposure as the x-axis to plot the Lorenz curve is the correct solution, otherwise, the expected repetitions/reweighting equivalence does not hold. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I also took a look at the impact on the example figure in the notebook. The Gini score values are a bit larger than what we had in main
(the Lorenz curves lie slightly further away from the diagonal) but otherwise they are quite qualitatively similar to what we had before. The text of the example does not need to be updated in particular.
+1 for merge. Thanks for the follow-up @m-maggi.
Sorry I was too late to give a feedback, but I think the |
@antoinebaker I agree, please feel free to open a PR with that fix and I apologize for merging too early :) |
Reference Issues/PRs
Fix attempt of #28534
What does this implement/fix? Explain your changes.
Take definition of Lorenz Curve from Poisson regression and non-normal loss and use it in Tweedie regression on insurance claims
Any other comments?
Following the discussion in #28534 it seems to me that the Lorenz curve should not use a linespace for the x values of the curve if the data is weighted.
Example snippet to test behaviour when linspace is used:
Results in
Snippet for results using the version of Poisson regression and non-normal loss, also implemented in this PR: