Skip to content

Navigation Menu

Search code, repositories, users, issues, pull requests...

Provide feedback

We read every piece of feedback, and take your input very seriously.

Saved searches

Use saved searches to filter your results more quickly

Appearance settings

DOC attempt to fix lorenz_curve in plot tweedie regression example #30198

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merged
merged 1 commit into from
Nov 25, 2024

Conversation

m-maggi
Copy link
Contributor

@m-maggi m-maggi commented Nov 2, 2024

Reference Issues/PRs

Fix attempt of #28534

What does this implement/fix? Explain your changes.

Take definition of Lorenz Curve from Poisson regression and non-normal loss and use it in Tweedie regression on insurance claims

Any other comments?

Following the discussion in #28534 it seems to me that the Lorenz curve should not use a linespace for the x values of the curve if the data is weighted.
Example snippet to test behaviour when linspace is used:

import matplotlib.pyplot as plt
import numpy as np


rng = np.random.default_rng(0)
n = 30
y_true = rng.uniform(low=0, high=10, size=n)
y_pred = y_true * rng.uniform(low=0.9, high=1.1, size=n)
exposure = rng.integers(low=0, high=10, size=n)

def lorenz_curve_linspace(frequency, exposure):
    ranking = np.argsort(frequency)
    ranked_frequencies = frequency[ranking]
    ranked_exposure = exposure[ranking]
    cumulated_claims = np.cumsum(ranked_frequencies * ranked_exposure)
    cumulated_claims = cumulated_claims / cumulated_claims[-1]
    cumulated_exposure = np.linspace(0, 1, len(frequency))
    plt.scatter(
        cumulated_exposure,
        cumulated_claims,
        marker=".",
        alpha=0.5,
    )
    return cumulated_exposure, cumulated_claims

y_true_repeated = y_true.repeat(exposure)
y_pred_repeated = y_pred.repeat(exposure)
sample_weight = np.ones_like(y_pred_repeated)
res = lorenz_curve_linspace(y_pred_repeated, sample_weight)

y_true_weighted = y_true
y_pred_weighted = y_pred
sample_weight = exposure
res = lorenz_curve_linspace(y_pred_weighted, sample_weight)

Results in

image

Snippet for results using the version of Poisson regression and non-normal loss, also implemented in this PR:

import matplotlib.pyplot as plt
import numpy as np


rng = np.random.default_rng(0)
n = 30
y_true = rng.uniform(low=0, high=10, size=n)
y_pred = y_true * rng.uniform(low=0.9, high=1.1, size=n)
exposure = rng.integers(low=0, high=10, size=n)

def lorenz_curve(frequency, exposure, weighted=True):
    ranking = np.argsort(frequency)
    ranked_frequencies = frequency[ranking]
    ranked_exposure = exposure[ranking]
    cumulated_claims = np.cumsum(ranked_frequencies * ranked_exposure)
    cumulated_claims = cumulated_claims / cumulated_claims[-1]
    if weighted:
        cumulated_exposure = np.cumsum(ranked_exposure)
        cumulated_exposure = cumulated_exposure / cumulated_exposure[-1]
        plt.scatter(
            cumulated_exposure,
            cumulated_claims,
            marker=".",
            alpha=0.5,
            label="weighted",
        )
    else:
        cumulated_exposure = np.linspace(0, 1, len(frequency))
        plt.scatter(
            cumulated_exposure,
            cumulated_claims,
            marker=".",
            alpha=0.5,
            label="unweighted",
        )
    return cumulated_exposure, cumulated_claims

y_true_repeated = y_true.repeat(exposure)
y_pred_repeated = y_pred.repeat(exposure)
sample_weight = np.ones_like(y_pred_repeated)
res = lorenz_curve(y_pred_repeated, sample_weight, False)

y_true_weighted = y_true
y_pred_weighted = y_pred
sample_weight = exposure
res = lorenz_curve(y_pred_weighted, sample_weight, True)
plt.legend();

image

Copy link

github-actions bot commented Nov 2, 2024

✔️ Linting Passed

All linting checks passed. Your pull request is in excellent shape! ☀️

Generated for commit: 21eed35. Link to the linter CI: here

@adrinjalali
Copy link
Member

ping @ogrisel @antoinebaker @snath-xoc for reviews here.

Copy link
Contributor

@OmarManzoor OmarManzoor left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM. Thanks @m-maggi

@ogrisel Could you have a look and merge?

@OmarManzoor OmarManzoor added the Waiting for Second Reviewer First reviewer is done, need a second one! label Nov 15, 2024
@ogrisel
Copy link
Member

ogrisel commented Nov 25, 2024

Thanks for the PR. I updated the snippet to compare the 2 strategies on the synthetic repeated/reweighted data and make one plot for each:

# %%
import matplotlib.pyplot as plt
import numpy as np

rng = np.random.default_rng(0)
n = 30
y_true = rng.uniform(low=0, high=10, size=n)
y_pred = y_true * rng.uniform(low=0.9, high=1.1, size=n)
exposure = rng.integers(low=0, high=10, size=n)


def lorenz_curve_linspace(frequency, exposure, label=None, use_cumulated_exposure=True):
    ranking = np.argsort(frequency)
    ranked_frequencies = frequency[ranking]
    ranked_exposure = exposure[ranking]
    cumulated_claims = np.cumsum(ranked_frequencies * ranked_exposure)
    cumulated_claims = cumulated_claims / cumulated_claims[-1]

    if use_cumulated_exposure:
        cumulated_exposure = np.cumsum(ranked_exposure).astype(np.float64)
        cumulated_exposure /= cumulated_exposure[-1]
    else:
        cumulated_exposure = np.linspace(0, 1, len(frequency))

    plt.scatter(
        cumulated_exposure,
        cumulated_claims,
        marker=".",
        alpha=0.5,
        label=label,
    )
    return cumulated_exposure, cumulated_claims


y_true_repeated = y_true.repeat(exposure)
y_pred_repeated = y_pred.repeat(exposure)
sample_weight_repeated = np.ones_like(y_pred_repeated)
res_repeated = lorenz_curve_linspace(
    y_pred_repeated, sample_weight_repeated, label="repeated"
)

y_true_weighted = y_true
y_pred_weighted = y_pred
sample_weight_weighted = exposure
res_weighted = lorenz_curve_linspace(
    y_pred_weighted, sample_weight_weighted, label="weighted"
)

plt.legend()
plt.title("Lorenz curve using cumulated exposure as x-axis");
# %%
res_repeated = lorenz_curve_linspace(
    y_pred_repeated,
    sample_weight_repeated,
    label="repeated",
    use_cumulated_exposure=False,
)
res_weighted = lorenz_curve_linspace(
    y_pred_weighted,
    sample_weight_weighted,
    label="weighted",
    use_cumulated_exposure=False,
)

plt.legend()
plt.title("Lorenz curve using linear exposure as x-axis");

Here are the resulting plots:

image
image

So this confirms that using the cumulated exposure as the x-axis to plot the Lorenz curve is the correct solution, otherwise, the expected repetitions/reweighting equivalence does not hold.

Copy link
Member

@ogrisel ogrisel left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I also took a look at the impact on the example figure in the notebook. The Gini score values are a bit larger than what we had in main (the Lorenz curves lie slightly further away from the diagonal) but otherwise they are quite qualitatively similar to what we had before. The text of the example does not need to be updated in particular.

+1 for merge. Thanks for the follow-up @m-maggi.

@ogrisel ogrisel merged commit fa5d727 into scikit-learn:main Nov 25, 2024
38 checks passed
@ogrisel ogrisel mentioned this pull request Nov 25, 2024
17 tasks
@antoinebaker
Copy link
Contributor

Sorry I was too late to give a feedback, but I think the xlabel should be changed accordingly, something like "Cumulative proportion of exposure (from safest to riskiest)" as in the Poisson tutorial or "Fraction of total exposure\n(ordered by model from safest to riskiest)". The tutorial text when introducing the Lorenz curve should now state that we are plotting against the cumulative exposure on the x-axis.

@ogrisel
Copy link
Member

ogrisel commented Nov 26, 2024

@antoinebaker I agree, please feel free to open a PR with that fix and I apologize for merging too early :)

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Documentation Waiting for Second Reviewer First reviewer is done, need a second one!
Projects
None yet
Development

Successfully merging this pull request may close these issues.

5 participants
Morty Proxy This is a proxified and sanitized view of the page, visit original site.