-
-
Notifications
You must be signed in to change notification settings - Fork 25.8k
[MRG+1] TST Move roc_auc_score from METRIC_UNDEFINED_BINARY to METRIC_UNDEFINED_MULTICLASS #9786
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
What kinds of discrepancy between expected and actual do we get if we leave the scaling as it was? |
We may just need a higher tolerance in |
@jnothman Thanks. n_samples = 50
random_state = check_random_state(0)
y_true = random_state.randint(0, 2, size=(n_samples, ))
y_pred = random_state.randint(0, 2, size=(n_samples, ))
y_score = random_state.random_sample(size=(n_samples,))
rng = np.random.RandomState(0)
sample_weight = rng.randint(1, 10, size=len(y_true))
roc_auc_score(y_true, y_score, sample_weight=sample_weight)
# 0.38036523593708349
roc_auc_score(y_true, y_score, sample_weight=sample_weight*2)
# 0.38036523593708349
roc_auc_score(y_true, y_score, sample_weight=sample_weight*0.3)
# 0.38733004532124787 I checked the implementation and it seems right. The difference seems to be introduced by python when doing floating operations. |
As long as we don't have any estimators that require scores in (0, 1), we could use integer scores to encourage more stability...? |
Integer scores, where scores are likely to be equal, would also be a more challenging test to pass |
@jnothman Sorry but I don't quite understand. What do you mean by 'integer scores'? Could you please provide more details? Thanks :) |
Currently we have |
@jnothman y_true = random_state.randint(0, 2, size=(n_samples, ))
y_score = random_state.randint(0, 10, size=(n_samples, ))
sample_weight = rng.randint(1, 10, size=len(y_true))
roc_auc_score(y_true, y_score, sample_weight=sample_weight) Such method work for |
okay. decimals=2 seems appropriate for 50 samples in any case.
…On 18 September 2017 at 13:27, Hanmin Qin ***@***.***> wrote:
@jnothman <https://github.com/jnothman>
Do you mean something like this?
y_true = random_state.randint(0, 2, size=(n_samples, ))
y_score = random_state.randint(0, 10, size=(n_samples, ))
sample_weight = rng.randint(1, 10, size=len(y_true))
roc_auc_score(y_true, y_score, sample_weight=sample_weight)
Such method work for roc_auc_score, but brier_score_loss require scores
in (0, 1), so the common test cannot pass in this way.
—
You are receiving this because you were mentioned.
Reply to this email directly, view it on GitHub
<#9786 (comment)>,
or mute the thread
<https://github.com/notifications/unsubscribe-auth/AAEz6yXPcpJ8_eehCHA10C4Mjh_P0754ks5sjeM6gaJpZM4PZ2Rl>
.
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM
@jnothman
|
yes on the face of it that looks weird. assert_allclose is a variant which
gives more explicit control of tolerances if you'd rather.
|
@jnothman Thanks. |
Yes, but we'll wait for another review to be sure I'm not missing something! |
With
|
Okay... Perhaps weighted roc needs some more investigation?
|
@lesteve Thanks for your review. x1 = [0.6198347107438017, 0.6776859504132231, 0.6776859504132231,
0.6776859504132231, 0.6776859504132231, 0.6776859504132231,
0.74380165289256195, 0.74380165289256195, 0.8925619834710744,
0.92561983471074383, 1.0, 1.0,
1.0, 1.0, 1.0]
x2 = [0.61983471074380181, 0.6776859504132231, 0.67768595041322344,
0.67768595041322321, 0.67768595041322321, 0.67768595041322321,
0.74380165289256217, 0.74380165289256217, 0.89256198347107418,
0.92561983471074349, 0.99999999999999956, 0.99999999999999967,
0.99999999999999989, 0.99999999999999967, 1.0]
y1 = [0.62096774193548387, 0.62096774193548387, 0.62903225806451613,
0.66129032258064513, 0.717741935483871, 0.7338709677419355,
0.7338709677419355, 0.75806451612903225, 0.75806451612903225,
0.75806451612903225, 0.75806451612903225, 0.80645161290322576,
0.81451612903225812, 0.84677419354838712, 0.94354838709677424]
y2 = [0.62096774193548399, 0.62096774193548399, 0.62903225806451624,
0.66129032258064524, 0.717741935483871, 0.7338709677419355,
0.7338709677419355, 0.75806451612903225, 0.75806451612903225,
0.75806451612903225, 0.75806451612903225, 0.80645161290322587,
0.81451612903225812, 0.84677419354838712, 0.94354838709677424]
x1 = np.array(x1)
x2 = np.array(x2)
y1 = np.array(y1)
y2 = np.array(y2)
np.testing.assert_allclose(x1, x2) #pass
np.testing.assert_allclose(y1, y2) #pass
order = np.lexsort((y1, x1))
x1, y1 = x1[order], y1[order]
order = np.lexsort((y2, x2))
x2, y2 = x1[order], y1[order]
print np.trapz(y1, x1)
# 0.27865902426
print np.trapz(y2, x2)
# 0.275193281792 If we want to solve the problem, we will need to write a more robust sorting function (e.g., regard two float as equal if x1-x2<1e-7) instead of simply calling optimal_idxs = np.where(np.r_[True,
np.logical_or(np.diff(fps, 2), np.diff(tps, 2)), True])[0] We also have tricky ways to access our goal. That is, to round x and y before sorting. x = np.round(x, 10)
y = np.round(y, 10)
order = np.lexsort((y, x)) We get very similar stable result in this way: roc_auc_score(y_true, y_score, sample_weight=sample_weight)
# before(unstable):0.38036523593708349
# after(stable):0.38036523591966814 |
we previously attempted to make it more tolerant to small variation in
threshold, but it raised too many complaints.
A better option might be to use cleaner thresholds and/or weights. For
example, sampling from linspace(0,1,32) instead of a random floats.
|
@jnothman I trust you on this one. If you think that atol=1e-2 is fine, then let's do that. Maybe have a special case for auc in the test with a FIXME to show that the higher tolerance is only needed for roc_auc scores? |
This reverts commit 6b2cf79.
@jnothman So seems that the reason is clear for this problem (np.lexsort and np.traz in function auc) and I come back to the original solution proposed by you. Do you think we need a seperate test for roc_auc_score here as proposed by @lesteve? |
Just for clarity I was not advocating a completely separate test but more a special case, i.e. something like this: # FIXME: roc_auc scores are more unstable than other scores
kwargs = {'atol': 1e-2} if 'roc_auc' in metric else {}
for scaling in [2, 0.3]:
assert_almost_equal(
np.testing.assert_allclose(
weighted_score,
metric(y1, y2, sample_weight=sample_weight * scaling),
atol=1e-2,
err_msg=("%s sample_weight is not invariant "
"under scaling" % name),
**kwargs)) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Apart from nitpicks, LGTM
doc/whats_new/v0.20.rst
Outdated
@@ -17,6 +17,8 @@ random sampling procedures. | ||
|
||
- :class:`decomposition.IncrementalPCA` in Python 2 (bug fix) | ||
- :class:`isotonic.IsotonicRegression` (bug fix) | ||
- :class:`metrics.roc_auc_score` (enhancement) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Do you really think this belongs here? Do we believe this will change user results often enough to caution them here? This makes it seem like their ROC scores will have suddenly changed... I think "bug fix" is more appropriate in any case.
@@ -371,6 +371,20 @@ def test_roc_curve_drop_intermediate(): | ||
[1.0, 0.9, 0.7, 0.6, 0.]) | ||
|
||
|
||
def test_roc_curve_fpr_tpr_increasing(): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I feel like the fact that elements are sorted for one random sample isn't a very strong assurance. There are edge cases that could be further tested (such as having repeated thresholds), too, but I'm not sure what reasonable edge cases for this test are.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Basically the edge cases are when the definition of fps are not equal because of floating point errors:
tps = stable_cumsum(y_true * weight)[threshold_idxs]
fps = stable_cumsum(weight)[threshold_idxs] - tps
fps = stable_cumsum((1 - y_true) * weight)[threshold_idxs]
It is not obvious to me how to construct simply an example that does not work but maybe with a little bit of thought there is a way to put a simpler one together.
For full details the best is to look at the definition of _binary_clf_curve, especially how the other variables are defined.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
OK I found a simpler example:
def test_roc_curve_fpr_tpr_increasing():
# Ensure that fpr and tpr returned by roc_curve are increasing
# Construct an edge case with float y_score and sample_weight
# when some adjacent values of fpr and tpr are the same.
y_true = [0, 0, 1, 1, 1]
y_score = [0.1, 0.7, 0.3, 0.4, 0.5]
sample_weight = np.repeat(0.2, 5)
fpr, tpr, _ = roc_curve(y_true, y_score,
sample_weight=sample_weight)
assert_equal((np.diff(fpr) < 0).sum(), 0)
assert_equal((np.diff(tpr) < 0).sum(), 0)
Are you happier with this one @jnothman?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
(Pdb) np.diff(fpr)
array([ 5.00000000e-01, 0.00000000e+00, 2.22044605e-16,
-3.33066907e-16, 5.00000000e-01])
@@ -371,6 +371,20 @@ def test_roc_curve_drop_intermediate(): | ||
[1.0, 0.9, 0.7, 0.6, 0.]) | ||
|
||
|
||
def test_roc_curve_fpr_tpr_increasing(): | ||
# Ensure that fpr and tpr returned by roc_curve are increasing | ||
# Regression test for issue #9786 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It's not really a regression test
y_score = rng.random_sample(size=(n_samples,)) | ||
sample_weight = rng.randint(1, 10, size=(n_samples, )) | ||
fpr, tpr, _ = roc_curve(y_true, y_score, | ||
sample_weight=sample_weight * 0.2) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The * 0.2 on sample weight is mysterious and deserves comment.
@jnothman Thanks for your precious time :) Comments addressed.
After consideration, I remove the statement about roc_curve (the change of the result is really small, most time much less than 1e-7) and move the statement about roc_auc_score to bug fix section.
I remove the comment since I now think that it is unnecessary. But from my perspective, this is a regression test (fail on master). Because in master, we cannot ensure that fpr and tpr are increasing because we use subtraction instead of accumulation. Though the error is really small, this is actually the core reason of the issue (it cause the wrong sort and the wrong roc_auc_score).
The test is constructed based on @lesteve's suggestion. I think it hit the edge case (float y_score and sample_weight when some adjacent values of fpr and tpr are the same). Now for adjacent same value, we no longer get something not increasing( e.g., [a+1e-10, a-1e-10] ). I have added a comment. WDYT? Thanks :) |
Will merge when @lesteve approves. Might even be reasonable to include in 0.19.1 |
@lesteve Thanks for the review and the improvement. I have slightly updated the comment of the test. |
doc/whats_new/v0.20.rst
Outdated
@@ -108,6 +107,11 @@ Decomposition, manifold learning and clustering | ||
- Fixed a bug in :func:`datasets.fetch_kddcup99`, where data were not properly | ||
shuffled. :issue:`9731` by `Nicolas Goix`_. | ||
|
||
Model evaluation and meta-estimators |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is not the right subsection, there should be one with "metrics" in it, look in older version whats_new if you can not find it in this file.
doc/whats_new/v0.20.rst
Outdated
@@ -108,6 +107,11 @@ Decomposition, manifold learning and clustering | ||
- Fixed a bug in :func:`datasets.fetch_kddcup99`, where data were not properly | ||
shuffled. :issue:`9731` by `Nicolas Goix`_. | ||
|
||
Model evaluation and meta-estimators | ||
|
||
- Fixed a bug in :func:`metrics.roc_auc_score`, where float calculations sometimes |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think it can written a bit better, maybe something like:
Fixed bug due to floating point error in :func:`metrics.roc_auc_score`
with non-integer sample weights. :issue:`9786` by :user:`Hanmin Qin
<qinhanmin2014>`.
@lesteve Thanks a lot for your help :) I totally agree with all your suggestions and have learnt a lot. |
I think this is good to go, merging, thanks a lot! Another piece of advice while I am at it: chose better names for your branches, you can do better than test-feature-3 surely. |
Follow up PRs @qinhanmin2014 if you are up to it (I would rather have two separate PRs in this case):
|
I'm okay with having tolerance in threshold differences as long as it
defaults to zero.
|
* ensure fpr and tpr are increasing in roc_curve with non integer sample weights * add tests and move roc_auc_score from METRIC_UNDEFINED_BINARY to METRIC_UNDEFINED_MULTICLASS
* ensure fpr and tpr are increasing in roc_curve with non integer sample weights * add tests and move roc_auc_score from METRIC_UNDEFINED_BINARY to METRIC_UNDEFINED_MULTICLASS
Reference Issue
Proposed in #9567 by @jnothman
What does this implement/fix? Explain your changes.
METRIC_UNDEFINED_BINARY are metrics don't support binary inputs, METRIC_UNDEFINED_MULTICLASS are metrics don't support multiclass inputs, so seems that roc_auc_score belongs to METRIC_UNDEFINED_MULTICLASS .
In order to pass the tests in test_common.py, I have to:
(1)add the check to ensure that the shape of sample_weight is [n_samples] (regression test already in test_common.py)
(2)Carefully choose the scaling value in the test to reduce minor errors introduced by python when doing floating operations(e.g.,
4.2 + 2.1 == 6.3
is False).Any other comments?
cc @jnothman