-
-
Notifications
You must be signed in to change notification settings - Fork 25.9k
FIX Isolation forest path length for small samples #19087
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: main
Are you sure you want to change the base?
FIX Isolation forest path length for small samples #19087
Conversation
…#15724) An improved approximation for the average path length in isolation forests is introduced, based on more a accurate harmonic number calculation. The routine _average_path_length() is mostly replaced.
Since the decision function is subject to numerical fluctuations, an anomaly threshold of -1e-15 is defined as suggested in issues
Thank you. Was the paper you cite published? Also, there were a couple tests that didn't pass. |
Hi @kyrajeep, yes, both references are published, the latter as mentioned above, the former here: Villarino, M.B.: Ramanujan’s harmonic number expansion into negative powers of a triangular number. JIPAM. J. Inequal. Pure Appl. Math. 9(3), 89 (2008). (https://www.emis.de/journals/JIPAM/article1026.html?sid=1026). The failing tests seem to be unrelated to these changes as far as I can see, maybe you could take a look? Thanks! |
Hi @ Konrad0, Thank you for providing the links to the publications. I am not familiar with the testing framework here yet but I may be able to review in the next few days after learning it. Perhaps could someone more experienced with testing for scikit learn take a look? Thanks!
|
Hi @albertcthomas, @jnothman and all, could one of you review this PR or maybe suggest someone else who might be able to do it? Thanks! |
Hi @Konrad0 thanks for your patience! The linting issue is related to the renaming of the main branch, I can't tell about the other one as the build is no longer available. I'm going to close and reopen the PR in order to trig the builds. |
Thanks @cmarmo, now all checks passed 😄 |
Hi @david-cortes, |
@Konrad0 I'm not a member of scikit-learn's team so I'm afraid I wouldn't qualify as a reviewer. That being said, I took a look at the PR and it looks fine, but would wonder: why stop at 50? why not for example use a non-hard-coded harmonic calculation for higher up numbers up to a bigger threshold? At 50 observations the gap between real and approximate is roughly 1e-2, which is the same difference as when having 1 more observation in a node. As a suggestion, this technique is quite fast and precise (algorithm 4): You can check an implementation of it here in this alternative isolation forest package that I maintain: |
Hi @david-cortes, |
@Konrad0 Apologies, hadn't noticed the part in which you added an extra term on the approximation. I did some testing against actual harmonic numbers up to 512 and can indeed confirm that the difference is at most 1e-15. I'm now looking at the PR in more detail, and would like to comment:
|
Hi @david-cortes,
You're right, changed. BTW, I expect some PEP8 problems here since I wanted to keep the list compact, so I didn't indent it as far as suggested.
That was kept from the previously existing code, so I'm also not sure why it was originally done this way.
Resolved in first point. @cmarmo: Can anything be done about the failing lint tests? |
HI @Konrad0 it is a renaming problem again... have you synchronized with upstream/main? This should solve the issue. |
For the failing tests, you can use the command line utility flake8 to auto-correct the code formatting. Also you'd probably need to add an entry in the what's new for the next version. |
Thanks @cmarmo for the hint, now the lint tests pass. @david-cortes The entry in what's new was added. For PEP8 compliance I had just missed a line break at the start of the array... |
Any chance of getting the review going? Do you have an idea, @cmarmo? |
Hi @Konrad0, thanks for your patience and your work so far! Do you mind synchronizing with upstream? Then perhaps @jjerphan and @albertcthomas might want to have a look? Thanks! |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Sorry for the late review @Konrad0. Please see my first comments. Just to make sure I understand, the new implementation (including the lookup table) is based on the 2 references you mentionned in the PR description and the code? You are solving the numerical issue at the same time so please add it to the PR description with the already opened PRs (I think you already added this information in a comment of the associated issue but it would be better to find it here as well). Thanks a lot!
sklearn/ensemble/_iforest.py
Outdated
@@ -311,7 +312,8 @@ def predict(self, X): | ||
check_is_fitted(self) | ||
X = check_array(X, accept_sparse='csr') | ||
is_inlier = np.ones(X.shape[0], dtype=int) | ||
is_inlier[self.decision_function(X) < 0] = -1 | ||
# is_inlier[self.decision_function(X) < 0] = -1 | ||
is_inlier[self.decision_function(X) < -1.0e-15] = -1 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Could we use -np.finfo(float).eps
instead of -1.0e-15
?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Sure, but I had to double it to -2*np.finfo(float).eps
. Anyway, I have a proper fix for the numerical instability lined up, but in order for it to work, the path length has to be fixed first.
result_one = 2.0 * (np.log(4.0) + np.euler_gamma) - 2.0 * 4.0 / 5.0 | ||
result_two = 2.0 * (np.log(998.0) + np.euler_gamma) - 2.0 * 998.0 / 999.0 | ||
result_5 = 77.0/30.0 | ||
result_999 = 12.9689417211006898253130364 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
where is this number coming from?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
These are the exact values for the test cases, can be calculated using e.g. SymPy.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
so this is a number with a finite number of decimals?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
No, it's a rounded rational number. It can be written as an exact fraction, but then the numerator as well as the denominator are both several hundred digits long.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Then maybe replace the exact value comment by saying that it is a rounded rational number
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
ok, fixed
sklearn/ensemble/_iforest.py
Outdated
@@ -463,6 +465,27 @@ def _more_tags(self): | ||
} | ||
|
||
|
||
_average_path_length_small = np.array(( |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
could you add a small comment above to say that this is a lookup table used in the _average_path_length
function below?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
done
average_path_length[mask_small] = _average_path_length_small[ | ||
n_samples_leaf[mask_small]] | ||
|
||
# Average path length equals 2*(H(n)-1), with H(n) the nth harmonic number. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
maybe put this in a Notes
section of the function dosctring?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for precising it with a comment @Konrad0.
I would also recommend adding a reference with Sphinx for the resource provided bellow.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Great suggestion, I tried to put in a note and a reference. All tests pass, but maybe one of you could check the doocumentation anyway, since for some strange reason Sphinx is crashing on my machine, so I couldn't look at the result.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I don't think sphinx render the docstrings of private functions.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
are you suggesting we move it to the top level description? To me, this is a rather minute detail, so I wouldn't expect someone looking to use the isolation forest to be interested in the implementation of the harmonic number calculation...
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
No this is fine here.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks @Konrad0 for your fix proposal!
Here are a few suggestions.
Can you prefix the title of this PR with FIX
, please?
average_path_length[mask_small] = _average_path_length_small[ | ||
n_samples_leaf[mask_small]] | ||
|
||
# Average path length equals 2*(H(n)-1), with H(n) the nth harmonic number. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for precising it with a comment @Konrad0.
I would also recommend adding a reference with Sphinx for the resource provided bellow.
sklearn/ensemble/_iforest.py
Outdated
mask_1 = n_samples_leaf <= 1 | ||
mask_2 = n_samples_leaf == 2 | ||
not_mask = ~np.logical_or(mask_1, mask_2) | ||
mask_small = n_samples_leaf < 52 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
mask_small = n_samples_leaf < 52 | |
mask_small = n_samples_leaf < len(_average_path_length_small) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
More elegant, done.
sklearn/ensemble/_iforest.py
Outdated
# Powers Of A Triangular Number. JIPAM. J. Inequal. Pure Appl. Math. 9(3), | ||
# 89 (2008). https://www.emis.de/journals/JIPAM/article1026.html?sid=1026. | ||
# Preprint at https://arxiv.org/abs/0707.3950. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The parameter can be removed.
# Powers Of A Triangular Number. JIPAM. J. Inequal. Pure Appl. Math. 9(3), | |
# 89 (2008). https://www.emis.de/journals/JIPAM/article1026.html?sid=1026. | |
# Preprint at https://arxiv.org/abs/0707.3950. | |
# Powers Of A Triangular Number. JIPAM. J. Inequal. Pure Appl. Math. 9(3), | |
# 89 (2008). https://www.emis.de/journals/JIPAM/article1026.html. | |
# Preprint at https://arxiv.org/abs/0707.3950. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
done
sklearn/ensemble/_iforest.py
Outdated
average_path_length[mask_2] = 1. | ||
tmp = 1.0/np.square(n_samples_leaf[not_mask]) | ||
average_path_length[not_mask] = ( | ||
2.0 * (np.log(n_samples_leaf[not_mask] - 1.0) + np.euler_gamma) | ||
- 2.0 * (n_samples_leaf[not_mask] - 1.0) / n_samples_leaf[not_mask] | ||
2.0 * (np.log(n_samples_leaf[not_mask]) - 1.0 + np.euler_gamma) | ||
+ 1.0/n_samples_leaf[not_mask] | ||
- tmp*(1.0/6.0 - tmp*(1.0/60.0 - tmp/126.0)) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can you give a better name to tmp
? Also how can one find those constants here?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Sure, renamed to n2_inv
.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
About the second question, the coefficients are from the Euler asymptotic expansion of the harmonic numbers H(n)
as found on Wikipedia (https://en.wikipedia.org/wiki/Harmonic_number#Calculation) or in the references provided above. The coefficients just have to be doubled since the path length is 2*(H(n)-1)
.
@@ -322,6 +322,7 @@ def test_iforest_with_uniform_data(): | ||
|
||
rng = np.random.RandomState(0) | ||
|
||
assert all(np.abs(iforest.decision_function(X)) < 1.0e-15) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
To complete @albertcthomas's remark.
assert all(np.abs(iforest.decision_function(X)) < 1.0e-15) | |
assert all(np.abs(iforest.decision_function(X)) < np.finfo(float).eps) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
As above, I had to double it to 2*np.finfo(float).eps
, but hopefully the numerical instability issue will be resolved altogether soon.
Merge branch 'main' of https://github.com/scikit-learn/scikit-learn into iForest_pathlength_small_samples
Thanks @albertcthomas and @jjerphan for your reviews! |
Thanks @Konrad0, I'll have a look soon. One important thing: please change the last commit message where you put |
Note that you can also add co-authors to commits if relevant. Edit: regarding the mention in the commit title, notification-wise it's fine for me, @Konrad0. |
782225c
to
5b9edfb
Compare
Sorry about that, didn't mean to inundate you with useless notifications. Commit message has been changed. |
Hi @albertcthomas and @jjerphan, are there any more adjustments you'd suggest? The previous ones should be implemented. Thanks! |
_average_path_length_small = np.array( | ||
( | ||
0.0, | ||
0.0, | ||
1.0, | ||
1.6666666666666666667, | ||
2.1666666666666666667, | ||
2.5666666666666666667, | ||
2.9000000000000000000, | ||
3.1857142857142857143, | ||
3.4357142857142857143, | ||
3.6579365079365079365, | ||
3.8579365079365079365, | ||
4.0397546897546897547, | ||
4.2064213564213564214, | ||
4.3602675102675102675, | ||
4.5031246531246531247, | ||
4.6364579864579864580, | ||
4.7614579864579864580, | ||
4.8791050452815158698, | ||
4.9902161563926269809, | ||
5.0954793142873638230, | ||
5.1954793142873638230, | ||
5.2907174095254590611, | ||
5.3816265004345499702, | ||
5.4685830221736804049, | ||
5.5519163555070137383, | ||
5.6319163555070137383, | ||
5.7088394324300906613, | ||
5.7829135065041647354, | ||
5.8543420779327361640, | ||
5.9233075951741154743, | ||
5.9899742618407821410, | ||
6.0544903908730402055, | ||
6.1169903908730402055, | ||
6.1775964514791008116, | ||
6.2364199808908655175, | ||
6.2935628380337226603, | ||
6.3491183935892782159, | ||
6.4031724476433322699, | ||
6.4558040265907006910, | ||
6.5070860778727519730, | ||
6.5570860778727519730, | ||
6.6058665656776300218, | ||
6.6534856132966776409, | ||
6.6999972412036543850, | ||
6.7454517866581998396, | ||
6.7898962311026442840, | ||
6.8333744919722095014, | ||
6.8759276834615712036, | ||
6.9175943501282378702, | ||
6.9584106766588501151, | ||
6.9984106766588501151, | ||
7.0376263629333599190, | ||
) | ||
) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Where is this table coming from? Could you generate it instead of hard-coding it?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Having a lookup table here is useful because it's fast and accurate. The corresponding value could be calculated each time from the definition of the harmonic numbers, but especially for the later values, this isn't very efficient. It is significantly slower than the asymptotic expansion, which can be used for larger values, where it becomes sufficiently accurate.
BTW, I'm not really happy about the formatting, but that's what the utility Black generated.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes, I agree with your point but I think having it definition be explicitly analytic than having it hard-coded is better for readers and for maintenance.
Probably we can find a common ground combining efficiency and explicitness.
What is the snippet that you have used to generate this table?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for your suggestion. The values were generated using SymPy to get correct results. One could do something like np.sum(2/np.arange(2, n))
, but as you can easily see this would be much slower and may include roundoff errors. I understand your point about readability, so how about we keep the list but document it explicitly with definition of the harmonic numbers and a code snippet explaining how the list can be generated? This way it will be clear to anyone reading the code where the list came from and we also keep the benefit of superior performance and accuracy.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It would be great including the snippet generating this list. 👍
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Alright, I added to documentation to the code. The main line to generate the list is [Sum(2/k, (k, 2, max(1, i))).evalf(22).round(19) for i in range(52)]
, it is mentioned just above the list. Let me know what you think of it, I also added an explanation how the calculation is split for large and small samples.
…nto iForest_pathlength_small_samples
sklearn/ensemble/_iforest.py
Outdated
# Lookup table used below in _average_path_length() for small samples. | ||
# Since the average path length equals 2*(H(n)-1), with H(n) the nth | ||
# harmonic number, it can be calculated as Sum(2/k, k=2...n) for n>=2 | ||
# and is zero for n<=1. | ||
# To achieve correct results to full rounding precision, the list below | ||
# can be generated using SymPy (https://www.sympy.org) as follows (r is | ||
# the result) | ||
# | ||
# from sympy import Sum, Symbol | ||
# k = Symbol('k') | ||
# r = [Sum(2/k, (k, 2, max(1, i))).evalf(22).round(19) for i in range(52)] | ||
# | ||
_average_path_length_small = np.array( |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for the snippet!
This seems to have a quadratic complexity, using a cumulative can make it linear. This can be done with numpy easily with a similar runtime.
In [1]: import warnings, numpy as np
In [2]: def f(n):
...: # Catch the RuntimeWarning due to the
...: # division by 0 for the first term
...: with warnings.catch_warnings():
...: warnings.simplefilter("ignore")
...: terms = 2 / np.arange(n)
...: terms[0] = terms[1] = 0.
...: # Compute the series of harmonic number
...: _average_path_length_small = np.cumsum(terms)
In [3]: %timeit f(52)
8.57 µs ± 2.29 µs per loop (mean ± std. dev. of 7 runs, 100000 loops each)
In [4]: %timeit f(100000)
594 µs ± 2.3 µs per loop (mean ± std. dev. of 7 runs, 1000 loops each)
In [5]: def a():
...: # insert this inline definition of the array
In [6]: %timeit a()
3.34 µs ± 7.51 ns per loop (mean ± std. dev. of 7 runs, 100000 loops each
Importing warnings
at the top of the file:
# Lookup table used below in _average_path_length() for small samples. | |
# Since the average path length equals 2*(H(n)-1), with H(n) the nth | |
# harmonic number, it can be calculated as Sum(2/k, k=2...n) for n>=2 | |
# and is zero for n<=1. | |
# To achieve correct results to full rounding precision, the list below | |
# can be generated using SymPy (https://www.sympy.org) as follows (r is | |
# the result) | |
# | |
# from sympy import Sum, Symbol | |
# k = Symbol('k') | |
# r = [Sum(2/k, (k, 2, max(1, i))).evalf(22).round(19) for i in range(52)] | |
# | |
_average_path_length_small = np.array( | |
# Lookup table used below in _average_path_length() for small samples. | |
# Since the average path length equals 2*(H(n)-1), with H(n) the nth | |
# harmonic number, it can be calculated as Sum(2/k, k=2...n) for n>=2 | |
# and is zero for n<=1. | |
# Catch the RuntimeWarning due to the division by 0 for the first term | |
with warnings.catch_warnings(): | |
warnings.simplefilter("ignore") | |
terms = 2 / np.arange(n) | |
terms[0] = terms[1] = 0. | |
# Compute the series of harmonic number | |
_average_path_length_small = np.cumsum(terms) |
What do you think?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Hi @jjerphan,
thanks for the suggestion, I appreciate it. Sorry for the misunderstanding, of course I'm aware that the list can be generated in linear time, the code snippet in the comment was written to be as easily understandable as possible, not as an actual implementation. IMO we should keep the list, it's faster as you've shown (can't beat a lookup table on that ;-)) and I honestly don't think your code is more comprehensible to an uninitiated reader. Let me know what you think
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think we need someone else to express their opinion on this point (cc @albertcthomas?). 🙂
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Agreed, both solutions would work, we just have different preferences 😄
# The path length is determined in different ways depending on | ||
# n_samples_leaf. For small values, a lookup table is used, see above for a | ||
# more detailed explanation. For large values, an asymptotic expansion is | ||
# used as described below. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
👍
Hi @Konrad0 if you are still interested in working on this do you mind fixing conflicts and synchronizing with main? |
Reference Issues/PRs
Fixes #15724: iForest average path length for small samples (sklearn.ensemble.IsolationForest)
Also, to mitigate the numerical instability issue outlined in #16721 and #16967, a threshold value is implemented as suggested in the aforementioned PRs.
What does this implement/fix?
An improved approximation for the average path length in isolation forests is introduced, based on more a accurate harmonic number calculation. The routine _average_path_length() is mostly replaced. This leads to more accurate anomaly scores.
Any other comments?
For the harmonic number calculation, see Wikipedia (https://en.wikipedia.org/wiki/Harmonic_number#Calculation) or the following publications and references therein. Further improvement regarding accuracy and/or performance may be feasible using e.g. a Ramanujan expansion.
Villarino, M. Ramanujan's Harmonic Number Expansion into Negative Powers of a Triangular Number. https://arxiv.org/abs/0707.3950
or
Wang, W. Harmonic Number Expansions of the Ramanujan Type. Results Math 73, 161 (2018). https://doi.org/10.1007/s00025-018-0920-8