Skip to content

Navigation Menu

Sign in
Appearance settings

Search code, repositories, users, issues, pull requests...

Provide feedback

We read every piece of feedback, and take your input very seriously.

Saved searches

Use saved searches to filter your results more quickly

Appearance settings

FIX Isolation forest path length for small samples #19087

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Open
wants to merge 14 commits into
base: main
Choose a base branch
Loading
from

Conversation

Konrad0
Copy link

@Konrad0 Konrad0 commented Jan 1, 2021

Reference Issues/PRs

Fixes #15724: iForest average path length for small samples (sklearn.ensemble.IsolationForest)
Also, to mitigate the numerical instability issue outlined in #16721 and #16967, a threshold value is implemented as suggested in the aforementioned PRs.

What does this implement/fix?

An improved approximation for the average path length in isolation forests is introduced, based on more a accurate harmonic number calculation. The routine _average_path_length() is mostly replaced. This leads to more accurate anomaly scores.

Any other comments?

For the harmonic number calculation, see Wikipedia (https://en.wikipedia.org/wiki/Harmonic_number#Calculation) or the following publications and references therein. Further improvement regarding accuracy and/or performance may be feasible using e.g. a Ramanujan expansion.
Villarino, M. Ramanujan's Harmonic Number Expansion into Negative Powers of a Triangular Number. https://arxiv.org/abs/0707.3950
or
Wang, W. Harmonic Number Expansions of the Ramanujan Type. Results Math 73, 161 (2018). https://doi.org/10.1007/s00025-018-0920-8

…#15724)

An improved approximation for the average path length in isolation
forests is introduced, based on more a accurate harmonic number
calculation. The routine _average_path_length() is mostly replaced.
@Konrad0 Konrad0 changed the title I forest pathlength small samples Isolation forest path length for small samples Jan 1, 2021
Since the decision function is subject to numerical fluctuations,
an anomaly threshold of -1e-15 is defined as suggested in issues
@kyrajeep
Copy link

kyrajeep commented Jan 4, 2021

Thank you. Was the paper you cite published? Also, there were a couple tests that didn't pass.

@Konrad0
Copy link
Author

Konrad0 commented Jan 4, 2021

Hi @kyrajeep,

yes, both references are published, the latter as mentioned above, the former here: Villarino, M.B.: Ramanujan’s harmonic number expansion into negative powers of a triangular number. JIPAM. J. Inequal. Pure Appl. Math. 9(3), 89 (2008). (https://www.emis.de/journals/JIPAM/article1026.html?sid=1026).
Both references contain the asymptotic expansion applied here.

The failing tests seem to be unrelated to these changes as far as I can see, maybe you could take a look? Thanks!

@kyrajeep
Copy link

kyrajeep commented Jan 5, 2021

Hi @ Konrad0,

Thank you for providing the links to the publications. I am not familiar with the testing framework here yet but I may be able to review in the next few days after learning it. Perhaps could someone more experienced with testing for scikit learn take a look? Thanks!

Hi @kyrajeep,

yes, both references are published, the latter as mentioned above, the former here: Villarino, M.B.: Ramanujan’s harmonic number expansion into negative powers of a triangular number. JIPAM. J. Inequal. Pure Appl. Math. 9(3), 89 (2008). (https://www.emis.de/journals/JIPAM/article1026.html?sid=1026).
Both references contain the asymptotic expansion applied here.

The failing tests seem to be unrelated to these changes as far as I can see, maybe you could take a look? Thanks!

@Konrad0
Copy link
Author

Konrad0 commented Jan 13, 2021

Hi @albertcthomas, @jnothman and all,

could one of you review this PR or maybe suggest someone else who might be able to do it? Thanks!
(As mentioned above I suspect at least some of the failed tests are unrelated to the changes implemented here.)

Base automatically changed from master to main January 22, 2021 10:53
@Konrad0
Copy link
Author

Konrad0 commented Jan 24, 2021

Hi @cmarmo,

since you commented on the related PR #16967, could you maybe take a look at this PR? As mentioned above I suspect the test failures are unrelated to the changes implemented here. Thanks!

@cmarmo
Copy link
Contributor

cmarmo commented Jan 25, 2021

Hi @Konrad0 thanks for your patience! The linting issue is related to the renaming of the main branch, I can't tell about the other one as the build is no longer available. I'm going to close and reopen the PR in order to trig the builds.

@Konrad0
Copy link
Author

Konrad0 commented Jan 25, 2021

Thanks @cmarmo, now all checks passed 😄

@Konrad0
Copy link
Author

Konrad0 commented Jan 27, 2021

Hi @david-cortes,
I saw that you were working on another issue regarding isolation forests, maybe you'd like to review this one? It's a small fix and all checks pass already.

@david-cortes
Copy link
Contributor

@Konrad0 I'm not a member of scikit-learn's team so I'm afraid I wouldn't qualify as a reviewer.

That being said, I took a look at the PR and it looks fine, but would wonder: why stop at 50? why not for example use a non-hard-coded harmonic calculation for higher up numbers up to a bigger threshold? At 50 observations the gap between real and approximate is roughly 1e-2, which is the same difference as when having 1 more observation in a node.

As a suggestion, this technique is quite fast and precise (algorithm 4):
http://fredrik-j.blogspot.com/2009/02/how-not-to-compute-harmonic-numbers.html

You can check an implementation of it here in this alternative isolation forest package that I maintain:
https://github.com/david-cortes/isotree/blob/b5f5688bf59e2c6efebd84d6e0ccb33081d65ad6/src/utils.cpp#L114
(hard-codes first couple numbers, then uses that formula up to some threshold, then uses approximation)

@Konrad0
Copy link
Author

Konrad0 commented Jan 27, 2021

Hi @david-cortes,
thanks for your input. I'm not sure where you get the number 1e-2 for 50 observations. Besides the lookup table, an improved approximation was implemented which is accurate at the level of the roundoff error for double precision (~2e-16) for more than 50 observations. Please let me know if I've misunderstood your point, I'd be glad to improve the implemented approximation!

@david-cortes
Copy link
Contributor

@Konrad0 Apologies, hadn't noticed the part in which you added an extra term on the approximation. I did some testing against actual harmonic numbers up to 512 and can indeed confirm that the difference is at most 1e-15.

I'm now looking at the PR in more detail, and would like to comment:

  • In function _average_path_length_small, I think it'd be faster to put the hard-coded array outside of the function instead of re-generating it every time the function is called.
  • Why do you reshape n_samples_leaf? Why not better reshape average_path_length at the end?
  • Would be better to delete the commented-out line in _average_path_length_small.

@Konrad0
Copy link
Author

Konrad0 commented Jan 27, 2021

Hi @david-cortes,
thanks for your suggestions, they are implemented in the latest commit.

* In function `_average_path_length_small`, I think it'd be faster to put the hard-coded array outside of the function instead of re-generating it every time the function is called.

You're right, changed. BTW, I expect some PEP8 problems here since I wanted to keep the list compact, so I didn't indent it as far as suggested.

* Why do you reshape `n_samples_leaf`? Why not better reshape `average_path_length` at the end?

That was kept from the previously existing code, so I'm also not sure why it was originally done this way.

* Would be better to delete the commented-out line in `_average_path_length_small`.

Resolved in first point.

@cmarmo: Can anything be done about the failing lint tests?

@cmarmo
Copy link
Contributor

cmarmo commented Jan 28, 2021

HI @Konrad0 it is a renaming problem again... have you synchronized with upstream/main? This should solve the issue.

@david-cortes
Copy link
Contributor

For the failing tests, you can use the command line utility flake8 to auto-correct the code formatting. Also you'd probably need to add an entry in the what's new for the next version.

@Konrad0
Copy link
Author

Konrad0 commented Jan 28, 2021

Thanks @cmarmo for the hint, now the lint tests pass.

@david-cortes The entry in what's new was added. For PEP8 compliance I had just missed a line break at the start of the array...

@Konrad0
Copy link
Author

Konrad0 commented Jul 4, 2021

Any chance of getting the review going? Do you have an idea, @cmarmo?

@cmarmo
Copy link
Contributor

cmarmo commented Jul 5, 2021

Hi @Konrad0, thanks for your patience and your work so far! Do you mind synchronizing with upstream? Then perhaps @jjerphan and @albertcthomas might want to have a look? Thanks!

Copy link
Contributor

@albertcthomas albertcthomas left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sorry for the late review @Konrad0. Please see my first comments. Just to make sure I understand, the new implementation (including the lookup table) is based on the 2 references you mentionned in the PR description and the code? You are solving the numerical issue at the same time so please add it to the PR description with the already opened PRs (I think you already added this information in a comment of the associated issue but it would be better to find it here as well). Thanks a lot!

@@ -311,7 +312,8 @@ def predict(self, X):
check_is_fitted(self)
X = check_array(X, accept_sparse='csr')
is_inlier = np.ones(X.shape[0], dtype=int)
is_inlier[self.decision_function(X) < 0] = -1
# is_inlier[self.decision_function(X) < 0] = -1
is_inlier[self.decision_function(X) < -1.0e-15] = -1
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Could we use -np.finfo(float).eps instead of -1.0e-15?

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sure, but I had to double it to -2*np.finfo(float).eps. Anyway, I have a proper fix for the numerical instability lined up, but in order for it to work, the path length has to be fixed first.

result_one = 2.0 * (np.log(4.0) + np.euler_gamma) - 2.0 * 4.0 / 5.0
result_two = 2.0 * (np.log(998.0) + np.euler_gamma) - 2.0 * 998.0 / 999.0
result_5 = 77.0/30.0
result_999 = 12.9689417211006898253130364
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

where is this number coming from?

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

These are the exact values for the test cases, can be calculated using e.g. SymPy.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

so this is a number with a finite number of decimals?

Copy link
Author

@Konrad0 Konrad0 Jul 19, 2021

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

No, it's a rounded rational number. It can be written as an exact fraction, but then the numerator as well as the denominator are both several hundred digits long.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Then maybe replace the exact value comment by saying that it is a rounded rational number

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

ok, fixed

@@ -463,6 +465,27 @@ def _more_tags(self):
}


_average_path_length_small = np.array((
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

could you add a small comment above to say that this is a lookup table used in the _average_path_length function below?

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

done

average_path_length[mask_small] = _average_path_length_small[
n_samples_leaf[mask_small]]

# Average path length equals 2*(H(n)-1), with H(n) the nth harmonic number.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

maybe put this in a Notes section of the function dosctring?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for precising it with a comment @Konrad0.

I would also recommend adding a reference with Sphinx for the resource provided bellow.

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Great suggestion, I tried to put in a note and a reference. All tests pass, but maybe one of you could check the doocumentation anyway, since for some strange reason Sphinx is crashing on my machine, so I couldn't look at the result.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't think sphinx render the docstrings of private functions.

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

are you suggesting we move it to the top level description? To me, this is a rather minute detail, so I wouldn't expect someone looking to use the isolation forest to be interested in the implementation of the harmonic number calculation...

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

No this is fine here.

Copy link
Member

@jjerphan jjerphan left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks @Konrad0 for your fix proposal!

Here are a few suggestions.

Can you prefix the title of this PR with FIX, please?

average_path_length[mask_small] = _average_path_length_small[
n_samples_leaf[mask_small]]

# Average path length equals 2*(H(n)-1), with H(n) the nth harmonic number.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for precising it with a comment @Konrad0.

I would also recommend adding a reference with Sphinx for the resource provided bellow.

mask_1 = n_samples_leaf <= 1
mask_2 = n_samples_leaf == 2
not_mask = ~np.logical_or(mask_1, mask_2)
mask_small = n_samples_leaf < 52
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
mask_small = n_samples_leaf < 52
mask_small = n_samples_leaf < len(_average_path_length_small)

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

More elegant, done.

Comment on lines 521 to 523
# Powers Of A Triangular Number. JIPAM. J. Inequal. Pure Appl. Math. 9(3),
# 89 (2008). https://www.emis.de/journals/JIPAM/article1026.html?sid=1026.
# Preprint at https://arxiv.org/abs/0707.3950.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The parameter can be removed.

Suggested change
# Powers Of A Triangular Number. JIPAM. J. Inequal. Pure Appl. Math. 9(3),
# 89 (2008). https://www.emis.de/journals/JIPAM/article1026.html?sid=1026.
# Preprint at https://arxiv.org/abs/0707.3950.
# Powers Of A Triangular Number. JIPAM. J. Inequal. Pure Appl. Math. 9(3),
# 89 (2008). https://www.emis.de/journals/JIPAM/article1026.html.
# Preprint at https://arxiv.org/abs/0707.3950.

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

done

Comment on lines 493 to 532
average_path_length[mask_2] = 1.
tmp = 1.0/np.square(n_samples_leaf[not_mask])
average_path_length[not_mask] = (
2.0 * (np.log(n_samples_leaf[not_mask] - 1.0) + np.euler_gamma)
- 2.0 * (n_samples_leaf[not_mask] - 1.0) / n_samples_leaf[not_mask]
2.0 * (np.log(n_samples_leaf[not_mask]) - 1.0 + np.euler_gamma)
+ 1.0/n_samples_leaf[not_mask]
- tmp*(1.0/6.0 - tmp*(1.0/60.0 - tmp/126.0))
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can you give a better name to tmp? Also how can one find those constants here?

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sure, renamed to n2_inv.

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

About the second question, the coefficients are from the Euler asymptotic expansion of the harmonic numbers H(n) as found on Wikipedia (https://en.wikipedia.org/wiki/Harmonic_number#Calculation) or in the references provided above. The coefficients just have to be doubled since the path length is 2*(H(n)-1).

@@ -322,6 +322,7 @@ def test_iforest_with_uniform_data():

rng = np.random.RandomState(0)

assert all(np.abs(iforest.decision_function(X)) < 1.0e-15)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

To complete @albertcthomas's remark.

Suggested change
assert all(np.abs(iforest.decision_function(X)) < 1.0e-15)
assert all(np.abs(iforest.decision_function(X)) < np.finfo(float).eps)

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

As above, I had to double it to 2*np.finfo(float).eps, but hopefully the numerical instability issue will be resolved altogether soon.

@Konrad0 Konrad0 changed the title Isolation forest path length for small samples FIX Isolation forest path length for small samples Jul 6, 2021
@Konrad0
Copy link
Author

Konrad0 commented Jul 6, 2021

Thanks @albertcthomas and @jjerphan for your reviews!
To answer a couple of questions: the implementation is based on an expansion that can be found in either of the two references or on Wikipedia (link added to comment and notes). The lookup table was calculated separately, it cannot be found in the references.
As mentioned above, a workaround for the numerical instability issue (PRs are now mentioned in the description) was applied here out of necessity, but I'll submit a PR for a proper fix without the need for thresholds like2*np.finfo(float).eps after the path length is working correctly.

@albertcthomas
Copy link
Contributor

Thanks @Konrad0, I'll have a look soon.

One important thing: please change the last commit message where you put @albertcthomas and @jjerphan otherwise we'll get intempestive notifications. As this is your last commit this can easily be done with git commit --amend to change the commit message and then force push.

@jjerphan
Copy link
Member

jjerphan commented Jul 7, 2021

Note that you can also add co-authors to commits if relevant.

Edit: regarding the mention in the commit title, notification-wise it's fine for me, @Konrad0.

@Konrad0 Konrad0 force-pushed the iForest_pathlength_small_samples branch from 782225c to 5b9edfb Compare July 7, 2021 21:05
@Konrad0
Copy link
Author

Konrad0 commented Jul 7, 2021

Sorry about that, didn't mean to inundate you with useless notifications. Commit message has been changed.

@Konrad0
Copy link
Author

Konrad0 commented Aug 2, 2021

Hi @albertcthomas and @jjerphan,

are there any more adjustments you'd suggest? The previous ones should be implemented. Thanks!

Comment on lines +490 to +545
_average_path_length_small = np.array(
(
0.0,
0.0,
1.0,
1.6666666666666666667,
2.1666666666666666667,
2.5666666666666666667,
2.9000000000000000000,
3.1857142857142857143,
3.4357142857142857143,
3.6579365079365079365,
3.8579365079365079365,
4.0397546897546897547,
4.2064213564213564214,
4.3602675102675102675,
4.5031246531246531247,
4.6364579864579864580,
4.7614579864579864580,
4.8791050452815158698,
4.9902161563926269809,
5.0954793142873638230,
5.1954793142873638230,
5.2907174095254590611,
5.3816265004345499702,
5.4685830221736804049,
5.5519163555070137383,
5.6319163555070137383,
5.7088394324300906613,
5.7829135065041647354,
5.8543420779327361640,
5.9233075951741154743,
5.9899742618407821410,
6.0544903908730402055,
6.1169903908730402055,
6.1775964514791008116,
6.2364199808908655175,
6.2935628380337226603,
6.3491183935892782159,
6.4031724476433322699,
6.4558040265907006910,
6.5070860778727519730,
6.5570860778727519730,
6.6058665656776300218,
6.6534856132966776409,
6.6999972412036543850,
6.7454517866581998396,
6.7898962311026442840,
6.8333744919722095014,
6.8759276834615712036,
6.9175943501282378702,
6.9584106766588501151,
6.9984106766588501151,
7.0376263629333599190,
)
)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Where is this table coming from? Could you generate it instead of hard-coding it?

Copy link
Author

@Konrad0 Konrad0 Aug 2, 2021

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Having a lookup table here is useful because it's fast and accurate. The corresponding value could be calculated each time from the definition of the harmonic numbers, but especially for the later values, this isn't very efficient. It is significantly slower than the asymptotic expansion, which can be used for larger values, where it becomes sufficiently accurate.
BTW, I'm not really happy about the formatting, but that's what the utility Black generated.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, I agree with your point but I think having it definition be explicitly analytic than having it hard-coded is better for readers and for maintenance.

Probably we can find a common ground combining efficiency and explicitness.

What is the snippet that you have used to generate this table?

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for your suggestion. The values were generated using SymPy to get correct results. One could do something like np.sum(2/np.arange(2, n)), but as you can easily see this would be much slower and may include roundoff errors. I understand your point about readability, so how about we keep the list but document it explicitly with definition of the harmonic numbers and a code snippet explaining how the list can be generated? This way it will be clear to anyone reading the code where the list came from and we also keep the benefit of superior performance and accuracy.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It would be great including the snippet generating this list. 👍

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Alright, I added to documentation to the code. The main line to generate the list is [Sum(2/k, (k, 2, max(1, i))).evalf(22).round(19) for i in range(52)], it is mentioned just above the list. Let me know what you think of it, I also added an explanation how the calculation is split for large and small samples.

Comment on lines 489 to 501
# Lookup table used below in _average_path_length() for small samples.
# Since the average path length equals 2*(H(n)-1), with H(n) the nth
# harmonic number, it can be calculated as Sum(2/k, k=2...n) for n>=2
# and is zero for n<=1.
# To achieve correct results to full rounding precision, the list below
# can be generated using SymPy (https://www.sympy.org) as follows (r is
# the result)
#
# from sympy import Sum, Symbol
# k = Symbol('k')
# r = [Sum(2/k, (k, 2, max(1, i))).evalf(22).round(19) for i in range(52)]
#
_average_path_length_small = np.array(
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for the snippet!

This seems to have a quadratic complexity, using a cumulative can make it linear. This can be done with numpy easily with a similar runtime.

In [1]: import warnings, numpy as np
In [2]: def f(n):
   ...:     # Catch the RuntimeWarning due to the
   ...:     # division by 0 for the first term
   ...:     with warnings.catch_warnings():
   ...:         warnings.simplefilter("ignore")
   ...:         terms = 2 / np.arange(n)
   ...:     terms[0] = terms[1] = 0.
   ...:     # Compute the series of harmonic number
   ...:     _average_path_length_small = np.cumsum(terms)
In [3]: %timeit f(52)
8.57 µs ± 2.29 µs per loop (mean ± std. dev. of 7 runs, 100000 loops each)
In [4]: %timeit f(100000)
594 µs ± 2.3 µs per loop (mean ± std. dev. of 7 runs, 1000 loops each)
In [5]: def a(): 
   ...:     # insert this inline definition of the array 
In [6]: %timeit a()
3.34 µs ± 7.51 ns per loop (mean ± std. dev. of 7 runs, 100000 loops each

Importing warnings at the top of the file:

Suggested change
# Lookup table used below in _average_path_length() for small samples.
# Since the average path length equals 2*(H(n)-1), with H(n) the nth
# harmonic number, it can be calculated as Sum(2/k, k=2...n) for n>=2
# and is zero for n<=1.
# To achieve correct results to full rounding precision, the list below
# can be generated using SymPy (https://www.sympy.org) as follows (r is
# the result)
#
# from sympy import Sum, Symbol
# k = Symbol('k')
# r = [Sum(2/k, (k, 2, max(1, i))).evalf(22).round(19) for i in range(52)]
#
_average_path_length_small = np.array(
# Lookup table used below in _average_path_length() for small samples.
# Since the average path length equals 2*(H(n)-1), with H(n) the nth
# harmonic number, it can be calculated as Sum(2/k, k=2...n) for n>=2
# and is zero for n<=1.
# Catch the RuntimeWarning due to the division by 0 for the first term
with warnings.catch_warnings():
warnings.simplefilter("ignore")
terms = 2 / np.arange(n)
terms[0] = terms[1] = 0.
# Compute the series of harmonic number
_average_path_length_small = np.cumsum(terms)

What do you think?

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hi @jjerphan,
thanks for the suggestion, I appreciate it. Sorry for the misunderstanding, of course I'm aware that the list can be generated in linear time, the code snippet in the comment was written to be as easily understandable as possible, not as an actual implementation. IMO we should keep the list, it's faster as you've shown (can't beat a lookup table on that ;-)) and I honestly don't think your code is more comprehensible to an uninitiated reader. Let me know what you think

Copy link
Member

@jjerphan jjerphan Aug 10, 2021

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think we need someone else to express their opinion on this point (cc @albertcthomas?). 🙂

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Agreed, both solutions would work, we just have different preferences 😄

Comment on lines +596 to +599
# The path length is determined in different ways depending on
# n_samples_leaf. For small values, a lookup table is used, see above for a
# more detailed explanation. For large values, an asymptotic expansion is
# used as described below.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

👍

@cmarmo
Copy link
Contributor

cmarmo commented Dec 13, 2022

Hi @Konrad0 if you are still interested in working on this do you mind fixing conflicts and synchronizing with main?
Thank you so much for your patience and your work so far.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Average path length in iForest is inaccurate for small sizes
6 participants
Morty Proxy This is a proxified and sanitized view of the page, visit original site.