Description
Description
The routine _average_path_length()
for isolation forests (sklearn.ensemble.IsolationForest) gives quite inaccurate results for small sizes (n_samples_leaf
), which propagates to the anomaly scores. For n_samples_leaf=3
the error is around 30%. This is due to the fact that a very rough approximation mentioned in the original paper is used, which only reaches reasonable accuracy for very large numbers. Therefore I would suggest to modify the current routine as follows with a combination of a lookup table for small values and an improved asymptotic series for large values, which should be accurate to double precision. Please let me know what you think. (This might be considered an improvement on #13251.)
Steps/Code to Fix
def _average_path_length_small(n_samples_leaf):
c = (0.0, 0.0, 1.0, 1.6666666666666666667,
2.1666666666666666667, 2.5666666666666666667, 2.9000000000000000000, 3.1857142857142857143,
3.4357142857142857143, 3.6579365079365079365, 3.8579365079365079365, 4.0397546897546897547,
4.2064213564213564214, 4.3602675102675102675, 4.5031246531246531247, 4.6364579864579864580,
4.7614579864579864580, 4.8791050452815158698, 4.9902161563926269809, 5.0954793142873638230,
5.1954793142873638230, 5.2907174095254590611, 5.3816265004345499702, 5.4685830221736804049,
5.5519163555070137383, 5.6319163555070137383, 5.7088394324300906613, 5.7829135065041647354,
5.8543420779327361640, 5.9233075951741154743, 5.9899742618407821410, 6.0544903908730402055,
6.1169903908730402055, 6.1775964514791008116, 6.2364199808908655175, 6.2935628380337226603,
6.3491183935892782159, 6.4031724476433322699, 6.4558040265907006910, 6.5070860778727519730,
6.5570860778727519730, 6.6058665656776300218, 6.6534856132966776409, 6.6999972412036543850,
6.7454517866581998396, 6.7898962311026442840, 6.8333744919722095014, 6.8759276834615712036,
6.9175943501282378702, 6.9584106766588501151, 6.9984106766588501151, 7.0376263629333599190)
#return np.array(c)[n_samples_leaf]
return np.array(c)[np.maximum(0, n_samples_leaf)]
def _average_path_length(n_samples_leaf):
"""
The average path length in a n_samples iTree, which is equal to
the average path length of an unsuccessful BST search since the
latter has the same structure as an isolation tree.
Parameters
----------
n_samples_leaf : array-like, shape (n_samples,).
The number of training samples in each test sample leaf, for
each estimators.
Returns
-------
average_path_length : array, same shape as n_samples_leaf
"""
n_samples_leaf = check_array(n_samples_leaf, ensure_2d=False)
n_samples_leaf_shape = n_samples_leaf.shape
n_samples_leaf = n_samples_leaf.reshape((1, -1))
average_path_length = np.zeros(n_samples_leaf.shape)
mask_small = n_samples_leaf < 52
not_mask = ~mask_small
average_path_length[mask_small] = _average_path_length_small(n_samples_leaf[mask_small])
tmp = 1.0/np.square(n_samples_leaf[not_mask])
average_path_length[not_mask] = (
2.0 * (np.log(n_samples_leaf[not_mask]) - 1.0 + np.euler_gamma)
+ 1.0/n_samples_leaf[not_mask] - tmp*(1.0/6 - tmp*(1.0/60 - tmp/126))
)
return average_path_length.reshape(n_samples_leaf_shape)
Versions
System:
python: 3.8.0 (default, Oct 23 2019, 18:51:26) [GCC 9.2.0]
executable: /usr/bin/python
machine: Linux-5.3.13-arch1-1-x86_64-with-glibc2.2.5
Python deps:
pip: 19.2.3
setuptools: 41.6.0
sklearn: 0.21.3
numpy: 1.17.4
scipy: 1.3.1
Cython: 0.29.14
pandas: 0.25.3