Average path length in iForest is inaccurate for small sizes

Description

The routine _average_path_length() for isolation forests (sklearn.ensemble.IsolationForest) gives quite inaccurate results for small sizes (n_samples_leaf), which propagates to the anomaly scores. For n_samples_leaf=3 the error is around 30%. This is due to the fact that a very rough approximation mentioned in the original paper is used, which only reaches reasonable accuracy for very large numbers. Therefore I would suggest to modify the current routine as follows with a combination of a lookup table for small values and an improved asymptotic series for large values, which should be accurate to double precision. Please let me know what you think. (This might be considered an improvement on #13251.)

Steps/Code to Fix

def _average_path_length_small(n_samples_leaf):
    c = (0.0, 0.0, 1.0, 1.6666666666666666667, 
         2.1666666666666666667, 2.5666666666666666667, 2.9000000000000000000, 3.1857142857142857143, 
         3.4357142857142857143, 3.6579365079365079365, 3.8579365079365079365, 4.0397546897546897547, 
         4.2064213564213564214, 4.3602675102675102675, 4.5031246531246531247, 4.6364579864579864580, 
         4.7614579864579864580, 4.8791050452815158698, 4.9902161563926269809, 5.0954793142873638230, 
         5.1954793142873638230, 5.2907174095254590611, 5.3816265004345499702, 5.4685830221736804049, 
         5.5519163555070137383, 5.6319163555070137383, 5.7088394324300906613, 5.7829135065041647354, 
         5.8543420779327361640, 5.9233075951741154743, 5.9899742618407821410, 6.0544903908730402055, 
         6.1169903908730402055, 6.1775964514791008116, 6.2364199808908655175, 6.2935628380337226603, 
         6.3491183935892782159, 6.4031724476433322699, 6.4558040265907006910, 6.5070860778727519730, 
         6.5570860778727519730, 6.6058665656776300218, 6.6534856132966776409, 6.6999972412036543850, 
         6.7454517866581998396, 6.7898962311026442840, 6.8333744919722095014, 6.8759276834615712036, 
         6.9175943501282378702, 6.9584106766588501151, 6.9984106766588501151, 7.0376263629333599190)
    #return np.array(c)[n_samples_leaf]
    return np.array(c)[np.maximum(0, n_samples_leaf)]

def _average_path_length(n_samples_leaf):
    """
    The average path length in a n_samples iTree, which is equal to
    the average path length of an unsuccessful BST search since the
    latter has the same structure as an isolation tree.
    Parameters
    ----------
    n_samples_leaf : array-like, shape (n_samples,).
        The number of training samples in each test sample leaf, for
        each estimators.

    Returns
    -------
    average_path_length : array, same shape as n_samples_leaf
    """

    n_samples_leaf = check_array(n_samples_leaf, ensure_2d=False)

    n_samples_leaf_shape = n_samples_leaf.shape
    n_samples_leaf = n_samples_leaf.reshape((1, -1))
    average_path_length = np.zeros(n_samples_leaf.shape)

    mask_small = n_samples_leaf < 52
    not_mask = ~mask_small

    average_path_length[mask_small] = _average_path_length_small(n_samples_leaf[mask_small])

    tmp = 1.0/np.square(n_samples_leaf[not_mask])
    average_path_length[not_mask] = (
        2.0 * (np.log(n_samples_leaf[not_mask]) - 1.0 + np.euler_gamma)
        + 1.0/n_samples_leaf[not_mask] - tmp*(1.0/6 - tmp*(1.0/60 - tmp/126))
    )

    return average_path_length.reshape(n_samples_leaf_shape)

Versions

System:
python: 3.8.0 (default, Oct 23 2019, 18:51:26) [GCC 9.2.0]
executable: /usr/bin/python
machine: Linux-5.3.13-arch1-1-x86_64-with-glibc2.2.5

Python deps:
pip: 19.2.3
setuptools: 41.6.0
sklearn: 0.21.3
numpy: 1.17.4
scipy: 1.3.1
Cython: 0.29.14
pandas: 0.25.3

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Uh oh!

Average path length in iForest is inaccurate for small sizes #15724

Description

Steps/Code to Fix

Versions

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Search code, repositories, users, issues, pull requests...

Uh oh!

Average path length in iForest is inaccurate for small sizes #15724

Description

Description

Steps/Code to Fix

Versions

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions