Skip to content

Navigation Menu

Sign in
Appearance settings

Search code, repositories, users, issues, pull requests...

Provide feedback

We read every piece of feedback, and take your input very seriously.

Saved searches

Use saved searches to filter your results more quickly

Appearance settings

Average path length in iForest is inaccurate for small sizes #15724

Copy link
Copy link
Open
@Konrad0

Description

@Konrad0
Issue body actions

Description

The routine _average_path_length() for isolation forests (sklearn.ensemble.IsolationForest) gives quite inaccurate results for small sizes (n_samples_leaf), which propagates to the anomaly scores. For n_samples_leaf=3 the error is around 30%. This is due to the fact that a very rough approximation mentioned in the original paper is used, which only reaches reasonable accuracy for very large numbers. Therefore I would suggest to modify the current routine as follows with a combination of a lookup table for small values and an improved asymptotic series for large values, which should be accurate to double precision. Please let me know what you think. (This might be considered an improvement on #13251.)

Steps/Code to Fix

def _average_path_length_small(n_samples_leaf):
    c = (0.0, 0.0, 1.0, 1.6666666666666666667, 
         2.1666666666666666667, 2.5666666666666666667, 2.9000000000000000000, 3.1857142857142857143, 
         3.4357142857142857143, 3.6579365079365079365, 3.8579365079365079365, 4.0397546897546897547, 
         4.2064213564213564214, 4.3602675102675102675, 4.5031246531246531247, 4.6364579864579864580, 
         4.7614579864579864580, 4.8791050452815158698, 4.9902161563926269809, 5.0954793142873638230, 
         5.1954793142873638230, 5.2907174095254590611, 5.3816265004345499702, 5.4685830221736804049, 
         5.5519163555070137383, 5.6319163555070137383, 5.7088394324300906613, 5.7829135065041647354, 
         5.8543420779327361640, 5.9233075951741154743, 5.9899742618407821410, 6.0544903908730402055, 
         6.1169903908730402055, 6.1775964514791008116, 6.2364199808908655175, 6.2935628380337226603, 
         6.3491183935892782159, 6.4031724476433322699, 6.4558040265907006910, 6.5070860778727519730, 
         6.5570860778727519730, 6.6058665656776300218, 6.6534856132966776409, 6.6999972412036543850, 
         6.7454517866581998396, 6.7898962311026442840, 6.8333744919722095014, 6.8759276834615712036, 
         6.9175943501282378702, 6.9584106766588501151, 6.9984106766588501151, 7.0376263629333599190)
    #return np.array(c)[n_samples_leaf]
    return np.array(c)[np.maximum(0, n_samples_leaf)]

def _average_path_length(n_samples_leaf):
    """
    The average path length in a n_samples iTree, which is equal to
    the average path length of an unsuccessful BST search since the
    latter has the same structure as an isolation tree.
    Parameters
    ----------
    n_samples_leaf : array-like, shape (n_samples,).
        The number of training samples in each test sample leaf, for
        each estimators.

    Returns
    -------
    average_path_length : array, same shape as n_samples_leaf
    """

    n_samples_leaf = check_array(n_samples_leaf, ensure_2d=False)

    n_samples_leaf_shape = n_samples_leaf.shape
    n_samples_leaf = n_samples_leaf.reshape((1, -1))
    average_path_length = np.zeros(n_samples_leaf.shape)

    mask_small = n_samples_leaf < 52
    not_mask = ~mask_small

    average_path_length[mask_small] = _average_path_length_small(n_samples_leaf[mask_small])

    tmp = 1.0/np.square(n_samples_leaf[not_mask])
    average_path_length[not_mask] = (
        2.0 * (np.log(n_samples_leaf[not_mask]) - 1.0 + np.euler_gamma)
        + 1.0/n_samples_leaf[not_mask] - tmp*(1.0/6 - tmp*(1.0/60 - tmp/126))
    )

    return average_path_length.reshape(n_samples_leaf_shape)

Versions

System:
python: 3.8.0 (default, Oct 23 2019, 18:51:26) [GCC 9.2.0]
executable: /usr/bin/python
machine: Linux-5.3.13-arch1-1-x86_64-with-glibc2.2.5

Python deps:
pip: 19.2.3
setuptools: 41.6.0
sklearn: 0.21.3
numpy: 1.17.4
scipy: 1.3.1
Cython: 0.29.14
pandas: 0.25.3

Metadata

Metadata

Assignees

No one assigned

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions

      Morty Proxy This is a proxified and sanitized view of the page, visit original site.