Skip to content

Navigation Menu

Search code, repositories, users, issues, pull requests...

Provide feedback

We read every piece of feedback, and take your input very seriously.

Saved searches

Use saved searches to filter your results more quickly

Appearance settings

[MRG] Improved parallel execution of plot_partial_dependence / partial_dependence #19392

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Open
wants to merge 3 commits into
base: main
Choose a base branch
Loading
from

Conversation

timo-stoettner
Copy link

@timo-stoettner timo-stoettner commented Feb 7, 2021

What does this implement/fix? Explain your changes.

Currently, when specifying n_jobs for plot_partial_dependence, it is parallelized by concurrently computing the partial dependences per feature (by calling the function partial_dependence). When only one feature is to be plotted, specifying n_jobs will lead to no performance boost at best. (Or more general, when n_jobs > number of features).

Apart from the fact that I don't think this is documented anywhere, this can be implemented differently to also provide a speedup when the number of features is smaller than n_jobs.

I moved the parallelization to partial_dependence instead. It divides up the grid for which the values are computed and parallelizes among those different parts of the grid. This has two benefits:

  • Specifying n_jobs now also leads to a performance boost when n_jobs > number of features
  • partial_dependence itself now has a parameter n_jobs and can therefore also be parallelized

Benchmarks

Some results of timeit before and after the changes with different parameter settings (sorted by relative speedup, executed on my 4-core MacBook):

n_features n_samples n_jobs kind grid_resolution timeit_result_main timeit_result_new best_main best_new speedup_seconds speedup_relative
21 1 10000 2 individual 200 24 s ± 146 ms per loop (mean ± std. dev. of 2 runs, 1 loop each) 13.4 s ± 6.96 ms per loop (mean ± std. dev. of 2 runs, 1 loop each) 23.8451 13.344 10.5011 0.440388
5 1 10000 2 individual 100 13.1 s ± 3.97 ms per loop (mean ± std. dev. of 2 runs, 1 loop each) 7.78 s ± 29 ms per loop (mean ± std. dev. of 2 runs, 1 loop each) 13.047 7.74754 5.2995 0.406184
5 1 10000 2 individual 100 12.9 s ± 267 ms per loop (mean ± std. dev. of 2 runs, 1 loop each) 7.78 s ± 29 ms per loop (mean ± std. dev. of 2 runs, 1 loop each) 12.6534 7.74754 4.90582 0.387709
7 4 100000 2 individual 10 19.2 s ± 1.03 s per loop (mean ± std. dev. of 2 runs, 1 loop each) 13.4 s ± 109 ms per loop (mean ± std. dev. of 2 runs, 1 loop each) 18.1699 13.331 4.83897 0.266317
1 1 100000 4 individual 100 24.6 s ± 388 ms per loop (mean ± std. dev. of 2 runs, 1 loop each) 18.3 s ± 119 ms per loop (mean ± std. dev. of 2 runs, 1 loop each) 24.2554 18.1511 6.10435 0.251669
28 1 100000 4 individual 50 14.1 s ± 932 ms per loop (mean ± std. dev. of 2 runs, 1 loop each) 10.2 s ± 60.3 ms per loop (mean ± std. dev. of 2 runs, 1 loop each) 13.1666 10.1478 3.0188 0.229277
17 1 10000 2 average 100 1.96 s ± 311 ms per loop (mean ± std. dev. of 2 runs, 1 loop each) 1.4 s ± 2.64 ms per loop (mean ± std. dev. of 2 runs, 1 loop each) 1.65249 1.39596 0.256534 0.15524
16 1 10000 2 average 100 1.96 s ± 311 ms per loop (mean ± std. dev. of 2 runs, 1 loop each) 1.45 s ± 41.7 ms per loop (mean ± std. dev. of 2 runs, 1 loop each) 1.65249 1.40574 0.246758 0.149325
15 1 10000 2 average 100 1.51 s ± 11 ms per loop (mean ± std. dev. of 2 runs, 1 loop each) 1.4 s ± 2.64 ms per loop (mean ± std. dev. of 2 runs, 1 loop each) 1.50335 1.39596 0.107393 0.0714358
2 4 10000 4 average 50 2.43 s ± 117 ms per loop (mean ± std. dev. of 2 runs, 1 loop each) 2.15 s ± 1.56 ms per loop (mean ± std. dev. of 2 runs, 1 loop each) 2.31589 2.15133 0.164566 0.0710596
7 4 100000 2 individual 10 14.3 s ± 47.7 ms per loop (mean ± std. dev. of 2 runs, 1 loop each) 13.4 s ± 109 ms per loop (mean ± std. dev. of 2 runs, 1 loop each) 14.2779 13.331 0.94689 0.0663187
14 1 10000 2 average 100 1.51 s ± 11 ms per loop (mean ± std. dev. of 2 runs, 1 loop each) 1.45 s ± 41.7 ms per loop (mean ± std. dev. of 2 runs, 1 loop each) 1.50335 1.40574 0.097618 0.0649335
2 4 10000 4 average 50 2.34 s ± 42.6 ms per loop (mean ± std. dev. of 2 runs, 1 loop each) 2.15 s ± 1.56 ms per loop (mean ± std. dev. of 2 runs, 1 loop each) 2.29967 2.15133 0.148345 0.064507
0 5 100000 2 average 200 7.56 s ± 209 ms per loop (mean ± std. dev. of 2 runs, 1 loop each) 6.98 s ± 81.7 ms per loop (mean ± std. dev. of 2 runs, 1 loop each) 7.35367 6.89909 0.45458 0.0618168
33 4 100000 4 average 100 3.35 s ± 8.13 ms per loop (mean ± std. dev. of 2 runs, 1 loop each) 3.22 s ± 42.7 ms per loop (mean ± std. dev. of 2 runs, 1 loop each) 3.34491 3.1804 0.164512 0.0491827
30 4 10000 2 average 100 4.08 s ± 250 ms per loop (mean ± std. dev. of 2 runs, 1 loop each) 3.67 s ± 19.7 ms per loop (mean ± std. dev. of 2 runs, 1 loop each) 3.82567 3.65248 0.173185 0.0452691
3 5 100000 1 average 10 691 ms ± 14.5 ms per loop (mean ± std. dev. of 2 runs, 1 loop each) 649 ms ± 757 µs per loop (mean ± std. dev. of 2 runs, 1 loop each) 0.676198 0.648693 0.0275046 0.0406754
4 4 100000 2 individual 200 2min 16s ± 718 ms per loop (mean ± std. dev. of 2 runs, 1 loop each) 2min 11s ± 583 ms per loop (mean ± std. dev. of 2 runs, 1 loop each) 135.885 130.866 5.01914 0.0369366
8 4 100000 2 individual 10 14.3 s ± 47.7 ms per loop (mean ± std. dev. of 2 runs, 1 loop each) 14.3 s ± 521 ms per loop (mean ± std. dev. of 2 runs, 1 loop each) 14.2779 13.7621 0.515808 0.0361265
4 4 100000 2 individual 200 2min 16s ± 386 ms per loop (mean ± std. dev. of 2 runs, 1 loop each) 2min 11s ± 583 ms per loop (mean ± std. dev. of 2 runs, 1 loop each) 135.739 130.866 4.87275 0.035898
32 4 100000 1 average 10 556 ms ± 5.98 ms per loop (mean ± std. dev. of 2 runs, 1 loop each) 534 ms ± 3.14 ms per loop (mean ± std. dev. of 2 runs, 1 loop each) 0.549602 0.531212 0.0183903 0.0334611
23 4 10000 1 average 50 1.85 s ± 353 µs per loop (mean ± std. dev. of 2 runs, 1 loop each) 1.81 s ± 1.64 ms per loop (mean ± std. dev. of 2 runs, 1 loop each) 1.84987 1.80488 0.0449915 0.0243215
13 5 10000 2 individual 200 1min 15s ± 116 ms per loop (mean ± std. dev. of 2 runs, 1 loop each) 1min 13s ± 48.3 ms per loop (mean ± std. dev. of 2 runs, 1 loop each) 75.5593 73.7815 1.77786 0.0235293
1 4 10000 2 individual 50 19.6 s ± 1.22 s per loop (mean ± std. dev. of 2 runs, 1 loop each) 18 s ± 71.3 ms per loop (mean ± std. dev. of 2 runs, 1 loop each) 18.3876 17.9554 0.432212 0.0235056
1 4 10000 2 individual 50 18.3 s ± 10.1 ms per loop (mean ± std. dev. of 2 runs, 1 loop each) 18 s ± 71.3 ms per loop (mean ± std. dev. of 2 runs, 1 loop each) 18.2911 17.9554 0.335784 0.0183577
0 5 100000 2 average 200 7.24 s ± 259 ms per loop (mean ± std. dev. of 2 runs, 1 loop each) 6.98 s ± 81.7 ms per loop (mean ± std. dev. of 2 runs, 1 loop each) 6.98345 6.89909 0.0843604 0.0120801
8 4 10000 1 average 10 473 ms ± 20.4 ms per loop (mean ± std. dev. of 2 runs, 1 loop each) 448 ms ± 311 µs per loop (mean ± std. dev. of 2 runs, 1 loop each) 0.452602 0.447662 0.00493988 0.0109144
24 2 100000 2 average 100 2 s ± 13.8 ms per loop (mean ± std. dev. of 2 runs, 1 loop each) 1.99 s ± 20.8 ms per loop (mean ± std. dev. of 2 runs, 1 loop each) 1.98503 1.97201 0.013015 0.0065566
3 5 100000 1 average 10 653 ms ± 3.02 ms per loop (mean ± std. dev. of 2 runs, 1 loop each) 649 ms ± 757 µs per loop (mean ± std. dev. of 2 runs, 1 loop each) 0.650019 0.648693 0.0013255 0.00203917
12 4 10000 1 individual 200 1min 33s ± 160 ms per loop (mean ± std. dev. of 2 runs, 1 loop each) 1min 33s ± 77.4 ms per loop (mean ± std. dev. of 2 runs, 1 loop each) 93.8372 93.8985 -0.0613526 -0.00065382
6 5 10000 1 individual 200 1min 57s ± 36.2 ms per loop (mean ± std. dev. of 2 runs, 1 loop each) 1min 57s ± 105 ms per loop (mean ± std. dev. of 2 runs, 1 loop each) 117.526 117.677 -0.151023 -0.00128502
19 5 10000 4 average 10 1.83 s ± 168 ms per loop (mean ± std. dev. of 2 runs, 1 loop each) 1.71 s ± 43.7 ms per loop (mean ± std. dev. of 2 runs, 1 loop each) 1.66691 1.67113 -0.00421947 -0.00253131
18 2 100000 2 average 50 1.49 s ± 168 ms per loop (mean ± std. dev. of 2 runs, 1 loop each) 1.38 s ± 47.7 ms per loop (mean ± std. dev. of 2 runs, 1 loop each) 1.32687 1.33121 -0.00433733 -0.00326884
20 2 10000 2 individual 10 4.94 s ± 17.2 ms per loop (mean ± std. dev. of 2 runs, 1 loop each) 5.05 s ± 86.7 ms per loop (mean ± std. dev. of 2 runs, 1 loop each) 4.92447 4.96486 -0.040398 -0.00820353
22 5 100000 2 individual 100 1min 35s ± 9.64 s per loop (mean ± std. dev. of 2 runs, 1 loop each) 1min 28s ± 961 ms per loop (mean ± std. dev. of 2 runs, 1 loop each) 86.3375 87.1939 -0.856435 -0.00991961
27 1 10000 1 individual 100 12.3 s ± 2.39 ms per loop (mean ± std. dev. of 2 runs, 1 loop each) 12.6 s ± 100 ms per loop (mean ± std. dev. of 2 runs, 1 loop each) 12.3309 12.4598 -0.1289 -0.0104534
26 2 10000 2 individual 100 14.8 s ± 18.8 ms per loop (mean ± std. dev. of 2 runs, 1 loop each) 15 s ± 5.16 ms per loop (mean ± std. dev. of 2 runs, 1 loop each) 14.8146 14.984 -0.169479 -0.01144
9 4 100000 2 individual 10 13.1 s ± 35.9 ms per loop (mean ± std. dev. of 2 runs, 1 loop each) 13.4 s ± 109 ms per loop (mean ± std. dev. of 2 runs, 1 loop each) 13.058 13.331 -0.272955 -0.0209032
9 4 10000 1 individual 200 1min 32s ± 405 ms per loop (mean ± std. dev. of 2 runs, 1 loop each) 1min 33s ± 77.4 ms per loop (mean ± std. dev. of 2 runs, 1 loop each) 91.6971 93.8985 -2.20142 -0.0240075
6 5 10000 1 individual 200 1min 56s ± 1.71 s per loop (mean ± std. dev. of 2 runs, 1 loop each) 1min 57s ± 105 ms per loop (mean ± std. dev. of 2 runs, 1 loop each) 114.774 117.677 -2.9032 -0.025295
25 2 10000 2 individual 200 25.5 s ± 181 ms per loop (mean ± std. dev. of 2 runs, 1 loop each) 26 s ± 2.47 ms per loop (mean ± std. dev. of 2 runs, 1 loop each) 25.3149 25.9944 -0.679459 -0.0268402
10 4 100000 2 individual 10 13.1 s ± 35.9 ms per loop (mean ± std. dev. of 2 runs, 1 loop each) 14.3 s ± 521 ms per loop (mean ± std. dev. of 2 runs, 1 loop each) 13.058 13.7621 -0.704036 -0.053916
11 4 10000 1 average 10 427 ms ± 2.34 ms per loop (mean ± std. dev. of 2 runs, 1 loop each) 448 ms ± 311 µs per loop (mean ± std. dev. of 2 runs, 1 loop each) 0.42461 0.447662 -0.0230521 -0.05429
29 2 10000 4 individual 10 6.24 s ± 815 ms per loop (mean ± std. dev. of 2 runs, 1 loop each) 6.28 s ± 187 ms per loop (mean ± std. dev. of 2 runs, 1 loop each) 5.42025 6.09757 -0.677328 -0.124963

So the modification seems to lead to a speedup in cases when effective_n_jobs > n_features and to no significant change in most other cases.

The code I used to come up with those benchmarks:

from sklearn.inspection import plot_partial_dependence
from sklearn.inspection import partial_dependence
from sklearn.ensemble import RandomForestRegressor
from sklearn.datasets import make_regression

import pandas as pd
from matplotlib import pyplot
import random


timing_results = []

def timeit_and_save_result(X, **kwargs):
    print(f"Executing timeit with {kwargs}")
    n_features = kwargs.pop('n_features')
    n_samples = kwargs.pop('n_samples')
    
    pyplot.close("all") 
    timeit_res = %timeit -r 2 -o plot_partial_dependence(est, X, range(n_features), random_state=0, **kwargs)
    
    timing_results.append({
        'timeit_result': timeit_res,
        'n_features': n_features,
        'n_samples': n_samples,
        **kwargs
    })
    

X, y = make_regression(n_samples=int(1e5), n_features=10, random_state=0)

est = RandomForestRegressor(n_estimators=10, n_jobs=4, random_state=0)
est.fit(X, y)


param_grid = {
    'n_samples': [int(1e4), int(1e5)],
    'n_features': [1, 2, 4, 5],
    'n_jobs': [1, 2, 4],
    'kind': ['average', 'individual'],
    'grid_resolution': [10, 50, 100, 200]
}


def sample_params(grid):
    return {k: random.choice(v) for k,v in grid.items()}


for _ in range(30):
    params = sample_params(param_grid)
    X_sample = X[:params['n_samples']]
    timeit_and_save_result(X_sample, **params)

@timo-stoettner timo-stoettner changed the title [MRG ] Improved parallel execution of plot_partial_dependence / partial_dependence [MRG] Improved parallel execution of plot_partial_dependence / partial_dependence Feb 12, 2021
Copy link
Member

@jjerphan jjerphan left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks @timo-stoettner. This indeed looks like an interesting proposition for parallelisation.

To better conclude on the performances of your solution, I would suggest performing a benchmark between main and your branch on a varying number of features and jobs.
You can report results in a table and/or graphically here; as an example, see this comment from @thomasjpfan.

Comment on lines +725 to +736
input_shape = (20, 3, 3)
data = np.ones(input_shape)

def shapes_after_split(n_jobs):
return [x.shape for x in
_split_data_for_parallel_execution(data, n_jobs)]

assert shapes_after_split(None)[0] == input_shape
assert shapes_after_split(2)[0] == (10, 3, 3)
assert shapes_after_split(2)[-1] == (10, 3, 3)
assert shapes_after_split(32)[0] == (1, 3, 3)
assert len(shapes_after_split(32)) == 20
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Maybe test with input_shape being odd. In that later case:

Suggested change
input_shape = (20, 3, 3)
data = np.ones(input_shape)
def shapes_after_split(n_jobs):
return [x.shape for x in
_split_data_for_parallel_execution(data, n_jobs)]
assert shapes_after_split(None)[0] == input_shape
assert shapes_after_split(2)[0] == (10, 3, 3)
assert shapes_after_split(2)[-1] == (10, 3, 3)
assert shapes_after_split(32)[0] == (1, 3, 3)
assert len(shapes_after_split(32)) == 20
input_shape = (21, 3, 3)
data = np.ones(input_shape)
def shapes_after_split(n_jobs):
return [x.shape for x in
_split_data_for_parallel_execution(data, n_jobs)]
assert shapes_after_split(None)[0] == input_shape
assert shapes_after_split(2)[0] == (10, 3, 3)
assert shapes_after_split(2)[-1] == (11, 3, 3)
assert len(shapes_after_split(32)) == 21

sklearn/inspection/_partial_dependence.py Outdated Show resolved Hide resolved
@timo-stoettner timo-stoettner changed the title [MRG] Improved parallel execution of plot_partial_dependence / partial_dependence [WIP] Improved parallel execution of plot_partial_dependence / partial_dependence Feb 13, 2021
@timo-stoettner
Copy link
Author

Thanks @jjerphan! I followed your suggestion and ran some more extensive benchmarks. In some cases that I haven't tested before main indeed appears to be faster. I'll have to figure out why and adapt the code accordingly. I changed the PRs title back to WIP for now.

@NicolasHug
Copy link
Member

Thanks for the proposal @timo-stoettner

In some cases that I haven't tested before main indeed appears to be faster. I'll have to figure out why and adapt the code accordingly.

as far as I understand this PR splits the data into n_jobs chunks so each job works on about grid_resolution // effective_n_jobs data points. When there aren't too many data points and when the computation is pretty fast (e.g. in the recursion method), spawning the jobs can induce a significant overhead. This is even more true when there are many plots to make, as this will spawn effective_n_jobs * n_features jobs.

Parallelizing over samples might be a good idea but indeed we need to figure out in which cases this makes sense. Typically if effective_n_job <= n_features I think we should stick to the current parallelization

I would suggest performing a benchmark between main and your branch on a varying number of features and jobs

and grid_resolution ;)

@timo-stoettner
Copy link
Author

Thanks @NicolasHug! Good points, I agree that we should stick to the current parallelization if there are more features / plots than jobs. I've tried that already and that seems to catch at least most cases where my proposal led to a decrease in performance. I'll run some more extensive benchmarks and will update the PR in the next days when I get to it.

timo-stoettner and others added 2 commits February 20, 2021 10:58
Co-authored-by: Julien Jerphanion <git@jjerphan.xyz>
… based on ratio of n_features / effective_n_jobs
@timo-stoettner
Copy link
Author

@NicolasHug As suggested I've added different parallelization strategies based on the comparison of effective_n_jobs and n_features. Also I've added results of some more extensive benchmarks to the PR description.

@timo-stoettner timo-stoettner changed the title [WIP] Improved parallel execution of plot_partial_dependence / partial_dependence [MRG] Improved parallel execution of plot_partial_dependence / partial_dependence Feb 20, 2021
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants
Morty Proxy This is a proxified and sanitized view of the page, visit original site.