Iteration over a Pandas DataFrame [duplicate]

Question

What is the (best practice) correct way to iterate over DataFrames?

I am using:

for i in range(working.shape[0]):
    for j in range(1, working.shape[1]):
        working.iloc[i,j] = (100 - working.iloc[i,j])*100

The above is correct but does not line up with other Stack Overflow answers. I was hoping that someone could explain why the above is not optimal and suggest a superior implementation.

I am very much a novice in programming in general and Pandas in particular. Also apologies for asking a question which has already been addressed on SF: I didn't really understand the standing answers to this though. possible duplicate but this answer is easy to understand for a novice, if less comprehensive.

Fantastic, thank you very much! However, my code omits the first column - can I use applymap more selectively? — Tikhon, Commented Aug 29, 2019 at 23:11
see this answer for more information about how to NOT iterate over a dataframe — Ben.T, Commented Aug 29, 2019 at 23:18

DeepSpace · Accepted Answer · 2019-08-30 09:12:44Z

What is the (best practice) correct way to iterate over DataFrames?

There are several ways (for example iterrows) but in general, you should try to avoid iteration at all costs. pandas offer several tools for vectorized operations which will almost always be faster than an iterative solution.

The example you provided can be vectorized in the following way using iloc:

working.iloc[:, 1:] = (100 - working.iloc[:, 1:]) * 100

Some timings:

from timeit import Timer

working = pd.DataFrame({'a': range(50), 'b': range(50)})


def iteration():
    for i in range(working.shape[0]):
        for j in range(1, working.shape[1]):
            working.iloc[i, j] = (100 - working.iloc[i, j]) * 100


def direct():
    # in actual code you will have to assign back to working.iloc[:, 1:]
    (100 - working.iloc[:, 1:]) * 100


print(min(Timer(iteration).repeat(50, 50)))
print(min(Timer(direct).repeat(50, 50)))

Outputs

0.38473859999999993
0.05334049999999735

A 7-factor difference and that's with only 50 rows.

Collectives™ on Stack Overflow

Iteration over a Pandas DataFrame [duplicate]

1 Answer 1

2 Comments

Linked

Hot Network Questions

Collectives™ on Stack Overflow

1 Answer 1

2 Comments

Linked

Related