Skip to main content
  1. About
  2. For Teams
Asked
Viewed 364 times
1

What is the (best practice) correct way to iterate over DataFrames?

I am using:

for i in range(working.shape[0]):
    for j in range(1, working.shape[1]):
        working.iloc[i,j] = (100 - working.iloc[i,j])*100

The above is correct but does not line up with other Stack Overflow answers. I was hoping that someone could explain why the above is not optimal and suggest a superior implementation.

I am very much a novice in programming in general and Pandas in particular. Also apologies for asking a question which has already been addressed on SF: I didn't really understand the standing answers to this though. possible duplicate but this answer is easy to understand for a novice, if less comprehensive.

2
  • Fantastic, thank you very much! However, my code omits the first column - can I use applymap more selectively?
    Tikhon
    –  Tikhon
    2019-08-29 23:11:58 +00:00
    Commented Aug 29, 2019 at 23:11
  • 1
    see this answer for more information about how to NOT iterate over a dataframe
    Ben.T
    –  Ben.T
    2019-08-29 23:18:30 +00:00
    Commented Aug 29, 2019 at 23:18

1 Answer 1

4

What is the (best practice) correct way to iterate over DataFrames?

There are several ways (for example iterrows) but in general, you should try to avoid iteration at all costs. pandas offer several tools for vectorized operations which will almost always be faster than an iterative solution.

The example you provided can be vectorized in the following way using iloc:

working.iloc[:, 1:] = (100 - working.iloc[:, 1:]) * 100

Some timings:

from timeit import Timer

working = pd.DataFrame({'a': range(50), 'b': range(50)})


def iteration():
    for i in range(working.shape[0]):
        for j in range(1, working.shape[1]):
            working.iloc[i, j] = (100 - working.iloc[i, j]) * 100


def direct():
    # in actual code you will have to assign back to working.iloc[:, 1:]
    (100 - working.iloc[:, 1:]) * 100


print(min(Timer(iteration).repeat(50, 50)))
print(min(Timer(direct).repeat(50, 50)))

Outputs

0.38473859999999993
0.05334049999999735

A 7-factor difference and that's with only 50 rows.

Sign up to request clarification or add additional context in comments.

2 Comments

Fantastic, thank you very much! However, my code omits the first column - can I use applymap more selectively?
I'm very grateful guys - thank you.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.

Morty Proxy This is a proxified and sanitized view of the page, visit original site.