Skip to content

Navigation Menu

Sign in
Appearance settings

Search code, repositories, users, issues, pull requests...

Provide feedback

We read every piece of feedback, and take your input very seriously.

Saved searches

Use saved searches to filter your results more quickly

Appearance settings

PDEP-11: Change default of dropna to False #53094

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Open
wants to merge 4 commits into
base: main
Choose a base branch
Loading
from
Open
Changes from 1 commit
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
Next Next commit
PDEP-11: Change default of dropna to False
  • Loading branch information
rhshadrach committed May 5, 2023
commit 7c5f8c7b821178633575a893ff1691521fe0a7ed
78 changes: 78 additions & 0 deletions 78 web/pandas/pdeps/0011-dropna-default.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,78 @@
# PDEP-11: dropna default in pandas

- Created: 4 May 2023
- Status: Under discussion
- Discussion: [PR ??](https://github.com/pandas-dev/pandas/pull/??)
- Authors: [Richard Shadrach](https://github.com/rhshadrach)
- Revision: 1

## Abstract

Throughout pandas, almost all of the methods that have a `dropna` argument default
to `True`. Being the default, this can cause NA values to be silently dropped.
This PDEP proposes to deprecate the current default value of `True` and change it
to `False` in the next major release of pandas.

## Motivation and Scope

Upon seeing the output for a Series `ser`:

```python
print(ser.value_counts())

1 3
2 1
dtype: Int64
```

users may be surprised that the Series can contain NA values. By then operating
on data under the assumption NA values are not present, erroroneous results can
arise. The same issue can occur with `groupby`, which can also be used to produce
detailed summary statistics of data. We think it is not unreasonable that an
experienced pandas user seeing the code

df[["a", "b"]].groupby("a").sum()

would describe this operation as something like the following.

> For each unique value in column `a`, compute the sum of corresponding values
> in column `b` and return the results in a DataFrame indexed by the unique
> values of `a`.

This is correct, except that NA values in the column `a` will be dropped from
the computation. That pandas is taking this additional step in the computation
is not apparent from the code, and can surprise users.
mroeschke marked this conversation as resolved.
Show resolved Hide resolved

## Detailed Description

We propose to deprecate the current default of `dropna` and change it to
`False` across all applicable methods. The following methods have a dropna
argument, those marked with a `*` already default to `False`.

```python
Series.groupby
Series.mode
Series.nunique
*Series.to_hdf
Series.value_counts
DataFrame.groupby
DataFrame.mode
DataFrame.nunique
DataFrame.pivot_table
DataFrame.stack
*DataFrame.to_hdf
DataFrame.value_counts
SeriesGroupBy.nunique
SeriesGroupBy.value_counts
DataFrameGroupBy.nunique
DataFrameGroupBy.value_counts
```
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think you might be missing a couple functions here.
This is the complete list (made using the keyword inspector script I posted on slack a while back).

<class 'pandas.core.arrays.categorical.Categorical'>.value_counts
<class 'pandas.core.indexes.category.CategoricalIndex'>.nunique
<class 'pandas.core.indexes.category.CategoricalIndex'>.value_counts
<class 'pandas.core.frame.DataFrame'>.groupby
<class 'pandas.core.frame.DataFrame'>.mode
<class 'pandas.core.frame.DataFrame'>.nunique
<class 'pandas.core.frame.DataFrame'>.pivot_table
<class 'pandas.core.frame.DataFrame'>.stack
<class 'pandas.core.frame.DataFrame'>.to_hdf
<class 'pandas.core.frame.DataFrame'>.value_counts
<class 'pandas.core.indexes.datetimes.DatetimeIndex'>.nunique
<class 'pandas.core.indexes.datetimes.DatetimeIndex'>.value_counts
<class 'pandas.io.pytables.HDFStore'>.append
<class 'pandas.io.pytables.HDFStore'>.append_to_multiple
<class 'pandas.io.pytables.HDFStore'>.put
<class 'pandas.core.indexes.base.Index'>.nunique
<class 'pandas.core.indexes.base.Index'>.value_counts
<class 'pandas.core.indexes.interval.IntervalIndex'>.nunique
<class 'pandas.core.indexes.interval.IntervalIndex'>.value_counts
<class 'pandas.core.indexes.multi.MultiIndex'>.nunique
<class 'pandas.core.indexes.multi.MultiIndex'>.value_counts
<class 'pandas.core.indexes.period.PeriodIndex'>.nunique
<class 'pandas.core.indexes.period.PeriodIndex'>.value_counts
<class 'pandas.core.indexes.range.RangeIndex'>.nunique
<class 'pandas.core.indexes.range.RangeIndex'>.value_counts
<class 'pandas.core.series.Series'>.groupby
<class 'pandas.core.series.Series'>.mode
<class 'pandas.core.series.Series'>.nunique
<class 'pandas.core.series.Series'>.to_hdf
<class 'pandas.core.series.Series'>.value_counts
<class 'pandas.core.indexes.timedeltas.TimedeltaIndex'>.nunique
<class 'pandas.core.indexes.timedeltas.TimedeltaIndex'>.value_counts
crosstab
lreshape
pivot_table
value_counts

I think the missing ones are
crosstab, lreshape, HDFStore.put|append|append_to_multiple.


## Timeline

If accepted, the current `dropna` default would be deprecated as part of pandas
2.x and this deprecation would be enforced in pandas 3.0.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

How would users find out about this deprecation? I'm concerned it will create noisy messages. For example, if you were to do df[["a", "b"]].groupby("a").sum(), would you always get a deprecation message? Would you only get a message if the result would change because the column "a" had NA values?

So can you be more specific about how the deprecation would work?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Added. A warning would only be emitted when dropna is unspecified and an NA value is encountered.


## PDEP History

- 4 May 2023: Initial draft
Morty Proxy This is a proxified and sanitized view of the page, visit original site.