Closed
Description
Feature Type
-
Adding new functionality to pandas
-
Changing existing functionality in pandas
-
Removing existing functionality in pandas
Problem Description
I wish I had a function that would inspect a DataFrame that has duplicate values and yield, per each group of rows that have a duplicate value, a subset of the input DataFrame featuring only the columns that vary.
Feature Description
from typing import Union
import pandas as pd
class ExtendedDF(pd.DataFrame):
@property
def _constructor(self):
return ExtendedDF
def inspect_duplicates(self, key_col: str) -> Union[pd.DataFrame, 'ExtendedDF']:
"""Inspects a DataFrame that has duplicate values in the `key_col` column,
and yields, per each group of rows that have same `key_col` value, a subset
of the input DataFrame featuring only the columns that vary.
Args:
key_col (str): name of the column with duplicate values
Yields:
pd.DataFrame: per each group of rows that have same `key_col` value,
yields a subset of the input DataFrame featuring only the columns that
vary.
Examples:
Consider a dataset containing ramen ratings with duplicates:
```py
df = pd.DataFrame({
'brand': ['Yum Yum', 'Yum Yum', 'Indomie', 'Indomie', 'Indomie'],
'style': ['cup', 'pack', 'cup', 'cup', 'pack'],
'rating': [4, 3.5, 4, 15, 5],
'col_that_doesnt_change': ['so yummy' for _ in range(5)],
'another_col_that_doesnt_change': ['mmm love it' for _ in range(5)],
})
df = ExtendedDF(df)
```
```
df
brand style rating col_that_doesnt_change another_col_that_doesnt_change
0 Yum Yum cup 4.0 so yummy mmm love it
1 Yum Yum pack 3.5 so yummy mmm love it
2 Indomie cup 4.0 so yummy mmm love it
3 Indomie cup 15.0 so yummy mmm love it
4 Indomie pack 5.0 so yummy mmm love it
```
Inspect the duplicates using 'brand' column as the key:
```py
print(
*df.inspect_duplicates('brand')
)
```
```
brand rating
2 Indomie 4.0
3 Indomie 15.0
4 Indomie 5.0
brand style rating
0 Yum Yum cup 4.0
1 Yum Yum pack 3.5
```
Inspect the duplicates using 'style' column as the key:
```py
print(
*df.inspect_duplicates('style')
)
```
```
style brand
0 cup Yum Yum
2 cup Indomie
3 cup Indomie
style brand rating
1 pack Yum Yum 3.5
4 pack Indomie 5.0
```
Inspect the duplicates using 'rating' column as the key:
```py
print(
*df.inspect_duplicates('rating')
)
```
```
rating brand
0 4.0 Yum Yum
2 4.0 Indomie
```
You can also concatenate everything that is yielded into a single DataFrame:
```py
print(
pd.concat([
*df.inspect_duplicates('brand')
])
)
```
```
brand style rating
0 Yum Yum cup 4.0
1 Yum Yum pack 3.5
2 Indomie NaN 4.0
3 Indomie NaN 15.0
4 Indomie NaN 5.0
```
"""
mark_all_dupl_mask = self.duplicated(key_col, keep=False)
df_dupl = self.loc[mark_all_dupl_mask]
for k in set(df_dupl[key_col].values):
sub_df = self.loc[self[key_col] == k]
mask_eq = sub_df.iloc[0] != sub_df.iloc[1]
diff_cols = mask_eq.loc[mask_eq].index.values
yield sub_df.loc[:, [key_col] + list(diff_cols)]
Alternative Solutions
None
Additional Context
Authors: