Closed
Description
Feature Type
-
Adding new functionality to pandas
-
Changing existing functionality in pandas
-
Removing existing functionality in pandas
Problem Description
I wish I could grep-like select columns of a DataFrame by a part of their names, and return a subset of the original DataFrame containing only columns that match the substring.
Feature Description
from typing import List, Union
import pandas as pd
class ExtendedDF(pd.DataFrame):
@property
def _constructor(self):
return ExtendedDF
def select_by_substr(self, substr: Union[str, List[str]], *, ignore_case: bool = True) -> Union[pd.DataFrame, 'ExtendedDF']:
"""grep-like select columns of a DataFrame by a part of their names.
Args:
substr (Union[str, List[str]]): a string or a list of strings to be used as search patterns
ignore_case (bool): if True (default), ignore search pattern case
Returns:
pd.DataFrame: a subset of the original DataFrame containing only columns that match the substring
Usage:
Consider two DataFrame objects extracted from two different sources, and thus varying in their column names:
```py
df1 = pd.DataFrame({
'Distance': [105.0, 0.0, 4.0, 1.0, 1241.0],
'Distance_percent': [0.2, 0.0, 5.2, 11.1, 92.8],
'Mixed': [921.0, 0.0, 52.0, 5.0, 0.0],
'Mixed_percent': [1.9, 0.0, 67.5, 55.6, 0.0],
'avg_diff': [121146.9, 293246.3, 212169.9, 41299.8, 29438.3],
'med_diff': [17544.0, 1657.0, 55205.0, 95750.0, 2577.0],
})
df2 = pd.DataFrame({
'distance': [105.0, 0.0, 4.0, 1.0, 1241.0],
'distance_percent': [0.2, 0.0, 5.2, 11.1, 92.8],
'mixed': [921.0, 0.0, 52.0, 5.0, 0.0],
'mixed_percent': [1.9, 0.0, 67.5, 55.6, 0.0],
'diff_avg': [121146.9, 293246.3, 212169.9, 41299.8, 29438.3],
'diff_med': [17544.0, 1657.0, 55205.0, 95750.0, 2577.0],
})
df1 = ExtendedDF(df1)
df2 = ExtendedDF(df2)
```
```
df1
Distance Distance_percent Mixed Mixed_percent avg_diff med_diff
0 105.0 0.2 921.0 1.9 121146.9 17544.0
1 0.0 0.0 0.0 0.0 293246.3 1657.0
2 4.0 5.2 52.0 67.5 212169.9 55205.0
3 1.0 11.1 5.0 55.6 41299.8 95750.0
4 1241.0 92.8 0.0 0.0 29438.3 2577.0
df2
distance distance_percent mixed mixed_percent diff_avg diff_med
0 105.0 0.2 921.0 1.9 121146.9 17544.0
1 0.0 0.0 0.0 0.0 293246.3 1657.0
2 4.0 5.2 52.0 67.5 212169.9 55205.0
3 1.0 11.1 5.0 55.6 41299.8 95750.0
4 1241.0 92.8 0.0 0.0 29438.3 2577.0
```
As an analyst, I need to inspect which column is which between the two datasets:
(a) either by defining a single string search pattern (`ignore_case=True` by default):
```py
cols_to_select = 'diff'
print('df1:')
print(df1.select_by_substr(cols_to_select).T) # transposed for a better legibility
print()
print('df2:')
print(df2.select_by_substr(cols_to_select).T) # transposed for a better legibility
```
```
df1:
0 1 2 3 4
avg_diff 121146.9 293246.3 212169.9 41299.8 29438.3
med_diff 17544.0 1657.0 55205.0 95750.0 2577.0
df2:
0 1 2 3 4
diff_avg 121146.9 293246.3 212169.9 41299.8 29438.3
diff_med 17544.0 1657.0 55205.0 95750.0 2577.0
```
(b) or by defining a list of string search patterns (`ignore_case=True` by default):
```py
cols_to_select = ['dist', 'Mix']
print('df1:')
print(df1.select_by_substr(cols_to_select).T) # transposed for a better legibility
print()
print('df2:')
print(df2.select_by_substr(cols_to_select).T) # transposed for a better legibility
```
```
df1:
0 1 2 3 4
Mixed 921.0 0.0 52.0 5.0 0.0
Distance 105.0 0.0 4.0 1.0 1241.0
Mixed_percent 1.9 0.0 67.5 55.6 0.0
Distance_percent 0.2 0.0 5.2 11.1 92.8
df2:
0 1 2 3 4
mixed_percent 1.9 0.0 67.5 55.6 0.0
mixed 921.0 0.0 52.0 5.0 0.0
distance 105.0 0.0 4.0 1.0 1241.0
distance_percent 0.2 0.0 5.2 11.1 92.8
```
(c) or, same as (b) but with an explicit `ignore_case=False`:
```py
cols_to_select = ['dist', 'Mix']
print('df1:')
print(df1.select_by_substr(cols_to_select, ignore_case=False).T) # transposed for a better legibility
print()
print('df2:')
print(df2.select_by_substr(cols_to_select, ignore_case=False).T) # transposed for a better legibility
```
```
df1:
0 1 2 3 4
Mixed_percent 1.9 0.0 67.5 55.6 0.0
Mixed 921.0 0.0 52.0 5.0 0.0
df2:
0 1 2 3 4
distance_percent 0.2 0.0 5.2 11.1 92.8
distance 105.0 0.0 4.0 1.0 1241.0
```
"""
substr = [substr] if isinstance(substr, str) else substr
if ignore_case:
selected_cols = [col_name for col_name in self.columns for s in substr if s.casefold() in col_name.casefold()]
else:
selected_cols = [col_name for col_name in self.columns for s in substr if s in col_name]
selected_cols = list(set(selected_cols))
return self[selected_cols]
Alternative Solutions
Idk
Additional Context
No response
Metadata
Metadata
Assignees
Labels
Related to indexing on series/frames, not to indexes themselvesRelated to indexing on series/frames, not to indexes themselvesClarification about behavior needed to assess issueClarification about behavior needed to assess issue