ENH: grep-like select columns of a DataFrame by a part of their names

Feature Type

Adding new functionality to pandas
Changing existing functionality in pandas
Removing existing functionality in pandas

Problem Description

I wish I could grep-like select columns of a DataFrame by a part of their names, and return a subset of the original DataFrame containing only columns that match the substring.

Feature Description

from typing import List, Union
import pandas as pd


class ExtendedDF(pd.DataFrame):
    @property
    def _constructor(self):
        return ExtendedDF

    def select_by_substr(self, substr: Union[str, List[str]], *, ignore_case: bool = True) -> Union[pd.DataFrame, 'ExtendedDF']:
        """grep-like select columns of a DataFrame by a part of their names.

        Args:
            substr (Union[str, List[str]]): a string or a list of strings to be used as search patterns
            ignore_case (bool): if True (default), ignore search pattern case

        Returns:
            pd.DataFrame: a subset of the original DataFrame containing only columns that match the substring

        Usage:

        Consider two DataFrame objects extracted from two different sources, and thus varying in their column names:

        ```py
        df1 = pd.DataFrame({
            'Distance': [105.0, 0.0, 4.0, 1.0, 1241.0],
            'Distance_percent': [0.2, 0.0, 5.2, 11.1, 92.8],
            'Mixed': [921.0, 0.0, 52.0, 5.0, 0.0],
            'Mixed_percent': [1.9, 0.0, 67.5, 55.6, 0.0],
            'avg_diff': [121146.9, 293246.3, 212169.9, 41299.8, 29438.3],
            'med_diff': [17544.0, 1657.0, 55205.0, 95750.0, 2577.0],
        })
        df2 = pd.DataFrame({
            'distance': [105.0, 0.0, 4.0, 1.0, 1241.0],
            'distance_percent': [0.2, 0.0, 5.2, 11.1, 92.8],
            'mixed': [921.0, 0.0, 52.0, 5.0, 0.0],
            'mixed_percent': [1.9, 0.0, 67.5, 55.6, 0.0],
            'diff_avg': [121146.9, 293246.3, 212169.9, 41299.8, 29438.3],
            'diff_med': [17544.0, 1657.0, 55205.0, 95750.0, 2577.0],
        })
        df1 = ExtendedDF(df1)
        df2 = ExtendedDF(df2)
        ```
        ```
        df1
           Distance  Distance_percent  Mixed  Mixed_percent  avg_diff  med_diff
        0     105.0               0.2  921.0            1.9  121146.9   17544.0
        1       0.0               0.0    0.0            0.0  293246.3    1657.0
        2       4.0               5.2   52.0           67.5  212169.9   55205.0
        3       1.0              11.1    5.0           55.6   41299.8   95750.0
        4    1241.0              92.8    0.0            0.0   29438.3    2577.0

        df2
           distance  distance_percent  mixed  mixed_percent  diff_avg  diff_med
        0     105.0               0.2  921.0            1.9  121146.9   17544.0
        1       0.0               0.0    0.0            0.0  293246.3    1657.0
        2       4.0               5.2   52.0           67.5  212169.9   55205.0
        3       1.0              11.1    5.0           55.6   41299.8   95750.0
        4    1241.0              92.8    0.0            0.0   29438.3    2577.0
        ```

        As an analyst, I need to inspect which column is which between the two datasets:

        (a) either by defining a single string search pattern (`ignore_case=True` by default):

        ```py
        cols_to_select = 'diff'
        print('df1:')
        print(df1.select_by_substr(cols_to_select).T) # transposed for a better legibility
        print()
        print('df2:')
        print(df2.select_by_substr(cols_to_select).T) # transposed for a better legibility
        ```
        ```
        df1:
                         0         1         2        3        4
        avg_diff  121146.9  293246.3  212169.9  41299.8  29438.3
        med_diff   17544.0    1657.0   55205.0  95750.0   2577.0

        df2:
                         0         1         2        3        4
        diff_avg  121146.9  293246.3  212169.9  41299.8  29438.3
        diff_med   17544.0    1657.0   55205.0  95750.0   2577.0
        ```

        (b) or by defining a list of string search patterns (`ignore_case=True` by default):

        ```py
        cols_to_select = ['dist', 'Mix']
        print('df1:')
        print(df1.select_by_substr(cols_to_select).T) # transposed for a better legibility
        print()
        print('df2:')
        print(df2.select_by_substr(cols_to_select).T) # transposed for a better legibility
        ```
        ```
        df1:
                              0    1     2     3       4
        Mixed             921.0  0.0  52.0   5.0     0.0
        Distance          105.0  0.0   4.0   1.0  1241.0
        Mixed_percent       1.9  0.0  67.5  55.6     0.0
        Distance_percent    0.2  0.0   5.2  11.1    92.8

        df2:
                              0    1     2     3       4
        mixed_percent       1.9  0.0  67.5  55.6     0.0
        mixed             921.0  0.0  52.0   5.0     0.0
        distance          105.0  0.0   4.0   1.0  1241.0
        distance_percent    0.2  0.0   5.2  11.1    92.8
        ```

        (c) or, same as (b) but with an explicit `ignore_case=False`:

        ```py
        cols_to_select = ['dist', 'Mix']
        print('df1:')
        print(df1.select_by_substr(cols_to_select, ignore_case=False).T) # transposed for a better legibility
        print()
        print('df2:')
        print(df2.select_by_substr(cols_to_select, ignore_case=False).T) # transposed for a better legibility
        ```
        ```
        df1:
                           0    1     2     3    4
        Mixed_percent    1.9  0.0  67.5  55.6  0.0
        Mixed          921.0  0.0  52.0   5.0  0.0

        df2:
                              0    1    2     3       4
        distance_percent    0.2  0.0  5.2  11.1    92.8
        distance          105.0  0.0  4.0   1.0  1241.0
        ```
        """
        substr = [substr] if isinstance(substr, str) else substr
        if ignore_case:
            selected_cols = [col_name for col_name in self.columns for s in substr if s.casefold() in col_name.casefold()]
        else:
            selected_cols = [col_name for col_name in self.columns for s in substr if s in col_name]
        selected_cols = list(set(selected_cols))
        return self[selected_cols]

Alternative Solutions

Idk

Additional Context

No response

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

ENH: grep-like select columns of a DataFrame by a part of their names #61319

Feature Type

Problem Description

Feature Description

Alternative Solutions

Additional Context

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Search code, repositories, users, issues, pull requests...

ENH: grep-like select columns of a DataFrame by a part of their names #61319

Description

Feature Type

Problem Description

Feature Description

Alternative Solutions

Additional Context

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions