Skip to content

Navigation Menu

Search code, repositories, users, issues, pull requests...

Provide feedback

We read every piece of feedback, and take your input very seriously.

Saved searches

Use saved searches to filter your results more quickly

Appearance settings

ENH: grep-like select columns of a DataFrame by a part of their names #61319

Copy link
Copy link
Closed
@kirisakow

Description

@kirisakow
Issue body actions

Feature Type

  • Adding new functionality to pandas

  • Changing existing functionality in pandas

  • Removing existing functionality in pandas

Problem Description

I wish I could grep-like select columns of a DataFrame by a part of their names, and return a subset of the original DataFrame containing only columns that match the substring.

Feature Description

from typing import List, Union
import pandas as pd


class ExtendedDF(pd.DataFrame):
    @property
    def _constructor(self):
        return ExtendedDF

    def select_by_substr(self, substr: Union[str, List[str]], *, ignore_case: bool = True) -> Union[pd.DataFrame, 'ExtendedDF']:
        """grep-like select columns of a DataFrame by a part of their names.

        Args:
            substr (Union[str, List[str]]): a string or a list of strings to be used as search patterns
            ignore_case (bool): if True (default), ignore search pattern case

        Returns:
            pd.DataFrame: a subset of the original DataFrame containing only columns that match the substring

        Usage:

        Consider two DataFrame objects extracted from two different sources, and thus varying in their column names:

        ```py
        df1 = pd.DataFrame({
            'Distance': [105.0, 0.0, 4.0, 1.0, 1241.0],
            'Distance_percent': [0.2, 0.0, 5.2, 11.1, 92.8],
            'Mixed': [921.0, 0.0, 52.0, 5.0, 0.0],
            'Mixed_percent': [1.9, 0.0, 67.5, 55.6, 0.0],
            'avg_diff': [121146.9, 293246.3, 212169.9, 41299.8, 29438.3],
            'med_diff': [17544.0, 1657.0, 55205.0, 95750.0, 2577.0],
        })
        df2 = pd.DataFrame({
            'distance': [105.0, 0.0, 4.0, 1.0, 1241.0],
            'distance_percent': [0.2, 0.0, 5.2, 11.1, 92.8],
            'mixed': [921.0, 0.0, 52.0, 5.0, 0.0],
            'mixed_percent': [1.9, 0.0, 67.5, 55.6, 0.0],
            'diff_avg': [121146.9, 293246.3, 212169.9, 41299.8, 29438.3],
            'diff_med': [17544.0, 1657.0, 55205.0, 95750.0, 2577.0],
        })
        df1 = ExtendedDF(df1)
        df2 = ExtendedDF(df2)
        ```
        ```
        df1
           Distance  Distance_percent  Mixed  Mixed_percent  avg_diff  med_diff
        0     105.0               0.2  921.0            1.9  121146.9   17544.0
        1       0.0               0.0    0.0            0.0  293246.3    1657.0
        2       4.0               5.2   52.0           67.5  212169.9   55205.0
        3       1.0              11.1    5.0           55.6   41299.8   95750.0
        4    1241.0              92.8    0.0            0.0   29438.3    2577.0

        df2
           distance  distance_percent  mixed  mixed_percent  diff_avg  diff_med
        0     105.0               0.2  921.0            1.9  121146.9   17544.0
        1       0.0               0.0    0.0            0.0  293246.3    1657.0
        2       4.0               5.2   52.0           67.5  212169.9   55205.0
        3       1.0              11.1    5.0           55.6   41299.8   95750.0
        4    1241.0              92.8    0.0            0.0   29438.3    2577.0
        ```

        As an analyst, I need to inspect which column is which between the two datasets:

        (a) either by defining a single string search pattern (`ignore_case=True` by default):

        ```py
        cols_to_select = 'diff'
        print('df1:')
        print(df1.select_by_substr(cols_to_select).T) # transposed for a better legibility
        print()
        print('df2:')
        print(df2.select_by_substr(cols_to_select).T) # transposed for a better legibility
        ```
        ```
        df1:
                         0         1         2        3        4
        avg_diff  121146.9  293246.3  212169.9  41299.8  29438.3
        med_diff   17544.0    1657.0   55205.0  95750.0   2577.0

        df2:
                         0         1         2        3        4
        diff_avg  121146.9  293246.3  212169.9  41299.8  29438.3
        diff_med   17544.0    1657.0   55205.0  95750.0   2577.0
        ```

        (b) or by defining a list of string search patterns (`ignore_case=True` by default):

        ```py
        cols_to_select = ['dist', 'Mix']
        print('df1:')
        print(df1.select_by_substr(cols_to_select).T) # transposed for a better legibility
        print()
        print('df2:')
        print(df2.select_by_substr(cols_to_select).T) # transposed for a better legibility
        ```
        ```
        df1:
                              0    1     2     3       4
        Mixed             921.0  0.0  52.0   5.0     0.0
        Distance          105.0  0.0   4.0   1.0  1241.0
        Mixed_percent       1.9  0.0  67.5  55.6     0.0
        Distance_percent    0.2  0.0   5.2  11.1    92.8

        df2:
                              0    1     2     3       4
        mixed_percent       1.9  0.0  67.5  55.6     0.0
        mixed             921.0  0.0  52.0   5.0     0.0
        distance          105.0  0.0   4.0   1.0  1241.0
        distance_percent    0.2  0.0   5.2  11.1    92.8
        ```

        (c) or, same as (b) but with an explicit `ignore_case=False`:

        ```py
        cols_to_select = ['dist', 'Mix']
        print('df1:')
        print(df1.select_by_substr(cols_to_select, ignore_case=False).T) # transposed for a better legibility
        print()
        print('df2:')
        print(df2.select_by_substr(cols_to_select, ignore_case=False).T) # transposed for a better legibility
        ```
        ```
        df1:
                           0    1     2     3    4
        Mixed_percent    1.9  0.0  67.5  55.6  0.0
        Mixed          921.0  0.0  52.0   5.0  0.0

        df2:
                              0    1    2     3       4
        distance_percent    0.2  0.0  5.2  11.1    92.8
        distance          105.0  0.0  4.0   1.0  1241.0
        ```
        """
        substr = [substr] if isinstance(substr, str) else substr
        if ignore_case:
            selected_cols = [col_name for col_name in self.columns for s in substr if s.casefold() in col_name.casefold()]
        else:
            selected_cols = [col_name for col_name in self.columns for s in substr if s in col_name]
        selected_cols = list(set(selected_cols))
        return self[selected_cols]

Alternative Solutions

Idk

Additional Context

No response

Metadata

Metadata

Assignees

No one assigned

    Labels

    EnhancementIndexingRelated to indexing on series/frames, not to indexes themselvesRelated to indexing on series/frames, not to indexes themselvesNeeds InfoClarification about behavior needed to assess issueClarification about behavior needed to assess issue

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions

      Morty Proxy This is a proxified and sanitized view of the page, visit original site.