Skip to content

Navigation Menu

Sign in
Appearance settings

Search code, repositories, users, issues, pull requests...

Provide feedback

We read every piece of feedback, and take your input very seriously.

Saved searches

Use saved searches to filter your results more quickly

Appearance settings

ENH: Need API support and __repr__ to discover the storage used for strings #59342

Copy link
Copy link
Open
@arnaudlegout

Description

@arnaudlegout
Issue body actions

Originally raised in #58551 (comment)

Problem Description

With PDEP-14 there is the need for developers to be aware of the storage used for strings. Indeed, the storage might have a lot of impact of performance, for instance

  • pyarrow storage
    • pros: compact (optimal memory footprint), fast (vectorization)
    • cons: immutable (so any modification creates a new string pyarrow ChunkedArray)
  • python storage
    • pros: mutable
    • cons: highest memory footprint (each string is a different Python object), slow (no vectorization)
  • numpy 2.0 strings storage (I don't have a good knowledge of these new strings, and never tested them)
    • pros: compact, vectorization, mutable (my understanding is that is takes more space and is slower than pyarrow strings)
    • cons: different representations depending on a string size, which make understanding performance harder

Feature Description

I would like to have two way to discover the storage

  • __repr__ goal is to give information on the inner of an object, one option suggested by @jorisvandenbossche is to display <pandas.StringDtype(storage=...)> instead of string[storage]
  • .get_storage that returns the storage (not sure what is possible with the current implementation, would be best to have a class, otherwise, a string). The API is useful to check before running a time consuming code that we have the correct storage.

Alternative Solutions

.

Additional Context

No response

Metadata

Metadata

Assignees

No one assigned

    Labels

    EnhancementNeeds DiscussionRequires discussion from core team before further actionRequires discussion from core team before further actionStringsString extension data type and string dataString extension data type and string data

    Type

    No type

    Projects

    No projects

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions

      Morty Proxy This is a proxified and sanitized view of the page, visit original site.