Open
Description
Originally raised in #58551 (comment)
Problem Description
With PDEP-14 there is the need for developers to be aware of the storage used for strings. Indeed, the storage might have a lot of impact of performance, for instance
pyarrow
storage- pros: compact (optimal memory footprint), fast (vectorization)
- cons: immutable (so any modification creates a new string pyarrow
ChunkedArray
)
python
storage- pros: mutable
- cons: highest memory footprint (each string is a different Python object), slow (no vectorization)
numpy
2.0 strings storage (I don't have a good knowledge of these new strings, and never tested them)- pros: compact, vectorization, mutable (my understanding is that is takes more space and is slower than pyarrow strings)
- cons: different representations depending on a string size, which make understanding performance harder
Feature Description
I would like to have two way to discover the storage
__repr__
goal is to give information on the inner of an object, one option suggested by @jorisvandenbossche is to display<pandas.StringDtype(storage=...)>
instead ofstring[storage]
.get_storage
that returns the storage (not sure what is possible with the current implementation, would be best to have a class, otherwise, a string). The API is useful to check before running a time consuming code that we have the correct storage.
Alternative Solutions
.
Additional Context
No response
Metadata
Metadata
Assignees
Labels
Requires discussion from core team before further actionRequires discussion from core team before further actionString extension data type and string dataString extension data type and string data