Skip to content

Navigation Menu

Search code, repositories, users, issues, pull requests...

Provide feedback

We read every piece of feedback, and take your input very seriously.

Saved searches

Use saved searches to filter your results more quickly

Appearance settings

BUG(string dtype): Arithmetic operations between Series with string dtype index #61425

Copy link
Copy link
Open
@rhshadrach

Description

@rhshadrach
Issue body actions

Similar to #61099, but concerning lhs + rhs. Alignment in general is heavily involved here as well. One thing to note is that unlike in comparisons operations, in arithmetic operations the lhs.index dtype is favored, assuming no coercion is necessary.

dtypes = [
    np.dtype(object),
    pd.StringDtype("pyarrow", na_value=np.nan),
    pd.StringDtype("python", na_value=np.nan),
    pd.StringDtype("pyarrow", na_value=pd.NA),
    pd.StringDtype("python", na_value=pd.NA),
    pd.ArrowDtype(pa.string())
]
idx1 = pd.Series(["a", np.nan, "b"], dtype=dtypes[1])
idx2 = pd.Series(["a", np.nan, "b"], dtype=dtypes[3])
df1 = pd.DataFrame({"idx": idx1, "value": [1, 2, 3]}).set_index("idx")
df2 = pd.DataFrame({"idx": idx2, "value": [1, 2, 3]}).set_index("idx")
print(df1["value"] + df2["value"])
print(df2["value"] + df1["value"])

When concerning string dtypes, I've observed the following:

  • NaN vs NA generally aligns, the value propagated is always NA
  • NaN vs NA does not align when the NA arises from ArrowExtensionArray
  • NaN vs None (object) aligns, the value propagated is from lhs
  • NA vs None does not align
  • PyArrow-NA + ArrowExtensionArray results in object dtype (NAs do align)
  • Python-NA + PyArrow-NA results in PyArrow-NA; contrary to the left being preferred
  • Python-NA + PyArrow-NA results in object type (NAs do align)
  • When lhs and rhs have indices that are both object dtype:
    • NaN vs None aligns and propagates the lhs value.
    • NA vs None does not align
    • NA vs NaN does not align

I think the main two things we need to decide are:

  1. How should NA vs NaN vs None align.
  2. When they do align, which value should be propagated.

A few properties I think are crucial:

  • Alignment should only depend on value and left-vs-right operand, not storage.
  • Alignment should be transitive.

If we do decide on aligning between different values, a natural order is None < NaN < NA. However, the most backwards compatible would be to have None vs NaN be operand dependent with NA always propagating when present.

Metadata

Metadata

Labels

API - ConsistencyInternal Consistency of API/BehaviorInternal Consistency of API/BehaviorBugNeeds DiscussionRequires discussion from core team before further actionRequires discussion from core team before further actionStringsString extension data type and string dataString extension data type and string data

Type

No type

Projects

No projects

Milestone

No milestone

Relationships

None yet

Development

No branches or pull requests

Issue actions

    Morty Proxy This is a proxified and sanitized view of the page, visit original site.