Description
Is your feature request related to a problem?
As described in #26766, it would be good to have type annotations for Index.
Describe the solution you'd like
I would like the type of the publicly-exposed sub-objects to be part of the Index
type. For example, these two Index
instances contain Timestamp
from the user's perspective, regardless of the internal implementation:
>>> import pandas as pd
>>> import numpy as np
>>> import datetime
>>> pd.Index([np.datetime64(100, "s")])[0]
Timestamp('1970-01-01 00:01:40')
>>> pd.Index([datetime.datetime(2020, 9, 28)])[0]
Timestamp('2020-09-28 00:00:00')
Because the fact Index
returns different subclasses of itself, getting the type checker to can acknowledge that correctly is tricky. What's more, you'll notice a naive "Index[S] based on fact it's created with List[S]" won't work: Index([np.datetime64(100, "s"])
contains Timestamp
instances, at least as far as the user is concerned, and Timestamp
is very much not a np.datetime64
.
Here is the only solution I've come up with that works; see also python/mypy#9482, there is no way at the moment to have this work without breaking up Index
into a parent class that does __new__
and a subclass that does all the work.
The basic idea is that you have a protocol, IndexType
. This is a sketch, because demonstrating this with real code would be that much harder:
from typing import TypeVar, Generic, List, Union, overload
from typing_extensions import Protocol
from datetime import datetime
T = TypeVar("T", covariant=True) # need to look into why covariant is required, might not be, not fundamental
S = TypeVar("S")
class datetime64(int):
"""Stand-in for np.datetime64."""
class IndexType(Protocol[T]):
def first(self) -> T: ...
class Index:
@overload
def __new__(cls, values: List[datetime64]) -> "Datetime64Index": ...
@overload
def __new__(cls, values: List[datetime]) -> "Datetime64Index": ...
@overload
def __new__(cls, values: List[S]) -> "DefaultIndex[S]": ...
def __new__(cls, values):
if type(values[0]) in (datetime, datetime64):
cls = Datetime64Index
else:
cls = DefaultIndex
return object.__new__(cls)
class DefaultIndex(Index, Generic[S]):
def __init__(self, values: List[S]):
self.values = values
def first(self) -> S:
return self.values[0]
class Datetime64Index(DefaultIndex):
def __init__(self, values: Union[List[datetime], List[datetime64]]):
self.values : List[datetime64] = [
datetime64(o.timestamp()) if isinstance(o, datetime) else o
for o in values
]
def first(self) -> datetime:
return datetime.fromtimestamp(self.values[0])
# Should work
a: IndexType[datetime] = Index([datetime64(100)])
b: IndexType[datetime] = Index([datetime(2000, 10, 20)])
c: IndexType[bool] = Index([True])
# Should complain
d: IndexType[datetime] = Index(["a"])
e: IndexType[bool] = Index(["a"])
API breaking implications
Hopefully nothing.
Describe alternatives you've considered
I tried lots and lots of other ways of structuring this. None of them worked, except this variant.
Additional context
Part of my motivation here is to help use type checking so that users can check whether switching from Pandas to Pandas-alikes like Modin/Dask/etc.. works, by having everyone use matching type annotations.
As such, just saying "this API accepts an Index" is not good enough, because some Pandas APIs have e.g. special cases for Index[bool]
, you really do need to have some way of indicating the Index type for annotations to be sufficiently helpful.
What I'd like
Some feedback on whether this approach is something you all would be OK with. If so, I can try to implement it for the real Index classes.