-
-
Notifications
You must be signed in to change notification settings - Fork 10.9k
BUG: fix datetime64/timedelta64 hash and match Python #14622
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
b8a0ff6
to
f34afac
Compare
f34afac
to
db94d21
Compare
Can you add a test for this comment? |
Sorry for the conflict, should be straightforward to resolve |
numpy/core/src/multiarray/datetime.c
Outdated
*/ | ||
NPY_NO_EXPORT npy_int64 | ||
datetime_hash(PyArray_DatetimeMetaData *meta, npy_datetime dt) | ||
{ |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Does this handle NaT
?
db94d21
to
5e88e5a
Compare
That requirement for hash(datetime.datetime) == hash(np.datetime64) meant lots of changes. |
Indeed it does. It might be worth discussing in that issue whether we care. I suppose we could merge this without resolving that, but I don't think we can consider this completely fixed until we address that too. |
Oh, I see you already made those changes - thanks! |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This patch is looking pretty good, thanks! Only small comments.
Some of the CI failures look real. @mattip, any idea what's up with PyPy? |
8f24882
to
16fdbcb
Compare
numpy/core/src/multiarray/datetime.c
Outdated
|
||
if (meta->base == NPY_FR_GENERIC) { | ||
/* XXX generic compares equal to *every* other base, so no single hash works. */ | ||
return (td == -1) ? -2 : td; /* avoid -1 (error) */ |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is the path I am suggesting should raise.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ah. Yes. I agree, but that breaks another test, "test_correct_hash_dict" (in test_regression.py). So I left it as a XXX comment for now.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think that test is just not written well enough - it should be comparing hash(x)
and x.__hash__()
, which it had a close enough approximation to until this change
Re: CL failures, it seems to be some funny business with pypy. In pypy: np.timedelta64 == datetime.timedelta, but datetime.timedelta != np.timedelta64. Very strange. This doesn't affect np.datetime64 in pypy. |
PyPy uses the pure-python implementation of datetime, but it might have some issues. Let me take a look |
The timedelta._eq issue was fixed for CPython 3.7, but not backported. PyPy is backporting that now. |
16fdbcb
to
6317593
Compare
@mattip Cool. Thanks! |
Sorry for this late reply -- I've been trying to find some time to make one more change. |
@walshb ping. |
c5414ae
to
7387c2d
Compare
Hi. I've taken another look at the hash function and I'm much happier with the implementation now. Please can you guys take another look? Oh, and sorry it's taken so long! |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I had started a new thing once, because I wasn't sure I was happy with the way of hashing, but besides some smaller things, I think this is great.
And unlike my PR, it also attacks timedelta reasonably by refactoring the conversion which is nice.
I would prefer fixing those nits, and frankly also to use the tuple hash updating, because I simply don't know that the current code here has half-decent mixing...
But, I don't care enough to stall the PR on that, some things like NaT and generic handling should be fixed, though (generic handling looks like a regression).
return x; | ||
PyArray_Descr *dtype = PyArray_DescrFromScalar(obj); | ||
PyArray_DatetimeMetaData *meta = get_datetime_metadata_from_dtype(dtype); | ||
return @lname@_hash(meta, PyArrayScalar_VAL(obj, @name@)); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The check for NaT
in this function is fair, but the result value is incorrect. We must return the object identity for NaT
values.
But I suppose one can argue that is yet a different bug...
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Why must we return the object identity for NaT values? I must be missing something.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It's a hash collision safety thing: Same as Python float("nan")
values.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ah right, since NaNs don't compare "normally" ( float('nan') != float('nan')
) we can always return a different hash for each NaN.
Interestingly hash(float('inf')) == 314159
, since comparisons work "normally" for infinity.
I'm pretty sure that historically numpy NaTs compared the "normal" way, unlike NaNs.
But now they compare the "NaN way" (maybe commit c8fc0e8 ? )
I'll go make it return the object identity...
*/ | ||
if (td == NPY_DATETIME_NAT) { | ||
Py_RETURN_NONE; | ||
if (dt == NPY_DATETIME_NAT || meta->base == NPY_FR_GENERIC) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This looks wrong. For generic
unit you need to return the value as if an int64.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm happy to do that, but is it even possible to construct a datetime64 with base == generic?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Hmmm, good point, maybe a comment? The timedelta error is also interesting, but I am happy to try it. Users shouldn't explicitly create them (if that is even easy), but rather implicitly/transiently when doing math with integers.
(And even there, we may stop using that path eventually.)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I went with your suggestion anyway :-)
* HASH_MULT + dts.hour) | ||
* HASH_MULT + dts.min) | ||
* HASH_MULT + dts.sec) | ||
* HASH_MULT + dts.us; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
FWIW, I would much prefer to use the tuple update pattern as here: https://github.com/numpy/numpy/pull/24408/files#diff-c32b1134e6b616b5aca0e41218075be0ae2cb509a03e46b1716e1988a93fafd8R643-R656
I don't trust HAS_MULT
to do a half-decent job, TBH. The code also doesn't take into account that that the size of the hash could be 32bit.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Haha, yep, actually in an earlier version of this patch I was converting the datetimestruct into a PyBytes, and using python's own bytes_hash on it. (see here ).
[edit: the reason I went back to the "HASH_MULT = 1000003" version is because the rest of numpy seems more relaxed about hashing]
The version using PyBytes has less code than the "tuple update pattern". I'll resurrect the PyBytes version and push it...
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I guess we usually go via datetime anyway, so going via a bytes object shouldn't hurt. The tuple update pattern is nice, but indeed a bit annoying unless we use it in more places.
There is a balance towards just fixing the problem at hand, and making sure the code converges to being proper. I like to do at least a bit of the second, because chances are nobody will look at it again in years.
fa643d6
to
e000644
Compare
Build failing due to this. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks, this looks nice to me! I don't want to nitpick unnecessarily, the small comments are really take-or-leave.
One thing that would be nice if you could add a release note for it. I think it is a worthy improvement. It also will be a place where we can note the ValueError
caveat for unitless/generic timedeltas (although, dunno if they should matter).
Ping @jbrockmendel for awareness. I don't think the choice you had started in pandas was important, right? (since this just uses a bytes detour to do the work, which I suspect is OK because we usually need the datetime path anyway)
if (dts.year < 1 || dts.year > 9999 | ||
|| dts.ps != 0 || dts.as != 0) { | ||
/* NpyDatetime_ConvertDatetime64ToDatetimeStruct does memset, | ||
* so this is safe from loose struct packing. */ |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for pointing that out :).
numpy/_core/tests/test_datetime.py
Outdated
@@ -2552,6 +2558,76 @@ def test_limit_str_roundtrip(self, time_unit, sign): | ||
limit_via_str = np.datetime64(str(limit), time_unit) | ||
assert limit_via_str == limit | ||
|
||
def test_datetime_hash(self): | ||
assert hash(np.datetime64()) != -1 # NaT |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Would be nice to test something like:
assert hash(np.datetime64()) != hash(np.datetime64()
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yep. Will do!
if ((x <= LONG_MAX)) { | ||
y = (npy_hash_t) x; | ||
if (val == NPY_DATETIME_NAT) { | ||
return (*PyBaseObject_Type.tp_hash)(obj); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
return (*PyBaseObject_Type.tp_hash)(obj); | |
/* Use identity similar to NaN */ | |
return PyBaseObject_Type.tp_hash(obj); |
I don't think you need to dereference the function pointer?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
That's right. It's a style issue. I am trying to fit in with the "numpy/cpython style". See references to bf_getbuffer
and bf_releasebuffer
in scalartypes.c.src. Or references to "tp_hash" in the cpython source. They all use (*funcptr)(...).
For what it's worth, in my own personal projects I don't do this, I do what you suggest. :-)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Hmmm, very sure most of NumPy doesn't dereference the function pointer...
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
OK, yeah you are right, there are lots of places with the no-dereference style in numpy.
I'll change it. Not a big deal.
I’ll take a closer look tomorrow, no red flags on a first look. |
e000644
to
6f4ec86
Compare
numpy/_core/tests/test_datetime.py
Outdated
@@ -2552,6 +2558,80 @@ def test_limit_str_roundtrip(self, time_unit, sign): | ||
limit_via_str = np.datetime64(str(limit), time_unit) | ||
assert limit_via_str == limit | ||
|
||
def test_datetime_hash(self): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
not sure if numpy does things differently from pandas, but id suggest splitting this into targeted tests with comments about which invariance is being tested in each place
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The numpy tests tend to be quite coarse-grained.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Old ones, we/I prefer fine grained for various reasons. But I don't care enough here to nitpick about it this time around.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I made the tests more fine-grained.
numpy/_core/tests/test_datetime.py
Outdated
dt2 = np.datetime64(dt, unit) | ||
_assert_equal_hash(dt, dt2) | ||
|
||
dt = np.datetime64(3, 'ms') |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
any reason not to start with "s" here? in pandas we rely on the invariance working for ['s', 'ms', 'us', 'ns']
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Only because we can't convert from seconds to attoseconds without integer overflow.
I split this test into two ranges: 1. minutes ... femtoseconds (which should keep Pandas happy) and 2. nanoseconds ... attoseconds.
6f4ec86
to
f2eff67
Compare
f2eff67
to
7ec91ca
Compare
Sorry, forgot about this a bit! I don't think there was anything left really and scrolling through it still looks good. Thanks for helping to finally make this sane, let's give it a try! (And also especially thanks for the many smaller followups.) |
This just adds a release bullet, seemed probably good to mention the change in hashing.
DOC: Add a release fragment for gh-14622
#3836