BUG: fix datetime64/timedelta64 hash and match Python #14622

walshb · Oct 1, 2019

#3836

numpy/core/src/multiarray/datetime.c

eric-wieser · Oct 8, 2019

Can you add a test for this comment?

eric-wieser · Oct 11, 2019

Sorry for the conflict, should be straightforward to resolve

eric-wieser · Oct 11, 2019

numpy/core/src/multiarray/datetime.c

+ */
+NPY_NO_EXPORT npy_int64
+datetime_hash(PyArray_DatetimeMetaData *meta, npy_datetime dt)
+{


Does this handle NaT?

walshb · Oct 19, 2019

That requirement for hash(datetime.datetime) == hash(np.datetime64) meant lots of changes.
EDIT: (for #3836 (comment))

eric-wieser · Oct 19, 2019

Indeed it does. It might be worth discussing in that issue whether we care.

I suppose we could merge this without resolving that, but I don't think we can consider this completely fixed until we address that too.

eric-wieser · Oct 19, 2019

Oh, I see you already made those changes - thanks!

numpy/core/src/multiarray/datetime.c

eric-wieser

This patch is looking pretty good, thanks! Only small comments.

eric-wieser · Oct 20, 2019

Some of the CI failures look real. @mattip, any idea what's up with PyPy?

eric-wieser · Oct 20, 2019

numpy/core/src/multiarray/datetime.c

+
+    if (meta->base == NPY_FR_GENERIC) {
+        /* XXX generic compares equal to *every* other base, so no single hash works. */
+        return (td == -1) ? -2 : td;  /* avoid -1 (error) */


This is the path I am suggesting should raise.

Ah. Yes. I agree, but that breaks another test, "test_correct_hash_dict" (in test_regression.py). So I left it as a XXX comment for now.

I think that test is just not written well enough - it should be comparing hash(x) and x.__hash__(), which it had a close enough approximation to until this change

walshb · Oct 20, 2019

Re: CL failures, it seems to be some funny business with pypy. In pypy:

np.timedelta64 == datetime.timedelta, but datetime.timedelta != np.timedelta64.

Very strange. This doesn't affect np.datetime64 in pypy.

mattip · Oct 20, 2019

PyPy uses the pure-python implementation of datetime, but it might have some issues. Let me take a look

mattip · Oct 20, 2019

The timedelta._eq issue was fixed for CPython 3.7, but not backported. PyPy is backporting that now.

walshb · Oct 20, 2019

@mattip Cool. Thanks!

walshb · Jun 9, 2022

@mattip Seems this is ready for final review.

Sorry for this late reply -- I've been trying to find some time to make one more change.
I'm not happy with the way that I hash the npy_datetimestruct via PyBytes_FromStringAndSize in some cases.
Please can I have a few more days to make this nicer?
Thanks.

mattip · Sep 4, 2024

@walshb ping.

walshb · Oct 26, 2024

Hi. I've taken another look at the hash function and I'm much happier with the implementation now.

Please can you guys take another look? Oh, and sorry it's taken so long!

mattip · Oct 27, 2024

LGTM. @seberg, you commented last on the #3836 issue. Any thoughts?

seberg

I had started a new thing once, because I wasn't sure I was happy with the way of hashing, but besides some smaller things, I think this is great.
And unlike my PR, it also attacks timedelta reasonably by refactoring the conversion which is nice.

I would prefer fixing those nits, and frankly also to use the tuple hash updating, because I simply don't know that the current code here has half-decent mixing...

But, I don't care enough to stall the PR on that, some things like NaT and generic handling should be fixed, though (generic handling looks like a regression).

seberg · Oct 28, 2024

numpy/_core/src/multiarray/scalartypes.c.src

-    return x;
+    PyArray_Descr *dtype = PyArray_DescrFromScalar(obj);
+    PyArray_DatetimeMetaData *meta = get_datetime_metadata_from_dtype(dtype);
+    return @lname@_hash(meta, PyArrayScalar_VAL(obj, @name@));


The check for NaT in this function is fair, but the result value is incorrect. We must return the object identity for NaT values.

But I suppose one can argue that is yet a different bug...

Why must we return the object identity for NaT values? I must be missing something.

It's a hash collision safety thing: Same as Python float("nan") values.

Ah right, since NaNs don't compare "normally" ( float('nan') != float('nan') ) we can always return a different hash for each NaN.
Interestingly hash(float('inf')) == 314159, since comparisons work "normally" for infinity.
I'm pretty sure that historically numpy NaTs compared the "normal" way, unlike NaNs.
But now they compare the "NaN way" (maybe commit c8fc0e8 ? )
I'll go make it return the object identity...

seberg · Oct 28, 2024

numpy/_core/src/multiarray/datetime.c

-     */
-    if (td == NPY_DATETIME_NAT) {
-        Py_RETURN_NONE;
+    if (dt == NPY_DATETIME_NAT || meta->base == NPY_FR_GENERIC) {


This looks wrong. For generic unit you need to return the value as if an int64.

I'm happy to do that, but is it even possible to construct a datetime64 with base == generic?

Hmmm, good point, maybe a comment? The timedelta error is also interesting, but I am happy to try it. Users shouldn't explicitly create them (if that is even easy), but rather implicitly/transiently when doing math with integers.
(And even there, we may stop using that path eventually.)

I went with your suggestion anyway :-)

numpy/_core/src/multiarray/datetime.c

seberg · Oct 28, 2024

numpy/_core/src/multiarray/datetime.c

+                 * HASH_MULT + dts.hour)
+                * HASH_MULT + dts.min)
+               * HASH_MULT + dts.sec)
+            * HASH_MULT + dts.us;


FWIW, I would much prefer to use the tuple update pattern as here: https://github.com/numpy/numpy/pull/24408/files#diff-c32b1134e6b616b5aca0e41218075be0ae2cb509a03e46b1716e1988a93fafd8R643-R656

I don't trust HAS_MULT to do a half-decent job, TBH. The code also doesn't take into account that that the size of the hash could be 32bit.

Haha, yep, actually in an earlier version of this patch I was converting the datetimestruct into a PyBytes, and using python's own bytes_hash on it. (see here ).

[edit: the reason I went back to the "HASH_MULT = 1000003" version is because the rest of numpy seems more relaxed about hashing]

The version using PyBytes has less code than the "tuple update pattern". I'll resurrect the PyBytes version and push it...

I guess we usually go via datetime anyway, so going via a bytes object shouldn't hurt. The tuple update pattern is nice, but indeed a bit annoying unless we use it in more places.
There is a balance towards just fixing the problem at hand, and making sure the code converges to being proper. I like to do at least a bit of the second, because chances are nobody will look at it again in years.

walshb · Oct 29, 2024

Build failing due to this.

seberg

Thanks, this looks nice to me! I don't want to nitpick unnecessarily, the small comments are really take-or-leave.

One thing that would be nice if you could add a release note for it. I think it is a worthy improvement. It also will be a place where we can note the ValueError caveat for unitless/generic timedeltas (although, dunno if they should matter).

Ping @jbrockmendel for awareness. I don't think the choice you had started in pandas was important, right? (since this just uses a bytes detour to do the work, which I suspect is OK because we usually need the datetime path anyway)

seberg · Oct 30, 2024

numpy/_core/src/multiarray/datetime.c

+        if (dts.year < 1 || dts.year > 9999
+            || dts.ps != 0 || dts.as != 0) {
+            /* NpyDatetime_ConvertDatetime64ToDatetimeStruct does memset,
+             * so this is safe from loose struct packing. */


Thanks for pointing that out :).

seberg · Oct 30, 2024

numpy/_core/tests/test_datetime.py

@@ -2552,6 +2558,76 @@ def test_limit_str_roundtrip(self, time_unit, sign):
        limit_via_str = np.datetime64(str(limit), time_unit)
        assert limit_via_str == limit

+    def test_datetime_hash(self):
+        assert hash(np.datetime64()) != -1  # NaT


Would be nice to test something like:

assert hash(np.datetime64()) != hash(np.datetime64()

Yep. Will do!

seberg · Oct 30, 2024

numpy/_core/src/multiarray/scalartypes.c.src

-    if ((x <= LONG_MAX)) {
-        y = (npy_hash_t) x;
+    if (val == NPY_DATETIME_NAT) {
+        return (*PyBaseObject_Type.tp_hash)(obj);


Suggested change

return (*PyBaseObject_Type.tp_hash)(obj);

/* Use identity similar to NaN */

return PyBaseObject_Type.tp_hash(obj);

I don't think you need to dereference the function pointer?

That's right. It's a style issue. I am trying to fit in with the "numpy/cpython style". See references to bf_getbuffer and bf_releasebuffer in scalartypes.c.src. Or references to "tp_hash" in the cpython source. They all use (*funcptr)(...).
For what it's worth, in my own personal projects I don't do this, I do what you suggest. :-)

Hmmm, very sure most of NumPy doesn't dereference the function pointer...

OK, yeah you are right, there are lots of places with the no-dereference style in numpy.
I'll change it. Not a big deal.

jbrockmendel · Oct 30, 2024

I’ll take a closer look tomorrow, no red flags on a first look.

jbrockmendel · Oct 31, 2024

numpy/_core/tests/test_datetime.py

@@ -2552,6 +2558,80 @@ def test_limit_str_roundtrip(self, time_unit, sign):
        limit_via_str = np.datetime64(str(limit), time_unit)
        assert limit_via_str == limit

+    def test_datetime_hash(self):


not sure if numpy does things differently from pandas, but id suggest splitting this into targeted tests with comments about which invariance is being tested in each place

The numpy tests tend to be quite coarse-grained.

Old ones, we/I prefer fine grained for various reasons. But I don't care enough here to nitpick about it this time around.

I made the tests more fine-grained.

jbrockmendel · Oct 31, 2024

numpy/_core/tests/test_datetime.py

+            dt2 = np.datetime64(dt, unit)
+            _assert_equal_hash(dt, dt2)
+
+        dt = np.datetime64(3, 'ms')


any reason not to start with "s" here? in pandas we rely on the invariance working for ['s', 'ms', 'us', 'ns']

Only because we can't convert from seconds to attoseconds without integer overflow.
I split this test into two ranges: 1. minutes ... femtoseconds (which should keep Pandas happy) and 2. nanoseconds ... attoseconds.

numpy#3836

seberg · Nov 22, 2024

Sorry, forgot about this a bit! I don't think there was anything left really and scrolling through it still looks good.

Thanks for helping to finally make this sane, let's give it a try! (And also especially thanks for the many smaller followups.)

This just adds a release bullet, seemed probably good to mention the change in hashing.

DOC: Add a release fragment for gh-14622

eric-wieser reviewed Oct 1, 2019

View reviewed changes