Skip to content

Navigation Menu

Search code, repositories, users, issues, pull requests...

Provide feedback

We read every piece of feedback, and take your input very seriously.

Saved searches

Use saved searches to filter your results more quickly

Appearance settings

BUG: fix datetime64/timedelta64 hash and match Python #14622

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merged
merged 1 commit into from
Nov 22, 2024

Conversation

walshb
Copy link
Contributor

@walshb walshb commented Oct 1, 2019

@walshb walshb force-pushed the bw-datetime-hash-3836 branch from b8a0ff6 to f34afac Compare October 6, 2019 17:38
@walshb walshb force-pushed the bw-datetime-hash-3836 branch from f34afac to db94d21 Compare October 7, 2019 18:04
@eric-wieser
Copy link
Member

Can you add a test for this comment?

@eric-wieser
Copy link
Member

Sorry for the conflict, should be straightforward to resolve

*/
NPY_NO_EXPORT npy_int64
datetime_hash(PyArray_DatetimeMetaData *meta, npy_datetime dt)
{
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Does this handle NaT?

@walshb walshb force-pushed the bw-datetime-hash-3836 branch from db94d21 to 5e88e5a Compare October 19, 2019 19:14
@walshb
Copy link
Contributor Author

walshb commented Oct 19, 2019

That requirement for hash(datetime.datetime) == hash(np.datetime64) meant lots of changes.
EDIT: (for #3836 (comment))

@eric-wieser
Copy link
Member

Indeed it does. It might be worth discussing in that issue whether we care.

I suppose we could merge this without resolving that, but I don't think we can consider this completely fixed until we address that too.

@eric-wieser
Copy link
Member

Oh, I see you already made those changes - thanks!

Copy link
Member

@eric-wieser eric-wieser left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This patch is looking pretty good, thanks! Only small comments.

@eric-wieser
Copy link
Member

eric-wieser commented Oct 20, 2019

Some of the CI failures look real. @mattip, any idea what's up with PyPy?

@walshb walshb force-pushed the bw-datetime-hash-3836 branch 2 times, most recently from 8f24882 to 16fdbcb Compare October 20, 2019 07:04

if (meta->base == NPY_FR_GENERIC) {
/* XXX generic compares equal to *every* other base, so no single hash works. */
return (td == -1) ? -2 : td; /* avoid -1 (error) */
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is the path I am suggesting should raise.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ah. Yes. I agree, but that breaks another test, "test_correct_hash_dict" (in test_regression.py). So I left it as a XXX comment for now.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think that test is just not written well enough - it should be comparing hash(x) and x.__hash__(), which it had a close enough approximation to until this change

@walshb
Copy link
Contributor Author

walshb commented Oct 20, 2019

Re: CL failures, it seems to be some funny business with pypy. In pypy:

np.timedelta64 == datetime.timedelta, but datetime.timedelta != np.timedelta64.

Very strange. This doesn't affect np.datetime64 in pypy.

@mattip
Copy link
Member

mattip commented Oct 20, 2019

PyPy uses the pure-python implementation of datetime, but it might have some issues. Let me take a look

@mattip
Copy link
Member

mattip commented Oct 20, 2019

The timedelta._eq issue was fixed for CPython 3.7, but not backported. PyPy is backporting that now.

@walshb walshb force-pushed the bw-datetime-hash-3836 branch from 16fdbcb to 6317593 Compare October 20, 2019 11:54
@walshb
Copy link
Contributor Author

walshb commented Oct 20, 2019

@mattip Cool. Thanks!

@walshb
Copy link
Contributor Author

walshb commented Jun 9, 2022

@mattip Seems this is ready for final review.

Sorry for this late reply -- I've been trying to find some time to make one more change.
I'm not happy with the way that I hash the npy_datetimestruct via PyBytes_FromStringAndSize in some cases.
Please can I have a few more days to make this nicer?
Thanks.

@InessaPawson InessaPawson added the 52 - Inactive Pending author response label Apr 8, 2024
@mattip
Copy link
Member

mattip commented Sep 4, 2024

@walshb ping.

@walshb walshb force-pushed the bw-datetime-hash-3836 branch from c5414ae to 7387c2d Compare October 26, 2024 13:17
@walshb
Copy link
Contributor Author

walshb commented Oct 26, 2024

Hi. I've taken another look at the hash function and I'm much happier with the implementation now.

Please can you guys take another look? Oh, and sorry it's taken so long!

@mattip
Copy link
Member

mattip commented Oct 27, 2024

LGTM. @seberg, you commented last on the #3836 issue. Any thoughts?

Copy link
Member

@seberg seberg left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I had started a new thing once, because I wasn't sure I was happy with the way of hashing, but besides some smaller things, I think this is great.
And unlike my PR, it also attacks timedelta reasonably by refactoring the conversion which is nice.

I would prefer fixing those nits, and frankly also to use the tuple hash updating, because I simply don't know that the current code here has half-decent mixing...

But, I don't care enough to stall the PR on that, some things like NaT and generic handling should be fixed, though (generic handling looks like a regression).

return x;
PyArray_Descr *dtype = PyArray_DescrFromScalar(obj);
PyArray_DatetimeMetaData *meta = get_datetime_metadata_from_dtype(dtype);
return @lname@_hash(meta, PyArrayScalar_VAL(obj, @name@));
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The check for NaT in this function is fair, but the result value is incorrect. We must return the object identity for NaT values.

But I suppose one can argue that is yet a different bug...

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why must we return the object identity for NaT values? I must be missing something.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It's a hash collision safety thing: Same as Python float("nan") values.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ah right, since NaNs don't compare "normally" ( float('nan') != float('nan') ) we can always return a different hash for each NaN.
Interestingly hash(float('inf')) == 314159, since comparisons work "normally" for infinity.
I'm pretty sure that historically numpy NaTs compared the "normal" way, unlike NaNs.
But now they compare the "NaN way" (maybe commit c8fc0e8 ? )
I'll go make it return the object identity...

*/
if (td == NPY_DATETIME_NAT) {
Py_RETURN_NONE;
if (dt == NPY_DATETIME_NAT || meta->base == NPY_FR_GENERIC) {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This looks wrong. For generic unit you need to return the value as if an int64.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm happy to do that, but is it even possible to construct a datetime64 with base == generic?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hmmm, good point, maybe a comment? The timedelta error is also interesting, but I am happy to try it. Users shouldn't explicitly create them (if that is even easy), but rather implicitly/transiently when doing math with integers.
(And even there, we may stop using that path eventually.)

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I went with your suggestion anyway :-)

numpy/_core/src/multiarray/datetime.c Outdated Show resolved Hide resolved
* HASH_MULT + dts.hour)
* HASH_MULT + dts.min)
* HASH_MULT + dts.sec)
* HASH_MULT + dts.us;
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

FWIW, I would much prefer to use the tuple update pattern as here: https://github.com/numpy/numpy/pull/24408/files#diff-c32b1134e6b616b5aca0e41218075be0ae2cb509a03e46b1716e1988a93fafd8R643-R656

I don't trust HAS_MULT to do a half-decent job, TBH. The code also doesn't take into account that that the size of the hash could be 32bit.

Copy link
Contributor Author

@walshb walshb Oct 28, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Haha, yep, actually in an earlier version of this patch I was converting the datetimestruct into a PyBytes, and using python's own bytes_hash on it. (see here ).

[edit: the reason I went back to the "HASH_MULT = 1000003" version is because the rest of numpy seems more relaxed about hashing]

The version using PyBytes has less code than the "tuple update pattern". I'll resurrect the PyBytes version and push it...

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I guess we usually go via datetime anyway, so going via a bytes object shouldn't hurt. The tuple update pattern is nice, but indeed a bit annoying unless we use it in more places.
There is a balance towards just fixing the problem at hand, and making sure the code converges to being proper. I like to do at least a bit of the second, because chances are nobody will look at it again in years.

@walshb walshb force-pushed the bw-datetime-hash-3836 branch 2 times, most recently from fa643d6 to e000644 Compare October 29, 2024 06:59
@walshb
Copy link
Contributor Author

walshb commented Oct 29, 2024

Build failing due to this.

@seberg seberg removed the 52 - Inactive Pending author response label Oct 29, 2024
Copy link
Member

@seberg seberg left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks, this looks nice to me! I don't want to nitpick unnecessarily, the small comments are really take-or-leave.

One thing that would be nice if you could add a release note for it. I think it is a worthy improvement. It also will be a place where we can note the ValueError caveat for unitless/generic timedeltas (although, dunno if they should matter).

Ping @jbrockmendel for awareness. I don't think the choice you had started in pandas was important, right? (since this just uses a bytes detour to do the work, which I suspect is OK because we usually need the datetime path anyway)

if (dts.year < 1 || dts.year > 9999
|| dts.ps != 0 || dts.as != 0) {
/* NpyDatetime_ConvertDatetime64ToDatetimeStruct does memset,
* so this is safe from loose struct packing. */
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for pointing that out :).

@@ -2552,6 +2558,76 @@ def test_limit_str_roundtrip(self, time_unit, sign):
limit_via_str = np.datetime64(str(limit), time_unit)
assert limit_via_str == limit

def test_datetime_hash(self):
assert hash(np.datetime64()) != -1 # NaT
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Would be nice to test something like:

assert hash(np.datetime64()) != hash(np.datetime64()

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yep. Will do!

if ((x <= LONG_MAX)) {
y = (npy_hash_t) x;
if (val == NPY_DATETIME_NAT) {
return (*PyBaseObject_Type.tp_hash)(obj);
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
return (*PyBaseObject_Type.tp_hash)(obj);
/* Use identity similar to NaN */
return PyBaseObject_Type.tp_hash(obj);

I don't think you need to dereference the function pointer?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

That's right. It's a style issue. I am trying to fit in with the "numpy/cpython style". See references to bf_getbuffer and bf_releasebuffer in scalartypes.c.src. Or references to "tp_hash" in the cpython source. They all use (*funcptr)(...).
For what it's worth, in my own personal projects I don't do this, I do what you suggest. :-)

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hmmm, very sure most of NumPy doesn't dereference the function pointer...

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

OK, yeah you are right, there are lots of places with the no-dereference style in numpy.
I'll change it. Not a big deal.

@jbrockmendel
Copy link
Contributor

I’ll take a closer look tomorrow, no red flags on a first look.

@walshb walshb force-pushed the bw-datetime-hash-3836 branch from e000644 to 6f4ec86 Compare October 31, 2024 07:36
@@ -2552,6 +2558,80 @@ def test_limit_str_roundtrip(self, time_unit, sign):
limit_via_str = np.datetime64(str(limit), time_unit)
assert limit_via_str == limit

def test_datetime_hash(self):
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

not sure if numpy does things differently from pandas, but id suggest splitting this into targeted tests with comments about which invariance is being tested in each place

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The numpy tests tend to be quite coarse-grained.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Old ones, we/I prefer fine grained for various reasons. But I don't care enough here to nitpick about it this time around.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I made the tests more fine-grained.

dt2 = np.datetime64(dt, unit)
_assert_equal_hash(dt, dt2)

dt = np.datetime64(3, 'ms')
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

any reason not to start with "s" here? in pandas we rely on the invariance working for ['s', 'ms', 'us', 'ns']

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Only because we can't convert from seconds to attoseconds without integer overflow.
I split this test into two ranges: 1. minutes ... femtoseconds (which should keep Pandas happy) and 2. nanoseconds ... attoseconds.

@walshb walshb force-pushed the bw-datetime-hash-3836 branch from 6f4ec86 to f2eff67 Compare November 1, 2024 07:48
@walshb walshb force-pushed the bw-datetime-hash-3836 branch from f2eff67 to 7ec91ca Compare November 8, 2024 20:17
@seberg
Copy link
Member

seberg commented Nov 22, 2024

Sorry, forgot about this a bit! I don't think there was anything left really and scrolling through it still looks good.

Thanks for helping to finally make this sane, let's give it a try! (And also especially thanks for the many smaller followups.)

@seberg seberg changed the title BUG: datetime64 hash. BUG: make datetime64/timedelta64 hash match Pythons Nov 22, 2024
@seberg seberg merged commit 420438f into numpy:main Nov 22, 2024
69 checks passed
@seberg seberg changed the title BUG: make datetime64/timedelta64 hash match Pythons BUG: fix datetime64/timedelta64 hash and match Python Nov 22, 2024
seberg added a commit to seberg/numpy that referenced this pull request Nov 22, 2024
This just adds a release bullet, seemed probably good to mention
the change in hashing.
charris added a commit that referenced this pull request Nov 22, 2024
DOC: Add a release fragment for gh-14622
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

9 participants
Morty Proxy This is a proxified and sanitized view of the page, visit original site.