ENH: Add property-based tests using Hypothesis #15189

Zac-HD · Dec 27, 2019

This property-based testing demo focuses on tricky tests for numeric code - Hypothesis adds most value when the input or configuration space is large, but you can make clear statements about "always" or "never" (even very loose ones!).

@mattip @rgommers, I can spend a few days adding tests to this PR if that would be useful. Perhaps more importantly I'm happy to help out with code review / debugging / reproducing bugs or whatever else turns up as a result of property-based testing.

The PR currently consists of:

@kitchoi's test, which closes TST: Add the first test using hypothesis #14440. This is a short and simple test which gives you an idea of what Hypothesis tests can look like.
A more complex test based on @tylerjereddy's clip tests. I've restricted this so it passes, but allowing clipping by zero-dimensional arrays or varied dtypes turns up inconsistencies with mininum(maximum(array, clip_low), clip_high) (contra ENH/DEP: Use a ufunc under the hood for ndarray.clip #12519).
In the process I found a mishandling of signalling nans: sign(nan) emits warning #15127 / BUG: do not emit warnings for np.sign, np.equal when using nan #15230!

I would prefer to get this merged before working on further tests, both to allow others to contribute property tests and to ensure that I'm working on something valuable for Numpy. Proposed follow-up PRs will cover:

something based on @rsokl's einsum tests, which found PR #10559 failed to fix einsum (optimize=True) broadcasting bug #10930 (and other tests)
Testing np.dtype() ala ValueError for np.dtype([("", int), ("f0", int)]) #14239
Testing IO that round-trips do in fact round-trip with a wide variety of shapes, dtypes, and IO methods.

Zac-HD · Dec 27, 2019

(copying over #14440 (comment))

What does Hypothesis do that other tests don't?

Computer-generated test cases tend to expose edge cases that hand-written tests don't. Usually they could have been exposed by hand-written tests, but it would take far more effort to do so without tools.

For Numpy itself, #10930, #13089, and #14239 have all been found by downstream users of Hypothesis!

Working on this pull request, the first test I wrote rediscovered #15127 and highlighted that the bounds to np.clip may not be scalar nans - either a docs issue or a bug.

How slow are Hypothesis tests?

This mostly depends on how many test cases you run - the default is 100 cases but you can choose any other number on a global or per-test basis. Typically generating data takes between 1% and 20% of the runtime for that test function, though it could be higher if we generate large arrays and apply a fast operation.

An important point is that we do not suggest using Hypothesis for everything! Property-based tests tend to be more general but less numerous than traditional unit tests, for a given level of testing. You can easily skip them from certain CI jobs (pytest -m "not hypothesis") if that becomes an issue, or use a setting profile that runs fewer examples in CI.

Performance is rarely a problem in practice though, so I would ultimately suggest waiting to measure before trying to optimize!

Reproducing test failures from CI, users, etc.

We have docs about this 😄. To summarize, the test may behave differently on different Python versions or operating systems (but that is true of non-Hypothesis tests too); short of that Hypothesis will print the input example and we can use the print_blob=True setting to be sure of reproducing the failure. Changes between Hypothesis versions are uncommon but not unknown; which is why the print_blob setting dumps the exact Hypothesis version too.

I would discourage using the @seed decorator or derandomize=True setting, as the net effect is to reduce the number of bugs found. We (Hypothesis developers) spend a lot of time ensuring that bugs can be reproduced once found, and I'd personally be happy to help if the Numpy project is having any trouble with Hypothesis.

What parts of Numpy would Hypothesis be especially valuable for?

Generically, parts of your code which are particularly complex, where you have high-level invariants you can check, or both. I would suggest paying particular attention to IO and serialisation round-trips, the dtype system, and broadcasting - the previous issues mentioned above also provide some examples. @rsokl might also have some ideas!

Personally, I like to start simply by trying to call some tricky code using Hypothesis to generate inputs - without asserting anything at all about the result. This often uncovers a more detailed definition of what inputs should be considered valid than has been documented!

mattip · Dec 30, 2019

It seems the PR created a test case that raises a DeprecationWarning:

...\numpy\core\tests\test_numeric.py:2063: in test_clip_property
    assert_array_equal(result, expected)
...\numpy\testing\_private\utils.py:728: in func_assert_same_pos
    x_id = func(x)

inf        = inf
xy         = array([nan, nan])

...\numpy\testing\_private\utils.py:774: DeprecationWarning
--------------------------------- Hypothesis ----------------------------------
Falsifying example: test_clip_property(
    self=<numpy.core.tests.test_numeric.TestClip at 0x222966645f8>,
    data=data(...),
)
Draw 1 (shape): (2,)
Draw 2: dtype('float64')
Draw 3: array([nan, nan])
Draw 4: BroadcastableShapes(input_shapes=((1,), (1,)), result_shape=(2,))
Draw 5: 0.0
Draw 6: 0.0

You can reproduce this example by temporarily adding \
@reproduce_failure('4.56.3', b'AXicY2RkYGRgYmBg+P+BAQQYGfACAEm1AfY=') \
as a decorator on your test case

test_requirements.txt

Zac-HD · Dec 31, 2019

@mattip - yep!

Should I just skip inputs that raise this warning?
Is clip expected to work with arguments that are not of the same dtype as the array? What about arguments to which array elements could safely be cast?
Is clip expected to work with zero-dimensional arrays as arguments? If so, I need to file a bug; if not I'd like to explain why in a comment.

I'll be working on HypothesisWorks/hypothesis#2218 for the next day or two, and then back to this 😄

mattip · Dec 31, 2019

This np.clip test is far from "just add a decorator with a strategy and run the tests like normal". There is a lot going on here: drawing various dtypes and shapes for arr, amax, amin; rejecting edge cases that maybe should succeed.

The case that is tripping up the test should have succeeded since arr, result, and expected are all np.array([nan, nan]) (for the record amin=0.0, amax=0.0). It seems something is strange about the nan which trips up assert_array_equal:

>>> np.unpackbits(arr[:1].view('uint8'))
array([0, 0, 0, 0, 0, 0, 0, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
       0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
       0, 0, 0, 0, 1, 1, 1, 1, 0, 0, 0, 0, 0, 1, 1, 1, 1, 1, 1, 1],
      dtype=uint8)
>>> np.unpackbits(np.array([np.nan]).view('uint8'))
array([0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
       0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
       0, 0, 0, 0, 1, 1, 1, 1, 1, 0, 0, 0, 0, 1, 1, 1, 1, 1, 1, 1],
      dtype=uint8)

I can reproduce the failure using this bit pattern:

a = np.packbits([0, 0, 0, 0, 0, 0, 0, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
    0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
    0, 0, 0, 0, 1, 1, 1, 1, 0, 0, 0, 0, 0, 1, 1, 1, 1, 1, 1, 1])
b = a.view(np.float64)
np.isnan(b)
# array([True])
b == np.inf
# RuntimeWarning: invalid value encountered in equal

This is probably a bug, xref gh-15127

Edit: specify the np.clip test in the first few words

mattip · Dec 31, 2019

On the one hand, you might want to claim that gh-15209 is a legitimate cleanup for an edge case. On the other, if using hypothesis means we need to pepper the code with performance-reducing fixes for edge cases no-one has come across in the many years numpy has been around, I am not so sure it is a clear win.

Zac-HD · Dec 31, 2019

I'd limit my claim to "it's probably better to discover such issues via Hypothesis than via users" - it's perfectly legitimate to just exclude such edge cases from the test (and ideally document somewhere that such edge cases exist, I guess)!

Personally I'd prefer fast code with a note that if I'm dealing with nans I might want such-and-such a recipe, but without this test we wouldn't even know we were making that tradeoff.

This np.clip test is far from "just add a decorator with a strategy and run the tests like normal". ... rejecting edge cases that maybe should succeed.

The clip test is quite a lot more complicated, because the arguments leading in to the last few lines have some complicated relationships - we have an input array with shape and dtype, and two bounds which are of the same dtype and either scalars or of a broadcastable shape! The tests I based them on are simpler, but don't explore the broadcasting or shape issues.

I rejected those edge cases precisely because they don't succeed, and I was hoping to demonstrate a passing test in this PR 😅... If they maybe should succeed though, I think that's a good example of the design feedback that naturally falls out of writing property-based tests!

Zac-HD · Jan 9, 2020

@mattip - I've rebased on master (which includes #15230), and made the comments much more detailed.

The build appears to be failing due to an unrelated mingw issue; assuming that's fixed is there anything else I need to do before this can be merged?

Getting the first test in and everything configured for CI will enable contributions from anyone else who has written property-based tests for Numpy, so I'd prefer to add more tests in a follow-up PR 🙂

mattip

The np.clip test is much clearer now, along with the poblems of using scalar nan values (it might be an issue, hard to decide).

Let's see if other reviewers have an opinion about the advantages of using hypothesis at all in numpy in general, and in these two tests in particular.

numpy/core/tests/test_arrayprint.py

numpy/core/tests/test_numeric.py

Zac-HD · Jan 17, 2020

@mattip - has anyone chimed in on the mailing list?

tylerjereddy · Jan 21, 2020

I do find the revised version of my property-based tests for clip a bit harder to read because of all the extra hypothesis machinery being used inside the test function proper. Perhaps that's because I like thinking about hypothesis conceptually as an intelligent version of pytest parametrization, where the case generation typically happens before the def test_something():.

I can reiterate that hypothesis was useful for evaluating Eric's clip ufunc PR, and indeed I'm pretty sure some documentation adjustments were made as a result of corner cases that i.e., Marten & Eric didn't want to slow things down for. I think it was as simple as adding No check is performed to ensure a_min < a_max.

So, I guess I'm just agreeing with the discussion above that it is probably better to at least have a way to semi-automate the discovery of corner cases that we may want to document even if we don't want an outrageous input handling framework on every single function after it gets probed over a large input space by hypothesis. Realistically, I think we're going to find bugs that core devs will occasionally want to fix by this means though.

I'm still +1, and one of the lead maintainers of hypothesis seems open to helping review/maintain the tests. Are there substantial ecosystem concerns with a new testing dependency? IIRC, @rgommers didn't seem too worried the last time this came up, since it is just on the testing side?

The only person who answered the mailing list query about this in September 2019 was me.

Some other concerns that seem to have come up in related PRs:

documenting & perhaps discouraging usage outside of CI in routine cases so that we don't get increased volume of issue reports related to hypothesis "case variance" in different environments
extra work for core devs reviewing PRs (yet another testing "framework" to learn)---perhaps offset by Zac helping out, at least initially, with that

Zac-HD · Jan 23, 2020

I do find the revised version of my property-based tests for clip a bit harder to read because of all the extra hypothesis machinery being used inside the test function proper. Perhaps that's because I like thinking about hypothesis conceptually as an intelligent version of pytest parametrization, where the case generation typically happens before the def test_something():.

@tylerjereddy - that is how I'd usually write property-based tests, but in this case several of the inputs (dtypes and shapes) depend on each other.

It would be trivial to move that part to a custom strategy with @st.composite, but unless I know I need the same customised data for multiple tests, I prefer the inline st.data() style - it's trivial to go the other way if we add a second test later.

So, I guess I'm just agreeing with the discussion above that it is probably better to at least have a way to semi-automate the discovery of corner cases that we may want to document even if we don't want an outrageous input handling framework on every single function after it gets probed over a large input space by hypothesis.

^ exactly this: I'm not saying that Numpy should handle everything, just that it should document what it doesn't. And of course you can constrain Hypothesis' input space to only things which are expected to work; that's the whole point.

Realistically, I think we're going to find bugs that core devs will occasionally want to fix by this means though.

Two or three in this PR, and another half-a-dozen downstream that I know of already!

Are there substantial ecosystem concerns with a new testing dependency? ... perhaps discouraging usage outside of CI in routine cases so that we don't get increased volume of issue reports related to hypothesis "case variance" in different environments

As I've said on the other issues, I don't expect this to be a problem, but if it comes up we can make Hypothesis tests from an as-installed Numpy deterministic; or even take them out of the installed package so they only run in CI.

I'm still +1, and one of the lead maintainers of hypothesis seems open to helping review/maintain the tests.

Yep, I'm happy to help with review and general maintainance of Hypothesis-related open source in general, and as a foundational project Numpy gets a very high priority from me. I know I'm not the only person willing to help either, but there's some frustration about the slow and lukewarm response to issues and pull requests like this one 😕

rgommers · Jan 24, 2020

IIRC, @rgommers didn't seem too worried the last time this came up, since it is just on the testing side?

Yes, adding a pure-Python test-only dependency is perfectly fine, easy for devs and no issue for users. My initial concerns were answered in the other thread.

Yep, I'm happy to help with review and general maintainance of Hypothesis-related open source in general, and as a foundational project Numpy gets a very high priority from me. I know I'm not the only person willing to help either, but there's some frustration about the slow and lukewarm response to issues and pull requests like this one confused

Much appreciated @Zac-HD. Apologies things move slowly; the initial PR was harder to see the value of, now it's mainly an issue of reviewer bandwidth. Personally, I'm all out of bandwidth for the next 2 weeks; happy for someone else to help get this to completion, and if not then I've added it to the milestone for the next release so we won't forget.

Co-Authored-By: kitchoi <kit@kychoi.org>

mattip · Feb 13, 2020

After discussion we decided to put this in, with the caveat that we may choose to ignore edge cases hypothesis testing may generate, and mark them "known failures". This is to avoid a situation where

hypothesis comes up with a valid bug
fixing it would require significant effort or break other things
no user has reported the issue, so it probably does not come up in real-world use

mattip · Feb 13, 2020

Thanks @Zac-HD

Zac-HD · Feb 13, 2020

Sounds like a good plan to me - we've already got two of those in this PR 😄

And my sincere thanks to everyone who has spent valuable time reviewing and discussing this; I know adding a new dependency and style of testing are both tricky decisions and appreciate your patient engagement. But it should be much easier to add tests from now on, or co-develop them with new or refactored features!

numpy has added hypothesis as a dependency for running property-based tests: numpy/numpy#15189 However, hypothesis is not an input of the derivations. And thus tests have been failing with: ERROR .. - ModuleNotFoundError: No module named 'hypothesis' https://hydra.nixos.org/build/124613306/nixlog/1 This change adds hypothesis to checkInputs, so that tests are run again.

mattip reviewed Dec 30, 2019

View reviewed changes

test_requirements.txt Outdated Show resolved Hide resolved

mattip mentioned this pull request Dec 31, 2019

sign(nan) emits warning #15127

Closed

Zac-HD force-pushed the property-tests branch 2 times, most recently from cfb1ff6 to 707b984 Compare January 4, 2020 01:49

charris changed the title ~~Add property-based tests using Hypothesis~~ ENH: Add property-based tests using Hypothesis Jan 8, 2020

charris added 05 - Testing component: documentation component: numpy._core labels Jan 8, 2020

Zac-HD force-pushed the property-tests branch 2 times, most recently from e10d1ad to e76d769 Compare January 9, 2020 07:54

mattip reviewed Jan 10, 2020

View reviewed changes

numpy/core/tests/test_arrayprint.py Show resolved Hide resolved

numpy/core/tests/test_numeric.py Show resolved Hide resolved

Zac-HD force-pushed the property-tests branch 2 times, most recently from bebf05d to 866b0e2 Compare January 17, 2020 00:47

Zac-HD force-pushed the property-tests branch from 866b0e2 to 289696b Compare January 24, 2020 05:28

rgommers added this to the 1.19.0 release milestone Jan 24, 2020

Zac-HD and others added 3 commits February 5, 2020 21:42

Property test for Unicode array printing

219004e

Co-Authored-By: kitchoi <kit@kychoi.org>

Add property test for clipping

4b20d78

Install and configure Hypothesis

8eb9cfc

Zac-HD force-pushed the property-tests branch from 289696b to 8eb9cfc Compare February 5, 2020 10:43

Zac-HD mentioned this pull request Feb 12, 2020

ImportError due to hypothesis incompatibility uber/hypothesis-gufunc#2

Closed

mattip merged commit abdd996 into numpy:master Feb 13, 2020

Zac-HD deleted the property-tests branch February 13, 2020 07:42

Zac-HD mentioned this pull request Apr 13, 2020

TST: Round-trip test np.save/np.load with Hypothesis #15963

Closed

danieldk mentioned this pull request Aug 9, 2020

python3Packages.numpy: add hypothesis to checkInputs NixOS/nixpkgs#94989

Merged

10 tasks

wphicks mentioned this pull request Feb 3, 2021

Improve floating point accuracy in percentile cupy/cupy#4617

Merged

mdhaber mentioned this pull request Jul 16, 2021

BUG: fix moments method to support arrays and list scipy/scipy#12197

Merged

rgommers mentioned this pull request Aug 1, 2023

TST: use Hypothesis for property-based tests scipy/scipy#18927

Merged

Search code, repositories, users, issues, pull requests...

Uh oh!

ENH: Add property-based tests using Hypothesis #15189

ENH: Add property-based tests using Hypothesis #15189

Uh oh!

Conversation

Zac-HD commented Dec 27, 2019 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Zac-HD commented Dec 27, 2019 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

What does Hypothesis do that other tests don't?

How slow are Hypothesis tests?

Reproducing test failures from CI, users, etc.

What parts of Numpy would Hypothesis be especially valuable for?

Uh oh!

mattip commented Dec 30, 2019

Uh oh!

Uh oh!

Zac-HD commented Dec 31, 2019

Uh oh!

mattip commented Dec 31, 2019 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

mattip commented Dec 31, 2019

Uh oh!

Zac-HD commented Dec 31, 2019

Uh oh!

Zac-HD commented Jan 9, 2020

Uh oh!

mattip left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Zac-HD commented Jan 17, 2020

Uh oh!

tylerjereddy commented Jan 21, 2020

Uh oh!

Zac-HD commented Jan 23, 2020

Uh oh!

rgommers commented Jan 24, 2020

Uh oh!

mattip commented Feb 13, 2020

Uh oh!

mattip commented Feb 13, 2020

Uh oh!

Zac-HD commented Feb 13, 2020

Uh oh!

Uh oh!

Zac-HD commented Dec 27, 2019 •

edited

Loading

Zac-HD commented Dec 27, 2019 •

edited

Loading

mattip commented Dec 31, 2019 •

edited

Loading