TST: Add the first test using hypothesis #14440

kitchoi · Sep 6, 2019

This pull request adds the first test that uses hypothesis and hence brings in hypothesis as an additional test dependency.

@mattip Could you take a look at this please?

This adds the first test that uses hypothesis and hence brings in hypothesis as an additional test dependency. cf. EuroScipy 2019 Sprint

rgommers · Sep 6, 2019

Thanks @kitchoi!

This will need to be discussed on the mailing list, since it adds a new dependency.

kitchoi · Sep 8, 2019

@rgommers Thanks! Yes I was wondering about that too, which is why I only added one test. The idea is that if the community agrees this is a good idea, this pull request will bring in the initial setup so that we can add more tests afterwards. I am also happy to add more tests in this pull request, to make the inclusion of a new test dependency more worthwhile.

mattip · Sep 8, 2019

I think @rgommers is simply asking that you post a mail to numpy-discussion@python.org pointing to this PR, with a few words about hypothesis and why you think it is a good idea to start using it. The questions people might ask, fwiw, would be "does it find anything our tests miss" (the current test seems to pass), and "how much does it slow things down" (it would be nice to check that).

kitchoi · Sep 8, 2019

I see. Thanks @mattip

When I started working on this in EuroScipy sprint, I was under the impression that there was already interest in the community to bring in hypothesis to numpy. I don't really have a strong opinion about bringing in hypothesis and therefore I am sorry to say that I would rather let someone else who does to start the discussion.

Shall we close this or does anyone of you want to be that advocate?

mattip · Sep 8, 2019

Let's leave this open and someone else will write to the list. Thanks for working out what needs to change for this to go in

rgommers · Sep 8, 2019

When I started working on this in EuroScipy sprint, I was under the impression that there was already interest in the community to bring in hypothesis to numpy.

We discussed this at the SciPy'19 sprints as well. There was some interest there, also from one of the hypothesis maintainers (@Zac-HD), and the plan was to open a PR with some tests (like you did here) so we could have a discussion based on concrete examples and benefits rather than "hypothesis looks cool". Your PR just happens to be the first to be submitted.

rgommers · Sep 9, 2019

numpy/core/tests/test_arrayprint.py

@@ -399,6 +403,15 @@ def test_wide_element(self):
            "[ 'xxxxx']"
        )

+    @hypothesis.given(hypothesis.extra.numpy.from_dtype(np.dtype("U")))
+    def test_any_text(self, text):


My main question here is: are the text strings that are generated by hypothesis 100% reproducible over time/versions? And if not, what's the workflow to deal with a bug report for some random string by random user on some OS/install method/etc.?

Thanks. Commit 28887ac should address this.

I tested the reproducibility with the following approach:
Before making the change in 28887ac

Add a side-effect in the test to append the text to a file ~/text.txt

Set PYTHONHASHSEED=random

Run the test once, mv ~/text.txt ~/text0.txt

Run the test again, diff ~/text.txt ~/text0.txt -> The files are different.

Then make the change in 28887ac

Perform step 3 and 4, the files are now identical.

We can also use the @example directive to force a known failing case into the test once reported by a user or by the CI system

kitchoi · Sep 9, 2019

We discussed this at the SciPy'19 sprints as well. There was some interest there, also from one of the hypothesis maintainers (@Zac-HD), and the plan was to open a PR with some tests (like you did here) so we could have a discussion based on concrete examples and benefits rather than "hypothesis looks cool". Your PR just happens to be the first to be submitted.

Thank you for the clarification. This gives me a better understanding of where we are at.
If there is any part of the test suite you know could benefit from hypothesis, please let me know and I am happy to explore those areas. Perhaps a bit unconventional but given this requires some experimentation, if anyone wants to push directly to this branch, I am happy to add collaborators to my forked repo. Just let me know.

Zac-HD · Sep 9, 2019

Hi all! Hypothesis maintainer here and very happy to answer any questions you might have. To get started:

What does Hypothesis do that other tests don't?

Computer-generated test cases tend to expose edge cases that hand-written tests don't. Usually they could have been exposed by hand-written tests, but it would take far more effort to do so without tools.

For Numpy itself, #10930, #13089, and #14239 have all been found by downstream users of Hypothesis!

How slow are Hypothesis tests?

This mostly depends on how many test cases you run - the default is 100 cases but you can choose any other number on a global or per-test basis. Typically generating data takes between 1% and 20% of the runtime for that test function, though it could be higher if we generate large arrays and apply a fast operation.

An important point is that we do not suggest using Hypothesis for everything! Property-based tests tend to be more general but less numerous than traditional unit tests, for a given level of testing. You can easily skip them from certain CI jobs (pytest -m "not hypothesis") if that becomes an issue, or use a setting profile that runs fewer examples in CI.

Performance is rarely a problem in practice though, so I would ultimately suggest waiting to measure before trying to optimize!

Reproducing test failures from CI, users, etc.

We have docs about this 😄. To summarize, the test may behave differently on different Python versions or operating systems (but that is true of non-Hypothesis tests too); short of that Hypothesis will print the input example and we can use the print_blob=True setting to be sure of reproducing the failure. Changes between Hypothesis versions are uncommon but not unknown; which is why the print_blob setting dumps the exact Hypothesis version too.

I would discourage using the @seed decorator or derandomize=True setting, as the net effect is to reduce the number of bugs found. We (Hypothesis developers) spend a lot of time ensuring that bugs can be reproduced once found, and I'd personally be happy to help if the Numpy project is having any trouble with Hypothesis.

What parts of Numpy would Hypothesis be especially valuable for?

Generically, parts of your code which are particularly complex, where you have high-level invariants you can check, or both. I would suggest paying particular attention to IO and serialisation round-trips, the dtype system, and broadcasting - the previous issues mentioned above also provide some examples. @rsokl might also have some ideas!

Personally, I like to start simply by trying to call some tricky code using Hypothesis to generate inputs - without asserting anything at all about the result. This often uncovers a more detailed definition of what inputs should be considered valid than has been documented!

rgommers · Sep 9, 2019

Thanks @Zac-HD!

It seems to me that the best time to run hypothesis tests is (1) when writing new functionality, and (2) when touching existing code (in CI). Perhaps less so by end users when they run np.test() - it feels like we're then going to end up with lots of corner case bug reports (I could be wrong there).

Perhaps there's also a nice way to add generic tests for, for example, nan, inf and empty array input - that's the kind of thing that often gets forgotten about.

Zac-HD · Sep 10, 2019

It seems to me that the best time to run hypothesis tests is (1) when writing new functionality, and (2) when touching existing code (in CI). Perhaps less so by end users when they run np.test() - it feels like we're then going to end up with lots of corner case bug reports (I could be wrong there).

Empirically it's rare for additional runs in CI or on user machines to find new bugs after the tests have passed on a pull request and been merged, but not unknown - Hypothesis self-tests pretty heavily and we see maybe 1-3 such cases per year.

On the other hand this is basically a perfect case for settings profiles: we can configure a profile for np.test() with derandomize=True. That would allow users to run the same tests even on very exotic platforms, while still in deterministic mode.

Perhaps there's also a nice way to add generic tests for, for example, nan, inf and empty array input - that's the kind of thing that often gets forgotten about.

Hypothesis may generate non-finite numbers wherever you ask for floats without specifying that they're disallowed. For non-Hypothesis tests, I'd just use a checklist - they're both useful reminders for maintainers and a great way of documenting and sharing project best-practice with new contributors.

I'm not quite sure what you mean by empty arrays - those with a dimension of size zero? Uninitialised values?

tylerjereddy · Sep 11, 2019

I'm +1 on using hypothesis for testing; I've already used it to evalutate NumPy & SciPy PRs in the past. Indeed, I maintain small property-based testing repos for NumPy and SciPy for this purpose.

On the NumPy side, this revealed some interesting details when reviewing Eric's clip ufunc PR, that would have been difficult for a human to probe in a sane amount of time.

rgommers · Sep 11, 2019

On the other hand this is basically a perfect case for settings profiles: we can configure a profile for np.test() with derandomize=True. That would allow users to run the same tests even on very exotic platforms, while still in deterministic mode.

That makes perfect sense to me.

I'm not quite sure what you mean by empty arrays - those with a dimension of size zero? Uninitialised values?

One of the dimensions has size zero, so x.size == 0. x.shape still matters, and can be n-D (e.g. (5, 0, 3) or (0, 4).

Empty arrays are quite easy to get in regular code, and often lead to unexpected behavior.

For non-Hypothesis tests, I'd just use a checklist - they're both useful reminders for maintainers and a great way of documenting and sharing project best-practice with new contributors.

True. On the other hand, few people seem to read our existing docs on how to test/document in detail, so built-in testing is nice:)

rsokl · Sep 11, 2019

Perhaps there's also a nice way to add generic tests for, for example, nan, inf and empty array input - that's the kind of thing that often gets forgotten about.

@rgommers, the following hypothesis strategy:

import hypothesis.extra.numpy as hnp
hnp.arrays(shape=hnp.array_shapes(min_side=0), dtype=hnp.floating_dtypes())

will take care of that straight away. E.g.

>>> strat = hnp.arrays(
    shape=hnp.array_shapes(min_side=0, min_dims=0), dtype=hnp.floating_dtypes()
)
>>> [strat.example() for i in range(5)]
[array(nan, dtype=float16),
 array([], shape=(3, 0), dtype=float16),
array([[-inf, -inf, -inf, -inf, -inf],
        [-inf, -inf, -inf, -inf, -inf],
        [-inf, -inf, -inf, -inf, -inf],
        [-inf, -inf, -inf, -inf, -inf],
 array(0., dtype=float16),
 array([           inf, -2.0721057e+16,  9.9999997e-06,           -inf,
                  -inf], dtype=float32)]

You could also construct strategies that only explore the edge cases that you are concerned with (although that is a little bit antithetical to property-based testing), and create a simple decorator that runs functions through these cases.

test_requirements.txt

mattip · Sep 16, 2019

Could we find a few more test cases for hypothesis: in concatenation with various shapes, or unicode round-tripping, or pickling/unpickling?

Zac-HD · Sep 17, 2019

Could we find a few more test cases for hypothesis: in concatenation with various shapes, or unicode round-tripping, or pickling/unpickling?

@mattip I'd prefer to make these a follow-up PR just so we don't push everything on to new contributors, but there is some prior art:

@tylerjereddy's clip tests are great (though I'd translate the integers(...) strategies as from_dtype(np.int64))
@rsokl's einsum tests, which found BUG: Fix einsum optimize logic for singleton dimensions #10559 (and other tests)
Testing np.dtype() ala ValueError for np.dtype([("", int), ("f0", int)]) #14239
Testing IO that round-trips do in fact round-trip is always worthwhile; especially with a wide variety of shapes, dtypes, and IO methods.

I could see a case for copying over the first, but IMO the rest of them should be in separate PRs not least because they might find some bugs - and that always slows things down!

If it helps, I'd be happy to write a PR full of IO tests after this one is merged 😄

Zac-HD · Dec 9, 2019

@mattip / @rgommers - is there anything I could do to help get this merged?

mattip · Dec 9, 2019

My hesitation is that the test showcased here is not a very good example of how I would like to see hypothesis used.

There are not comments for people who are not familiar with hypothesis what is going on
Character-based ndarrays are not really our bread-and-butter. Are there examples where hypothesis will help with numerical code, and not just with __repr__?
Even better, can we use any of the issues/PRs/comments in the comment above to replace current tests, with comments to why hypothesis is better?

This PR should also extend the documentation somewhere with best practices around hypothesis. Has another project already done such documentation and we can reference it?

The mailing list discussion garnered one response which was "let's discuss it on the PR", I guess the general community has no real opinion.

kitchoi · Dec 9, 2019

This PR should also extend the documentation somewhere with best practices around hypothesis. Has another project already done such documentation and we can reference it?

Agreed if there exists a battle-tested set of best practices, it would be good to reference it, but I don't know where it is. On top of that, may be there are sets of best practices that only apply to numpy? For example, assuming we want reproducible builds (yes Zac discourages this), I am not sure whether we should use seed per test, or use settings for the entire build.

Zac-HD · Dec 10, 2019

Character-based ndarrays are not really our bread-and-butter. Are there examples where hypothesis will help with numerical code, and not just with __repr__?

#10930 and #13089 are serious Numpy bugs which were detect using Hypothesis in downstream projects. I've spoken to @rsokl and he's happy to contribute his einsum test - it's a complicated one but covers a far wider variety of signatures and inputs than would be possible with unit tests.

Even better, can we use any of the issues/PRs/comments in the comment above to replace current tests, with comments to why hypothesis is better?

For example, this test would have failed for any array that did not contain a 1:

numpy/numpy/core/tests/test_einsum.py

Lines 490 to 493 in c010145

    
           p = np.ones((10,2)) 
        
           q = np.ones((1,2)) 
        
           assert_array_equal(np.einsum('ij,ij->j', p, q, optimize=True), 
        
                              np.einsum('ij,ij->j', p, q, optimize=False))

Generating a variety of shapes, as well as element values, is even more powerful.

assuming we want reproducible builds (yes Zac discourages this), I am not sure whether we should use seed per test, or use settings for the entire build.

Use a settings profile with derandomize=True, this use-case is why the setting exists. I do want to clarify that Hypothesis is designed so that failures are always reproducible, even from randomly generated inputs.

This PR should also extend the documentation somewhere with best practices around hypothesis. Has another project already done such documentation and we can reference it?

I think this reduces to "when writing tests, use Hypothesis in the obvious way". If that doesn't work, we'd consider that a bug in Hypothesis... the only tweaks I can think of is tuning the project-level settings profiles to run more or fewer examples for performance reasons.

mattip · Dec 10, 2019

#10930 and #13089 are serious Numpy bugs which were detect using Hypothesis

Please add the hypothesis tests that detected those bugs into this PR, or open a new one with the tests, and mention the relevant issues. Can the tests affected by those issues be converted to hypothesis tests? That would be more convincing.

when writing tests, use Hypothesis in the obvious way

That would be a great start to this part of the documentation, in a sentence about what kind of tests lend themselves to using hypothesis, with a link to what "obvious way" means for hypothesis.

Zac-HD · Dec 10, 2019

@mattip - I'm aiming to open a new PR with those tests and some docs later this week 😄

And re-reading my comment, it's not helpful at all 😓... I meant something like "I would use Hypothesis for almost all (>90%) tests, and in almost all cases do so by describing the input data with builtin strategies with essentially the same test body I would usually write".

People usually pick it up pretty quickly - the user experience is almost exactly that of @pytest.mark.parametrize where you haven't passed the expected output, but with a much more concise description of the inputs and wider variety of test cases.

TST: Add the first test using hypothesis

28ecf82

This adds the first test that uses hypothesis and hence brings in hypothesis as an additional test dependency. cf. EuroScipy 2019 Sprint

rgommers added 05 - Testing 54 - Needs decision labels Sep 6, 2019

mattip mentioned this pull request Sep 9, 2019

MAINT: use test_requirements.txt in tox and shippable, ship it too #14460

Merged

rgommers reviewed Sep 9, 2019

View reviewed changes

kitchoi added 2 commits September 9, 2019 19:46

Make test input reproducible

28887ac

Merge branch 'master' into tst-first-hypothesis-test

7ceadc7

mattip reviewed Sep 16, 2019

View reviewed changes

test_requirements.txt Outdated Show resolved Hide resolved

kitchoi added 2 commits December 9, 2019 19:57

Merge branch 'master' into tst-first-hypothesis-test

baa728a

Bump version of hypothesis

4aa0f72

Zac-HD mentioned this pull request Dec 27, 2019

ENH: Add property-based tests using Hypothesis #15189

Merged

5 tasks

mattip closed this in #15189 Feb 13, 2020

rgommers mentioned this pull request May 30, 2025

DEV: stop monkeypatching and use of autouse=True fixtures by default in test suite #29090

Open

Search code, repositories, users, issues, pull requests...

Uh oh!

TST: Add the first test using hypothesis #14440

TST: Add the first test using hypothesis #14440

Uh oh!

Conversation

kitchoi commented Sep 6, 2019

Uh oh!

rgommers commented Sep 6, 2019

Uh oh!

kitchoi commented Sep 8, 2019

Uh oh!

mattip commented Sep 8, 2019 • edited by rgommers Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

kitchoi commented Sep 8, 2019

Uh oh!

mattip commented Sep 8, 2019

Uh oh!

rgommers commented Sep 8, 2019

Uh oh!

rgommers Sep 9, 2019

Choose a reason for hiding this comment

Uh oh!

kitchoi Sep 9, 2019

Choose a reason for hiding this comment

Uh oh!

mattip Sep 9, 2019

Choose a reason for hiding this comment

Uh oh!

kitchoi commented Sep 9, 2019

Uh oh!

Zac-HD commented Sep 9, 2019

What does Hypothesis do that other tests don't?

How slow are Hypothesis tests?

Reproducing test failures from CI, users, etc.

What parts of Numpy would Hypothesis be especially valuable for?

Uh oh!

rgommers commented Sep 9, 2019

Uh oh!

Zac-HD commented Sep 10, 2019

Uh oh!

tylerjereddy commented Sep 11, 2019

Uh oh!

rgommers commented Sep 11, 2019

Uh oh!

rsokl commented Sep 11, 2019 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Uh oh!

mattip commented Sep 16, 2019

Uh oh!

Zac-HD commented Sep 17, 2019

Uh oh!

Zac-HD commented Dec 9, 2019

Uh oh!

mattip commented Dec 9, 2019

Uh oh!

kitchoi commented Dec 9, 2019

Uh oh!

Zac-HD commented Dec 10, 2019

Uh oh!

mattip commented Dec 10, 2019

Uh oh!

Zac-HD commented Dec 10, 2019

Uh oh!

Uh oh!

mattip commented Sep 8, 2019 •

edited by rgommers

Loading

rsokl commented Sep 11, 2019 •

edited

Loading