-
-
Notifications
You must be signed in to change notification settings - Fork 11k
TST: Add the first test using hypothesis #14440
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
This adds the first test that uses hypothesis and hence brings in hypothesis as an additional test dependency. cf. EuroScipy 2019 Sprint
Thanks @kitchoi! This will need to be discussed on the mailing list, since it adds a new dependency. |
@rgommers Thanks! Yes I was wondering about that too, which is why I only added one test. The idea is that if the community agrees this is a good idea, this pull request will bring in the initial setup so that we can add more tests afterwards. I am also happy to add more tests in this pull request, to make the inclusion of a new test dependency more worthwhile. |
I think @rgommers is simply asking that you post a mail to numpy-discussion@python.org pointing to this PR, with a few words about hypothesis and why you think it is a good idea to start using it. The questions people might ask, fwiw, would be "does it find anything our tests miss" (the current test seems to pass), and "how much does it slow things down" (it would be nice to check that). |
I see. Thanks @mattip When I started working on this in EuroScipy sprint, I was under the impression that there was already interest in the community to bring in hypothesis to numpy. I don't really have a strong opinion about bringing in hypothesis and therefore I am sorry to say that I would rather let someone else who does to start the discussion. Shall we close this or does anyone of you want to be that advocate? |
Let's leave this open and someone else will write to the list. Thanks for working out what needs to change for this to go in |
We discussed this at the SciPy'19 sprints as well. There was some interest there, also from one of the |
@@ -399,6 +403,15 @@ def test_wide_element(self): | ||
"[ 'xxxxx']" | ||
) | ||
|
||
@hypothesis.given(hypothesis.extra.numpy.from_dtype(np.dtype("U"))) | ||
def test_any_text(self, text): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
My main question here is: are the text
strings that are generated by hypothesis
100% reproducible over time/versions? And if not, what's the workflow to deal with a bug report for some random string by random user on some OS/install method/etc.?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks. Commit 28887ac should address this.
I tested the reproducibility with the following approach:
Before making the change in 28887ac
- Add a side-effect in the test to append the
text
to a file~/text.txt
- Set
PYTHONHASHSEED=random
- Run the test once,
mv ~/text.txt ~/text0.txt
- Run the test again,
diff ~/text.txt ~/text0.txt
-> The files are different. - Then make the change in 28887ac
- Perform step 3 and 4, the files are now identical.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We can also use the @example
directive to force a known failing case into the test once reported by a user or by the CI system
Thank you for the clarification. This gives me a better understanding of where we are at. |
Hi all! Hypothesis maintainer here and very happy to answer any questions you might have. To get started: What does Hypothesis do that other tests don't?Computer-generated test cases tend to expose edge cases that hand-written tests don't. Usually they could have been exposed by hand-written tests, but it would take far more effort to do so without tools. For Numpy itself, #10930, #13089, and #14239 have all been found by downstream users of Hypothesis! How slow are Hypothesis tests?This mostly depends on how many test cases you run - the default is 100 cases but you can choose any other number on a global or per-test basis. Typically generating data takes between 1% and 20% of the runtime for that test function, though it could be higher if we generate large arrays and apply a fast operation. An important point is that we do not suggest using Hypothesis for everything! Property-based tests tend to be more general but less numerous than traditional unit tests, for a given level of testing. You can easily skip them from certain CI jobs ( Performance is rarely a problem in practice though, so I would ultimately suggest waiting to measure before trying to optimize! Reproducing test failures from CI, users, etc.We have docs about this 😄. To summarize, the test may behave differently on different Python versions or operating systems (but that is true of non-Hypothesis tests too); short of that Hypothesis will print the input example and we can use the I would discourage using the What parts of Numpy would Hypothesis be especially valuable for?Generically, parts of your code which are particularly complex, where you have high-level invariants you can check, or both. I would suggest paying particular attention to IO and serialisation round-trips, the dtype system, and broadcasting - the previous issues mentioned above also provide some examples. @rsokl might also have some ideas! Personally, I like to start simply by trying to call some tricky code using Hypothesis to generate inputs - without asserting anything at all about the result. This often uncovers a more detailed definition of what inputs should be considered valid than has been documented! |
Thanks @Zac-HD! It seems to me that the best time to run hypothesis tests is (1) when writing new functionality, and (2) when touching existing code (in CI). Perhaps less so by end users when they run Perhaps there's also a nice way to add generic tests for, for example, |
Empirically it's rare for additional runs in CI or on user machines to find new bugs after the tests have passed on a pull request and been merged, but not unknown - Hypothesis self-tests pretty heavily and we see maybe 1-3 such cases per year. On the other hand this is basically a perfect case for settings profiles: we can configure a profile for
Hypothesis may generate non-finite numbers wherever you ask for floats without specifying that they're disallowed. For non-Hypothesis tests, I'd just use a checklist - they're both useful reminders for maintainers and a great way of documenting and sharing project best-practice with new contributors. I'm not quite sure what you mean by empty arrays - those with a dimension of size zero? Uninitialised values? |
I'm +1 on using hypothesis for testing; I've already used it to evalutate NumPy & SciPy PRs in the past. Indeed, I maintain small property-based testing repos for NumPy and SciPy for this purpose. On the NumPy side, this revealed some interesting details when reviewing Eric's clip ufunc PR, that would have been difficult for a human to probe in a sane amount of time. |
That makes perfect sense to me.
One of the dimensions has size zero, so Empty arrays are quite easy to get in regular code, and often lead to unexpected behavior.
True. On the other hand, few people seem to read our existing docs on how to test/document in detail, so built-in testing is nice:) |
@rgommers, the following hypothesis strategy: import hypothesis.extra.numpy as hnp
hnp.arrays(shape=hnp.array_shapes(min_side=0), dtype=hnp.floating_dtypes()) will take care of that straight away. E.g. >>> strat = hnp.arrays(
shape=hnp.array_shapes(min_side=0, min_dims=0), dtype=hnp.floating_dtypes()
)
>>> [strat.example() for i in range(5)]
[array(nan, dtype=float16),
array([], shape=(3, 0), dtype=float16),
array([[-inf, -inf, -inf, -inf, -inf],
[-inf, -inf, -inf, -inf, -inf],
[-inf, -inf, -inf, -inf, -inf],
[-inf, -inf, -inf, -inf, -inf],
array(0., dtype=float16),
array([ inf, -2.0721057e+16, 9.9999997e-06, -inf,
-inf], dtype=float32)] You could also construct strategies that only explore the edge cases that you are concerned with (although that is a little bit antithetical to property-based testing), and create a simple decorator that runs functions through these cases. |
Could we find a few more test cases for hypothesis: in concatenation with various shapes, or unicode round-tripping, or pickling/unpickling? |
@mattip I'd prefer to make these a follow-up PR just so we don't push everything on to new contributors, but there is some prior art:
I could see a case for copying over the first, but IMO the rest of them should be in separate PRs not least because they might find some bugs - and that always slows things down! If it helps, I'd be happy to write a PR full of IO tests after this one is merged 😄 |
My hesitation is that the test showcased here is not a very good example of how I would like to see hypothesis used.
This PR should also extend the documentation somewhere with best practices around hypothesis. Has another project already done such documentation and we can reference it? The mailing list discussion garnered one response which was "let's discuss it on the PR", I guess the general community has no real opinion. |
Agreed if there exists a battle-tested set of best practices, it would be good to reference it, but I don't know where it is. On top of that, may be there are sets of best practices that only apply to numpy? For example, assuming we want reproducible builds (yes Zac discourages this), I am not sure whether we should use |
#10930 and #13089 are serious Numpy bugs which were detect using Hypothesis in downstream projects. I've spoken to @rsokl and he's happy to contribute his
For example, this test would have failed for any array that did not contain a numpy/numpy/core/tests/test_einsum.py Lines 490 to 493 in c010145
Generating a variety of shapes, as well as element values, is even more powerful.
Use a settings profile with
I think this reduces to "when writing tests, use Hypothesis in the obvious way". If that doesn't work, we'd consider that a bug in Hypothesis... the only tweaks I can think of is tuning the project-level settings profiles to run more or fewer examples for performance reasons. |
Please add the hypothesis tests that detected those bugs into this PR, or open a new one with the tests, and mention the relevant issues. Can the tests affected by those issues be converted to hypothesis tests? That would be more convincing.
That would be a great start to this part of the documentation, in a sentence about what kind of tests lend themselves to using hypothesis, with a link to what "obvious way" means for hypothesis. |
@mattip - I'm aiming to open a new PR with those tests and some docs later this week 😄 And re-reading my comment, it's not helpful at all 😓... I meant something like "I would use Hypothesis for almost all (>90%) tests, and in almost all cases do so by describing the input data with builtin strategies with essentially the same test body I would usually write". People usually pick it up pretty quickly - the user experience is almost exactly that of |
This pull request adds the first test that uses hypothesis and hence brings in hypothesis as an additional test dependency.
@mattip Could you take a look at this please?