Skip to content

Navigation Menu

Sign in
Appearance settings

Search code, repositories, users, issues, pull requests...

Provide feedback

We read every piece of feedback, and take your input very seriously.

Saved searches

Use saved searches to filter your results more quickly

Appearance settings

TST: Add the first test using hypothesis #14440

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Closed
wants to merge 5 commits into from

Conversation

kitchoi
Copy link
Contributor

@kitchoi kitchoi commented Sep 6, 2019

This pull request adds the first test that uses hypothesis and hence brings in hypothesis as an additional test dependency.

@mattip Could you take a look at this please?

This adds the first test that uses hypothesis and hence brings in hypothesis
as an additional test dependency.

cf. EuroScipy 2019 Sprint
@rgommers
Copy link
Member

rgommers commented Sep 6, 2019

Thanks @kitchoi!

This will need to be discussed on the mailing list, since it adds a new dependency.

@kitchoi
Copy link
Contributor Author

kitchoi commented Sep 8, 2019

@rgommers Thanks! Yes I was wondering about that too, which is why I only added one test. The idea is that if the community agrees this is a good idea, this pull request will bring in the initial setup so that we can add more tests afterwards. I am also happy to add more tests in this pull request, to make the inclusion of a new test dependency more worthwhile.

@mattip
Copy link
Member

mattip commented Sep 8, 2019

I think @rgommers is simply asking that you post a mail to numpy-discussion@python.org pointing to this PR, with a few words about hypothesis and why you think it is a good idea to start using it. The questions people might ask, fwiw, would be "does it find anything our tests miss" (the current test seems to pass), and "how much does it slow things down" (it would be nice to check that).

@kitchoi
Copy link
Contributor Author

kitchoi commented Sep 8, 2019

I see. Thanks @mattip

When I started working on this in EuroScipy sprint, I was under the impression that there was already interest in the community to bring in hypothesis to numpy. I don't really have a strong opinion about bringing in hypothesis and therefore I am sorry to say that I would rather let someone else who does to start the discussion.

Shall we close this or does anyone of you want to be that advocate?

@mattip
Copy link
Member

mattip commented Sep 8, 2019

Let's leave this open and someone else will write to the list. Thanks for working out what needs to change for this to go in

@rgommers
Copy link
Member

rgommers commented Sep 8, 2019

When I started working on this in EuroScipy sprint, I was under the impression that there was already interest in the community to bring in hypothesis to numpy.

We discussed this at the SciPy'19 sprints as well. There was some interest there, also from one of the hypothesis maintainers (@Zac-HD), and the plan was to open a PR with some tests (like you did here) so we could have a discussion based on concrete examples and benefits rather than "hypothesis looks cool". Your PR just happens to be the first to be submitted.

@@ -399,6 +403,15 @@ def test_wide_element(self):
"[ 'xxxxx']"
)

@hypothesis.given(hypothesis.extra.numpy.from_dtype(np.dtype("U")))
def test_any_text(self, text):
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

My main question here is: are the text strings that are generated by hypothesis 100% reproducible over time/versions? And if not, what's the workflow to deal with a bug report for some random string by random user on some OS/install method/etc.?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks. Commit 28887ac should address this.

I tested the reproducibility with the following approach:
Before making the change in 28887ac

  1. Add a side-effect in the test to append the text to a file ~/text.txt
  2. Set PYTHONHASHSEED=random
  3. Run the test once, mv ~/text.txt ~/text0.txt
  4. Run the test again, diff ~/text.txt ~/text0.txt -> The files are different.
  5. Then make the change in 28887ac
  6. Perform step 3 and 4, the files are now identical.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We can also use the @example directive to force a known failing case into the test once reported by a user or by the CI system

@kitchoi
Copy link
Contributor Author

kitchoi commented Sep 9, 2019

We discussed this at the SciPy'19 sprints as well. There was some interest there, also from one of the hypothesis maintainers (@Zac-HD), and the plan was to open a PR with some tests (like you did here) so we could have a discussion based on concrete examples and benefits rather than "hypothesis looks cool". Your PR just happens to be the first to be submitted.

Thank you for the clarification. This gives me a better understanding of where we are at.
If there is any part of the test suite you know could benefit from hypothesis, please let me know and I am happy to explore those areas. Perhaps a bit unconventional but given this requires some experimentation, if anyone wants to push directly to this branch, I am happy to add collaborators to my forked repo. Just let me know.

@Zac-HD
Copy link
Contributor

Zac-HD commented Sep 9, 2019

Hi all! Hypothesis maintainer here and very happy to answer any questions you might have. To get started:

What does Hypothesis do that other tests don't?

Computer-generated test cases tend to expose edge cases that hand-written tests don't. Usually they could have been exposed by hand-written tests, but it would take far more effort to do so without tools.

For Numpy itself, #10930, #13089, and #14239 have all been found by downstream users of Hypothesis!

How slow are Hypothesis tests?

This mostly depends on how many test cases you run - the default is 100 cases but you can choose any other number on a global or per-test basis. Typically generating data takes between 1% and 20% of the runtime for that test function, though it could be higher if we generate large arrays and apply a fast operation.

An important point is that we do not suggest using Hypothesis for everything! Property-based tests tend to be more general but less numerous than traditional unit tests, for a given level of testing. You can easily skip them from certain CI jobs (pytest -m "not hypothesis") if that becomes an issue, or use a setting profile that runs fewer examples in CI.

Performance is rarely a problem in practice though, so I would ultimately suggest waiting to measure before trying to optimize!

Reproducing test failures from CI, users, etc.

We have docs about this 😄. To summarize, the test may behave differently on different Python versions or operating systems (but that is true of non-Hypothesis tests too); short of that Hypothesis will print the input example and we can use the print_blob=True setting to be sure of reproducing the failure. Changes between Hypothesis versions are uncommon but not unknown; which is why the print_blob setting dumps the exact Hypothesis version too.

I would discourage using the @seed decorator or derandomize=True setting, as the net effect is to reduce the number of bugs found. We (Hypothesis developers) spend a lot of time ensuring that bugs can be reproduced once found, and I'd personally be happy to help if the Numpy project is having any trouble with Hypothesis.

What parts of Numpy would Hypothesis be especially valuable for?

Generically, parts of your code which are particularly complex, where you have high-level invariants you can check, or both. I would suggest paying particular attention to IO and serialisation round-trips, the dtype system, and broadcasting - the previous issues mentioned above also provide some examples. @rsokl might also have some ideas!

Personally, I like to start simply by trying to call some tricky code using Hypothesis to generate inputs - without asserting anything at all about the result. This often uncovers a more detailed definition of what inputs should be considered valid than has been documented!

@rgommers
Copy link
Member

rgommers commented Sep 9, 2019

Thanks @Zac-HD!

It seems to me that the best time to run hypothesis tests is (1) when writing new functionality, and (2) when touching existing code (in CI). Perhaps less so by end users when they run np.test() - it feels like we're then going to end up with lots of corner case bug reports (I could be wrong there).

Perhaps there's also a nice way to add generic tests for, for example, nan, inf and empty array input - that's the kind of thing that often gets forgotten about.

@Zac-HD
Copy link
Contributor

Zac-HD commented Sep 10, 2019

It seems to me that the best time to run hypothesis tests is (1) when writing new functionality, and (2) when touching existing code (in CI). Perhaps less so by end users when they run np.test() - it feels like we're then going to end up with lots of corner case bug reports (I could be wrong there).

Empirically it's rare for additional runs in CI or on user machines to find new bugs after the tests have passed on a pull request and been merged, but not unknown - Hypothesis self-tests pretty heavily and we see maybe 1-3 such cases per year.

On the other hand this is basically a perfect case for settings profiles: we can configure a profile for np.test() with derandomize=True. That would allow users to run the same tests even on very exotic platforms, while still in deterministic mode.

Perhaps there's also a nice way to add generic tests for, for example, nan, inf and empty array input - that's the kind of thing that often gets forgotten about.

Hypothesis may generate non-finite numbers wherever you ask for floats without specifying that they're disallowed. For non-Hypothesis tests, I'd just use a checklist - they're both useful reminders for maintainers and a great way of documenting and sharing project best-practice with new contributors.

I'm not quite sure what you mean by empty arrays - those with a dimension of size zero? Uninitialised values?

@tylerjereddy
Copy link
Contributor

I'm +1 on using hypothesis for testing; I've already used it to evalutate NumPy & SciPy PRs in the past. Indeed, I maintain small property-based testing repos for NumPy and SciPy for this purpose.

On the NumPy side, this revealed some interesting details when reviewing Eric's clip ufunc PR, that would have been difficult for a human to probe in a sane amount of time.

@rgommers
Copy link
Member

On the other hand this is basically a perfect case for settings profiles: we can configure a profile for np.test() with derandomize=True. That would allow users to run the same tests even on very exotic platforms, while still in deterministic mode.

That makes perfect sense to me.

I'm not quite sure what you mean by empty arrays - those with a dimension of size zero? Uninitialised values?

One of the dimensions has size zero, so x.size == 0. x.shape still matters, and can be n-D (e.g. (5, 0, 3) or (0, 4).

Empty arrays are quite easy to get in regular code, and often lead to unexpected behavior.

For non-Hypothesis tests, I'd just use a checklist - they're both useful reminders for maintainers and a great way of documenting and sharing project best-practice with new contributors.

True. On the other hand, few people seem to read our existing docs on how to test/document in detail, so built-in testing is nice:)

@rsokl
Copy link
Contributor

rsokl commented Sep 11, 2019

Perhaps there's also a nice way to add generic tests for, for example, nan, inf and empty array input - that's the kind of thing that often gets forgotten about.

@rgommers, the following hypothesis strategy:

import hypothesis.extra.numpy as hnp
hnp.arrays(shape=hnp.array_shapes(min_side=0), dtype=hnp.floating_dtypes())

will take care of that straight away. E.g.

>>> strat = hnp.arrays(
    shape=hnp.array_shapes(min_side=0, min_dims=0), dtype=hnp.floating_dtypes()
)
>>> [strat.example() for i in range(5)]
[array(nan, dtype=float16),
 array([], shape=(3, 0), dtype=float16),
array([[-inf, -inf, -inf, -inf, -inf],
        [-inf, -inf, -inf, -inf, -inf],
        [-inf, -inf, -inf, -inf, -inf],
        [-inf, -inf, -inf, -inf, -inf],
 array(0., dtype=float16),
 array([           inf, -2.0721057e+16,  9.9999997e-06,           -inf,
                  -inf], dtype=float32)]

You could also construct strategies that only explore the edge cases that you are concerned with (although that is a little bit antithetical to property-based testing), and create a simple decorator that runs functions through these cases.

test_requirements.txt Outdated Show resolved Hide resolved
@mattip
Copy link
Member

mattip commented Sep 16, 2019

Could we find a few more test cases for hypothesis: in concatenation with various shapes, or unicode round-tripping, or pickling/unpickling?

@Zac-HD
Copy link
Contributor

Zac-HD commented Sep 17, 2019

Could we find a few more test cases for hypothesis: in concatenation with various shapes, or unicode round-tripping, or pickling/unpickling?

@mattip I'd prefer to make these a follow-up PR just so we don't push everything on to new contributors, but there is some prior art:

I could see a case for copying over the first, but IMO the rest of them should be in separate PRs not least because they might find some bugs - and that always slows things down!

If it helps, I'd be happy to write a PR full of IO tests after this one is merged 😄

@Zac-HD
Copy link
Contributor

Zac-HD commented Dec 9, 2019

@mattip / @rgommers - is there anything I could do to help get this merged?

@mattip
Copy link
Member

mattip commented Dec 9, 2019

My hesitation is that the test showcased here is not a very good example of how I would like to see hypothesis used.

  • There are not comments for people who are not familiar with hypothesis what is going on
  • Character-based ndarrays are not really our bread-and-butter. Are there examples where hypothesis will help with numerical code, and not just with __repr__?
  • Even better, can we use any of the issues/PRs/comments in the comment above to replace current tests, with comments to why hypothesis is better?

This PR should also extend the documentation somewhere with best practices around hypothesis. Has another project already done such documentation and we can reference it?

The mailing list discussion garnered one response which was "let's discuss it on the PR", I guess the general community has no real opinion.

@kitchoi
Copy link
Contributor Author

kitchoi commented Dec 9, 2019

This PR should also extend the documentation somewhere with best practices around hypothesis. Has another project already done such documentation and we can reference it?

Agreed if there exists a battle-tested set of best practices, it would be good to reference it, but I don't know where it is. On top of that, may be there are sets of best practices that only apply to numpy? For example, assuming we want reproducible builds (yes Zac discourages this), I am not sure whether we should use seed per test, or use settings for the entire build.

@Zac-HD
Copy link
Contributor

Zac-HD commented Dec 10, 2019

Character-based ndarrays are not really our bread-and-butter. Are there examples where hypothesis will help with numerical code, and not just with __repr__?

#10930 and #13089 are serious Numpy bugs which were detect using Hypothesis in downstream projects. I've spoken to @rsokl and he's happy to contribute his einsum test - it's a complicated one but covers a far wider variety of signatures and inputs than would be possible with unit tests.

Even better, can we use any of the issues/PRs/comments in the comment above to replace current tests, with comments to why hypothesis is better?

For example, this test would have failed for any array that did not contain a 1:

p = np.ones((10,2))
q = np.ones((1,2))
assert_array_equal(np.einsum('ij,ij->j', p, q, optimize=True),
np.einsum('ij,ij->j', p, q, optimize=False))

Generating a variety of shapes, as well as element values, is even more powerful.

assuming we want reproducible builds (yes Zac discourages this), I am not sure whether we should use seed per test, or use settings for the entire build.

Use a settings profile with derandomize=True, this use-case is why the setting exists. I do want to clarify that Hypothesis is designed so that failures are always reproducible, even from randomly generated inputs.

This PR should also extend the documentation somewhere with best practices around hypothesis. Has another project already done such documentation and we can reference it?

I think this reduces to "when writing tests, use Hypothesis in the obvious way". If that doesn't work, we'd consider that a bug in Hypothesis... the only tweaks I can think of is tuning the project-level settings profiles to run more or fewer examples for performance reasons.

@mattip
Copy link
Member

mattip commented Dec 10, 2019

#10930 and #13089 are serious Numpy bugs which were detect using Hypothesis

Please add the hypothesis tests that detected those bugs into this PR, or open a new one with the tests, and mention the relevant issues. Can the tests affected by those issues be converted to hypothesis tests? That would be more convincing.

when writing tests, use Hypothesis in the obvious way

That would be a great start to this part of the documentation, in a sentence about what kind of tests lend themselves to using hypothesis, with a link to what "obvious way" means for hypothesis.

@Zac-HD
Copy link
Contributor

Zac-HD commented Dec 10, 2019

@mattip - I'm aiming to open a new PR with those tests and some docs later this week 😄

And re-reading my comment, it's not helpful at all 😓... I meant something like "I would use Hypothesis for almost all (>90%) tests, and in almost all cases do so by describing the input data with builtin strategies with essentially the same test body I would usually write".

People usually pick it up pretty quickly - the user experience is almost exactly that of @pytest.mark.parametrize where you haven't passed the expected output, but with a much more concise description of the inputs and wider variety of test cases.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

6 participants
Morty Proxy This is a proxified and sanitized view of the page, visit original site.