Skip to content

Navigation Menu

Sign in
Appearance settings

Search code, repositories, users, issues, pull requests...

Provide feedback

We read every piece of feedback, and take your input very seriously.

Saved searches

Use saved searches to filter your results more quickly

Appearance settings

C++: Assume modelled functions always override buffers by default #15633

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merged
merged 14 commits into from
Feb 22, 2024

Conversation

MathiasVP
Copy link
Contributor

@MathiasVP MathiasVP commented Feb 16, 2024

This PR supersedes #15528 and, instead of adding a new abstract class that one needs to implement to tell the dataflow library that a write by a modelled function always overrides the argument, this is now assumed by default, and one can override isPartialWrite to say that the write shouldn't block existing flow.

This PR is fairly large since I hit a couple of issues along the road:

  • Since this is now an "opt out" instead of an "opt in" feature, we have to fix up all our existing models since we'll lose a lot of flow otherwise (and indeed I saw that happening in earlier stages of this work). 7e9bf2a fixes all our existing models.

  • 499ab08 fixes an unexpected problem. That commit has a fairly long message that explains the problem, and why that commit fixes the problem.

    Sadly, this removes a couple of good results on DCA since we're now more sensitive to the infamous fieldFlowBranchLimit constant inside the dataflow library. Getting those results back means bumping the fieldFlowBranchLimit value for existing queries, although I don't think that's critical to do now.

    Finally, this also seems to cause a 17% analysis time regression on vim/vim. This is because we're making the dataflow library do less function summarization and more flow into/out of functions.

    I actually really like this direction, because this means that we may be able to completely get rid of the pruning SSA stage of dataflow (and subsequently delete a couple of hundred lines of code 🎉) because I don't think it really buys us anything anymore since we've been optimizing the useful SSA stage quite a lot since it was introduced when we initially implemented use-use flow. I'll create an issue to explain this more thoroughly.

    Sadly, this commit also introduces this FP in our tests:

    void write_to_param(int* x) {
      int s = source();
      x = &s;
    }
    
    void test_write_to_param() {
      int x = 0;
      write_to_param(&x);
      sink(x); // spurious flow now
    }

    which wasn't there before. This happens because the write x = &s; makes dataflow synthesize a write such as*x = *&s (which is equivalent to *x = s). But it really should't synthesize this write in this case. It's not trivial to remove this FP, so I think we should just leave this in for now, and I'll write an issue about how to fix this as a follow-up. I haven't seen any real-world FPs because of it yet.

Most of the query changes in be54a41 are simply changes to the subpath predicate (caused by 499ab08). There's one FP disappearing in there as well 🎉

Once we're happy with this PR I'll open an internal PR to accept the test changes in the internal repo.

```cpp
char* source();
void sink(const char*);
int sprintf(char *, const char *, ...);

void call_sprintf(char* path, char* data) {
        sprintf(path, "%s", "abc"); // (1)
        sprintf(path, "%s", data); // (2)
}

void foo() {
        char path[10];
        call_sprintf(path, source()); // (3)
        sink(path);
}
```
we identify that the `*path [post update]` node at `// (2)` is a
`ReturnNodeExt` and since `*data` flows to that node flow will be carried
out to `*path [post update]` at // (3) and thus reach `sink(path)`.

The reason `*path [post update]` at `// 2` is recognized as a `ReturnNodeExt`
is because it satisfies the following condition (which is identified by the
shared dataflow library):
There is flow from the parameter node `*path` to the pre-update node of the
post-update node `*path [post update]` at `// (2)`.

However, when we start recognizing that the call to `sprintf(path, ...)` at
`// (1)` overrides the value of `*path` and no longer provide use-use flow out
of `*path` the `*path [post update]` node at `// (2)` is no longer recognized
as a `ReturnNodeExt` (because it doesn't satisfy the above criteria).

Thus, we need to identify the flow above without relying on the dataflow
library's summary mechanism. That is, instead of relying on the dataflow
library's mechanism to summarize the `*data -> *path` flow for `call_sprintf`
we need to:
- Ensure that the write to `*path` at `// (2)` is recognized as the "final"
write to the parameter, and
- Ensure that there's flow out of that parameter and back to
`*path [post update]` at `// (3)`.

Luckiky, we do all of this already to support flow out of writes to parameters
that don't have post-update nodes. For example, in something like:
```cpp
void set(int* x, int y) {
  *x = y;
}

void test() {
  int x;
  set(&x, source());
  sink(x);
}
```
So in order to make the original example work, all we need to do is to remove
the restrictions on this mechanism so that the same mechanism that makes the
above example work also makes the original example work!
@MathiasVP MathiasVP requested a review from a team as a code owner February 16, 2024 12:28
@github-actions github-actions bot added the C++ label Feb 16, 2024
Copy link
Contributor

@geoffw0 geoffw0 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

A few questions / suggestions, but generally LGTM.

The effect on tests is quite positive IMO. Though CI points to a few tests currently failing.

@@ -118,6 +118,8 @@ private class StdSequenceContainerData extends TaintFunction {
input.isReturnValueDeref() and
output.isQualifierObject()
}

override predicate isPartialWrite(FunctionOutput output) { output.isQualifierObject() }
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm not entirely sure we don't want this to be the default behaviour for qualifiers. I get that, in principle, a qualifier isn't all that different to an argument (at least an "inout" style argument), but typical usage is different and I would expect a partial write to be the norm???

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This might be something to think about, but it doesn't need to hold up the PR necessarily.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah, a partial write is probably the norm. There are obvious outliers such as std::string::clear, but I would also expect it to be the norm.

I'm not sure how to structure this best, though. Since we're now "blocking by default" and have thus added a predicate to "disable blocking" there's no now predicate to override for "enabling blocking". If we should make writes to the qualifier non-blocking by default we should also have a "enabling blocking" predicate. I supposed that could simply be a matter of adding an overridable predicate (with a sensible default) to PartialFlowFunction that says that writes to qualifiers are always non-blocking.

But I'd like to leave that as a follow-up like you suggested.

/** Holds if there is def-use or use-use flow from `nodeFrom` to `nodeTo`. */
predicate ssaFlow(Node nodeFrom, Node nodeTo) {
exists(Node nFrom, boolean uncertain, SsaDefOrUse defOrUse |
ssaFlowImpl(defOrUse, nFrom, nodeTo, uncertain) and nodeFrom != nodeTo
ssaFlowImpl(defOrUse, nFrom, nodeTo, uncertain) and
not modeledFlowBarrier(nFrom) and
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What happens if someone naively models a flow from an argument to itself, without using the partial flow model? Will the model inexplicably fail?

Copy link
Contributor Author

@MathiasVP MathiasVP Feb 22, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nothing will inexplicably fail. In fact, nothing will fail at all. Let's assume you have this snippet:

void test() {
  char* p = taint(); // (1)
  foo(p); // (2)
  sink(p); // (3)
}

Let's consider three scenarios:

  1. You have no model for foo
  2. You have a dataflow model for foo that looks like:
override predicate hasDataFlow(FunctionInput input, FunctionOutput output) {
  input.isParameterDeref(0) and
  output.isParameterDeref(0)
}
  1. You have a dataflow model for foo that looks like the above, but also has:
override predicate isPartialWrite(FunctionOutput output) { output.isParameterDeref(0) }

The behavior for each case will be:

You have no model for foo

There will be flow from taint() to p (in // (2)), and from p (in // (2)) to p (in // (3)). See here:

Screenshot 2024-02-22 at 14 47 50

You have a dataflow model for foo, but no isPartialWrite override

There will be flow from taint() to p (in // (2)), and from p (in // (2)) to the output argument of foo, and from the output argument of foo to p (in // (3). See here:

Screenshot 2024-02-22 at 14 50 17

That is, now that the model for foo implicitly blocks flow it means that flow will have to go through foo in order to reach sink.

You have a dataflow model for foo, and you also override isPartialWrite.

Since the flow is now specified as a partial write it means you get both of the flows from before:

Screenshot 2024-02-22 at 14 51 48

Copy link
Contributor Author

@MathiasVP MathiasVP Feb 22, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I should add: The paths above have some duplication (i.e., there are two *call to taint nodes along the path) since I disable any kind of path compression in my example.

@MathiasVP
Copy link
Contributor Author

The effect on tests is quite positive IMO. Though CI points to a few tests currently failing.

Yeah, those are internal tests I need to update. There's one new result duplication (related to something ending up in both *access to array and access to array. Not completely sure what's up with that. But I'm not too worried about this.

The other test failures are changes to the path explanations, and they simply need to be accepted in the internal repo.

Copy link
Contributor

@geoffw0 geoffw0 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for addressing / answering my comments. I'm happy for this to be merged now...

Once we're happy with this PR I'll open an internal PR to accept the test changes in the internal repo.

... well, once this is done.

@MathiasVP
Copy link
Contributor Author

Thanks for addressing / answering my comments. I'm happy for this to be merged now...

Once we're happy with this PR I'll open an internal PR to accept the test changes in the internal repo.

... well, once this is done.

Will get on that now. Thanks!

@MathiasVP MathiasVP added the depends on internal PR This PR should only be merged in sync with an internal Semmle PR label Feb 22, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
C++ depends on internal PR This PR should only be merged in sync with an internal Semmle PR documentation
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants
Morty Proxy This is a proxified and sanitized view of the page, visit original site.