-
Notifications
You must be signed in to change notification settings - Fork 1.7k
C++: Assume modelled functions always override buffers by default #15633
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
```cpp char* source(); void sink(const char*); int sprintf(char *, const char *, ...); void call_sprintf(char* path, char* data) { sprintf(path, "%s", "abc"); // (1) sprintf(path, "%s", data); // (2) } void foo() { char path[10]; call_sprintf(path, source()); // (3) sink(path); } ``` we identify that the `*path [post update]` node at `// (2)` is a `ReturnNodeExt` and since `*data` flows to that node flow will be carried out to `*path [post update]` at // (3) and thus reach `sink(path)`. The reason `*path [post update]` at `// 2` is recognized as a `ReturnNodeExt` is because it satisfies the following condition (which is identified by the shared dataflow library): There is flow from the parameter node `*path` to the pre-update node of the post-update node `*path [post update]` at `// (2)`. However, when we start recognizing that the call to `sprintf(path, ...)` at `// (1)` overrides the value of `*path` and no longer provide use-use flow out of `*path` the `*path [post update]` node at `// (2)` is no longer recognized as a `ReturnNodeExt` (because it doesn't satisfy the above criteria). Thus, we need to identify the flow above without relying on the dataflow library's summary mechanism. That is, instead of relying on the dataflow library's mechanism to summarize the `*data -> *path` flow for `call_sprintf` we need to: - Ensure that the write to `*path` at `// (2)` is recognized as the "final" write to the parameter, and - Ensure that there's flow out of that parameter and back to `*path [post update]` at `// (3)`. Luckiky, we do all of this already to support flow out of writes to parameters that don't have post-update nodes. For example, in something like: ```cpp void set(int* x, int y) { *x = y; } void test() { int x; set(&x, source()); sink(x); } ``` So in order to make the original example work, all we need to do is to remove the restrictions on this mechanism so that the same mechanism that makes the above example work also makes the original example work!
64fe044
to
be54a41
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
A few questions / suggestions, but generally LGTM.
The effect on tests is quite positive IMO. Though CI points to a few tests currently failing.
cpp/ql/lib/semmle/code/cpp/ir/dataflow/internal/SsaInternals.qll
Outdated
Show resolved
Hide resolved
@@ -118,6 +118,8 @@ private class StdSequenceContainerData extends TaintFunction { | ||
input.isReturnValueDeref() and | ||
output.isQualifierObject() | ||
} | ||
|
||
override predicate isPartialWrite(FunctionOutput output) { output.isQualifierObject() } |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm not entirely sure we don't want this to be the default behaviour for qualifiers. I get that, in principle, a qualifier isn't all that different to an argument (at least an "inout" style argument), but typical usage is different and I would expect a partial write to be the norm???
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This might be something to think about, but it doesn't need to hold up the PR necessarily.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yeah, a partial write is probably the norm. There are obvious outliers such as std::string::clear, but I would also expect it to be the norm.
I'm not sure how to structure this best, though. Since we're now "blocking by default" and have thus added a predicate to "disable blocking" there's no now predicate to override for "enabling blocking". If we should make writes to the qualifier non-blocking by default we should also have a "enabling blocking" predicate. I supposed that could simply be a matter of adding an overridable predicate (with a sensible default) to PartialFlowFunction
that says that writes to qualifiers are always non-blocking.
But I'd like to leave that as a follow-up like you suggested.
/** Holds if there is def-use or use-use flow from `nodeFrom` to `nodeTo`. */ | ||
predicate ssaFlow(Node nodeFrom, Node nodeTo) { | ||
exists(Node nFrom, boolean uncertain, SsaDefOrUse defOrUse | | ||
ssaFlowImpl(defOrUse, nFrom, nodeTo, uncertain) and nodeFrom != nodeTo | ||
ssaFlowImpl(defOrUse, nFrom, nodeTo, uncertain) and | ||
not modeledFlowBarrier(nFrom) and |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
What happens if someone naively models a flow from an argument to itself, without using the partial flow model? Will the model inexplicably fail?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Nothing will inexplicably fail. In fact, nothing will fail at all. Let's assume you have this snippet:
void test() {
char* p = taint(); // (1)
foo(p); // (2)
sink(p); // (3)
}
Let's consider three scenarios:
- You have no model for
foo
- You have a dataflow model for
foo
that looks like:
override predicate hasDataFlow(FunctionInput input, FunctionOutput output) {
input.isParameterDeref(0) and
output.isParameterDeref(0)
}
- You have a dataflow model for
foo
that looks like the above, but also has:
override predicate isPartialWrite(FunctionOutput output) { output.isParameterDeref(0) }
The behavior for each case will be:
You have no model for foo
There will be flow from taint()
to p
(in // (2)
), and from p
(in // (2)
) to p
(in // (3)
). See here:

You have a dataflow model for foo
, but no isPartialWrite
override
There will be flow from taint()
to p
(in // (2)
), and from p
(in // (2)
) to the output argument of foo
, and from the output argument of foo
to p
(in // (3)
. See here:

That is, now that the model for foo
implicitly blocks flow it means that flow will have to go through foo
in order to reach sink
.
You have a dataflow model for foo
, and you also override isPartialWrite
.
Since the flow is now specified as a partial write it means you get both of the flows from before:

There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I should add: The paths above have some duplication (i.e., there are two *call to taint
nodes along the path) since I disable any kind of path compression in my example.
Yeah, those are internal tests I need to update. There's one new result duplication (related to something ending up in both The other test failures are changes to the path explanations, and they simply need to be accepted in the internal repo. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for addressing / answering my comments. I'm happy for this to be merged now...
Once we're happy with this PR I'll open an internal PR to accept the test changes in the internal repo.
... well, once this is done.
Will get on that now. Thanks! |
This PR supersedes #15528 and, instead of adding a new abstract class that one needs to implement to tell the dataflow library that a write by a modelled function always overrides the argument, this is now assumed by default, and one can override
isPartialWrite
to say that the write shouldn't block existing flow.This PR is fairly large since I hit a couple of issues along the road:
Since this is now an "opt out" instead of an "opt in" feature, we have to fix up all our existing models since we'll lose a lot of flow otherwise (and indeed I saw that happening in earlier stages of this work). 7e9bf2a fixes all our existing models.
499ab08 fixes an unexpected problem. That commit has a fairly long message that explains the problem, and why that commit fixes the problem.
Sadly, this removes a couple of good results on DCA since we're now more sensitive to the infamous
fieldFlowBranchLimit
constant inside the dataflow library. Getting those results back means bumping thefieldFlowBranchLimit
value for existing queries, although I don't think that's critical to do now.Finally, this also seems to cause a 17% analysis time regression on
vim/vim
. This is because we're making the dataflow library do less function summarization and more flow into/out of functions.I actually really like this direction, because this means that we may be able to completely get rid of the pruning SSA stage of dataflow (and subsequently delete a couple of hundred lines of code 🎉) because I don't think it really buys us anything anymore since we've been optimizing the useful SSA stage quite a lot since it was introduced when we initially implemented use-use flow. I'll create an issue to explain this more thoroughly.
Sadly, this commit also introduces this FP in our tests:
which wasn't there before. This happens because the write
x = &s;
makes dataflow synthesize a write such as*x = *&s
(which is equivalent to*x = s
). But it really should't synthesize this write in this case. It's not trivial to remove this FP, so I think we should just leave this in for now, and I'll write an issue about how to fix this as a follow-up. I haven't seen any real-world FPs because of it yet.Most of the query changes in be54a41 are simply changes to the
subpath
predicate (caused by 499ab08). There's one FP disappearing in there as well 🎉Once we're happy with this PR I'll open an internal PR to accept the test changes in the internal repo.