C++: Assume modelled functions always override buffers by default #15633

MathiasVP · Feb 16, 2024

This PR supersedes #15528 and, instead of adding a new abstract class that one needs to implement to tell the dataflow library that a write by a modelled function always overrides the argument, this is now assumed by default, and one can override isPartialWrite to say that the write shouldn't block existing flow.

This PR is fairly large since I hit a couple of issues along the road:

Since this is now an "opt out" instead of an "opt in" feature, we have to fix up all our existing models since we'll lose a lot of flow otherwise (and indeed I saw that happening in earlier stages of this work). 7e9bf2a fixes all our existing models.
499ab08 fixes an unexpected problem. That commit has a fairly long message that explains the problem, and why that commit fixes the problem.

Sadly, this removes a couple of good results on DCA since we're now more sensitive to the infamous fieldFlowBranchLimit constant inside the dataflow library. Getting those results back means bumping the fieldFlowBranchLimit value for existing queries, although I don't think that's critical to do now.

Finally, this also seems to cause a 17% analysis time regression on vim/vim. This is because we're making the dataflow library do less function summarization and more flow into/out of functions.

I actually really like this direction, because this means that we may be able to completely get rid of the pruning SSA stage of dataflow (and subsequently delete a couple of hundred lines of code 🎉) because I don't think it really buys us anything anymore since we've been optimizing the useful SSA stage quite a lot since it was introduced when we initially implemented use-use flow. I'll create an issue to explain this more thoroughly.

Sadly, this commit also introduces this FP in our tests:
```
void write_to_param(int* x) {
  int s = source();
  x = &s;
}

void test_write_to_param() {
  int x = 0;
  write_to_param(&x);
  sink(x); // spurious flow now
}
```
which wasn't there before. This happens because the write x = &s; makes dataflow synthesize a write such as*x = *&s (which is equivalent to *x = s). But it really should't synthesize this write in this case. It's not trivial to remove this FP, so I think we should just leave this in for now, and I'll write an issue about how to fix this as a follow-up. I haven't seen any real-world FPs because of it yet.

Most of the query changes in be54a41 are simply changes to the subpath predicate (caused by 499ab08). There's one FP disappearing in there as well 🎉

Once we're happy with this PR I'll open an internal PR to accept the test changes in the internal repo.

```cpp char* source(); void sink(const char*); int sprintf(char *, const char *, ...); void call_sprintf(char* path, char* data) { sprintf(path, "%s", "abc"); // (1) sprintf(path, "%s", data); // (2) } void foo() { char path[10]; call_sprintf(path, source()); // (3) sink(path); } ``` we identify that the `*path [post update]` node at `// (2)` is a `ReturnNodeExt` and since `*data` flows to that node flow will be carried out to `*path [post update]` at // (3) and thus reach `sink(path)`. The reason `*path [post update]` at `// 2` is recognized as a `ReturnNodeExt` is because it satisfies the following condition (which is identified by the shared dataflow library): There is flow from the parameter node `*path` to the pre-update node of the post-update node `*path [post update]` at `// (2)`. However, when we start recognizing that the call to `sprintf(path, ...)` at `// (1)` overrides the value of `*path` and no longer provide use-use flow out of `*path` the `*path [post update]` node at `// (2)` is no longer recognized as a `ReturnNodeExt` (because it doesn't satisfy the above criteria). Thus, we need to identify the flow above without relying on the dataflow library's summary mechanism. That is, instead of relying on the dataflow library's mechanism to summarize the `*data -> *path` flow for `call_sprintf` we need to: - Ensure that the write to `*path` at `// (2)` is recognized as the "final" write to the parameter, and - Ensure that there's flow out of that parameter and back to `*path [post update]` at `// (3)`. Luckiky, we do all of this already to support flow out of writes to parameters that don't have post-update nodes. For example, in something like: ```cpp void set(int* x, int y) { *x = y; } void test() { int x; set(&x, source()); sink(x); } ``` So in order to make the original example work, all we need to do is to remove the restrictions on this mechanism so that the same mechanism that makes the above example work also makes the original example work!

cpp/ql/lib/semmle/code/cpp/models/interfaces/DataFlow.qll

cpp/ql/lib/semmle/code/cpp/models/interfaces/PartialFlow.qll

geoffw0

A few questions / suggestions, but generally LGTM.

The effect on tests is quite positive IMO. Though CI points to a few tests currently failing.

cpp/ql/lib/semmle/code/cpp/ir/dataflow/internal/SsaInternals.qll

cpp/ql/lib/semmle/code/cpp/models/interfaces/PartialFlow.qll

cpp/ql/lib/semmle/code/cpp/ir/dataflow/internal/SsaInternals.qll

geoffw0 · Feb 20, 2024

cpp/ql/lib/semmle/code/cpp/models/implementations/StdContainer.qll

@@ -118,6 +118,8 @@ private class StdSequenceContainerData extends TaintFunction {
    input.isReturnValueDeref() and
    output.isQualifierObject()
  }
+
+  override predicate isPartialWrite(FunctionOutput output) { output.isQualifierObject() }


I'm not entirely sure we don't want this to be the default behaviour for qualifiers. I get that, in principle, a qualifier isn't all that different to an argument (at least an "inout" style argument), but typical usage is different and I would expect a partial write to be the norm???

This might be something to think about, but it doesn't need to hold up the PR necessarily.

Yeah, a partial write is probably the norm. There are obvious outliers such as std::string::clear, but I would also expect it to be the norm.

I'm not sure how to structure this best, though. Since we're now "blocking by default" and have thus added a predicate to "disable blocking" there's no now predicate to override for "enabling blocking". If we should make writes to the qualifier non-blocking by default we should also have a "enabling blocking" predicate. I supposed that could simply be a matter of adding an overridable predicate (with a sensible default) to PartialFlowFunction that says that writes to qualifiers are always non-blocking.

But I'd like to leave that as a follow-up like you suggested.

cpp/ql/test/query-tests/Security/CWE/CWE-311/semmle/tests/test2.cpp

geoffw0 · Feb 20, 2024

cpp/ql/lib/semmle/code/cpp/ir/dataflow/internal/SsaInternals.qll

 /** Holds if there is def-use or use-use flow from `nodeFrom` to `nodeTo`. */
 predicate ssaFlow(Node nodeFrom, Node nodeTo) {
  exists(Node nFrom, boolean uncertain, SsaDefOrUse defOrUse |
-    ssaFlowImpl(defOrUse, nFrom, nodeTo, uncertain) and nodeFrom != nodeTo
+    ssaFlowImpl(defOrUse, nFrom, nodeTo, uncertain) and
+    not modeledFlowBarrier(nFrom) and


What happens if someone naively models a flow from an argument to itself, without using the partial flow model? Will the model inexplicably fail?

Nothing will inexplicably fail. In fact, nothing will fail at all. Let's assume you have this snippet:

void test() { char* p = taint(); // (1) foo(p); // (2) sink(p); // (3) }

Let's consider three scenarios:

You have no model for foo

You have a dataflow model for foo that looks like:

override predicate hasDataFlow(FunctionInput input, FunctionOutput output) { input.isParameterDeref(0) and output.isParameterDeref(0) }

You have a dataflow model for foo that looks like the above, but also has:

override predicate isPartialWrite(FunctionOutput output) { output.isParameterDeref(0) }

The behavior for each case will be:

You have no model for foo

There will be flow from taint() to p (in // (2)), and from p (in // (2)) to p (in // (3)). See here:

You have a dataflow model for foo, but no isPartialWrite override

There will be flow from taint() to p (in // (2)), and from p (in // (2)) to the output argument of foo, and from the output argument of foo to p (in // (3). See here:

That is, now that the model for foo implicitly blocks flow it means that flow will have to go through foo in order to reach sink.

You have a dataflow model for foo, and you also override isPartialWrite.

Since the flow is now specified as a partial write it means you get both of the flows from before:

I should add: The paths above have some duplication (i.e., there are two *call to taint nodes along the path) since I disable any kind of path compression in my example.

MathiasVP · Feb 20, 2024

The effect on tests is quite positive IMO. Though CI points to a few tests currently failing.

Yeah, those are internal tests I need to update. There's one new result duplication (related to something ending up in both *access to array and access to array. Not completely sure what's up with that. But I'm not too worried about this.

The other test failures are changes to the path explanations, and they simply need to be accepted in the internal repo.

cpp/ql/lib/semmle/code/cpp/models/interfaces/PartialFlow.qll

geoffw0

Thanks for addressing / answering my comments. I'm happy for this to be merged now...

Once we're happy with this PR I'll open an internal PR to accept the test changes in the internal repo.

... well, once this is done.

MathiasVP · Feb 22, 2024

Thanks for addressing / answering my comments. I'm happy for this to be merged now...

Once we're happy with this PR I'll open an internal PR to accept the test changes in the internal repo.

... well, once this is done.

Will get on that now. Thanks!

MathiasVP added 5 commits February 16, 2024 12:56

C++: Add a testcase.

625c47f

C++: Block flow by default.

24a63ae

C++: Add a model for 'partial updating' and extend models appropriately.

7e9bf2a

C++: Accept test changes.

9b2019d

MathiasVP requested a review from a team as a code owner February 16, 2024 12:28

github-actions bot added the C++ label Feb 16, 2024

github-advanced-security bot found potential problems Feb 16, 2024

View reviewed changes

cpp/ql/lib/semmle/code/cpp/models/interfaces/DataFlow.qll Show resolved Hide resolved

cpp/ql/lib/semmle/code/cpp/models/interfaces/PartialFlow.qll Fixed Show resolved Hide resolved

C++: Add change note.

497592a

github-actions bot added the documentation label Feb 16, 2024

C++: Add file-level QLDoc.

57c1bf5

github-advanced-security bot found potential problems Feb 16, 2024

View reviewed changes

cpp/ql/lib/semmle/code/cpp/models/interfaces/PartialFlow.qll Fixed Show resolved Hide resolved

MathiasVP added 2 commits February 16, 2024 13:51

C++: Make Code Scanning happy.

b407c86

C++: Accept query test changes.

be54a41

MathiasVP force-pushed the model-experiments branch from 64fe044 to be54a41 Compare February 16, 2024 14:01

geoffw0 reviewed Feb 20, 2024

View reviewed changes

MathiasVP added 2 commits February 22, 2024 13:25

C++: Fix QLoc on 'isPartialWrite'.

aca3970

C++: Fix QLoc on 'PartialFlowFunction'.

671904d

github-advanced-security bot found potential problems Feb 22, 2024

View reviewed changes

cpp/ql/lib/semmle/code/cpp/models/interfaces/PartialFlow.qll Show resolved Hide resolved

C++: Update QLDoc on 'modeledFlowBarrier'.

350d5bf

geoffw0 approved these changes Feb 22, 2024

View reviewed changes

Merge branch 'main' into model-experiments

0bf29f0

MathiasVP added the depends on internal PR This PR should only be merged in sync with an internal Semmle PR label Feb 22, 2024

Merge branch 'main' into model-experiments

c7ee5b2

MathiasVP merged commit 63a5b49 into github:main Feb 22, 2024

MathiasVP mentioned this pull request Mar 5, 2024

C++: Remove the pruning stage from dataflow SSA #15816

Merged

MathiasVP mentioned this pull request Feb 4, 2025

C++: Iterator derefs are partial writes #18674

Draft

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

C++: Assume modelled functions always override buffers by default #15633

C++: Assume modelled functions always override buffers by default #15633

Uh oh!

MathiasVP commented Feb 16, 2024 •

edited

Loading

Uh oh!

Uh oh!

Uh oh!

Uh oh!

geoffw0 left a comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

geoffw0 Feb 20, 2024

Uh oh!

geoffw0 Feb 22, 2024

Uh oh!

MathiasVP Feb 22, 2024

Uh oh!

Uh oh!

geoffw0 Feb 20, 2024

Uh oh!

MathiasVP Feb 22, 2024 •

edited

Loading

Uh oh!

MathiasVP Feb 22, 2024 •

edited

Loading

Uh oh!

MathiasVP commented Feb 20, 2024

Uh oh!

Uh oh!

geoffw0 left a comment

Uh oh!

MathiasVP commented Feb 22, 2024

Uh oh!

Uh oh!

Search code, repositories, users, issues, pull requests...

C++: Assume modelled functions always override buffers by default #15633

C++: Assume modelled functions always override buffers by default #15633

Uh oh!

Conversation

MathiasVP commented Feb 16, 2024 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

geoffw0 left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

geoffw0 Feb 20, 2024

Choose a reason for hiding this comment

Uh oh!

geoffw0 Feb 22, 2024

Choose a reason for hiding this comment

Uh oh!

MathiasVP Feb 22, 2024

Choose a reason for hiding this comment

Uh oh!

Uh oh!

geoffw0 Feb 20, 2024

Choose a reason for hiding this comment

Uh oh!

MathiasVP Feb 22, 2024 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

You have no model for foo

You have a dataflow model for foo, but no isPartialWrite override

You have a dataflow model for foo, and you also override isPartialWrite.

Uh oh!

MathiasVP Feb 22, 2024 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

MathiasVP commented Feb 20, 2024

Uh oh!

Uh oh!

geoffw0 left a comment

Choose a reason for hiding this comment

Uh oh!

MathiasVP commented Feb 22, 2024

Uh oh!

Uh oh!

MathiasVP commented Feb 16, 2024 •

edited

Loading

MathiasVP Feb 22, 2024 •

edited

Loading

You have no model for `foo`

You have a dataflow model for `foo`, but no `isPartialWrite` override

You have a dataflow model for `foo`, and you also override `isPartialWrite`.

MathiasVP Feb 22, 2024 •

edited

Loading