Ruby: overhaul API graphs #13496

asgerf · 2023-06-19T11:31:17Z

Makes a significant overhaul of API graphs in Ruby, both in how they are implemented and what they are capable of.

Currently this is only for Ruby, but some code has already been factored out with intent to share with JS and Python as well. (The PR is large enough as it is, and actually sharing the file involves moving files around that make the diff harder to read)

The main benefits of this change are:

Any DataFlow::LocalSourceNode can be converted to an API::Node for tracking where that value flows. This can be done by calling .track() on it.
Similarly any DataFlow::Node can be converted to an API::Node for tracking what flows into that value. This can be done by calling .backtrack() on it.
Some consequence of the above are:
- Parameters of user-defined methods can now be seen as sources, and likewise, their return values can be seen as sinks. Previously we had to drop down to data flow there, and as a result, some of our models relied on local data flow leading to FNs.
- It is much easier to gradually migrate models to use API graphs. Previously the inability to go from data flow to API graphs meant you could get stuck due to a dependency on another model, which you would then also have to migrate at the same time.
X.getInstance() previously only found calls to X.new, but now includes self parameters that could be an instance of X. Concretely, these are all the self parameters of an instance method of any ancestor of any descendent of X.
X.getMethod(m) now finds calls to self.m in singleton methods in a subclasses of X.
X.getMethod(m) now also matches super calls inside the relevant methods.
This means API graphs can and should now be relied upon for finding calls to external methods in virtually all cases. Exceptions mainly have to do with special cases like Kernel and working around inaccurate self capture in blocks.
API graphs should be easier to use in general, since you don't need to worry about whether an API exists for a given data-flow node. The documentation also doesn't focus so much on the library boundary, just data flow and inheritance, illustrated with examples.

Relation to MaD

To provide some more context as to why this is needed, consider the criteria by which we determine if a given call targets a particular external method. We mainly look at four things:

The name of the method being invoked
The type of the receiver
The number of arguments given
And rarely, the values/types of the arguments

The MaD format for dynamic languages require that these are in fact the only criteria by which calls are identified. However, Ruby has had a number of models that rely on other criteria, such as being syntactically inside a particular class. I believe these were heuristics in place because (2) was too hard to check at the time the model was written. This had led to FNs, and has also made it difficult to judge which models could be presented in MaD, because those heuristic criteria could not be used in MaD.

API graphs should be the solution to (2). So my hope with this improvement to API graphs that we can start checking the type of the receiver using API graphs, and moving away from the heuristic criteria that block transition to MaD.

Epsilon edges

Under the hood, we introduce a notion of epsilon edges. An epsilon edge A -> B means anything looked up in A will implicitly looked up in B as well. Both data flow and inheritance give rise to epsilon edges. The construction of the "epsilon graph" is in a module ApiGraphShared.qll that I intend to share with JS/Python, but currently just lives in Ruby.

Previously an edge in the API graph incorporated interprocedural flow, but labelled edges are now usually entirely local, as the interprocedural flow is captured by preceding epsilon edges.

For example, given this program,

def func p
  p.bar
end
func Foo

Previously, there was an edge from Foo directly to p.bar in the method:

flowchart LR
  root
  Foo
  p.bar
  root -- "Member[Foo]" --> Foo
  Foo -- "Method[bar]" --> p.bar

Now, the step from Foo -> p is an epsilon edge, at the Method[bar] edge is local:

flowchart LR
  root
  Foo
  p
  p.bar
  root -- "Member[Foo]" --> Foo
  Foo -- epsilon --> p
  p -- "Method[bar]" --> p.bar

Epsilon edges are also use to incorporate inheritance. For example getTopLevelMember("Foo").getMethod("baz") would identify Bar.baz in the example below:

class Bar < Foo
end
Bar.baz

A somewhat simplified version of the API graph would look like this:

flowchart LR
  root -- "Member[Foo]" --> FooExpr -- epsilon --> BarModule --epsilon --> BarExpr -- "Method[baz]" --> Bar.baz

Strict evaluation order

Almost every user-facing predicate now has pragma[inline_late] and bindingset[this], meaning chains of API graph calls are now join-ordered more reliably. This also means you must restrict the receiver before using API graphs, so you should avoid doing something like this, as it will be forced to find enumerate the epsilon-successors of every node in the graph, which quite a lot:

API::Node barCall(API::Node base) {
  result = base.getMethod("bar") // Do not do this!
}

getASuccessor is deprecated

We no longer support using API graphs as a general labelled graph. getASuccessor is deprecated, getPath() is deprecated and the toString value is now a very direct translation of the internal representation of a node.

Previously we would construct an edge(Node pred, Label lbl, Node succ) relation, but then cache the specialized versions of this relation, making the original edge relation largely unnecessary.

An upside of this is better performance, and adding new kinds of labelled edges is vastly simplified as it's just a matter of:

Adding a new cached predicate in Impl
Adding a pragma[inline_late] predicate in API::Node, calling getAnEpsilonSuccessor() and passing that as argument to the cached edge relation in Impl

Evaluation

Evaluation shows:

635 new taint sinks
79 new taint sources
901 new tainted nodes
24 new alerts
Performance looks reasonable. There's a 10% regression on two projects, but those also gained some new results, and a 24% speed-up on one. Given the benefits I think we can accept this.

This used right-to-left evaluation for API graphs, which is not supported anymore

Old version had scalability issues when adding taking more interprocedural flow and inheritance into account.

These results were previously flagged for the wrong reason. Calls to a user-define method were seen as ORM calls. The real source is inside the user-defined method, but we miss that due to lack of 'self' handling in ORM tracking.

RasmusWL

I have only read the PR description, and the small change to the Python .qll file. Looks really good from my point of view 💪

Especially this bit is super nice!

Any DataFlow::LocalSourceNode can be converted to an API::Node for tracking where that value flows. This can be done by calling .track() on it.
Similarly any DataFlow::Node can be converted to an API::Node for tracking what flows into that value. This can be done by calling .backtrack() on it.

I'm a bit sad by not being able to do the following anymore. I've done it a few times in the past, when writing exploratory queries (such as, give me an API node that .cursor().execute() is called on, to find new SQL modeling) -- I don't remember on top of my head adding this to any "production" QL code though

API::Node barCall(API::Node base) {
  result = base.getMethod("bar") // Do not do this!
}

python/ql/lib/semmle/python/dataflow/new/internal/TypeTracker.qll

alexrford · 2023-06-22T15:58:58Z

I've not reviewed the actual changes to API graphs, but the library modelling improvements look really great.

asgerf · 2023-06-26T13:40:07Z

ruby/ql/lib/codeql/ruby/ApiGraphs.qll

+      or
+      implicitCallEdge(pred, succ)
+      or
+      exists(DataFlow::HashLiteralNode splat | hashSplatEdge(splat, pred, succ))


At the time of writing, using an _ here triggers a compiler bug and crashes the compiler (it's been fixed on main).

erik-krogh

It's clever, and I think I get the basic idea now.
I haven't looked at everything, and I haven't looked at the details, and I'm not enough into the shared dataflow-library / ruby to do a good review of all the parts.

At first I was confused about epsilonStar, because you're encoding the "content" as part of the ApiNode (MkForwardNode / MkBackwardNode).
So surely the this.getAnEpsilonSuccessor() calls would also include all the forwards/backwards nodes that has content, and I thought that would be a problem.

But I can see that I didn't need to worry, because each of the edges in Impl ensure that the predecessor is a relevant "start" edge.
So getAnEpsilonSuccessor produces a lot of irrelevant nodes that are afterwards filtered out.

I can also see why you needed to encode the content as part of the ApiNode (to get fastTc to work).

Could you filter out all the forward/backwards nodes that are not end/start nodes from getAnEpsilonSuccessor() in a post-processing predicate to get a speedup?
Or would that ruin the efficient storage of the fastTc results?

ruby/ql/lib/codeql/ruby/ApiGraphs.qll

ruby/ql/lib/codeql/ruby/typetracking/ApiGraphShared.qll

Co-authored-by: Erik Krogh Kristensen <[email protected]>

hvitved

Truly amazing work! My comments are mostly trivial, the overall approach looks really solid, and excellent that it actually scales.

hvitved · 2023-06-23T11:33:43Z

ruby/ql/lib/codeql/ruby/typetracking/ApiGraphShared.qll

@@ -0,0 +1,329 @@
+/**


This files should probably be inside an internal folder?

So should the type-tracker files, so I just kept it here to avoid moving around too many files.

hvitved · 2023-06-23T11:34:57Z

ruby/ql/lib/codeql/ruby/typetracking/ApiGraphShared.qll

+    string toString();
+
+    /** Gets the location associated with this API node, if any. */
+    Location getLocation();


Location needs to be a parameter as well, but can be postponed until other languages actually need to use it.

Yeah it depends on exactly where this file is going to live in relation to type tracking.

Right now the idea is that the library just imports TypeTracker.qll and TypeTrackerSpecific.qll which provides a lot of language-specific stuff which, for the time being, should also work for Python

hvitved · 2023-06-23T11:37:27Z

ruby/ql/lib/codeql/ruby/typetracking/ApiGraphShared.qll

+    pragma[noopt]
+    cached
+    predicate epsilonEdge(ApiNode pred, ApiNode succ) {
+      // forward


hvitved · 2023-06-27T07:25:37Z

ruby/ql/lib/codeql/ruby/dataflow/internal/DataFlowPublic.qll

  }

+  cached
+  predicate methodHasSuperCall(MethodNode method, CallNode call) {


Is this really worth caching?

Right now there aren't any other uses than in API graphs, so it doesn't matter. But in principle I'd say yes, it's a good place to cache, because the predicate is very small, and if re-evaluated at the wrong time it can trigger re-evaluation of getEnclosingMethod which is large and not cached.

hvitved · 2023-06-27T07:27:24Z

ruby/ql/lib/codeql/ruby/dataflow/internal/DataFlowPublic.qll

  }

+  /**
+   * Gets a module for which this constant is the reference to an ancestor module.


this constant -> the constant constRef

hvitved · 2023-06-27T09:05:07Z

ruby/ql/lib/codeql/ruby/ApiGraphs.qll

   * additional entry points may be added by extending this class.
   */
  abstract class EntryPoint extends string {
+    // Note: this class can be deprecated in Ruby, but is still referenced by shared code in ApiGraphModels.qll,


Would it make sense to move this class to ApiGraphModelsSpecific.qll, and then define a deprecated sub class here?

I'd rather just wait and deprecate it simultaneously across languages.

hvitved · 2023-06-27T09:22:05Z

ruby/ql/lib/codeql/ruby/ApiGraphs.qll

+  /**
+   * A node corresponding to an argument, right-hand side of a store, or return value from a callable.
+   *
+   * Such a node may serve as the starting-point of backtracking, and has epsilon edges going


hvitved · 2023-06-27T09:24:47Z

ruby/ql/lib/codeql/ruby/ApiGraphs.qll

    /**
-     * Holds if `rhs` is a definition of a node that should have an incoming edge labeled `lbl`,
-     * from a def node that is reachable from `node`.
+     * Holds if the epsilon `pred -> succ` be generated, to associate `mod` with its references in the codebase.


hvitved · 2023-06-27T09:57:30Z

ruby/ql/lib/codeql/ruby/typetracking/ApiGraphShared.qll

+     * Holds if `pred` can reach `succ` by zero or more epsilon edges.
+     */
+    cached
+    predicate epsilonStar(ApiNode pred, ApiNode succ) = fastTC(epsilonEdge/2)(pred, succ)


Does it work if you remove the reflexive case from epsilonEdge, and instead define

cached predicate epsilonPlus(ApiNode pred, ApiNode succ) = fastTC(epsilonEdge/2)(pred, succ) pragma[inline] predicate epsilonStar(ApiNode pred, ApiNode succ) { pred = succ or epsilonPlus(pred, succ) }

No, it causes misoptimizations. The current solution exists to ensure the RA pipelines we get from a chain of API graph calls becomes completely linear, which is quite robust against optimizer wobbles.

Inserting a disjunction at every API graph call causes the optimizer to do bunch of work to try and remove those disjunctions, like DNF rewrites and pulling out shared helper predicates, which oftens leads to worse performance.

hvitved · 2023-06-27T10:23:37Z

ruby/ql/lib/codeql/ruby/frameworks/ActiveRecord.qll

    )
  }
+
+  /** Gets the class as a `DataFlow::ClasNode`. */


asgerf · 2023-06-28T11:24:27Z

Thanks for the review @hvitved! I've pushed my changes in response to your review.

I've triggered another DCA run to test the changes related to the Twirp model.

erik-krogh

JS plz 🙏

asgerf · 2023-06-28T13:00:51Z

To answer some top-level comments:

From @erik-krogh:

Could you filter out all the forward/backwards nodes that are not end/start nodes from getAnEpsilonSuccessor() in a post-processing predicate to get a speedup?
Or would that ruin the efficient storage of the fastTc results?

Correct, caching a post-processed version of the fastTC predicate would involve materialising all of the tuples in the relation.

@aschackmull has previously mentioned the need for a variant of boundedFastTC that restricts the viable end-points of a path, such a feature would likely be applicable here.

But on the whole I don't think it would be a huge performance win.

From @RasmusWL:

I'm a bit sad by not being able to do the following anymore. I've done it a few times in the past, when writing exploratory queries (such as, give me an API node that .cursor().execute() is called on, to find new SQL modeling) -- I don't remember on top of my head adding this to any "production" QL code though

Understandable. The use-case you mention is still supported via track() and backtrack(), although perhaps not as quick to write down:

API::Node foo(DataFlow::MethodCall node) {
  node.getMethodName() = "cursor" and
  result = node.track().getMethod("execute")
}

If it's a big enough issue we can easily implement something like API::getAMethodCallByName("...") so you can write API::getAMethodCallByName("cursor").getReturn().getMethod("execute").

asgerf added 14 commits June 19, 2023 12:01

Ruby: overhaul API graphs

0110610

Ruby: switch to local dataflow when dealing with Kernel/IO

5b05e72

Ruby: rename some call sites

61cda97

Ruby: update GraphQL model

2ef010f

Ruby: update SQLite3 model

b305c13

Ruby: update Twirp

f8ae530

This used right-to-left evaluation for API graphs, which is not supported anymore

Ruby: Use new features in ActionMailbox model

1ae4148

Ruby: use new features in ActionMailer

fbfa319

Ruby: use new features in ActionController

bb3b973

Ruby: minor overhaul of ActiveRecord model

8bc4193

Old version had scalability issues when adding taking more interprocedural flow and inheritance into account.

Ruby: minor overhaul of ActiveResource model

e3a0449

Ruby: update StoredXSS test results

ce0073b

These results were previously flagged for the wrong reason. Calls to a user-define method were seen as ORM calls. The real source is inside the user-defined method, but we miss that due to lack of 'self' handling in ORM tracking.

Ruby: benign changes to SQLi tests (fixed FNs)

f392af2

Ruby: Update ActiveDispatch due to change in toString

8539db0

github-actions bot added Python Ruby labels Jun 19, 2023

asgerf marked this pull request as ready for review June 19, 2023 14:08

asgerf requested review from a team as code owners June 19, 2023 14:08

RasmusWL previously approved these changes Jun 21, 2023

View reviewed changes

python/ql/lib/semmle/python/dataflow/new/internal/TypeTracker.qll Show resolved Hide resolved

Merge branch 'main' into rb/tracking-on-demand

0039cb1

asgerf dismissed RasmusWL’s stale review via 0039cb1 June 23, 2023 10:56

github-advanced-security bot found potential problems Jun 23, 2023

View reviewed changes

erik-krogh reviewed Jun 26, 2023

View reviewed changes

asgerf and others added 3 commits June 26, 2023 15:28

Update ruby/ql/lib/codeql/ruby/ApiGraphs.qll

ef9d910

Co-authored-by: Erik Krogh Kristensen <[email protected]>

Ruby: clarify qldoc for getADescendentModule

b61e823

Update ruby/ql/lib/codeql/ruby/ApiGraphs.qll

f6e2449

Co-authored-by: Erik Krogh Kristensen <[email protected]>

hvitved requested changes Jun 27, 2023

View reviewed changes

Ruby: address some review comments

174ab25

asgerf added 7 commits June 28, 2023 13:20

Ruby: add test for self.class call

67032b5

Ruby: remove forwarder for getADescendentModule

f171c21

Ruby: preserve comment in SQLite3

6feda75

Ruby: add asCallable()

dd86843

Ruby: use asCallable() in Twirp model

423da55

Ruby: expand Twirp test

129e634

Ruby: simplify Twirp model

7af3d22

Ruby: add change note

2f12234

github-actions bot added the documentation label Jun 28, 2023

Ruby: use a valid change note category

39789d4

hvitved approved these changes Jun 28, 2023

View reviewed changes

erik-krogh approved these changes Jun 28, 2023

View reviewed changes

asgerf merged commit f051702 into github:main Jun 28, 2023

asgerf mentioned this pull request Jul 7, 2023

Ruby: exclude Object class from API graph #13683

Merged

Ruby: overhaul API graphs #13496

Ruby: overhaul API graphs #13496

Uh oh!

Conversation

asgerf commented Jun 19, 2023 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Relation to MaD

Epsilon edges

Strict evaluation order

getASuccessor is deprecated

Evaluation

Uh oh!

RasmusWL left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

alexrford commented Jun 22, 2023

Uh oh!

Check warning

Choose a reason for hiding this comment

Uh oh!

erik-krogh left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

hvitved left a comment

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

asgerf Jun 28, 2023 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

asgerf commented Jun 28, 2023

Uh oh!

erik-krogh left a comment

Choose a reason for hiding this comment

Uh oh!

asgerf commented Jun 28, 2023 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

asgerf commented Jun 19, 2023 •

edited

Loading

asgerf Jun 28, 2023 •

edited

Loading

asgerf commented Jun 28, 2023 •

edited

Loading