Skip to content

Conversation

@asgerf
Copy link
Contributor

@asgerf asgerf commented Jun 19, 2023

Makes a significant overhaul of API graphs in Ruby, both in how they are implemented and what they are capable of.

Currently this is only for Ruby, but some code has already been factored out with intent to share with JS and Python as well. (The PR is large enough as it is, and actually sharing the file involves moving files around that make the diff harder to read)

The main benefits of this change are:

  • Any DataFlow::LocalSourceNode can be converted to an API::Node for tracking where that value flows. This can be done by calling .track() on it.
  • Similarly any DataFlow::Node can be converted to an API::Node for tracking what flows into that value. This can be done by calling .backtrack() on it.
  • Some consequence of the above are:
    • Parameters of user-defined methods can now be seen as sources, and likewise, their return values can be seen as sinks. Previously we had to drop down to data flow there, and as a result, some of our models relied on local data flow leading to FNs.
    • It is much easier to gradually migrate models to use API graphs. Previously the inability to go from data flow to API graphs meant you could get stuck due to a dependency on another model, which you would then also have to migrate at the same time.
  • X.getInstance() previously only found calls to X.new, but now includes self parameters that could be an instance of X. Concretely, these are all the self parameters of an instance method of any ancestor of any descendent of X.
  • X.getMethod(m) now finds calls to self.m in singleton methods in a subclasses of X.
  • X.getMethod(m) now also matches super calls inside the relevant methods.
  • This means API graphs can and should now be relied upon for finding calls to external methods in virtually all cases. Exceptions mainly have to do with special cases like Kernel and working around inaccurate self capture in blocks.
  • API graphs should be easier to use in general, since you don't need to worry about whether an API exists for a given data-flow node. The documentation also doesn't focus so much on the library boundary, just data flow and inheritance, illustrated with examples.

Relation to MaD

To provide some more context as to why this is needed, consider the criteria by which we determine if a given call targets a particular external method. We mainly look at four things:

  1. The name of the method being invoked
  2. The type of the receiver
  3. The number of arguments given
  4. And rarely, the values/types of the arguments

The MaD format for dynamic languages require that these are in fact the only criteria by which calls are identified. However, Ruby has had a number of models that rely on other criteria, such as being syntactically inside a particular class. I believe these were heuristics in place because (2) was too hard to check at the time the model was written. This had led to FNs, and has also made it difficult to judge which models could be presented in MaD, because those heuristic criteria could not be used in MaD.

API graphs should be the solution to (2). So my hope with this improvement to API graphs that we can start checking the type of the receiver using API graphs, and moving away from the heuristic criteria that block transition to MaD.

Epsilon edges

Under the hood, we introduce a notion of epsilon edges. An epsilon edge A -> B means anything looked up in A will implicitly looked up in B as well. Both data flow and inheritance give rise to epsilon edges. The construction of the "epsilon graph" is in a module ApiGraphShared.qll that I intend to share with JS/Python, but currently just lives in Ruby.

Previously an edge in the API graph incorporated interprocedural flow, but labelled edges are now usually entirely local, as the interprocedural flow is captured by preceding epsilon edges.

For example, given this program,

def func p
  p.bar
end
func Foo

Previously, there was an edge from Foo directly to p.bar in the method:

flowchart LR
  root
  Foo
  p.bar
  root -- "Member[Foo]" --> Foo
  Foo -- "Method[bar]" --> p.bar
Loading

Now, the step from Foo -> p is an epsilon edge, at the Method[bar] edge is local:

flowchart LR
  root
  Foo
  p
  p.bar
  root -- "Member[Foo]" --> Foo
  Foo -- epsilon --> p
  p -- "Method[bar]" --> p.bar
Loading

Epsilon edges are also use to incorporate inheritance. For example getTopLevelMember("Foo").getMethod("baz") would identify Bar.baz in the example below:

class Bar < Foo
end
Bar.baz

A somewhat simplified version of the API graph would look like this:

flowchart LR
  root -- "Member[Foo]" --> FooExpr -- epsilon --> BarModule --epsilon --> BarExpr -- "Method[baz]" --> Bar.baz
Loading

Strict evaluation order

Almost every user-facing predicate now has pragma[inline_late] and bindingset[this], meaning chains of API graph calls are now join-ordered more reliably. This also means you must restrict the receiver before using API graphs, so you should avoid doing something like this, as it will be forced to find enumerate the epsilon-successors of every node in the graph, which quite a lot:

API::Node barCall(API::Node base) {
  result = base.getMethod("bar") // Do not do this!
}

getASuccessor is deprecated

We no longer support using API graphs as a general labelled graph. getASuccessor is deprecated, getPath() is deprecated and the toString value is now a very direct translation of the internal representation of a node.

Previously we would construct an edge(Node pred, Label lbl, Node succ) relation, but then cache the specialized versions of this relation, making the original edge relation largely unnecessary.

An upside of this is better performance, and adding new kinds of labelled edges is vastly simplified as it's just a matter of:

  • Adding a new cached predicate in Impl
  • Adding a pragma[inline_late] predicate in API::Node, calling getAnEpsilonSuccessor() and passing that as argument to the cached edge relation in Impl

Evaluation

Evaluation shows:

  • 635 new taint sinks
  • 79 new taint sources
  • 901 new tainted nodes
  • 24 new alerts
  • Performance looks reasonable. There's a 10% regression on two projects, but those also gained some new results, and a 24% speed-up on one. Given the benefits I think we can accept this.

asgerf added 14 commits June 19, 2023 12:01
This used right-to-left evaluation for API graphs, which is not supported anymore
Old version had scalability issues when adding taking more interprocedural flow and inheritance into account.
These results were previously flagged for the wrong reason.

Calls to a user-define method were seen as ORM calls. The real source is inside the user-defined method, but we miss that due to lack of 'self' handling in ORM tracking.
@asgerf asgerf marked this pull request as ready for review June 19, 2023 14:08
@asgerf asgerf requested review from a team as code owners June 19, 2023 14:08
RasmusWL
RasmusWL previously approved these changes Jun 21, 2023
Copy link
Member

@RasmusWL RasmusWL left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I have only read the PR description, and the small change to the Python .qll file. Looks really good from my point of view 💪

Especially this bit is super nice!

Any DataFlow::LocalSourceNode can be converted to an API::Node for tracking where that value flows. This can be done by calling .track() on it.
Similarly any DataFlow::Node can be converted to an API::Node for tracking what flows into that value. This can be done by calling .backtrack() on it.

I'm a bit sad by not being able to do the following anymore. I've done it a few times in the past, when writing exploratory queries (such as, give me an API node that .cursor().execute() is called on, to find new SQL modeling) -- I don't remember on top of my head adding this to any "production" QL code though

API::Node barCall(API::Node base) {
  result = base.getMethod("bar") // Do not do this!
}

@alexrford
Copy link
Contributor

I've not reviewed the actual changes to API graphs, but the library modelling improvements look really great.

or
implicitCallEdge(pred, succ)
or
exists(DataFlow::HashLiteralNode splat | hashSplatEdge(splat, pred, succ))

Check warning

Code scanning / CodeQL

Omittable 'exists' variable

This exists variable can be omitted by using a don't-care expression [in this argument](1).
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

At the time of writing, using an _ here triggers a compiler bug and crashes the compiler (it's been fixed on main).

Copy link
Contributor

@erik-krogh erik-krogh left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It's clever, and I think I get the basic idea now.
I haven't looked at everything, and I haven't looked at the details, and I'm not enough into the shared dataflow-library / ruby to do a good review of all the parts.

At first I was confused about epsilonStar, because you're encoding the "content" as part of the ApiNode (MkForwardNode / MkBackwardNode).
So surely the this.getAnEpsilonSuccessor() calls would also include all the forwards/backwards nodes that has content, and I thought that would be a problem.

But I can see that I didn't need to worry, because each of the edges in Impl ensure that the predecessor is a relevant "start" edge.
So getAnEpsilonSuccessor produces a lot of irrelevant nodes that are afterwards filtered out.

I can also see why you needed to encode the content as part of the ApiNode (to get fastTc to work).

Could you filter out all the forward/backwards nodes that are not end/start nodes from getAnEpsilonSuccessor() in a post-processing predicate to get a speedup?
Or would that ruin the efficient storage of the fastTc results?

Copy link
Contributor

@hvitved hvitved left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Truly amazing work! My comments are mostly trivial, the overall approach looks really solid, and excellent that it actually scales.

@@ -0,0 +1,329 @@
/**
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This files should probably be inside an internal folder?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

So should the type-tracker files, so I just kept it here to avoid moving around too many files.

string toString();

/** Gets the location associated with this API node, if any. */
Location getLocation();
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Location needs to be a parameter as well, but can be postponed until other languages actually need to use it.

Copy link
Contributor Author

@asgerf asgerf Jun 28, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah it depends on exactly where this file is going to live in relation to type tracking.

Right now the idea is that the library just imports TypeTracker.qll and TypeTrackerSpecific.qll which provides a lot of language-specific stuff which, for the time being, should also work for Python

pragma[noopt]
cached
predicate epsilonEdge(ApiNode pred, ApiNode succ) {
// forward
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

remove?

}

cached
predicate methodHasSuperCall(MethodNode method, CallNode call) {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is this really worth caching?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Right now there aren't any other uses than in API graphs, so it doesn't matter. But in principle I'd say yes, it's a good place to cache, because the predicate is very small, and if re-evaluated at the wrong time it can trigger re-evaluation of getEnclosingMethod which is large and not cached.

}

/**
* Gets a module for which this constant is the reference to an ancestor module.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this constant -> the constant constRef

* additional entry points may be added by extending this class.
*/
abstract class EntryPoint extends string {
// Note: this class can be deprecated in Ruby, but is still referenced by shared code in ApiGraphModels.qll,
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Would it make sense to move this class to ApiGraphModelsSpecific.qll, and then define a deprecated sub class here?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'd rather just wait and deprecate it simultaneously across languages.

/**
* A node corresponding to an argument, right-hand side of a store, or return value from a callable.
*
* Such a node may serve as the starting-point of backtracking, and has epsilon edges going
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

going to

/**
* Holds if `rhs` is a definition of a node that should have an incoming edge labeled `lbl`,
* from a def node that is reachable from `node`.
* Holds if the epsilon `pred -> succ` be generated, to associate `mod` with its references in the codebase.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

should

* Holds if `pred` can reach `succ` by zero or more epsilon edges.
*/
cached
predicate epsilonStar(ApiNode pred, ApiNode succ) = fastTC(epsilonEdge/2)(pred, succ)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Does it work if you remove the reflexive case from epsilonEdge, and instead define

cached
predicate epsilonPlus(ApiNode pred, ApiNode succ) = fastTC(epsilonEdge/2)(pred, succ)

pragma[inline]
predicate epsilonStar(ApiNode pred, ApiNode succ) {
  pred = succ
  or
  epsilonPlus(pred, succ)
}

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

No, it causes misoptimizations. The current solution exists to ensure the RA pipelines we get from a chain of API graph calls becomes completely linear, which is quite robust against optimizer wobbles.

Inserting a disjunction at every API graph call causes the optimizer to do bunch of work to try and remove those disjunctions, like DNF rewrites and pulling out shared helper predicates, which oftens leads to worse performance.

)
}

/** Gets the class as a `DataFlow::ClasNode`. */
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Class

@asgerf
Copy link
Contributor Author

asgerf commented Jun 28, 2023

Thanks for the review @hvitved! I've pushed my changes in response to your review.

I've triggered another DCA run to test the changes related to the Twirp model.

Copy link
Contributor

@erik-krogh erik-krogh left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

JS plz 🙏

@asgerf
Copy link
Contributor Author

asgerf commented Jun 28, 2023

To answer some top-level comments:


From @erik-krogh:

Could you filter out all the forward/backwards nodes that are not end/start nodes from getAnEpsilonSuccessor() in a post-processing predicate to get a speedup?
Or would that ruin the efficient storage of the fastTc results?

Correct, caching a post-processed version of the fastTC predicate would involve materialising all of the tuples in the relation.

@aschackmull has previously mentioned the need for a variant of boundedFastTC that restricts the viable end-points of a path, such a feature would likely be applicable here.

But on the whole I don't think it would be a huge performance win.


From @RasmusWL:

I'm a bit sad by not being able to do the following anymore. I've done it a few times in the past, when writing exploratory queries (such as, give me an API node that .cursor().execute() is called on, to find new SQL modeling) -- I don't remember on top of my head adding this to any "production" QL code though

Understandable. The use-case you mention is still supported via track() and backtrack(), although perhaps not as quick to write down:

API::Node foo(DataFlow::MethodCall node) {
  node.getMethodName() = "cursor" and
  result = node.track().getMethod("execute")
}

If it's a big enough issue we can easily implement something like API::getAMethodCallByName("...") so you can write API::getAMethodCallByName("cursor").getReturn().getMethod("execute").

@asgerf asgerf merged commit f051702 into github:main Jun 28, 2023
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

6 participants