-
Notifications
You must be signed in to change notification settings - Fork 1.9k
Ruby: overhaul API graphs #13496
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Ruby: overhaul API graphs #13496
Conversation
This used right-to-left evaluation for API graphs, which is not supported anymore
Old version had scalability issues when adding taking more interprocedural flow and inheritance into account.
These results were previously flagged for the wrong reason. Calls to a user-define method were seen as ORM calls. The real source is inside the user-defined method, but we miss that due to lack of 'self' handling in ORM tracking.
RasmusWL
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I have only read the PR description, and the small change to the Python .qll file. Looks really good from my point of view 💪
Especially this bit is super nice!
Any DataFlow::LocalSourceNode can be converted to an API::Node for tracking where that value flows. This can be done by calling .track() on it.
Similarly any DataFlow::Node can be converted to an API::Node for tracking what flows into that value. This can be done by calling .backtrack() on it.
I'm a bit sad by not being able to do the following anymore. I've done it a few times in the past, when writing exploratory queries (such as, give me an API node that .cursor().execute() is called on, to find new SQL modeling) -- I don't remember on top of my head adding this to any "production" QL code though
API::Node barCall(API::Node base) {
result = base.getMethod("bar") // Do not do this!
}
|
I've not reviewed the actual changes to API graphs, but the library modelling improvements look really great. |
| or | ||
| implicitCallEdge(pred, succ) | ||
| or | ||
| exists(DataFlow::HashLiteralNode splat | hashSplatEdge(splat, pred, succ)) |
Check warning
Code scanning / CodeQL
Omittable 'exists' variable
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
At the time of writing, using an _ here triggers a compiler bug and crashes the compiler (it's been fixed on main).
erik-krogh
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It's clever, and I think I get the basic idea now.
I haven't looked at everything, and I haven't looked at the details, and I'm not enough into the shared dataflow-library / ruby to do a good review of all the parts.
At first I was confused about epsilonStar, because you're encoding the "content" as part of the ApiNode (MkForwardNode / MkBackwardNode).
So surely the this.getAnEpsilonSuccessor() calls would also include all the forwards/backwards nodes that has content, and I thought that would be a problem.
But I can see that I didn't need to worry, because each of the edges in Impl ensure that the predecessor is a relevant "start" edge.
So getAnEpsilonSuccessor produces a lot of irrelevant nodes that are afterwards filtered out.
I can also see why you needed to encode the content as part of the ApiNode (to get fastTc to work).
Could you filter out all the forward/backwards nodes that are not end/start nodes from getAnEpsilonSuccessor() in a post-processing predicate to get a speedup?
Or would that ruin the efficient storage of the fastTc results?
Co-authored-by: Erik Krogh Kristensen <[email protected]>
Co-authored-by: Erik Krogh Kristensen <[email protected]>
hvitved
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Truly amazing work! My comments are mostly trivial, the overall approach looks really solid, and excellent that it actually scales.
| @@ -0,0 +1,329 @@ | |||
| /** | |||
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This files should probably be inside an internal folder?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
So should the type-tracker files, so I just kept it here to avoid moving around too many files.
| string toString(); | ||
|
|
||
| /** Gets the location associated with this API node, if any. */ | ||
| Location getLocation(); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Location needs to be a parameter as well, but can be postponed until other languages actually need to use it.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yeah it depends on exactly where this file is going to live in relation to type tracking.
Right now the idea is that the library just imports TypeTracker.qll and TypeTrackerSpecific.qll which provides a lot of language-specific stuff which, for the time being, should also work for Python
| pragma[noopt] | ||
| cached | ||
| predicate epsilonEdge(ApiNode pred, ApiNode succ) { | ||
| // forward |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
remove?
| } | ||
|
|
||
| cached | ||
| predicate methodHasSuperCall(MethodNode method, CallNode call) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is this really worth caching?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Right now there aren't any other uses than in API graphs, so it doesn't matter. But in principle I'd say yes, it's a good place to cache, because the predicate is very small, and if re-evaluated at the wrong time it can trigger re-evaluation of getEnclosingMethod which is large and not cached.
| } | ||
|
|
||
| /** | ||
| * Gets a module for which this constant is the reference to an ancestor module. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
this constant -> the constant constRef
| * additional entry points may be added by extending this class. | ||
| */ | ||
| abstract class EntryPoint extends string { | ||
| // Note: this class can be deprecated in Ruby, but is still referenced by shared code in ApiGraphModels.qll, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Would it make sense to move this class to ApiGraphModelsSpecific.qll, and then define a deprecated sub class here?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'd rather just wait and deprecate it simultaneously across languages.
| /** | ||
| * A node corresponding to an argument, right-hand side of a store, or return value from a callable. | ||
| * | ||
| * Such a node may serve as the starting-point of backtracking, and has epsilon edges going |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
going to
| /** | ||
| * Holds if `rhs` is a definition of a node that should have an incoming edge labeled `lbl`, | ||
| * from a def node that is reachable from `node`. | ||
| * Holds if the epsilon `pred -> succ` be generated, to associate `mod` with its references in the codebase. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
should
| * Holds if `pred` can reach `succ` by zero or more epsilon edges. | ||
| */ | ||
| cached | ||
| predicate epsilonStar(ApiNode pred, ApiNode succ) = fastTC(epsilonEdge/2)(pred, succ) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Does it work if you remove the reflexive case from epsilonEdge, and instead define
cached
predicate epsilonPlus(ApiNode pred, ApiNode succ) = fastTC(epsilonEdge/2)(pred, succ)
pragma[inline]
predicate epsilonStar(ApiNode pred, ApiNode succ) {
pred = succ
or
epsilonPlus(pred, succ)
}
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
No, it causes misoptimizations. The current solution exists to ensure the RA pipelines we get from a chain of API graph calls becomes completely linear, which is quite robust against optimizer wobbles.
Inserting a disjunction at every API graph call causes the optimizer to do bunch of work to try and remove those disjunctions, like DNF rewrites and pulling out shared helper predicates, which oftens leads to worse performance.
| ) | ||
| } | ||
|
|
||
| /** Gets the class as a `DataFlow::ClasNode`. */ |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Class
|
Thanks for the review @hvitved! I've pushed my changes in response to your review. I've triggered another DCA run to test the changes related to the Twirp model. |
erik-krogh
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
JS plz 🙏
|
To answer some top-level comments: From @erik-krogh:
Correct, caching a post-processed version of the @aschackmull has previously mentioned the need for a variant of But on the whole I don't think it would be a huge performance win. From @RasmusWL:
Understandable. The use-case you mention is still supported via API::Node foo(DataFlow::MethodCall node) {
node.getMethodName() = "cursor" and
result = node.track().getMethod("execute")
}If it's a big enough issue we can easily implement something like |
Makes a significant overhaul of API graphs in Ruby, both in how they are implemented and what they are capable of.
Currently this is only for Ruby, but some code has already been factored out with intent to share with JS and Python as well. (The PR is large enough as it is, and actually sharing the file involves moving files around that make the diff harder to read)
The main benefits of this change are:
DataFlow::LocalSourceNodecan be converted to anAPI::Nodefor tracking where that value flows. This can be done by calling.track()on it.DataFlow::Nodecan be converted to anAPI::Nodefor tracking what flows into that value. This can be done by calling.backtrack()on it.X.getInstance()previously only found calls toX.new, but now includesselfparameters that could be an instance ofX. Concretely, these are all theselfparameters of an instance method of any ancestor of any descendent ofX.X.getMethod(m)now finds calls toself.min singleton methods in a subclasses ofX.X.getMethod(m)now also matchessupercalls inside the relevant methods.Kerneland working around inaccurateselfcapture in blocks.Relation to MaD
To provide some more context as to why this is needed, consider the criteria by which we determine if a given call targets a particular external method. We mainly look at four things:
The MaD format for dynamic languages require that these are in fact the only criteria by which calls are identified. However, Ruby has had a number of models that rely on other criteria, such as being syntactically inside a particular class. I believe these were heuristics in place because (2) was too hard to check at the time the model was written. This had led to FNs, and has also made it difficult to judge which models could be presented in MaD, because those heuristic criteria could not be used in MaD.
API graphs should be the solution to (2). So my hope with this improvement to API graphs that we can start checking the type of the receiver using API graphs, and moving away from the heuristic criteria that block transition to MaD.
Epsilon edges
Under the hood, we introduce a notion of epsilon edges. An epsilon edge
A -> Bmeans anything looked up inAwill implicitly looked up inBas well. Both data flow and inheritance give rise to epsilon edges. The construction of the "epsilon graph" is in a moduleApiGraphShared.qllthat I intend to share with JS/Python, but currently just lives in Ruby.Previously an edge in the API graph incorporated interprocedural flow, but labelled edges are now usually entirely local, as the interprocedural flow is captured by preceding epsilon edges.
For example, given this program,
Previously, there was an edge from
Foodirectly top.barin the method:Now, the step from
Foo -> pis an epsilon edge, at theMethod[bar]edge is local:Epsilon edges are also use to incorporate inheritance. For example
getTopLevelMember("Foo").getMethod("baz")would identifyBar.bazin the example below:A somewhat simplified version of the API graph would look like this:
Strict evaluation order
Almost every user-facing predicate now has
pragma[inline_late]andbindingset[this], meaning chains of API graph calls are now join-ordered more reliably. This also means you must restrict the receiver before using API graphs, so you should avoid doing something like this, as it will be forced to find enumerate the epsilon-successors of every node in the graph, which quite a lot:getASuccessor is deprecated
We no longer support using API graphs as a general labelled graph.
getASuccessoris deprecated,getPath()is deprecated and thetoStringvalue is now a very direct translation of the internal representation of a node.Previously we would construct an
edge(Node pred, Label lbl, Node succ)relation, but then cache the specialized versions of this relation, making the originaledgerelation largely unnecessary.An upside of this is better performance, and adding new kinds of labelled edges is vastly simplified as it's just a matter of:
Implpragma[inline_late]predicate inAPI::Node, callinggetAnEpsilonSuccessor()and passing that as argument to the cached edge relation inImplEvaluation
Evaluation shows: