In this workshop we will explore control flow, how it is represented by the standard library, and how you can use it to reason about reachability.
-
Install Visual Studio Code.
-
Install the CodeQL extension for Visual Studio Code.
-
You do not need to install the CodeQL CLI: the extension will handle this for you.
-
Clone this repository:
git clone https://github.com/rvermeulen/codeql-workshop-control-flow-cpp
-
Install the CodeQL pack dependencies using the command
CodeQL: Install Pack Dependencies
and selectexercises
,exercises-tests
,solutions
, andsolutions-tests
.
The workshop is split into multiple exercises introducing control flow. In these exercises you will learn:
- About control flow and the control flow graph (CFG).
- How control flow is represented in QL.
- Learn about reachability and how you can answer reachability questions using recursive predicates and transitive closures.
- About properties of control flow graph nodes, such as dominating other control flow nodes.
- About corner-cases when reasoning using control flow, how data flow provides higher level construct to answer reachability questions, but still requires control flow to excludes correct cases.
Control flow is an ordering on program elements that dictates the order in which program elements are executed or evaluated. The control flow of a program is captured by a control flow graph that has labelled directed edges between nodes that capture the order and conditions for the flow to occur.
When CodeQL extracts code, it will create an abstract syntax tree (AST), and based on the AST it will create a control flow graph (CFG) to capture the order of execution.
The CodeQL standard library for C/C++ computes an expression-level intra-procedural CFG and exposes the CFG via the class ControlFlowNode
and the successor relation getASuccessor
.
This means that the CFG has edges between expressions, statements, and methods.
An important goal of the computed CFG is to accurately capture the order of side-effects in a program.
Consider the following snippet:
int add(int m, int n) {
return m + n;
}
The following AST captures the parent child relationship and based on that relationship the following CFG is generated. The edges are labelled with the order for clarification and are not part of the actual CFG.
Looking at the CFG we can make a few observations:
- For expression (e.g.,
AddExpr
) their children are executed from left to right before the expression itself is executed (post-order traversal on the AST node). - Statements with side-effects (e.g,
ReturnStmt
) are executed like expression. - Statements without side-effects (e.g,
BlockStmt
) are their own entry point (pre-order traversal on the AST node). - The method itself is the last node of the CFG.
AST ┌───────────────┐ CFG
│ Unconditional │
└───────────────┘
.─────────. .─────────.
( Function ) ( Function )◀────┐
`─────────' `─────────' │
│ │
│
│ │
▼ │
.─────────. .─────────. │
( BlockStmt ) ────────(1)──────▶( BlockStmt ) │
`─────────' `─────────' │(6)
│ │ │
┌────────────────────────┘ │
│ │ │
▼ │ (2) ┌───────────┘
.─────────. │ │
(ReturnStmt ) │ │
`─────────' │ │
│ │ │
│ │
│ ▼ │
.───────────. │
▼ ( ReturnStmt ) .─────────.
.─────────. `───────────' ( AddExpr )◀───────┐
( AddExpr ) │ `─────────' │
`─────────' │ (3) (5) │
Λ │ │
─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ▼ │
▼ ▼ .───────────────. .───────────────.
.─────────────. .─────────────. ( VariableAccess )─────────(4)───────▶( VariableAccess )
(VariableAccess ) (VariableAccess ) `───────────────' `───────────────'
`─────────────' `─────────────'
Each CFG node in the previous example has a single successor. When we include constructs like conditions or loops, we will encounter CFG nodes with multiple successors.
Consider the following snippet:
unsigned int absolute(int i) {
if (i < 0) {
return -i;
}
return i;
}
For which we generate the following AST and CFG:
AST CFG
.───────. .───────.
(Function ) (Function )◀───────────────────────────────────────────┐
`───────' ┌───────────────┐ `───────' │
│ │ Conditional │ ▲ │
│ └───────────────┘ │ │
│ └─────────────────────────────────────────────────┼──┐
▼ │ │
.─────────. .─────────. │ │
( BlockStmt ) ────────────▶( BlockStmt ) │ │
`─────────' `─────────' │ │
│ │ │ │
│ │ │ │
│ │ │ │
▼ ▼ │ │
.───────. .───────. │ │
( IfStmt ) ( IfStmt ) │ │
`───────' `───────' │ │
│ │ │ │
│ │ │ │
┌────────────┴────────────────┬───────────────────┐ ┌─────────────────────┘ │ │
│ │ │ │ ┌────────────────────────────(false)──────────────┐ │ │
▼ ▼ ▼ │ │ ▼ │ │
.───────. .─────────. .─────────. │ .───────. .─────────. .─────────. │ │
( LTExpr ) ( BlockStmt ) ( BlockStmt ) │ ( LTExpr )─────────────────▶( BlockStmt ) ( BlockStmt ) │ │
`───────' `─────────' `─────────' │ `───────' (true) `─────────' `─────────' │ │
│ │ │ │ ▲ │ │ │ │
┌────────┴────────┐ │ │ │ └────────┐ │ │ │ │
│ │ │ │ │ │ │ │ │ │
▼ ▼ ▼ ▼ ▼ │ ▼ ▼ │ │
.─────────────. .─────────────. .─────────────. .─────────────. .─────────────. .─────────────. .─────────────. .─────────────. │ │
(VariableAccess ) ( Literal ) ( ReturnStmt ) ( ReturnStmt ) (VariableAccess ) ( Literal ) ( ReturnStmt ) ( ReturnStmt ) │ │
`─────────────' `─────────────' `─────────────' `─────────────' `─────────────' `─────────────' `─────────────' `─────────────' │ │
│ │ │ ▲ │ │ │ │
▼ ▼ │ │ ┌───────────┘ ▼ │ │
.───────────────. .───────────────. └─────────────────┘ │ .───────────────. .───────────────. │ │
( UnaryMinuxExpr ) ( VariableAccess ) │ ( UnaryMinuxExpr ) ( VariableAccess ) ──┘ │
`───────────────' `───────────────' │ `────────────┬──' `───────────────' │
│ │ ▲ │ │
│ │ │ └──────────────────────────────┘
▼ │ │
.───────────────. │ .───────────────.
( VariableAccess ) └─▶( VariableAccess )
`───────────────' `───────────────'
Here we can see a CFG with a node that has multiple successors.
The LTExpr
part of the condition in the IfStmt
statement continues execution to one of two successors depending on whether the condition evaluates to true
or false
.
This is reflected in the labels of the outgoing CFG edges.
Next are the exercises used to further explore control flow.
Find all the control flow nodes and their successors by implementing exercise1.ql.
Hints
- The
cpp
module provides a classControlFlowNode
to reason about control flow nodes in a program.
A solution can be found in the query Exercise1.ql
Now that we have described how two control flow nodes relate using the successor relationship we can reason about reachability. Reachability is a concept from graph theory that refers to the ability to get from one vertex to another within a graph. In our case it means that control can be transferred from one node in the CFG to another node in the CFG. Or more concretely, the reachable node will be executed after executing the node it is reachable from.
Implement the predicate reachable
using the successor relationship by completing the query Exercise2.ql.
Hints
This query can be implemented using recursion or using the transitive closure
A solution can be found in the query exercise2.ql
To visualize the intra-procedural CFG, that is the CFG of a method not crossing method boundaries, restrict the begin node and the reachable node by completing the query in Excercise3.ql.
To understand how to restrict the begin node and the end node we first have to determine what a begin node and end node in a CFG are with respect to the successor relationship.
Hints
- The
ControlFlowNode
class has member predicategetAPredecessor
to get a predecessor. How could you get a predecessor with only thegetASuccessor
member predicate? - You can use the pattern
not exists(...)
to state that a predicate has no results. - The CodeQL CLI command
codeql database analyze
can output to thedot
format. The result can be rendered with Graphviz. See the filecontrol-flow-graph.png
for a rendering of the control flow graphs in test.cpp.
A solution can be found in the query Exercise3.ql
From the results of the previous exercises, we can see that the control flow paths include both statements and expressions.
In some cases, you would be interested in determining the successor statement of another statement.
For example, given an if statement, what is the next reachable if statement.
Our reachable
predicate will return all reachable if statement so additional logic is required to answer the question.
Implement the predicate getANextStmt
in Exercise4.ql.
While implementing the predicate consider the following C/C++ snippet and which if
statements should be returned by Exercise4.ql for the first if
statement.
if (...) {
if (...) {
...
}
}
if (...) {
...
}
with the following CFG.
│
│
│
│
▼
.───────.
( IfStmt )
`───────'
│
┌─────────────┘
▼
.───────. .───────.
( ... )───(true)───▶(BlockStmt)
`───────' `───────'
│ │
│ │
│ ▼
│ .───────.
│ ( IfStmt )
│ `───────'
│ │
(false) ┌──────────┘
│ ▼
│ .───────. .───────.
│ ( ... )─(true)────▶(BlockStmt)
│ `───────' `───────'
│ │ │
│ │ │
│ (false) │
│ │ │
│ ▼ ▼
│ .───────. .───────.
└─────▶( IfStmt )◀───────────( ... )
`───────' `───────'
│
┌────────┘
▼
.───────. .───────.
( ... )───────▶( ... )─ ─ ─ ─ ─▶
`───────' `───────'
Hints
When you use the reachable
predicate you need to exclude results to include only the strict successor statements.
However, you cannot exclude the correct nodes due to conditional nodes. See the above example snippet and CFG.
To implement a correct solution, you need to resort to a recursive predicate.
A solution can be found in the query Exercise4.ql
Now that we understand the successor relation, the control flow graph, and reachability, we can look at how reachability can be used.
In multiple scenarios, including security relevant scenarios, a certain action must be performed before another action. For example, before you can use a method you must first initialize the class providing the method.
Look at the test case for Exercise5.ql and implement Exercise5.ql to ensure the incorrect_use
is detected.
Hints
For all correct uses, the some_action
method access is reachable from the init
function call.
A solution can be found in the query Exercise5.ql
First time use of the successor relationship and reachability can result in surprising result. The successor relationship relates program elements that can reside in different scopes within a method.
For example, any program element following an if
statement will be related to one or more statements in the if
body.
Consider the following case:
void incorrect_use2(const char *input) {
struct ctx ctx;
if (input) {
init(&ctx);
}
some_action(&ctx, input);
}
The some_action
method access is reachable from an init
function call.
This means that the solution to exercise 5 will not find this incorrect use, which is the case for our solution to exercise 5.
To ensure that all accesses of the method some_action
are preceded by an init
access we can make use of the dominator concept from graph theory.
A node dominates another node if every path from the entry node to a node m must go through node n.
If we swap m with method access for someAction
and n with method access of initialize
then this describes the property we want.
The standard library provides the predicate dominates
that is defined in the the module Dominance.
Use the predicate dominates
and update the solution to Exercise5.ql in Exercise6.ql to account for the new case.
Hints
A solution can be found in the query Exercise6.ql
CodeQL constructs an intraprocedural CFG, meaning it only considers the program elements in a function. However, function are not used in isolation and there are cases where we need to extend the control flow analysis.
The test case is extended with the function default_ctx
that initializes the context and sets some fields to default values.
To detect this case we need to reason about the control flow graph interprocedurally.
The reasoning about callers and callees, flow of control crossing function boundaries, requires the use of the (static) Call graph.
To be able to correctly identify the correct_use2
case we need to determine that there exists a function call that dominates the function call some_action
and confirm that the function call always call the init
method.
Implement Exercise7.ql by completing the predicates initCallDominatesExit
, that holds if a function calls the init
function on all execution paths, and callAlwaysCallsInit
, that holds if a function call calls a function that always calls init
on all execution paths.
Hints
- The predicate
dominates
holds if a control flow node can only be reached by going through another control flow node. - The
Function
is the last node in the control flow graph of a function.
A solution can be found in the query Exercise7.ql
In the previous exercise we started to reason interprocedurally about the control flow.
However, our reasoning was incomplete because we only considered calls dominating the some_action
function call within the same function.
Consider the test case that includes multiple calls to correct_or_incorrect_use
where one caller has initialized the context and the other is didn't.
To identify the correct uses, we need to determine if all the callers of correct_or_incorrect_use
are dominated by an init
function call, or a function call that always performs calls the init
function.
Implement the query alwaysPrecededByInitializationCall
in Exercise8.ql and use it to determine that a ControlFlowNode
is always directly preceded by an init
function call, or the property holds for all the callers of its enclosing callable.
Hints
- The
FunctionCall
class has a member predicategetTarget
to reason about the functions it can call. - The
ControlFlowNode
class has a member predicategetControlFlowScope
to reason about the function that contains the control flow node. - To state that a property must hold for all values (e.g., all callers) use the formula forall. Note that the
forall
formula holds vacuously if there are no values to test the property on because it is logicall the same asnot exists(<vars> | <formula 1> | not <formula 2>)
. If there needs to be at least one value for which the property holds you can use the forex formula.
A solution can be found in the query Exercise8.ql
Multiple corner cases down the line the question arises whether there isn't a better way to solve our problem.
While our solutions improved incrementally, we still have a lot of corner cases to cover.
For example, we didn't even consider that the arguments to the init
calls are the same as to the some_action
calls.
The answer to the question is yes. Data flow analysis allows us to more succinctly describe the problem and provide us with more precise answers. However, for the correct answer, we still need to reason about control flow.
As a precursor to the data flow workshop, we are going to implement parts of an interprocedural dataflow configuration to find uses of some_action
without a preceding call to init
. The core idea is to track the struct's instance to its uses as an argument to some_action
calls and stop tracking the instance if it is used as an argument to init
.
However, because C/C++'s data flow implementation relies on def-use relations, and not on use-use relations, we still need the predicates we created so far to filter out correct uses.
Complete the dataflow configuration in Exercise9.ql
Hints
- The class
DataFlow::Node
can transformed to uninitialized local variables using the member predicateasUninitialized
. - The class
LocalVariable
has a member predicategetType()
to get the type of the variable. - Interprocedural dataflow uses the concepts of a source and a sink for which it determines if the source can reach the sink. A barrier is a condition that prevents further analysis to determine if a sink is reachable and is typically used to exclude data that is sanitized or validated.
A solution can be found in the query exercise9.ql