|
| 1 | +\ingroup module_hidden |
| 2 | +\defgroup module_ansi-c ANSI-C Language Front-end |
| 3 | + |
| 4 | +\author Kareem Khazem |
| 5 | + |
| 6 | +\section preprocessing Preprocessing & Parsing |
| 7 | + |
| 8 | +In the \ref ansi-c and \ref java_bytecode directories |
| 9 | + |
| 10 | +**Key classes:** |
| 11 | +* \ref languaget and its subclasses |
| 12 | +* ansi_c_parse_treet |
| 13 | + |
| 14 | +\dot |
| 15 | +digraph G { |
| 16 | + node [shape=box]; |
| 17 | + rankdir="LR"; |
| 18 | + 1 [shape=none, label=""]; |
| 19 | + 2 [label="preprocessing & parsing"]; |
| 20 | + 3 [shape=none, label=""]; |
| 21 | + 1 -> 2 [label="Command line options, file names"]; |
| 22 | + 2 -> 3 [label="Parse tree"]; |
| 23 | +} |
| 24 | +\enddot |
| 25 | + |
| 26 | + |
| 27 | + |
| 28 | +--- |
| 29 | +\section type-checking Type-checking |
| 30 | + |
| 31 | +In the \ref ansi-c and \ref java_bytecode directories. |
| 32 | + |
| 33 | +**Key classes:** |
| 34 | +* \ref languaget and its subclasses |
| 35 | +* \ref irept |
| 36 | +* \ref irep_idt |
| 37 | +* \ref symbolt |
| 38 | +* symbol_tablet |
| 39 | + |
| 40 | +\dot |
| 41 | +digraph G { |
| 42 | + node [shape=box]; |
| 43 | + rankdir="LR"; |
| 44 | + 1 [shape=none, label=""]; |
| 45 | + 2 [label="type checking"]; |
| 46 | + 3 [shape=none, label=""]; |
| 47 | + 1 -> 2 [label="Parse tree"]; |
| 48 | + 2 -> 3 [label="Symbol table"]; |
| 49 | +} |
| 50 | +\enddot |
| 51 | + |
| 52 | +This stage generates a symbol table, mapping identifiers to symbols; |
| 53 | +\ref symbolt "symbols" are tuples of (value, type, location, flags). |
| 54 | + |
| 55 | +This is a good point to introduce the \ref irept ("internal |
| 56 | +representation") class---the base type of many of CBMC's hierarchical |
| 57 | +data structures. In particular, \ref exprt "expressions", |
| 58 | +\ref typet "types" and \ref codet "statements" are all subtypes of |
| 59 | +\ref irept. |
| 60 | +An irep is a tree of ireps. A subtlety is that an irep is actually the |
| 61 | +root of _three_ (possibly empty) trees, i.e. it has three disjoint sets |
| 62 | +of children: \ref irept::get_sub() returns a list of children, and |
| 63 | +\ref irept::get_named_sub() and \ref irept::get_comments() each return an |
| 64 | +association from names to children. **Most clients never use these |
| 65 | +functions directly**, as subtypes of irept generally provide more |
| 66 | +descriptive functions. For example, the operands of an |
| 67 | +\ref exprt "expression" (\ref exprt::op0() "op0", op1 etc) are |
| 68 | +really that expression's children; the |
| 69 | +\ref code_assignt::lhs() "left-hand" and right-hand side of an |
| 70 | +\ref code_assignt "assignment" are the children of that assignment. |
| 71 | +The \ref irept::pretty() function provides a descriptive string |
| 72 | +representation of an irep. |
| 73 | + |
| 74 | +\ref irep_idt "irep_idts" ("identifiers") are strings that use sharing |
| 75 | +to improve memory consumption. A common pattern is a map from irep_idts |
| 76 | +to ireps. A goto-program contains a single symbol table (with a single |
| 77 | +scope), meaning that the names of identifiers in the target program are |
| 78 | +lightly mangled in order to make them globally unique. If there is an |
| 79 | +identifier `foo` in the target program, the `name` field of `foo`'s |
| 80 | +\ref symbolt "symbol" in the goto-program will be |
| 81 | +* `foo` if it is global; |
| 82 | +* <code>bar\::foo</code> if it is a parameter to a function `bar()`; |
| 83 | +* <code>bar\::3\::foo</code> if it is a local variable in a function |
| 84 | + `bar()`, where `3` is a counter that is incremented every time a |
| 85 | + newly-scoped `foo` is encountered in that function. |
| 86 | + |
| 87 | +The use of *sharing* to save memory is a pervasive design decision in |
| 88 | +the implementation of ireps and identifiers. Sharing makes equality |
| 89 | +comparisons fast (as there is no need to traverse entire trees), and |
| 90 | +this is especially important given the large number of map lookups |
| 91 | +throughout the codebase. More importantly, the use of sharing saves vast |
| 92 | +amounts of memory, as there is plenty of duplication within the |
| 93 | +goto-program data structures. For example, every statement, and every |
| 94 | +sub-expression of a statement, contains a \ref source_locationt |
| 95 | +that indicates the source file and location that it came from. Every |
| 96 | +symbol in every expression has a field indicating its type and location; |
| 97 | +etc. Although each of these are constructed as separate objects, the |
| 98 | +values that they eventually point to are shared throughout the codebase, |
| 99 | +decreasing memory consumption dramatically. |
| 100 | + |
| 101 | +The Type Checking stage turns a parse tree into a |
| 102 | +\ref symbol_tablet "symbol table". In this context, the 'symbols' |
| 103 | +consist of code statements as well as what might more traditionally be |
| 104 | +called symbols. Thus, for example: |
| 105 | +* The statement `int foo = 11;` is converted into a symbol whose type is |
| 106 | + integer_typet and value is the \ref constant_exprt |
| 107 | + "constant expression" `11`; that symbol is stored in the symbol table |
| 108 | + using the mangled name of `foo` as the key; |
| 109 | +* The function definition `void foo(){ int x = 11; bar(); }` is |
| 110 | + converted into a symbol whose type is \ref code_typet (not to be |
| 111 | + confused with \ref typet or \ref codet!); the code_typet contains the |
| 112 | + parameter and return types of the function. The value of the symbol is |
| 113 | + the function's body (a \ref codet), and the symbol is stored in the |
| 114 | + symbol table with `foo` as the key. |
0 commit comments