diffblue
diff --git a/‎doc/architectural/front-page.md
Lines changed: 13 additions & 2 deletions b/‎doc/architectural/front-page.md
Lines changed: 13 additions & 2 deletions
diff --git a/‎src/ansi-c/module.md
Lines changed: 114 additions & 0 deletions b/‎src/ansi-c/module.md
Lines changed: 114 additions & 0 deletions
diff --git a/‎src/cbmc/module.md
Lines changed: 47 additions & 0 deletions b/‎src/cbmc/module.md
Lines changed: 47 additions & 0 deletions
@@ -1,8 +1,11 @@
 CProver Documentation
 =====================
 
-These pages contain both user tutorials and automatically-generated API
-documentation. Users can download CProver tools from the
+\author Kareem Khazem
+
+These pages contain user tutorials, automatically-generated API
+documentation, and higher-level architectural overviews for the
+CProver codebase. Users can download CProver tools from the
 <a href="http://www.cprover.org/">CProver website</a>; contributors
 should use the <a href="https://github.com/diffblue/cbmc">repository</a>
 hosted on GitHub.
@@ -21,4 +24,12 @@ hosted on GitHub.
   members in the search bar at top-right or use one of the links in the
   sidebar.
 
+* For higher-level architectural information, each of the pages under
+  the "Modules" link in the sidebar gives an overview of a directory in
+  the CProver codebase.
+
+* The \ref module_cbmc "CBMC guided tour" is a good start for new
+  contributors to CBMC. It describes the stages through which CBMC
+  transforms source files into bug reports and counterexamples, linking
+  to the relevant documentation for each stage.
 \defgroup module_hidden _hidden
@@ -0,0 +1,114 @@
+\ingroup module_hidden
+\defgroup module_ansi-c ANSI-C Language Front-end
+
+\author Kareem Khazem
+
+\section preprocessing Preprocessing & Parsing
+
+In the \ref ansi-c and \ref java_bytecode directories
+
+**Key classes:**
+* \ref languaget and its subclasses
+* ansi_c_parse_treet
+
+\dot
+digraph G {
+  node [shape=box];
+  rankdir="LR";
+  1 [shape=none, label=""];
+  2 [label="preprocessing & parsing"];
+  3 [shape=none, label=""];
+  1 -> 2 [label="Command line options, file names"];
+  2 -> 3 [label="Parse tree"];
+}
+\enddot
+
+
+
+---
+\section type-checking Type-checking
+
+In the \ref ansi-c and \ref java_bytecode directories.
+
+**Key classes:**
+* \ref languaget and its subclasses
+* \ref irept
+* \ref irep_idt
+* \ref symbolt
+* symbol_tablet
+
+\dot
+digraph G {
+  node [shape=box];
+  rankdir="LR";
+  1 [shape=none, label=""];
+  2 [label="type checking"];
+  3 [shape=none, label=""];
+  1 -> 2 [label="Parse tree"];
+  2 -> 3 [label="Symbol table"];
+}
+\enddot
+
+This stage generates a symbol table, mapping identifiers to symbols;
+\ref symbolt "symbols" are tuples of (value, type, location, flags).
+
+This is a good point to introduce the \ref irept ("internal
+representation") class---the base type of many of CBMC's hierarchical
+data structures. In particular, \ref exprt "expressions",
+\ref typet "types" and \ref codet "statements" are all subtypes of
+\ref irept.
+An irep is a tree of ireps. A subtlety is that an irep is actually the
+root of _three_ (possibly empty) trees, i.e. it has three disjoint sets
+of children: \ref irept::get_sub() returns a list of children, and
+\ref irept::get_named_sub() and \ref irept::get_comments() each return an
+association from names to children. **Most clients never use these
+functions directly**, as subtypes of irept generally provide more
+descriptive functions. For example, the operands of an
+\ref exprt "expression" (\ref exprt::op0() "op0", op1 etc) are
+really that expression's children; the
+\ref code_assignt::lhs() "left-hand" and right-hand side of an
+\ref code_assignt "assignment" are the children of that assignment.
+The \ref irept::pretty() function provides a descriptive string
+representation of an irep.
+
+\ref irep_idt "irep_idts" ("identifiers") are strings that use sharing
+to improve memory consumption. A common pattern is a map from irep_idts
+to ireps. A goto-program contains a single symbol table (with a single
+scope), meaning that the names of identifiers in the target program are
+lightly mangled in order to make them globally unique. If there is an
+identifier `foo` in the target program, the `name` field of `foo`'s
+\ref symbolt "symbol" in the goto-program will be
+* `foo` if it is global;
+* <code>bar\::foo</code> if it is a parameter to a function `bar()`;
+* <code>bar\::3\::foo</code> if it is a local variable in a function
+  `bar()`, where `3` is a counter that is incremented every time a
+  newly-scoped `foo` is encountered in that function.
+
+The use of *sharing* to save memory is a pervasive design decision in
+the implementation of ireps and identifiers. Sharing makes equality
+comparisons fast (as there is no need to traverse entire trees), and
+this is especially important given the large number of map lookups
+throughout the codebase. More importantly, the use of sharing saves vast
+amounts of memory, as there is plenty of duplication within the
+goto-program data structures. For example, every statement, and every
+sub-expression of a statement, contains a \ref source_locationt
+that indicates the source file and location that it came from. Every
+symbol in every expression has a field indicating its type and location;
+etc. Although each of these are constructed as separate objects, the
+values that they eventually point to are shared throughout the codebase,
+decreasing memory consumption dramatically.
+
+The Type Checking stage turns a parse tree into a
+\ref symbol_tablet "symbol table". In this context, the 'symbols'
+consist of code statements as well as what might more traditionally be
+called symbols. Thus, for example:
+* The statement `int foo = 11;` is converted into a symbol whose type is
+  integer_typet and value is the \ref constant_exprt
+  "constant expression" `11`; that symbol is stored in the symbol table
+  using the mangled name of `foo` as the key;
+* The function definition `void foo(){ int x = 11; bar(); }` is
+  converted into a symbol whose type is \ref code_typet (not to be
+  confused with \ref typet or \ref codet!); the code_typet contains the
+  parameter and return types of the function. The value of the symbol is
+  the function's body (a \ref codet), and the symbol is stored in the
+  symbol table with `foo` as the key.
@@ -0,0 +1,47 @@
+\ingroup module_hidden
+\defgroup module_cbmc CBMC tour
+
+\author Kareem Khazem
+
+CBMC takes C code or a goto-binary as input and tries to emit traces of
+executions that lead to crashes or undefined behaviour. The diagram
+below shows the intermediate steps in this process.
+
+
+\dot
+digraph G {
+
+  rankdir="TB";
+  node [shape=box, fontcolor=blue];
+
+  subgraph top {
+    rank=same;
+    1 -> 2 -> 3 -> 4;
+  }
+
+  subgraph bottom {
+    rank=same;
+    5 -> 6 -> 7 -> 8 -> 9;
+  }
+
+  /* shift bottom subgraph over */
+  9 -> 1 [color=white];
+
+  4 -> 5;
+
+  1 [label="command line\nparsing" URL="\ref cbmc_parse_optionst"];
+  2 [label="preprocessing,\nparsing" URL="\ref preprocessing"];
+  3 [label="language\ntype-checking" URL="\ref type-checking"];
+  4 [label="goto\nconversion" URL="\ref goto-conversion"];
+  5 [label="instrumentation" URL="\ref instrumentation"];
+  6 [label="symbolic\nexecution" URL="\ref symbolic-execution"];
+  7 [label="SAT/SMT\nencoding" URL="\ref sat-smt-encoding"];
+  8 [label="decision\nprocedure" URL="\ref decision-procedure"];
+  9 [label="counter example\nproduction" URL="\ref counter-example-production"];
+}
+\enddot
+
+The \ref cprover-manual "CProver Manual" describes CBMC from a user
+perspective. Each node in the diagram above links to the appropriate
+class or module documentation, describing that particular stage in the
+CBMC pipeline.