Skip to content

Commit 5003672

Browse files
committed
Documentation of GOTO model (de)serialisation.
1 parent dbd6988 commit 5003672

File tree

2 files changed

+226
-0
lines changed

2 files changed

+226
-0
lines changed

src/goto-programs/README.md

Lines changed: 178 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -377,3 +377,181 @@ This stage concludes the *analysis-independent* program transformations.
377377
\subsubsection goto-program-example-1-section Unsigned mult (unsigned a, unsigned b) { int acc, i; for (i = 0; i < b; i++) acc += a; return acc; }
378378

379379
To be documented.
380+
381+
382+
\section section-goto-binary Binary Represenration
383+
384+
An instance of `goto_modelt` can be serialised to a binary stream (which is
385+
typically a file on the disk), and later deserialised from that stream back to
386+
an equivalent `goto_modelt` instance.
387+
388+
\subsection subsection-goto-binary-serialisation Serialisation
389+
390+
The serialisation is implemented in C++ modules:
391+
- `write_goto_binary.h`
392+
- `write_goto_binary.cpp`
393+
394+
To serialise a `goto_modelt` instance `gm` to a stream `ostr` call the function `write_goto_binary`, e.g. `write_goto_binary(ostr, gm)`.
395+
396+
The content of the written stream will have this strucutre:
397+
- The header:
398+
- A magic number: byte `0x7f` followed by 3 characters `GBF`.
399+
- A version number written in the 7-bit encoding. Currently, only version `4` is supported.
400+
- The symbol table:
401+
- The number of symbols in the table in the 7-bit encoding.
402+
- The array of individual symbols in the table. Each written symbol `s` has this structure:
403+
- The `irept` instance `s.type`.
404+
- The `irept` instance `s.value`.
405+
- The `irept` instance `s.location`.
406+
- The string `s.name`.
407+
- The string `s.module`.
408+
- The string `s.base_name`.
409+
- The string `s.mode`.
410+
- The string `s.pretty_name`.
411+
- The number `0` in the 7-bit encoding.
412+
- The flags word in the 7-bit encoding. The bits in the flags word correspond to the following `bool` fields (from the most significant bit):
413+
- `s.is_weak`
414+
- `s.is_type`
415+
- `s.is_property`
416+
- `s.is_macro`
417+
- `s.is_exported`
418+
- `s.is_input`
419+
- `s.is_output`
420+
- `s.is_state_var`
421+
- `s.is_parameter`
422+
- `s.is_auxiliary`
423+
- `0` (corresponding to `s.binding`, i.e. we always clear this info)
424+
- `s.is_lvalue`
425+
- `s.is_static_lifetime`
426+
- `s.is_thread_local`
427+
- `s.is_file_local`
428+
- `s.is_extern`
429+
- `s.is_volatile`
430+
- The functions with bodies, i.e. those missing a body are skipped.
431+
- The number of functions with bodies in the 7-bit encoding.
432+
- The array of individual function with bodies. Each written function has this structure:
433+
- The string with the name of the function.
434+
- The number of instructions in the body of the function in the 7-bit encoding.
435+
- The array of individual instruction in function's body. Each written instruction `I` has this structure:
436+
- The `irept` instance `I.code`, i.e. data of the instruction, like arguments.
437+
- The string `I.function`, i.e. the name of the function this instruction belongs to.
438+
- The `irept` instance `I.source_location`, i.e. the reference to the original source code (file, line).
439+
- The word in the 7-bit encoding `I.type`, i.e. the op-code of the instruction.
440+
- The `irept` instance `I.guard`.
441+
- The empty string (representing former `I.event`).
442+
- The word in the 7-bit encoding `I.target_number`, i.e. the jump target to this instruction from other instructions.
443+
- The word in the 7-bit encoding `I.targets.size()`, i.e. the count of jump targets from this instruction.
444+
- The array of individual jump targets from this instruction, each written as a word in the 7-bit encoding.
445+
- The word in the 7-bit encoding `I.labels.size()`.
446+
- The array of individual labels, each written as a word in the 7-bit encoding.
447+
448+
An important propery of the serialisation is that each serialised `irept`
449+
instance occurs in the stream exactly once. Namely, in the position of
450+
its first serialisation query. All other such queries save only hash
451+
code (i.e. reference) of the `irept` instance.
452+
453+
The similar strategy is used for serialisation of string constants
454+
shared amongst `irept` instances. Such a string is fully saved only in
455+
the first serialisation query of an `irept` instance it appears in and
456+
all other queries only save its integer has code.
457+
458+
Details about serialisation of `irept` instances, strings, and words in
459+
7-bit encoding can be found [here](\ref irep-serialization).
460+
461+
\subsection subsection-goto-binary-deserialisation Deserialisation
462+
463+
The deserialisation is implemented in C++ modules:
464+
- `read_goto_binary.h`
465+
- `read_goto_binary.cpp`
466+
- `read_bin_goto_object.h`
467+
- `read_bin_goto_object.cpp`
468+
469+
First two modules are responsible for location of the stream with the
470+
serialised data within a passed file. And the remaining two modules
471+
perform the actual deserialisation of `goto_modelt` instance from
472+
the located stream.
473+
474+
To deserialise a `goto_modelt` instance `gm` from a file `/some/path/name.gbf`
475+
call the function `read_goto_binary`, e.g.
476+
`read_goto_binary("/some/path/name.gbf", gm, message_handler)`, where
477+
`message_handler` must be an instance of `message_handlert` and serves
478+
for reporting issues during the process.
479+
480+
The passed binary file is assumed to have the same structure as described in
481+
the [previous subsection](\ref subsection-goto-binary-serialisation).
482+
The process of the deserialisation does not involve any seeking in the file.
483+
The content is read linearly from the beginning to the end. `irept` instances
484+
and their string constats are deserialised into the memory only once at their
485+
first occurreces in the stream. All subsequent deserialisation queries are
486+
resolved as in-memory references to the `irept` instances and/or strings.
487+
488+
NOTE: The first deserialisation is detected so that the loaded hash code
489+
is new. That implies that the full definition follows right after the hash.
490+
491+
Details about serialisation of `irept` instances, strings, and words in
492+
7-bit encoding can be found [here](\ref irep-serialization).
493+
494+
\subsubsection subsection-goto-binary-deserialisation-from-elf Deserialisation from ELF image
495+
496+
One can decide to store the serialised stream as a separate section, named
497+
`goto-cc`, into an ELF image. Then the deserialisation has a support of
498+
automatic detection of that section in an ELF file and the deserialisation
499+
will be automatically started from that section.
500+
501+
For reading the ELF images there is used an instance of `elf_readert`
502+
implemented in the C++ module:
503+
- `elf_reader.h`
504+
- `elf_reader.cpp`
505+
506+
\subsubsection subsection-goto-binary-deserialisation-from-mac-o-fat-image Deserialisation from Mac-O fat image
507+
508+
One can decide to store the serialised stream into Mac-O fat image as a
509+
separate non-empty section with flags `CPU_TYPE_HPPA` and
510+
`CPU_SUBTYPE_HPPA_7100LC`. Then the deserialisation has a support of
511+
automatic detection of that section in a Mac-O fat image, extraction
512+
of the section from the emage into a temporary file (this is done by
513+
calling `lipo` utility with `-thin hppa7100LC` option), and the
514+
deserialisation will be automatically started from that temporary
515+
file.
516+
517+
For reading the Mac-O fat images there is used an instance of
518+
`osx_fat_readert` implemented in the C++ module:
519+
- `osx_fat_reader.h`
520+
- `osx_fat_reader.cpp`
521+
522+
NOTE: This functionality is available only when the modules are built
523+
on a MacOS machine.
524+
525+
\subsubsection subsection-goto-binary-is-binary-file Checking file type
526+
527+
You can use function `is_goto_binary` to check whether a passed file contains
528+
a sederialised `goto_modelt` instance or not. This is done by checking the
529+
magic number of the stream (see subsection
530+
[Serialisation](\ref subsection-goto-binary-serialisation)). However, in the
531+
case when the deserialised data were stored into ELF or Mac-O fat image, then
532+
only the check for presence of the concrete section in the image is performed.
533+
534+
\subsubsection subsection-goto-binary-deserialisation-linking Linking Goto Models
535+
536+
Similar to linking of object files together by C/C++ linker, the module
537+
provides linking of a dereserialised `goto_modelt` instance into a given
538+
(i.e. previously deserialised or otherwise created) `goto_modelt` instance.
539+
540+
This is implemented in function `read_object_and_link`. The function first
541+
deserialises the passed file into a temporary `goto_modelt` instance, and
542+
then it performs 'linking' of the temporary into a passed destination
543+
`goto_modelt` instance.
544+
545+
Details about linking of `goto_modelt` instances can be found
546+
[here](\ref section-linking-goto-models).
547+
548+
549+
\section section-linking-goto-models Linking Goto Models
550+
551+
C++ modules:
552+
- `link_goto_model.h`
553+
- `link_goto_model.cpp`
554+
555+
Dependencies:
556+
- [linking folder](\ref linking).
557+
- [typecheck](\ref section-goto-typecheck).

src/util/README.md

Lines changed: 48 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -144,3 +144,51 @@ To be documented.
144144
\subsubsection ast-example-3-section Java arrays: struct Array { int length, int *data };
145145

146146
To be documented.
147+
148+
\subsection section-goto-typecheck Goto Model Typecheck
149+
150+
Class `typecheckt`.
151+
152+
\subsection irep-serialization `irept` Serialization
153+
154+
The module provides serialisation and deserialisation of
155+
integer numbers, strings, and `irept` instances.
156+
157+
This is implemented in C++ modules:
158+
- `irep_serialization.h`
159+
- `irep_serialization.cpp`
160+
- `irep_hash_container.h` (applies only to `irept` instances)
161+
- `irep_hash_container.cpp` (applies only to `irept` instances)
162+
163+
\subsubsection irep-serialization-numbers Serialization of Numbers
164+
165+
A number is serialiased in 7-bit encoding. For example. given a 2-byte
166+
number in base 2, like `10101010 01010101`, then it is save in 3 bytes,
167+
where each byte takes only 7 bits from the number, reading from the
168+
left. The 8th bit in each output byte is set to 1 except in the last
169+
byte, where the bit is 0. That 0 bit indicates the end of the end
170+
encoding of the number. So, the output bytes look like this:
171+
`11010101 11010100 00000010`.
172+
173+
This is implmented in the function `write_gb_word`.
174+
175+
The deserialisation does the oposite process and it is implemented in
176+
`irep_serializationt::read_gb_word`.
177+
178+
\subsubsection irep-serialization-strings Serialization of Strings
179+
180+
A string is encoded as classic 0-terminated C string. However,
181+
characters `0` and `\\` are escaped by writing additional `\\`
182+
before them.
183+
184+
This is implmented in the function `write_gb_string` and the
185+
deserialisation is implemented in `irep_serializationt::read_gb_string`.
186+
187+
Each string which is stored inside an `irept` instance is saved (meaining
188+
its characters) into the ouptut stream, only in the first serialisation
189+
query of the string. In that case, before the string there is also saved
190+
a computed integer hash code of the string. Then, all subsequent
191+
serialisation queries save only that integer hash code of the string.
192+
193+
\subsubsection irep-serialization-ireps Serialization of `irept` Instances
194+

0 commit comments

Comments
 (0)