Skip to content

Commit 0c11a5e

Browse files
committed
Documentation of GOTO model (de)serialisation.
1 parent 401a46c commit 0c11a5e

File tree

2 files changed

+226
-0
lines changed

2 files changed

+226
-0
lines changed

src/goto-programs/README.md

Lines changed: 178 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -329,3 +329,181 @@ This stage concludes the *analysis-independent* program transformations.
329329
\subsubsection goto-program-example-1-section Unsigned mult (unsigned a, unsigned b) { int acc, i; for (i = 0; i < b; i++) acc += a; return acc; }
330330

331331
To be documented.
332+
333+
334+
\section section-goto-binary Binary Represenration
335+
336+
An instance of `goto_modelt` can be serialised to a binary stream (which is
337+
typically a file on the disk), and later deserialised from that stream back to
338+
an equivalent `goto_modelt` instance.
339+
340+
\subsection subsection-goto-binary-serialisation Serialisation
341+
342+
The serialisation is implemented in C++ modules:
343+
- `write_goto_binary.h`
344+
- `write_goto_binary.cpp`
345+
346+
To serialise a `goto_modelt` instance `gm` to a stream `ostr` call the function `write_goto_binary`, e.g. `write_goto_binary(ostr, gm)`.
347+
348+
The content of the written stream will have this strucutre:
349+
- The header:
350+
- A magic number: byte `0x7f` followed by 3 characters `GBF`.
351+
- A version number written in the 7-bit encoding. Currently, only version `4` is supported.
352+
- The symbol table:
353+
- The number of symbols in the table in the 7-bit encoding.
354+
- The array of individual symbols in the table. Each written symbol `s` has this structure:
355+
- The `irept` instance `s.type`.
356+
- The `irept` instance `s.value`.
357+
- The `irept` instance `s.location`.
358+
- The string `s.name`.
359+
- The string `s.module`.
360+
- The string `s.base_name`.
361+
- The string `s.mode`.
362+
- The string `s.pretty_name`.
363+
- The number `0` in the 7-bit encoding.
364+
- The flags word in the 7-bit encoding. The bits in the flags word correspond to the following `bool` fields (from the most significant bit):
365+
- `s.is_weak`
366+
- `s.is_type`
367+
- `s.is_property`
368+
- `s.is_macro`
369+
- `s.is_exported`
370+
- `s.is_input`
371+
- `s.is_output`
372+
- `s.is_state_var`
373+
- `s.is_parameter`
374+
- `s.is_auxiliary`
375+
- `0` (corresponding to `s.binding`, i.e. we always clear this info)
376+
- `s.is_lvalue`
377+
- `s.is_static_lifetime`
378+
- `s.is_thread_local`
379+
- `s.is_file_local`
380+
- `s.is_extern`
381+
- `s.is_volatile`
382+
- The functions with bodies, i.e. those missing a body are skipped.
383+
- The number of functions with bodies in the 7-bit encoding.
384+
- The array of individual function with bodies. Each written function has this structure:
385+
- The string with the name of the function.
386+
- The number of instructions in the body of the function in the 7-bit encoding.
387+
- The array of individual instruction in function's body. Each written instruction `I` has this structure:
388+
- The `irept` instance `I.code`, i.e. data of the instruction, like arguments.
389+
- The string `I.function`, i.e. the name of the function this instruction belongs to.
390+
- The `irept` instance `I.source_location`, i.e. the reference to the original source code (file, line).
391+
- The word in the 7-bit encoding `I.type`, i.e. the op-code of the instruction.
392+
- The `irept` instance `I.guard`.
393+
- The empty string (representing former `I.event`).
394+
- The word in the 7-bit encoding `I.target_number`, i.e. the jump target to this instruction from other instructions.
395+
- The word in the 7-bit encoding `I.targets.size()`, i.e. the count of jump targets from this instruction.
396+
- The array of individual jump targets from this instruction, each written as a word in the 7-bit encoding.
397+
- The word in the 7-bit encoding `I.labels.size()`.
398+
- The array of individual labels, each written as a word in the 7-bit encoding.
399+
400+
An important propery of the serialisation is that each serialised `irept`
401+
instance occurs in the stream exactly once. Namely, in the position of
402+
its first serialisation query. All other such queries save only hash
403+
code (i.e. reference) of the `irept` instance.
404+
405+
The similar strategy is used for serialisation of string constants
406+
shared amongst `irept` instances. Such a string is fully saved only in
407+
the first serialisation query of an `irept` instance it appears in and
408+
all other queries only save its integer has code.
409+
410+
Details about serialisation of `irept` instances, strings, and words in
411+
7-bit encoding can be found [here](\ref irep-serialization).
412+
413+
\subsection subsection-goto-binary-deserialisation Deserialisation
414+
415+
The deserialisation is implemented in C++ modules:
416+
- `read_goto_binary.h`
417+
- `read_goto_binary.cpp`
418+
- `read_bin_goto_object.h`
419+
- `read_bin_goto_object.cpp`
420+
421+
First two modules are responsible for location of the stream with the
422+
serialised data within a passed file. And the remaining two modules
423+
perform the actual deserialisation of `goto_modelt` instance from
424+
the located stream.
425+
426+
To deserialise a `goto_modelt` instance `gm` from a file `/some/path/name.gbf`
427+
call the function `read_goto_binary`, e.g.
428+
`read_goto_binary("/some/path/name.gbf", gm, message_handler)`, where
429+
`message_handler` must be an instance of `message_handlert` and serves
430+
for reporting issues during the process.
431+
432+
The passed binary file is assumed to have the same structure as described in
433+
the [previous subsection](\ref subsection-goto-binary-serialisation).
434+
The process of the deserialisation does not involve any seeking in the file.
435+
The content is read linearly from the beginning to the end. `irept` instances
436+
and their string constats are deserialised into the memory only once at their
437+
first occurreces in the stream. All subsequent deserialisation queries are
438+
resolved as in-memory references to the `irept` instances and/or strings.
439+
440+
NOTE: The first deserialisation is detected so that the loaded hash code
441+
is new. That implies that the full definition follows right after the hash.
442+
443+
Details about serialisation of `irept` instances, strings, and words in
444+
7-bit encoding can be found [here](\ref irep-serialization).
445+
446+
\subsubsection subsection-goto-binary-deserialisation-from-elf Deserialisation from ELF image
447+
448+
One can decide to store the serialised stream as a separate section, named
449+
`goto-cc`, into an ELF image. Then the deserialisation has a support of
450+
automatic detection of that section in an ELF file and the deserialisation
451+
will be automatically started from that section.
452+
453+
For reading the ELF images there is used an instance of `elf_readert`
454+
implemented in the C++ module:
455+
- `elf_reader.h`
456+
- `elf_reader.cpp`
457+
458+
\subsubsection subsection-goto-binary-deserialisation-from-mac-o-fat-image Deserialisation from Mac-O fat image
459+
460+
One can decide to store the serialised stream into Mac-O fat image as a
461+
separate non-empty section with flags `CPU_TYPE_HPPA` and
462+
`CPU_SUBTYPE_HPPA_7100LC`. Then the deserialisation has a support of
463+
automatic detection of that section in a Mac-O fat image, extraction
464+
of the section from the emage into a temporary file (this is done by
465+
calling `lipo` utility with `-thin hppa7100LC` option), and the
466+
deserialisation will be automatically started from that temporary
467+
file.
468+
469+
For reading the Mac-O fat images there is used an instance of
470+
`osx_fat_readert` implemented in the C++ module:
471+
- `osx_fat_reader.h`
472+
- `osx_fat_reader.cpp`
473+
474+
NOTE: This functionality is available only when the modules are built
475+
on a MacOS machine.
476+
477+
\subsubsection subsection-goto-binary-is-binary-file Checking file type
478+
479+
You can use function `is_goto_binary` to check whether a passed file contains
480+
a sederialised `goto_modelt` instance or not. This is done by checking the
481+
magic number of the stream (see subsection
482+
[Serialisation](\ref subsection-goto-binary-serialisation)). However, in the
483+
case when the deserialised data were stored into ELF or Mac-O fat image, then
484+
only the check for presence of the concrete section in the image is performed.
485+
486+
\subsubsection subsection-goto-binary-deserialisation-linking Linking Goto Models
487+
488+
Similar to linking of object files together by C/C++ linker, the module
489+
provides linking of a dereserialised `goto_modelt` instance into a given
490+
(i.e. previously deserialised or otherwise created) `goto_modelt` instance.
491+
492+
This is implemented in function `read_object_and_link`. The function first
493+
deserialises the passed file into a temporary `goto_modelt` instance, and
494+
then it performs 'linking' of the temporary into a passed destination
495+
`goto_modelt` instance.
496+
497+
Details about linking of `goto_modelt` instances can be found
498+
[here](\ref section-linking-goto-models).
499+
500+
501+
\section section-linking-goto-models Linking Goto Models
502+
503+
C++ modules:
504+
- `link_goto_model.h`
505+
- `link_goto_model.cpp`
506+
507+
Dependencies:
508+
- [linking folder](\ref linking).
509+
- [typecheck](\ref section-goto-typecheck).

src/util/README.md

Lines changed: 48 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -281,3 +281,51 @@ To be documented.
281281
\subsubsection ast-example-3-section Java arrays: struct Array { int length, int *data };
282282

283283
To be documented.
284+
285+
\subsection section-goto-typecheck Goto Model Typecheck
286+
287+
Class `typecheckt`.
288+
289+
\subsection irep-serialization `irept` Serialization
290+
291+
The module provides serialisation and deserialisation of
292+
integer numbers, strings, and `irept` instances.
293+
294+
This is implemented in C++ modules:
295+
- `irep_serialization.h`
296+
- `irep_serialization.cpp`
297+
- `irep_hash_container.h` (applies only to `irept` instances)
298+
- `irep_hash_container.cpp` (applies only to `irept` instances)
299+
300+
\subsubsection irep-serialization-numbers Serialization of Numbers
301+
302+
A number is serialiased in 7-bit encoding. For example. given a 2-byte
303+
number in base 2, like `10101010 01010101`, then it is save in 3 bytes,
304+
where each byte takes only 7 bits from the number, reading from the
305+
left. The 8th bit in each output byte is set to 1 except in the last
306+
byte, where the bit is 0. That 0 bit indicates the end of the end
307+
encoding of the number. So, the output bytes look like this:
308+
`11010101 11010100 00000010`.
309+
310+
This is implmented in the function `write_gb_word`.
311+
312+
The deserialisation does the oposite process and it is implemented in
313+
`irep_serializationt::read_gb_word`.
314+
315+
\subsubsection irep-serialization-strings Serialization of Strings
316+
317+
A string is encoded as classic 0-terminated C string. However,
318+
characters `0` and `\\` are escaped by writing additional `\\`
319+
before them.
320+
321+
This is implmented in the function `write_gb_string` and the
322+
deserialisation is implemented in `irep_serializationt::read_gb_string`.
323+
324+
Each string which is stored inside an `irept` instance is saved (meaining
325+
its characters) into the ouptut stream, only in the first serialisation
326+
query of the string. In that case, before the string there is also saved
327+
a computed integer hash code of the string. Then, all subsequent
328+
serialisation queries save only that integer hash code of the string.
329+
330+
\subsubsection irep-serialization-ireps Serialization of `irept` Instances
331+

0 commit comments

Comments
 (0)