|
| 1 | +# The PDF Format |
| 2 | + |
| 3 | +It's recommended to look in the PDF specification for details and clarifications. |
| 4 | +This is only intended to give a very rough overview of the format. |
| 5 | + |
| 6 | +## Overall Structure |
| 7 | + |
| 8 | +A PDF consists of: |
| 9 | + |
| 10 | +1. Header: Contains the version of the PDF, e.g. `%PDF-1.7` |
| 11 | +2. Body: Contains a sequence of indirect objects |
| 12 | +3. Cross-reference table (xref): Contains a list of the indirect objects in the body |
| 13 | +4. Trailer |
| 14 | + |
| 15 | +## The xref table |
| 16 | + |
| 17 | +A cross-reference table (xref) is a table of the indirect objects in the body. |
| 18 | +It allows quick access to those objects by pointing to their location in the file. |
| 19 | + |
| 20 | +It looks like this: |
| 21 | + |
| 22 | +```text |
| 23 | +xref 42 5 |
| 24 | +0000001000 65535 f |
| 25 | +0000001234 00000 n |
| 26 | +0000001987 00000 n |
| 27 | +0000011987 00000 n |
| 28 | +0000031987 00000 n |
| 29 | +``` |
| 30 | + |
| 31 | +Let's go through it step-by-step: |
| 32 | + |
| 33 | +* `xref` is justa keyword that specifies the start of the xref table. |
| 34 | +* `42` is TODO; `6` is the number of entries in the xref table. |
| 35 | +* Now every object has 3 entries `nnnnnnnnnn ggggg n`: The 10-digit byte offset, |
| 36 | + a 5-digit generation number, and a literal keyword which is either `n` or `f`. |
| 37 | + * `nnnnnnnnnn` is the byte offset of the object. It tells the reader where |
| 38 | + the object is in the file. |
| 39 | + * `ggggg` is the generation number. It tells the reader how old the object is. |
| 40 | + * `n` means that the object is a normal in-use object, `f` means that the object |
| 41 | + is a free object. |
| 42 | + * The first free object always has a generation number of 65535. It forms |
| 43 | + the head of a linked-list of all free objects. |
| 44 | + * The generation number of a normal object is always 0. The generation |
| 45 | + number allows the PDF format to contain multiple versions of the same |
| 46 | + object. This is a version history mechanism. |
| 47 | + |
| 48 | +## The body |
| 49 | + |
| 50 | +The body is a sequence of indirect objects: |
| 51 | + |
| 52 | +`counter generationnumber << the_object >> endobj` |
| 53 | + |
| 54 | +* `counter` (integer) is a unique identifier for the object. |
| 55 | +* `generationnumber` (integer) is the generation number of the object. |
| 56 | +* `the_object` is the object itself. It can be empty. Starts with `/Keyword` to |
| 57 | + specify which kind of object it is. |
| 58 | +* `endobj` marks the end of the object. |
| 59 | + |
| 60 | +A concrete example can be found in `test_reader.py::test_get_images_raw`: |
| 61 | + |
| 62 | +```text |
| 63 | +1 0 obj << /Count 1 /Kids [4 0 R] /Type /Pages >> endobj |
| 64 | +2 0 obj << >> endobj |
| 65 | +3 0 obj << >> endobj |
| 66 | +4 0 obj << /Contents 3 0 R /CropBox [0.0 0.0 2550.0 3508.0] |
| 67 | + /MediaBox [0.0 0.0 2550.0 3508.0] /Parent 1 0 R |
| 68 | + /Resources << /Font << >> >> |
| 69 | + /Rotate 0 /Type /Page >> endobj |
| 70 | +5 0 obj << /Pages 1 0 R /Type /Catalog >> endobj |
| 71 | +``` |
| 72 | + |
| 73 | +## The trailer |
| 74 | + |
| 75 | +The trailer looks like this: |
| 76 | + |
| 77 | +```text |
| 78 | +trailer << /Root 5 0 R |
| 79 | + /Size 6 |
| 80 | + >> |
| 81 | +startxref 1234 |
| 82 | +%%EOF |
| 83 | +``` |
| 84 | + |
| 85 | +Let's go through it: |
| 86 | + |
| 87 | +* `trailer <<` indicates that the *trailer dictionary` starts. It ends with `>>`. |
| 88 | +* `startxref` is a keyword followed by the byte-location of the `xref` keyword. |
| 89 | + As the trailer is always at the bottom of the file, this allows readers to |
| 90 | + quickly find the xref table. |
| 91 | +* `%%EOF` is the end-of-file marker. |
| 92 | + |
| 93 | +The trailer dictionary is a key-value list. The keys are specified in |
| 94 | +Table 3.13 of the PDF Reference 1.7, e.g. `/Root` and `/Size` (both are required). |
| 95 | + |
| 96 | +* `/Root` (dictionary) contains the document catalog. |
| 97 | + * The `5` is the object number of the catalog dictionary |
| 98 | + * `0` is the generation number of the catalog dictionary |
| 99 | + * `R` is the keyword that indicates that the object is a reference to the |
| 100 | + catalog dictionary. |
| 101 | +* `/Size` (integer) contains the total number of entries in the files xref table. |
0 commit comments