Skip to content

Commit 7541047

Browse files
authored
DOC: The PDF Format + commit prefixes (#810)
1 parent b3247e8 commit 7541047

File tree

3 files changed

+154
-1
lines changed

3 files changed

+154
-1
lines changed

docs/dev/intro.md

Lines changed: 51 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -9,6 +9,57 @@ the users, but for people who want to work on PyPDF2 itself.
99
pip install -r requirements/dev.txt
1010
```
1111

12+
## Running Tests
13+
14+
```
15+
pytest .
16+
```
17+
18+
## Tools: git and pre-commit
19+
20+
Git is a command line application for version control. If you don't know it,
21+
you can [play ohmygit](https://ohmygit.org/) to learn it.
22+
23+
Github is the service where the PyPDF2 project is hosted. While git is free and
24+
open source, Github is a paid service by Microsoft - but for free in lot of
25+
cases.
26+
27+
[pre-commit](https://pypi.org/project/pre-commit/) is a command line application
28+
that uses git hooks to automatically execute code. This allows you to avoid
29+
style issues and other code quality issues. After you entered `pre-commit install`
30+
once in your local copy of PyPDF2, it will automatically be executed when
31+
you `git commit`.
32+
33+
## Commit Messages
34+
35+
Having a clean commit message helps people to quickly understand what the commit
36+
was about, witout actually looking at the changes. The first line of the
37+
commit message is used to [auto-generate the CHANGELOG](https://github.com/py-pdf/PyPDF2/blob/main/make_changelog.py). For this reason, the format should be:
38+
39+
```
40+
PREFIX: DESCRIPTION
41+
42+
BODY
43+
```
44+
45+
The `PREFIX` can be:
46+
47+
* `BUG`: A bug was fixed. Likely there is one or multiple issues. Then write in
48+
the `BODY`: `Closes #123` where 123 is the issue number on Github.
49+
It would be absolutely amazing if you could write a regression test in those
50+
cases. That is a test that would fail without the fix.
51+
* `ENH`: A new feature! Describe in the body what it can be used for.
52+
* `DEP`: A deprecation - either marking something as "this is going to be removed"
53+
or actually removing it.
54+
* `ROB`: A robustness change. Dealing better with broken PDF files.
55+
* `DOC`: A documentation change.
56+
* `TST`: Adding / adjusting tests.
57+
* `DEV`: Developer experience improvements - e.g. pre-commit or setting up CI
58+
* `MAINT`: Quite a lot of different stuff. Performance improvements are for sure
59+
the most interesting changes in here. Refactorings as well.
60+
* `STY`: A style change. Something that makes PyPDF2 code more consistent.
61+
Typically a small change.
62+
1263
## Benchmarks
1364

1465
We need to keep an eye on performance and thus we have a few benchmarks.

docs/dev/pdf-format.md

Lines changed: 101 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,101 @@
1+
# The PDF Format
2+
3+
It's recommended to look in the PDF specification for details and clarifications.
4+
This is only intended to give a very rough overview of the format.
5+
6+
## Overall Structure
7+
8+
A PDF consists of:
9+
10+
1. Header: Contains the version of the PDF, e.g. `%PDF-1.7`
11+
2. Body: Contains a sequence of indirect objects
12+
3. Cross-reference table (xref): Contains a list of the indirect objects in the body
13+
4. Trailer
14+
15+
## The xref table
16+
17+
A cross-reference table (xref) is a table of the indirect objects in the body.
18+
It allows quick access to those objects by pointing to their location in the file.
19+
20+
It looks like this:
21+
22+
```text
23+
xref 42 5
24+
0000001000 65535 f
25+
0000001234 00000 n
26+
0000001987 00000 n
27+
0000011987 00000 n
28+
0000031987 00000 n
29+
```
30+
31+
Let's go through it step-by-step:
32+
33+
* `xref` is justa keyword that specifies the start of the xref table.
34+
* `42` is TODO; `6` is the number of entries in the xref table.
35+
* Now every object has 3 entries `nnnnnnnnnn ggggg n`: The 10-digit byte offset,
36+
a 5-digit generation number, and a literal keyword which is either `n` or `f`.
37+
* `nnnnnnnnnn` is the byte offset of the object. It tells the reader where
38+
the object is in the file.
39+
* `ggggg` is the generation number. It tells the reader how old the object is.
40+
* `n` means that the object is a normal in-use object, `f` means that the object
41+
is a free object.
42+
* The first free object always has a generation number of 65535. It forms
43+
the head of a linked-list of all free objects.
44+
* The generation number of a normal object is always 0. The generation
45+
number allows the PDF format to contain multiple versions of the same
46+
object. This is a version history mechanism.
47+
48+
## The body
49+
50+
The body is a sequence of indirect objects:
51+
52+
`counter generationnumber << the_object >> endobj`
53+
54+
* `counter` (integer) is a unique identifier for the object.
55+
* `generationnumber` (integer) is the generation number of the object.
56+
* `the_object` is the object itself. It can be empty. Starts with `/Keyword` to
57+
specify which kind of object it is.
58+
* `endobj` marks the end of the object.
59+
60+
A concrete example can be found in `test_reader.py::test_get_images_raw`:
61+
62+
```text
63+
1 0 obj << /Count 1 /Kids [4 0 R] /Type /Pages >> endobj
64+
2 0 obj << >> endobj
65+
3 0 obj << >> endobj
66+
4 0 obj << /Contents 3 0 R /CropBox [0.0 0.0 2550.0 3508.0]
67+
/MediaBox [0.0 0.0 2550.0 3508.0] /Parent 1 0 R
68+
/Resources << /Font << >> >>
69+
/Rotate 0 /Type /Page >> endobj
70+
5 0 obj << /Pages 1 0 R /Type /Catalog >> endobj
71+
```
72+
73+
## The trailer
74+
75+
The trailer looks like this:
76+
77+
```text
78+
trailer << /Root 5 0 R
79+
/Size 6
80+
>>
81+
startxref 1234
82+
%%EOF
83+
```
84+
85+
Let's go through it:
86+
87+
* `trailer <<` indicates that the *trailer dictionary` starts. It ends with `>>`.
88+
* `startxref` is a keyword followed by the byte-location of the `xref` keyword.
89+
As the trailer is always at the bottom of the file, this allows readers to
90+
quickly find the xref table.
91+
* `%%EOF` is the end-of-file marker.
92+
93+
The trailer dictionary is a key-value list. The keys are specified in
94+
Table 3.13 of the PDF Reference 1.7, e.g. `/Root` and `/Size` (both are required).
95+
96+
* `/Root` (dictionary) contains the document catalog.
97+
* The `5` is the object number of the catalog dictionary
98+
* `0` is the generation number of the catalog dictionary
99+
* `R` is the keyword that indicates that the object is a reference to the
100+
catalog dictionary.
101+
* `/Size` (integer) contains the total number of entries in the files xref table.

docs/index.rst

Lines changed: 2 additions & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -49,10 +49,11 @@ You can contribute to `PyPDF2 on Github <https://github.com/py-pdf/PyPDF2>`_.
4949
modules/PageRange
5050

5151
.. toctree::
52-
:caption: PyPDF Developers
52+
:caption: Developer Guide
5353
:maxdepth: 1
5454

5555
dev/intro
56+
dev/pdf-format
5657

5758
.. toctree::
5859
:caption: About PyPDF2

0 commit comments

Comments
 (0)