Skip to content

Commit 22d91c1

Browse files
authored
gh-99146 struct module documentation should have more predictable examples/warnings (GH-99141)
* nail down a couple examples to have more predictable output * update a number of things, but this is really just a stash... * added an applications section to describe typical uses for native and machine-independent formats * make sure all format strings use a format prefix character * responding to comments from @gpshead. Not likely finished yet. * This got more involved than I expected... * respond to several PR comments * a lot of wordsmithing * try and be more consistent in use of ``x`` vs ``'x'`` * expand examples a bit * update the "see also" to be more up-to-date * original examples relied on import * so present all examples as if * reformat based on @gpshead comment (missed before) * responding to comments * missed this * one more suggested edit * wordsmithing
1 parent 5d41833 commit 22d91c1

File tree

1 file changed

+206
-76
lines changed

1 file changed

+206
-76
lines changed

Doc/library/struct.rst

Lines changed: 206 additions & 76 deletions
Original file line numberDiff line numberDiff line change
@@ -12,21 +12,25 @@
1212

1313
--------------
1414

15-
This module performs conversions between Python values and C structs represented
16-
as Python :class:`bytes` objects. This can be used in handling binary data
17-
stored in files or from network connections, among other sources. It uses
18-
:ref:`struct-format-strings` as compact descriptions of the layout of the C
19-
structs and the intended conversion to/from Python values.
15+
This module converts between Python values and C structs represented
16+
as Python :class:`bytes` objects. Compact :ref:`format strings <struct-format-strings>`
17+
describe the intended conversions to/from Python values.
18+
The module's functions and objects can be used for two largely
19+
distinct applications, data exchange with external sources (files or
20+
network connections), or data transfer between the Python application
21+
and the C layer.
2022

2123
.. note::
2224

23-
By default, the result of packing a given C struct includes pad bytes in
24-
order to maintain proper alignment for the C types involved; similarly,
25-
alignment is taken into account when unpacking. This behavior is chosen so
26-
that the bytes of a packed struct correspond exactly to the layout in memory
27-
of the corresponding C struct. To handle platform-independent data formats
28-
or omit implicit pad bytes, use ``standard`` size and alignment instead of
29-
``native`` size and alignment: see :ref:`struct-alignment` for details.
25+
When no prefix character is given, native mode is the default. It
26+
packs or unpacks data based on the platform and compiler on which
27+
the Python interpreter was built.
28+
The result of packing a given C struct includes pad bytes which
29+
maintain proper alignment for the C types involved; similarly,
30+
alignment is taken into account when unpacking. In contrast, when
31+
communicating data between external sources, the programmer is
32+
responsible for defining byte ordering and padding between elements.
33+
See :ref:`struct-alignment` for details.
3034

3135
Several :mod:`struct` functions (and methods of :class:`Struct`) take a *buffer*
3236
argument. This refers to objects that implement the :ref:`bufferobjects` and
@@ -102,10 +106,13 @@ The module defines the following exception and functions:
102106
Format Strings
103107
--------------
104108

105-
Format strings are the mechanism used to specify the expected layout when
106-
packing and unpacking data. They are built up from :ref:`format-characters`,
107-
which specify the type of data being packed/unpacked. In addition, there are
108-
special characters for controlling the :ref:`struct-alignment`.
109+
Format strings describe the data layout when
110+
packing and unpacking data. They are built up from :ref:`format characters<format-characters>`,
111+
which specify the type of data being packed/unpacked. In addition,
112+
special characters control the :ref:`byte order, size and alignment<struct-alignment>`.
113+
Each format string consists of an optional prefix character which
114+
describes the overall properties of the data and one or more format
115+
characters which describe the actual data values and padding.
109116

110117

111118
.. _struct-alignment:
@@ -116,6 +123,11 @@ Byte Order, Size, and Alignment
116123
By default, C types are represented in the machine's native format and byte
117124
order, and properly aligned by skipping pad bytes if necessary (according to the
118125
rules used by the C compiler).
126+
This behavior is chosen so
127+
that the bytes of a packed struct correspond exactly to the memory layout
128+
of the corresponding C struct.
129+
Whether to use native byte ordering
130+
and padding or standard formats depends on the application.
119131

120132
.. index::
121133
single: @ (at); in struct format strings
@@ -144,12 +156,10 @@ following table:
144156

145157
If the first character is not one of these, ``'@'`` is assumed.
146158

147-
Native byte order is big-endian or little-endian, depending on the host
148-
system. For example, Intel x86 and AMD64 (x86-64) are little-endian;
149-
IBM z and most legacy architectures are big-endian;
150-
and ARM, RISC-V and IBM Power feature switchable endianness
151-
(bi-endian, though the former two are nearly always little-endian in practice).
152-
Use ``sys.byteorder`` to check the endianness of your system.
159+
Native byte order is big-endian or little-endian, depending on the
160+
host system. For example, Intel x86, AMD64 (x86-64), and Apple M1 are
161+
little-endian; IBM z and many legacy architectures are big-endian.
162+
Use :data:`sys.byteorder` to check the endianness of your system.
153163

154164
Native size and alignment are determined using the C compiler's
155165
``sizeof`` expression. This is always combined with native byte order.
@@ -231,9 +241,9 @@ platform-dependent.
231241
+--------+--------------------------+--------------------+----------------+------------+
232242
| ``d`` | :c:expr:`double` | float | 8 | \(4) |
233243
+--------+--------------------------+--------------------+----------------+------------+
234-
| ``s`` | :c:expr:`char[]` | bytes | | |
244+
| ``s`` | :c:expr:`char[]` | bytes | | \(9) |
235245
+--------+--------------------------+--------------------+----------------+------------+
236-
| ``p`` | :c:expr:`char[]` | bytes | | |
246+
| ``p`` | :c:expr:`char[]` | bytes | | \(8) |
237247
+--------+--------------------------+--------------------+----------------+------------+
238248
| ``P`` | :c:expr:`void \*` | integer | | \(5) |
239249
+--------+--------------------------+--------------------+----------------+------------+
@@ -292,24 +302,40 @@ Notes:
292302
format <half precision format_>`_ for more information.
293303

294304
(7)
295-
For padding, ``x`` inserts null bytes.
296-
305+
When packing, ``'x'`` inserts one NUL byte.
306+
307+
(8)
308+
The ``'p'`` format character encodes a "Pascal string", meaning a short
309+
variable-length string stored in a *fixed number of bytes*, given by the count.
310+
The first byte stored is the length of the string, or 255, whichever is
311+
smaller. The bytes of the string follow. If the string passed in to
312+
:func:`pack` is too long (longer than the count minus 1), only the leading
313+
``count-1`` bytes of the string are stored. If the string is shorter than
314+
``count-1``, it is padded with null bytes so that exactly count bytes in all
315+
are used. Note that for :func:`unpack`, the ``'p'`` format character consumes
316+
``count`` bytes, but that the string returned can never contain more than 255
317+
bytes.
318+
319+
(9)
320+
For the ``'s'`` format character, the count is interpreted as the length of the
321+
bytes, not a repeat count like for the other format characters; for example,
322+
``'10s'`` means a single 10-byte string mapping to or from a single
323+
Python byte string, while ``'10c'`` means 10
324+
separate one byte character elements (e.g., ``cccccccccc``) mapping
325+
to or from ten different Python byte objects. (See :ref:`struct-examples`
326+
for a concrete demonstration of the difference.)
327+
If a count is not given, it defaults to 1. For packing, the string is
328+
truncated or padded with null bytes as appropriate to make it fit. For
329+
unpacking, the resulting bytes object always has exactly the specified number
330+
of bytes. As a special case, ``'0s'`` means a single, empty string (while
331+
``'0c'`` means 0 characters).
297332

298333
A format character may be preceded by an integral repeat count. For example,
299334
the format string ``'4h'`` means exactly the same as ``'hhhh'``.
300335

301336
Whitespace characters between formats are ignored; a count and its format must
302337
not contain whitespace though.
303338

304-
For the ``'s'`` format character, the count is interpreted as the length of the
305-
bytes, not a repeat count like for the other format characters; for example,
306-
``'10s'`` means a single 10-byte string, while ``'10c'`` means 10 characters.
307-
If a count is not given, it defaults to 1. For packing, the string is
308-
truncated or padded with null bytes as appropriate to make it fit. For
309-
unpacking, the resulting bytes object always has exactly the specified number
310-
of bytes. As a special case, ``'0s'`` means a single, empty string (while
311-
``'0c'`` means 0 characters).
312-
313339
When packing a value ``x`` using one of the integer formats (``'b'``,
314340
``'B'``, ``'h'``, ``'H'``, ``'i'``, ``'I'``, ``'l'``, ``'L'``,
315341
``'q'``, ``'Q'``), if ``x`` is outside the valid range for that format
@@ -319,17 +345,6 @@ then :exc:`struct.error` is raised.
319345
Previously, some of the integer formats wrapped out-of-range values and
320346
raised :exc:`DeprecationWarning` instead of :exc:`struct.error`.
321347

322-
The ``'p'`` format character encodes a "Pascal string", meaning a short
323-
variable-length string stored in a *fixed number of bytes*, given by the count.
324-
The first byte stored is the length of the string, or 255, whichever is
325-
smaller. The bytes of the string follow. If the string passed in to
326-
:func:`pack` is too long (longer than the count minus 1), only the leading
327-
``count-1`` bytes of the string are stored. If the string is shorter than
328-
``count-1``, it is padded with null bytes so that exactly count bytes in all
329-
are used. Note that for :func:`unpack`, the ``'p'`` format character consumes
330-
``count`` bytes, but that the string returned can never contain more than 255
331-
bytes.
332-
333348
.. index:: single: ? (question mark); in struct format strings
334349

335350
For the ``'?'`` format character, the return value is either :const:`True` or
@@ -345,18 +360,36 @@ Examples
345360
^^^^^^^^
346361

347362
.. note::
348-
All examples assume a native byte order, size, and alignment with a
349-
big-endian machine.
363+
Native byte order examples (designated by the ``'@'`` format prefix or
364+
lack of any prefix character) may not match what the reader's
365+
machine produces as
366+
that depends on the platform and compiler.
367+
368+
Pack and unpack integers of three different sizes, using big endian
369+
ordering::
350370

351-
A basic example of packing/unpacking three integers::
371+
>>> from struct import *
372+
>>> pack(">bhl", 1, 2, 3)
373+
b'\x01\x00\x02\x00\x00\x00\x03'
374+
>>> unpack('>bhl', b'\x01\x00\x02\x00\x00\x00\x03'
375+
(1, 2, 3)
376+
>>> calcsize('>bhl')
377+
7
352378

353-
>>> from struct import *
354-
>>> pack('hhl', 1, 2, 3)
355-
b'\x00\x01\x00\x02\x00\x00\x00\x03'
356-
>>> unpack('hhl', b'\x00\x01\x00\x02\x00\x00\x00\x03')
357-
(1, 2, 3)
358-
>>> calcsize('hhl')
359-
8
379+
Attempt to pack an integer which is too large for the defined field::
380+
381+
>>> pack(">h", 99999)
382+
Traceback (most recent call last):
383+
File "<stdin>", line 1, in <module>
384+
struct.error: 'h' format requires -32768 <= number <= 32767
385+
386+
Demonstrate the difference between ``'s'`` and ``'c'`` format
387+
characters::
388+
389+
>>> pack("@ccc", b'1', b'2', b'3')
390+
b'123'
391+
>>> pack("@3s", b'123')
392+
b'123'
360393

361394
Unpacked fields can be named by assigning them to variables or by wrapping
362395
the result in a named tuple::
@@ -369,35 +402,132 @@ the result in a named tuple::
369402
>>> Student._make(unpack('<10sHHb', record))
370403
Student(name=b'raymond ', serialnum=4658, school=264, gradelevel=8)
371404

372-
The ordering of format characters may have an impact on size since the padding
373-
needed to satisfy alignment requirements is different::
374-
375-
>>> pack('ci', b'*', 0x12131415)
376-
b'*\x00\x00\x00\x12\x13\x14\x15'
377-
>>> pack('ic', 0x12131415, b'*')
378-
b'\x12\x13\x14\x15*'
379-
>>> calcsize('ci')
405+
The ordering of format characters may have an impact on size in native
406+
mode since padding is implicit. In standard mode, the user is
407+
responsible for inserting any desired padding.
408+
Note in
409+
the first ``pack`` call below that three NUL bytes were added after the
410+
packed ``'#'`` to align the following integer on a four-byte boundary.
411+
In this example, the output was produced on a little endian machine::
412+
413+
>>> pack('@ci', b'#', 0x12131415)
414+
b'#\x00\x00\x00\x15\x14\x13\x12'
415+
>>> pack('@ic', 0x12131415, b'#')
416+
b'\x15\x14\x13\x12#'
417+
>>> calcsize('@ci')
380418
8
381-
>>> calcsize('ic')
419+
>>> calcsize('@ic')
382420
5
383421

384-
The following format ``'llh0l'`` specifies two pad bytes at the end, assuming
385-
longs are aligned on 4-byte boundaries::
422+
The following format ``'llh0l'`` results in two pad bytes being added
423+
at the end, assuming the platform's longs are aligned on 4-byte boundaries::
386424

387-
>>> pack('llh0l', 1, 2, 3)
425+
>>> pack('@llh0l', 1, 2, 3)
388426
b'\x00\x00\x00\x01\x00\x00\x00\x02\x00\x03\x00\x00'
389427

390-
This only works when native size and alignment are in effect; standard size and
391-
alignment does not enforce any alignment.
392-
393428

394429
.. seealso::
395430

396431
Module :mod:`array`
397432
Packed binary storage of homogeneous data.
398433

399-
Module :mod:`xdrlib`
400-
Packing and unpacking of XDR data.
434+
Module :mod:`json`
435+
JSON encoder and decoder.
436+
437+
Module :mod:`pickle`
438+
Python object serialization.
439+
440+
441+
.. _applications:
442+
443+
Applications
444+
------------
445+
446+
Two main applications for the :mod:`struct` module exist, data
447+
interchange between Python and C code within an application or another
448+
application compiled using the same compiler (:ref:`native formats<struct-native-formats>`), and
449+
data interchange between applications using agreed upon data layout
450+
(:ref:`standard formats<struct-standard-formats>`). Generally speaking, the format strings
451+
constructed for these two domains are distinct.
452+
453+
454+
.. _struct-native-formats:
455+
456+
Native Formats
457+
^^^^^^^^^^^^^^
458+
459+
When constructing format strings which mimic native layouts, the
460+
compiler and machine architecture determine byte ordering and padding.
461+
In such cases, the ``@`` format character should be used to specify
462+
native byte ordering and data sizes. Internal pad bytes are normally inserted
463+
automatically. It is possible that a zero-repeat format code will be
464+
needed at the end of a format string to round up to the correct
465+
byte boundary for proper alignment of consective chunks of data.
466+
467+
Consider these two simple examples (on a 64-bit, little-endian
468+
machine)::
469+
470+
>>> calcsize('@lhl')
471+
24
472+
>>> calcsize('@llh')
473+
18
474+
475+
Data is not padded to an 8-byte boundary at the end of the second
476+
format string without the use of extra padding. A zero-repeat format
477+
code solves that problem::
478+
479+
>>> calcsize('@llh0l')
480+
24
481+
482+
The ``'x'`` format code can be used to specify the repeat, but for
483+
native formats it is better to use a zero-repeat format like ``'0l'``.
484+
485+
By default, native byte ordering and alignment is used, but it is
486+
better to be explicit and use the ``'@'`` prefix character.
487+
488+
489+
.. _struct-standard-formats:
490+
491+
Standard Formats
492+
^^^^^^^^^^^^^^^^
493+
494+
When exchanging data beyond your process such as networking or storage,
495+
be precise. Specify the exact byte order, size, and alignment. Do
496+
not assume they match the native order of a particular machine.
497+
For example, network byte order is big-endian, while many popular CPUs
498+
are little-endian. By defining this explicitly, the user need not
499+
care about the specifics of the platform their code is running on.
500+
The first character should typically be ``<`` or ``>``
501+
(or ``!``). Padding is the responsibility of the programmer. The
502+
zero-repeat format character won't work. Instead, the user must
503+
explicitly add ``'x'`` pad bytes where needed. Revisiting the
504+
examples from the previous section, we have::
505+
506+
>>> calcsize('<qh6xq')
507+
24
508+
>>> pack('<qh6xq', 1, 2, 3) == pack('@lhl', 1, 2, 3)
509+
True
510+
>>> calcsize('@llh')
511+
18
512+
>>> pack('@llh', 1, 2, 3) == pack('<qqh', 1, 2, 3)
513+
True
514+
>>> calcsize('<qqh6x')
515+
24
516+
>>> calcsize('@llh0l')
517+
24
518+
>>> pack('@llh0l', 1, 2, 3) == pack('<qqh6x', 1, 2, 3)
519+
True
520+
521+
The above results (executed on a 64-bit machine) aren't guaranteed to
522+
match when executed on different machines. For example, the examples
523+
below were executed on a 32-bit machine::
524+
525+
>>> calcsize('<qqh6x')
526+
24
527+
>>> calcsize('@llh0l')
528+
12
529+
>>> pack('@llh0l', 1, 2, 3) == pack('<qqh6x', 1, 2, 3)
530+
False
401531

402532

403533
.. _struct-objects:
@@ -411,9 +541,9 @@ The :mod:`struct` module also defines the following type:
411541
.. class:: Struct(format)
412542

413543
Return a new Struct object which writes and reads binary data according to
414-
the format string *format*. Creating a Struct object once and calling its
415-
methods is more efficient than calling the :mod:`struct` functions with the
416-
same format since the format string only needs to be compiled once.
544+
the format string *format*. Creating a ``Struct`` object once and calling its
545+
methods is more efficient than calling module-level functions with the
546+
same format since the format string is only compiled once.
417547

418548
.. note::
419549

0 commit comments

Comments
 (0)