gh-99146 struct module documentation should have more predictable examples/warnings (GH-99141)

smontanaro · web-flow · commit 22d91c16bb03 · 2022-11-22T18:06:36.000Z
* nail down a couple examples to have more predictable output * update a number of things, but this is really just a stash... * added an applications section to describe typical uses for native and machine-independent formats * make sure all format strings use a format prefix character * responding to comments from @gpshead. Not likely finished yet. * This got more involved than I expected... * respond to several PR comments * a lot of wordsmithing * try and be more consistent in use of ``x`` vs ``'x'`` * expand examples a bit * update the "see also" to be more up-to-date * original examples relied on import * so present all examples as if * reformat based on @gpshead comment (missed before) * responding to comments * missed this * one more suggested edit * wordsmithing
diff --git a/Doc/library/struct.rst b/Doc/library/struct.rst
@@ -12,21 +12,25 @@
 
 --------------
 
-This module performs conversions between Python values and C structs represented
-as Python :class:`bytes` objects.  This can be used in handling binary data
-stored in files or from network connections, among other sources.  It uses
-:ref:`struct-format-strings` as compact descriptions of the layout of the C
-structs and the intended conversion to/from Python values.
+This module converts between Python values and C structs represented
+as Python :class:`bytes` objects.  Compact :ref:`format strings <struct-format-strings>`
+describe the intended conversions to/from Python values.
+The module's functions and objects can be used for two largely
+distinct applications, data exchange with external sources (files or
+network connections), or data transfer between the Python application
+and the C layer.
 
 .. note::
 
-   By default, the result of packing a given C struct includes pad bytes in
-   order to maintain proper alignment for the C types involved; similarly,
-   alignment is taken into account when unpacking.  This behavior is chosen so
-   that the bytes of a packed struct correspond exactly to the layout in memory
-   of the corresponding C struct.  To handle platform-independent data formats
-   or omit implicit pad bytes, use ``standard`` size and alignment instead of
-   ``native`` size and alignment: see :ref:`struct-alignment` for details.
+   When no prefix character is given, native mode is the default. It
+   packs or unpacks data based on the platform and compiler on which
+   the Python interpreter was built.
+   The result of packing a given C struct includes pad bytes which
+   maintain proper alignment for the C types involved; similarly,
+   alignment is taken into account when unpacking.  In contrast, when
+   communicating data between external sources, the programmer is
+   responsible for defining byte ordering and padding between elements.
+   See :ref:`struct-alignment` for details.
 
 Several :mod:`struct` functions (and methods of :class:`Struct`) take a *buffer*
 argument.  This refers to objects that implement the :ref:`bufferobjects` and
@@ -102,10 +106,13 @@ The module defines the following exception and functions:
 Format Strings
 --------------
 
-Format strings are the mechanism used to specify the expected layout when
-packing and unpacking data.  They are built up from :ref:`format-characters`,
-which specify the type of data being packed/unpacked.  In addition, there are
-special characters for controlling the :ref:`struct-alignment`.
+Format strings describe the data layout when
+packing and unpacking data.  They are built up from :ref:`format characters<format-characters>`,
+which specify the type of data being packed/unpacked.  In addition,
+special characters control the :ref:`byte order, size and alignment<struct-alignment>`.
+Each format string consists of an optional prefix character which
+describes the overall properties of the data and one or more format
+characters which describe the actual data values and padding.
 
 
 .. _struct-alignment:
@@ -116,6 +123,11 @@ Byte Order, Size, and Alignment
 By default, C types are represented in the machine's native format and byte
 order, and properly aligned by skipping pad bytes if necessary (according to the
 rules used by the C compiler).
+This behavior is chosen so
+that the bytes of a packed struct correspond exactly to the memory layout
+of the corresponding C struct.
+Whether to use native byte ordering
+and padding or standard formats depends on the application.
 
 .. index::
    single: @ (at); in struct format strings
@@ -144,12 +156,10 @@ following table:
 
 If the first character is not one of these, ``'@'`` is assumed.
 
-Native byte order is big-endian or little-endian, depending on the host
-system. For example, Intel x86 and AMD64 (x86-64) are little-endian;
-IBM z and most legacy architectures are big-endian;
-and ARM, RISC-V and IBM Power feature switchable endianness
-(bi-endian, though the former two are nearly always little-endian in practice).
-Use ``sys.byteorder`` to check the endianness of your system.
+Native byte order is big-endian or little-endian, depending on the
+host system. For example, Intel x86, AMD64 (x86-64), and Apple M1 are
+little-endian; IBM z and many legacy architectures are big-endian.
+Use :data:`sys.byteorder` to check the endianness of your system.
 
 Native size and alignment are determined using the C compiler's
 ``sizeof`` expression.  This is always combined with native byte order.
@@ -231,9 +241,9 @@ platform-dependent.
 +--------+--------------------------+--------------------+----------------+------------+
 | ``d``  | :c:expr:`double`         | float              | 8              | \(4)       |
 +--------+--------------------------+--------------------+----------------+------------+
-| ``s``  | :c:expr:`char[]`         | bytes              |                |            |
+| ``s``  | :c:expr:`char[]`         | bytes              |                | \(9)       |
 +--------+--------------------------+--------------------+----------------+------------+
-| ``p``  | :c:expr:`char[]`         | bytes              |                |            |
+| ``p``  | :c:expr:`char[]`         | bytes              |                | \(8)       |
 +--------+--------------------------+--------------------+----------------+------------+
 | ``P``  | :c:expr:`void \*`        | integer            |                | \(5)       |
 +--------+--------------------------+--------------------+----------------+------------+
@@ -292,24 +302,40 @@ Notes:
    format <half precision format_>`_ for more information.
 
 (7)
-   For padding, ``x`` inserts null bytes.
-
+   When packing, ``'x'`` inserts one NUL byte.
+
+(8)
+   The ``'p'`` format character encodes a "Pascal string", meaning a short
+   variable-length string stored in a *fixed number of bytes*, given by the count.
+   The first byte stored is the length of the string, or 255, whichever is
+   smaller.  The bytes of the string follow.  If the string passed in to
+   :func:`pack` is too long (longer than the count minus 1), only the leading
+   ``count-1`` bytes of the string are stored.  If the string is shorter than
+   ``count-1``, it is padded with null bytes so that exactly count bytes in all
+   are used.  Note that for :func:`unpack`, the ``'p'`` format character consumes
+   ``count`` bytes, but that the string returned can never contain more than 255
+   bytes.
+
+(9)
+   For the ``'s'`` format character, the count is interpreted as the length of the
+   bytes, not a repeat count like for the other format characters; for example,
+   ``'10s'`` means a single 10-byte string mapping to or from a single
+   Python byte string, while ``'10c'`` means 10
+   separate one byte character elements (e.g., ``cccccccccc``) mapping
+   to or from ten different Python byte objects. (See :ref:`struct-examples`
+   for a concrete demonstration of the difference.)
+   If a count is not given, it defaults to 1.  For packing, the string is
+   truncated or padded with null bytes as appropriate to make it fit. For
+   unpacking, the resulting bytes object always has exactly the specified number
+   of bytes.  As a special case, ``'0s'`` means a single, empty string (while
+   ``'0c'`` means 0 characters).
 
 A format character may be preceded by an integral repeat count.  For example,
 the format string ``'4h'`` means exactly the same as ``'hhhh'``.
 
 Whitespace characters between formats are ignored; a count and its format must
 not contain whitespace though.
 
-For the ``'s'`` format character, the count is interpreted as the length of the
-bytes, not a repeat count like for the other format characters; for example,
-``'10s'`` means a single 10-byte string, while ``'10c'`` means 10 characters.
-If a count is not given, it defaults to 1.  For packing, the string is
-truncated or padded with null bytes as appropriate to make it fit. For
-unpacking, the resulting bytes object always has exactly the specified number
-of bytes.  As a special case, ``'0s'`` means a single, empty string (while
-``'0c'`` means 0 characters).
-
 When packing a value ``x`` using one of the integer formats (``'b'``,
 ``'B'``, ``'h'``, ``'H'``, ``'i'``, ``'I'``, ``'l'``, ``'L'``,
 ``'q'``, ``'Q'``), if ``x`` is outside the valid range for that format
@@ -319,17 +345,6 @@ then :exc:`struct.error` is raised.
    Previously, some of the integer formats wrapped out-of-range values and
    raised :exc:`DeprecationWarning` instead of :exc:`struct.error`.
 
-The ``'p'`` format character encodes a "Pascal string", meaning a short
-variable-length string stored in a *fixed number of bytes*, given by the count.
-The first byte stored is the length of the string, or 255, whichever is
-smaller.  The bytes of the string follow.  If the string passed in to
-:func:`pack` is too long (longer than the count minus 1), only the leading
-``count-1`` bytes of the string are stored.  If the string is shorter than
-``count-1``, it is padded with null bytes so that exactly count bytes in all
-are used.  Note that for :func:`unpack`, the ``'p'`` format character consumes
-``count`` bytes, but that the string returned can never contain more than 255
-bytes.
-
 .. index:: single: ? (question mark); in struct format strings
 
 For the ``'?'`` format character, the return value is either :const:`True` or
@@ -345,18 +360,36 @@ Examples
 ^^^^^^^^
 
 .. note::
-   All examples assume a native byte order, size, and alignment with a
-   big-endian machine.
+   Native byte order examples (designated by the ``'@'`` format prefix or
+   lack of any prefix character) may not match what the reader's
+   machine produces as
+   that depends on the platform and compiler.
+
+Pack and unpack integers of three different sizes, using big endian
+ordering::
 
-A basic example of packing/unpacking three integers::
+    >>> from struct import *
+    >>> pack(">bhl", 1, 2, 3)
+    b'\x01\x00\x02\x00\x00\x00\x03'
+    >>> unpack('>bhl', b'\x01\x00\x02\x00\x00\x00\x03'
+    (1, 2, 3)
+    >>> calcsize('>bhl')
+    7
 
-   >>> from struct import *
-   >>> pack('hhl', 1, 2, 3)
-   b'\x00\x01\x00\x02\x00\x00\x00\x03'
-   >>> unpack('hhl', b'\x00\x01\x00\x02\x00\x00\x00\x03')
-   (1, 2, 3)
-   >>> calcsize('hhl')
-   8
+Attempt to pack an integer which is too large for the defined field::
+
+    >>> pack(">h", 99999)
+    Traceback (most recent call last):
+      File "<stdin>", line 1, in <module>
+    struct.error: 'h' format requires -32768 <= number <= 32767
+
+Demonstrate the difference between ``'s'`` and ``'c'`` format
+characters::
+
+    >>> pack("@ccc", b'1', b'2', b'3')
+    b'123'
+    >>> pack("@3s", b'123')
+    b'123'
 
 Unpacked fields can be named by assigning them to variables or by wrapping
 the result in a named tuple::
@@ -369,35 +402,132 @@ the result in a named tuple::
     >>> Student._make(unpack('<10sHHb', record))
     Student(name=b'raymond   ', serialnum=4658, school=264, gradelevel=8)
 
-The ordering of format characters may have an impact on size since the padding
-needed to satisfy alignment requirements is different::
-
-    >>> pack('ci', b'*', 0x12131415)
-    b'*\x00\x00\x00\x12\x13\x14\x15'
-    >>> pack('ic', 0x12131415, b'*')
-    b'\x12\x13\x14\x15*'
-    >>> calcsize('ci')
+The ordering of format characters may have an impact on size in native
+mode since padding is implicit. In standard mode, the user is
+responsible for inserting any desired padding.
+Note in
+the first ``pack`` call below that three NUL bytes were added after the
+packed ``'#'`` to align the following integer on a four-byte boundary.
+In this example, the output was produced on a little endian machine::
+
+    >>> pack('@ci', b'#', 0x12131415)
+    b'#\x00\x00\x00\x15\x14\x13\x12'
+    >>> pack('@ic', 0x12131415, b'#')
+    b'\x15\x14\x13\x12#'
+    >>> calcsize('@ci')
     8
-    >>> calcsize('ic')
+    >>> calcsize('@ic')
     5
 
-The following format ``'llh0l'`` specifies two pad bytes at the end, assuming
-longs are aligned on 4-byte boundaries::
+The following format ``'llh0l'`` results in two pad bytes being added
+at the end, assuming the platform's longs are aligned on 4-byte boundaries::
 
-    >>> pack('llh0l', 1, 2, 3)
+    >>> pack('@llh0l', 1, 2, 3)
     b'\x00\x00\x00\x01\x00\x00\x00\x02\x00\x03\x00\x00'
 
-This only works when native size and alignment are in effect; standard size and
-alignment does not enforce any alignment.
-
 
 .. seealso::
 
    Module :mod:`array`
       Packed binary storage of homogeneous data.
 
-   Module :mod:`xdrlib`
-      Packing and unpacking of XDR data.
+   Module :mod:`json`
+      JSON encoder and decoder.
+
+   Module :mod:`pickle`
+      Python object serialization.
+
+
+.. _applications:
+
+Applications
+------------
+
+Two main applications for the :mod:`struct` module exist, data
+interchange between Python and C code within an application or another
+application compiled using the same compiler (:ref:`native formats<struct-native-formats>`), and
+data interchange between applications using agreed upon data layout
+(:ref:`standard formats<struct-standard-formats>`).  Generally speaking, the format strings
+constructed for these two domains are distinct.
+
+
+.. _struct-native-formats:
+
+Native Formats
+^^^^^^^^^^^^^^
+
+When constructing format strings which mimic native layouts, the
+compiler and machine architecture determine byte ordering and padding.
+In such cases, the ``@`` format character should be used to specify
+native byte ordering and data sizes.  Internal pad bytes are normally inserted
+automatically.  It is possible that a zero-repeat format code will be
+needed at the end of a format string to round up to the correct
+byte boundary for proper alignment of consective chunks of data.
+
+Consider these two simple examples (on a 64-bit, little-endian
+machine)::
+
+    >>> calcsize('@lhl')
+    24
+    >>> calcsize('@llh')
+    18
+
+Data is not padded to an 8-byte boundary at the end of the second
+format string without the use of extra padding.  A zero-repeat format
+code solves that problem::
+
+    >>> calcsize('@llh0l')
+    24
+
+The ``'x'`` format code can be used to specify the repeat, but for
+native formats it is better to use a zero-repeat format like ``'0l'``.
+
+By default, native byte ordering and alignment is used, but it is
+better to be explicit and use the ``'@'`` prefix character.
+
+
+.. _struct-standard-formats:
+
+Standard Formats
+^^^^^^^^^^^^^^^^
+
+When exchanging data beyond your process such as networking or storage,
+be precise.  Specify the exact byte order, size, and alignment.  Do
+not assume they match the native order of a particular machine.
+For example, network byte order is big-endian, while many popular CPUs
+are little-endian.  By defining this explicitly, the user need not
+care about the specifics of the platform their code is running on.
+The first character should typically be ``<`` or ``>``
+(or ``!``).  Padding is the responsibility of the programmer.  The
+zero-repeat format character won't work.  Instead, the user must
+explicitly add ``'x'`` pad bytes where needed.  Revisiting the
+examples from the previous section, we have::
+
+    >>> calcsize('<qh6xq')
+    24
+    >>> pack('<qh6xq', 1, 2, 3) == pack('@lhl', 1, 2, 3)
+    True
+    >>> calcsize('@llh')
+    18
+    >>> pack('@llh', 1, 2, 3) == pack('<qqh', 1, 2, 3)
+    True
+    >>> calcsize('<qqh6x')
+    24
+    >>> calcsize('@llh0l')
+    24
+    >>> pack('@llh0l', 1, 2, 3) == pack('<qqh6x', 1, 2, 3)
+    True
+
+The above results (executed on a 64-bit machine) aren't guaranteed to
+match when executed on different machines.  For example, the examples
+below were executed on a 32-bit machine::
+
+    >>> calcsize('<qqh6x')
+    24
+    >>> calcsize('@llh0l')
+    12
+    >>> pack('@llh0l', 1, 2, 3) == pack('<qqh6x', 1, 2, 3)
+    False
 
 
 .. _struct-objects:
@@ -411,9 +541,9 @@ The :mod:`struct` module also defines the following type:
 .. class:: Struct(format)
 
    Return a new Struct object which writes and reads binary data according to
-   the format string *format*.  Creating a Struct object once and calling its
-   methods is more efficient than calling the :mod:`struct` functions with the
-   same format since the format string only needs to be compiled once.
+   the format string *format*.  Creating a ``Struct`` object once and calling its
+   methods is more efficient than calling module-level functions with the
+   same format since the format string is only compiled once.
 
    .. note::