Skip to content

Commit ea23e78

Browse files
rhpvordermanambv
andauthored
bpo-43613: Faster implementation of gzip.compress and gzip.decompress (GH-27941)
Co-authored-by: Łukasz Langa <[email protected]>
1 parent a7ef15a commit ea23e78

File tree

8 files changed

+193
-87
lines changed

8 files changed

+193
-87
lines changed

Doc/library/gzip.rst

+14-3
Original file line numberDiff line numberDiff line change
@@ -174,19 +174,30 @@ The module defines the following items:
174174

175175
Compress the *data*, returning a :class:`bytes` object containing
176176
the compressed data. *compresslevel* and *mtime* have the same meaning as in
177-
the :class:`GzipFile` constructor above.
177+
the :class:`GzipFile` constructor above. When *mtime* is set to ``0``, this
178+
function is equivalent to :func:`zlib.compress` with *wbits* set to ``31``.
179+
The zlib function is faster.
178180

179181
.. versionadded:: 3.2
180182
.. versionchanged:: 3.8
181183
Added the *mtime* parameter for reproducible output.
184+
.. versionchanged:: 3.11
185+
Speed is improved by compressing all data at once instead of in a
186+
streamed fashion. Calls with *mtime* set to ``0`` are delegated to
187+
:func:`zlib.compress` for better speed.
182188

183189
.. function:: decompress(data)
184190

185191
Decompress the *data*, returning a :class:`bytes` object containing the
186-
uncompressed data.
192+
uncompressed data. This function is capable of decompressing multi-member
193+
gzip data (multiple gzip blocks concatenated together). When the data is
194+
certain to contain only one member the :func:`zlib.decompress` function with
195+
*wbits* set to 31 is faster.
187196

188197
.. versionadded:: 3.2
189-
198+
.. versionchanged:: 3.11
199+
Speed is improved by decompressing members at once in memory instead of in
200+
a streamed fashion.
190201

191202
.. _gzip-usage-examples:
192203

Doc/library/zlib.rst

+28-18
Original file line numberDiff line numberDiff line change
@@ -47,19 +47,43 @@ The available exception and functions in this module are:
4747
platforms, use ``adler32(data) & 0xffffffff``.
4848

4949

50-
.. function:: compress(data, /, level=-1)
50+
.. function:: compress(data, /, level=-1, wbits=MAX_WBITS)
5151

5252
Compresses the bytes in *data*, returning a bytes object containing compressed data.
5353
*level* is an integer from ``0`` to ``9`` or ``-1`` controlling the level of compression;
5454
``1`` (Z_BEST_SPEED) is fastest and produces the least compression, ``9`` (Z_BEST_COMPRESSION)
5555
is slowest and produces the most. ``0`` (Z_NO_COMPRESSION) is no compression.
5656
The default value is ``-1`` (Z_DEFAULT_COMPRESSION). Z_DEFAULT_COMPRESSION represents a default
5757
compromise between speed and compression (currently equivalent to level 6).
58+
59+
.. _compress-wbits:
60+
61+
The *wbits* argument controls the size of the history buffer (or the
62+
"window size") used when compressing data, and whether a header and
63+
trailer is included in the output. It can take several ranges of values,
64+
defaulting to ``15`` (MAX_WBITS):
65+
66+
* +9 to +15: The base-two logarithm of the window size, which
67+
therefore ranges between 512 and 32768. Larger values produce
68+
better compression at the expense of greater memory usage. The
69+
resulting output will include a zlib-specific header and trailer.
70+
71+
* −9 to −15: Uses the absolute value of *wbits* as the
72+
window size logarithm, while producing a raw output stream with no
73+
header or trailing checksum.
74+
75+
* +25 to +31 = 16 + (9 to 15): Uses the low 4 bits of the value as the
76+
window size logarithm, while including a basic :program:`gzip` header
77+
and trailing checksum in the output.
78+
5879
Raises the :exc:`error` exception if any error occurs.
5980

6081
.. versionchanged:: 3.6
6182
*level* can now be used as a keyword parameter.
6283

84+
.. versionchanged:: 3.11
85+
The *wbits* parameter is now available to set window bits and
86+
compression type.
6387

6488
.. function:: compressobj(level=-1, method=DEFLATED, wbits=MAX_WBITS, memLevel=DEF_MEM_LEVEL, strategy=Z_DEFAULT_STRATEGY[, zdict])
6589

@@ -76,23 +100,9 @@ The available exception and functions in this module are:
76100
*method* is the compression algorithm. Currently, the only supported value is
77101
:const:`DEFLATED`.
78102

79-
The *wbits* argument controls the size of the history buffer (or the
80-
"window size") used when compressing data, and whether a header and
81-
trailer is included in the output. It can take several ranges of values,
82-
defaulting to ``15`` (MAX_WBITS):
83-
84-
* +9 to +15: The base-two logarithm of the window size, which
85-
therefore ranges between 512 and 32768. Larger values produce
86-
better compression at the expense of greater memory usage. The
87-
resulting output will include a zlib-specific header and trailer.
88-
89-
* −9 to −15: Uses the absolute value of *wbits* as the
90-
window size logarithm, while producing a raw output stream with no
91-
header or trailing checksum.
92-
93-
* +25 to +31 = 16 + (9 to 15): Uses the low 4 bits of the value as the
94-
window size logarithm, while including a basic :program:`gzip` header
95-
and trailing checksum in the output.
103+
The *wbits* parameter controls the size of the history buffer (or the
104+
"window size"), and what header and trailer format will be used. It has
105+
the same meaning as `described for compress() <#compress-wbits>`__.
96106

97107
The *memLevel* argument controls the amount of memory used for the
98108
internal compression state. Valid values range from ``1`` to ``9``.

Lib/gzip.py

+108-53
Original file line numberDiff line numberDiff line change
@@ -403,6 +403,59 @@ def __iter__(self):
403403
return self._buffer.__iter__()
404404

405405

406+
def _read_exact(fp, n):
407+
'''Read exactly *n* bytes from `fp`
408+
409+
This method is required because fp may be unbuffered,
410+
i.e. return short reads.
411+
'''
412+
data = fp.read(n)
413+
while len(data) < n:
414+
b = fp.read(n - len(data))
415+
if not b:
416+
raise EOFError("Compressed file ended before the "
417+
"end-of-stream marker was reached")
418+
data += b
419+
return data
420+
421+
422+
def _read_gzip_header(fp):
423+
'''Read a gzip header from `fp` and progress to the end of the header.
424+
425+
Returns last mtime if header was present or None otherwise.
426+
'''
427+
magic = fp.read(2)
428+
if magic == b'':
429+
return None
430+
431+
if magic != b'\037\213':
432+
raise BadGzipFile('Not a gzipped file (%r)' % magic)
433+
434+
(method, flag, last_mtime) = struct.unpack("<BBIxx", _read_exact(fp, 8))
435+
if method != 8:
436+
raise BadGzipFile('Unknown compression method')
437+
438+
if flag & FEXTRA:
439+
# Read & discard the extra field, if present
440+
extra_len, = struct.unpack("<H", _read_exact(fp, 2))
441+
_read_exact(fp, extra_len)
442+
if flag & FNAME:
443+
# Read and discard a null-terminated string containing the filename
444+
while True:
445+
s = fp.read(1)
446+
if not s or s==b'\000':
447+
break
448+
if flag & FCOMMENT:
449+
# Read and discard a null-terminated string containing a comment
450+
while True:
451+
s = fp.read(1)
452+
if not s or s==b'\000':
453+
break
454+
if flag & FHCRC:
455+
_read_exact(fp, 2) # Read & discard the 16-bit header CRC
456+
return last_mtime
457+
458+
406459
class _GzipReader(_compression.DecompressReader):
407460
def __init__(self, fp):
408461
super().__init__(_PaddedFile(fp), zlib.decompressobj,
@@ -415,53 +468,11 @@ def _init_read(self):
415468
self._crc = zlib.crc32(b"")
416469
self._stream_size = 0 # Decompressed size of unconcatenated stream
417470

418-
def _read_exact(self, n):
419-
'''Read exactly *n* bytes from `self._fp`
420-
421-
This method is required because self._fp may be unbuffered,
422-
i.e. return short reads.
423-
'''
424-
425-
data = self._fp.read(n)
426-
while len(data) < n:
427-
b = self._fp.read(n - len(data))
428-
if not b:
429-
raise EOFError("Compressed file ended before the "
430-
"end-of-stream marker was reached")
431-
data += b
432-
return data
433-
434471
def _read_gzip_header(self):
435-
magic = self._fp.read(2)
436-
if magic == b'':
472+
last_mtime = _read_gzip_header(self._fp)
473+
if last_mtime is None:
437474
return False
438-
439-
if magic != b'\037\213':
440-
raise BadGzipFile('Not a gzipped file (%r)' % magic)
441-
442-
(method, flag,
443-
self._last_mtime) = struct.unpack("<BBIxx", self._read_exact(8))
444-
if method != 8:
445-
raise BadGzipFile('Unknown compression method')
446-
447-
if flag & FEXTRA:
448-
# Read & discard the extra field, if present
449-
extra_len, = struct.unpack("<H", self._read_exact(2))
450-
self._read_exact(extra_len)
451-
if flag & FNAME:
452-
# Read and discard a null-terminated string containing the filename
453-
while True:
454-
s = self._fp.read(1)
455-
if not s or s==b'\000':
456-
break
457-
if flag & FCOMMENT:
458-
# Read and discard a null-terminated string containing a comment
459-
while True:
460-
s = self._fp.read(1)
461-
if not s or s==b'\000':
462-
break
463-
if flag & FHCRC:
464-
self._read_exact(2) # Read & discard the 16-bit header CRC
475+
self._last_mtime = last_mtime
465476
return True
466477

467478
def read(self, size=-1):
@@ -524,7 +535,7 @@ def _read_eof(self):
524535
# We check that the computed CRC and size of the
525536
# uncompressed data matches the stored values. Note that the size
526537
# stored is the true file size mod 2**32.
527-
crc32, isize = struct.unpack("<II", self._read_exact(8))
538+
crc32, isize = struct.unpack("<II", _read_exact(self._fp, 8))
528539
if crc32 != self._crc:
529540
raise BadGzipFile("CRC check failed %s != %s" % (hex(crc32),
530541
hex(self._crc)))
@@ -544,21 +555,65 @@ def _rewind(self):
544555
super()._rewind()
545556
self._new_member = True
546557

558+
559+
def _create_simple_gzip_header(compresslevel: int,
560+
mtime = None) -> bytes:
561+
"""
562+
Write a simple gzip header with no extra fields.
563+
:param compresslevel: Compresslevel used to determine the xfl bytes.
564+
:param mtime: The mtime (must support conversion to a 32-bit integer).
565+
:return: A bytes object representing the gzip header.
566+
"""
567+
if mtime is None:
568+
mtime = time.time()
569+
if compresslevel == _COMPRESS_LEVEL_BEST:
570+
xfl = 2
571+
elif compresslevel == _COMPRESS_LEVEL_FAST:
572+
xfl = 4
573+
else:
574+
xfl = 0
575+
# Pack ID1 and ID2 magic bytes, method (8=deflate), header flags (no extra
576+
# fields added to header), mtime, xfl and os (255 for unknown OS).
577+
return struct.pack("<BBBBLBB", 0x1f, 0x8b, 8, 0, int(mtime), xfl, 255)
578+
579+
547580
def compress(data, compresslevel=_COMPRESS_LEVEL_BEST, *, mtime=None):
548581
"""Compress data in one shot and return the compressed string.
549-
Optional argument is the compression level, in range of 0-9.
582+
583+
compresslevel sets the compression level in range of 0-9.
584+
mtime can be used to set the modification time. The modification time is
585+
set to the current time by default.
550586
"""
551-
buf = io.BytesIO()
552-
with GzipFile(fileobj=buf, mode='wb', compresslevel=compresslevel, mtime=mtime) as f:
553-
f.write(data)
554-
return buf.getvalue()
587+
if mtime == 0:
588+
# Use zlib as it creates the header with 0 mtime by default.
589+
# This is faster and with less overhead.
590+
return zlib.compress(data, level=compresslevel, wbits=31)
591+
header = _create_simple_gzip_header(compresslevel, mtime)
592+
trailer = struct.pack("<LL", zlib.crc32(data), (len(data) & 0xffffffff))
593+
# Wbits=-15 creates a raw deflate block.
594+
return header + zlib.compress(data, wbits=-15) + trailer
595+
555596

556597
def decompress(data):
557598
"""Decompress a gzip compressed string in one shot.
558599
Return the decompressed string.
559600
"""
560-
with GzipFile(fileobj=io.BytesIO(data)) as f:
561-
return f.read()
601+
decompressed_members = []
602+
while True:
603+
fp = io.BytesIO(data)
604+
if _read_gzip_header(fp) is None:
605+
return b"".join(decompressed_members)
606+
# Use a zlib raw deflate compressor
607+
do = zlib.decompressobj(wbits=-zlib.MAX_WBITS)
608+
# Read all the data except the header
609+
decompressed = do.decompress(data[fp.tell():])
610+
crc, length = struct.unpack("<II", do.unused_data[:8])
611+
if crc != zlib.crc32(decompressed):
612+
raise BadGzipFile("CRC check failed")
613+
if length != (len(decompressed) & 0xffffffff):
614+
raise BadGzipFile("Incorrect length of data produced")
615+
decompressed_members.append(decompressed)
616+
data = do.unused_data[8:].lstrip(b"\x00")
562617

563618

564619
def main():

Lib/test/test_zlib.py

+7
Original file line numberDiff line numberDiff line change
@@ -831,6 +831,13 @@ def test_wbits(self):
831831
dco = zlib.decompressobj(32 + 15)
832832
self.assertEqual(dco.decompress(gzip), HAMLET_SCENE)
833833

834+
for wbits in (-15, 15, 31):
835+
with self.subTest(wbits=wbits):
836+
expected = HAMLET_SCENE
837+
actual = zlib.decompress(
838+
zlib.compress(HAMLET_SCENE, wbits=wbits), wbits=wbits
839+
)
840+
self.assertEqual(expected, actual)
834841

835842
def choose_lines(source, number, seed=None, generator=random):
836843
"""Return a list of number lines randomly chosen from the source"""
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,5 @@
1+
:func:`zlib.compress` now accepts a wbits parameter which allows users to
2+
compress data as a raw deflate block without zlib headers and trailers in
3+
one go. Previously this required instantiating a ``zlib.compressobj``. It
4+
also provides a faster alternative to ``gzip.compress`` when wbits=31 is
5+
used.
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,3 @@
1+
Improve the speed of :func:`gzip.compress` and :func:`gzip.decompress` by
2+
compressing and decompressing at once in memory instead of in a streamed
3+
fashion.

0 commit comments

Comments
 (0)