Skip to content

base64.b85encode uses significant amount of RAM #101178

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Open
AbdealiLoKo opened this issue Jan 20, 2023 · 4 comments · May be fixed by #102753
Open

base64.b85encode uses significant amount of RAM #101178

AbdealiLoKo opened this issue Jan 20, 2023 · 4 comments · May be fixed by #102753
Labels
performance Performance or resource usage stdlib Python modules in the Lib dir type-bug An unexpected behavior, bug, or error

Comments

@AbdealiLoKo
Copy link

AbdealiLoKo commented Jan 20, 2023

Bug report

On the same string:

  • b64encode: RAM: 604MB, CPU: 79%, Time taken: 1.18sec and gave a result
  • b85encode: RAM: 6.9GB, CPU: 64%, Time taken: 45sec and crashed due to insufficient RAM (most likely)

b85encode takes up my entire RAM and crashes

On IPython:

In [1]: import base64

In [2]: contents = b'a' * 250 * 1024 * 1024

In [3]: %time len(base64.b64encode(contents).decode("ascii")) / 1024 / 1024
CPU times: user 489 ms, sys: 414 ms, total: 904 ms
Wall time: 974 ms
Out[3]: 333.33333587646484

In [4]: %time len(base64.b85encode(contents).decode("ascii")) / 1024 / 1024
Killed

Here is the GNU time stats:

$ /usr/bin/time /opt/python3.9/bin/python -c "import base64; print(len(base64.b64encode(b'a' * 250 * 1024 * 1024)) / 1024 / 1024)"
333.33333587646484
0.55user 0.39system 0:01.18elapsed 79%CPU (0avgtext+0avgdata 604008maxresident)k
0inputs+0outputs (0major+151155minor)pagefaults 0swaps

$ /usr/bin/time /opt/python3.9/bin/python -c "import base64; print(len(base64.b85encode(b'a' * 250 * 1024 * 1024)) / 1024 / 1024)"
26.08user 3.39system 0:45.86elapsed 64%CPU (0avgtext+0avgdata 6966080maxresident)k
512inputs+0outputs (2major+1756895minor)pagefaults 0swaps

I have gotten same results in Python 3.6, 3.7, 3.8, 3.9

Your environment

  • CPython versions tested on: Python 3.9 installed with miniconda
  • Operating system and architecture: CentOS 7, AWS t3.large machine

Linked PRs

@AbdealiLoKo AbdealiLoKo added the type-bug An unexpected behavior, bug, or error label Jan 20, 2023
@mdboom mdboom added the performance Performance or resource usage label Jan 20, 2023
@sobolevn
Copy link
Member

I can confirm, that this still happens on main:

» time ./python.exe -c "import base64; print(len(base64.b85encode(b'a' * 250 * 1024 * 1024)) / 1024 / 1024)"
312.5
./python.exe -c   160.32s user 74.28s system 87% cpu 4:29.26 total

Compare it to almost-instant 64 version (it uses binascii's C implementation):

» time ./python.exe -c "import base64; print(len(base64.b64encode(b'a' * 250 * 1024 * 1024)) / 1024 / 1024)"
333.33333587646484
./python.exe -c   0.91s user 0.26s system 98% cpu 1.187 total

@eendebakpt
Copy link
Contributor

I looked into the implementation of b85encode (in _85encode). The memory problems are caused by the fact that internally copies of the data are made. The copies consist of small chuncks, where each chuck has the memory overhead of a python object.

  • First with words = struct.Struct('!%dI' % (len(b) // 4)).unpack(b). There is a iter_unpack method in the Struct class, but it does not work as expected.
  • And then with a list comprehension chunks = [ ... for word in words]. In the case padding=False we can replace this with a generator which would save memory. For padding=True we need to modify the last element of the list, which does not work for the generator.

The poor performance (compared to b64encode) is mainly due to the part (chars2[word // 614125] + chars2[word // 85 % 7225] + chars[word % 85]).

To make any significant gains an implementation in C seems the most worthwhile.

kangtastic added a commit to kangtastic/cpython that referenced this issue Mar 13, 2023
Add a base85 encoder and decoder to `binascii` and four new functions,
`binascii.a2b_ascii85()`, `a2b_base85()`, `b2a_ascii85()`, and
`b2a_base85()`. These can be used to replace the existing pure-Python
base85 implementation within `base64.a85encode()`, `b85encode()`,
`a85decode()`, and `b85encode()`. Performance and memory usage both
benefit by at least an order of magnitude.

No API or documentation changes are necessary with respect to
`base64.a85encode()`, `b85encode()`, etc., and all existing unit tests
for those functions continue to pass without modification.

The only significant observable behavioral difference compared to the
old base85 implementation should be that attempting to decode Ascii85
or base85 data of length 1 mod 5 (after accounting for Ascii85 quirks)
now results in an error. Such data would never be produced by any
base85 encoder, so it must be invalid.

Resolves: pythongh-101178
kangtastic added a commit to kangtastic/cpython that referenced this issue Mar 13, 2023
Add a base85 encoder and decoder to `binascii` and four new functions,
`binascii.a2b_ascii85()`, `a2b_base85()`, `b2a_ascii85()`, and
`b2a_base85()`. These are used to replace the existing pure-Python
base85 implementation within `base64.a85encode()`, `b85encode()`,
`a85decode()`, and `b85encode()`. Performance and memory usage both
benefit by at least an order of magnitude.

No API or documentation changes are necessary with respect to
`base64.a85encode()`, `b85encode()`, etc., and all existing unit tests
for those functions continue to pass without modification.

Note that attempting to decode Ascii85 or base85 data of length 1 mod 5
(after accounting for Ascii85 quirks) now produces an error. No base85
encoder would emit such data, so it must be invalid. This should be the
only notable external-facing difference in behavior compared to the old
implementation.

Resolves: pythongh-101178
kangtastic added a commit to kangtastic/cpython that referenced this issue Mar 16, 2023
Add Ascii85 and base85 encoders and decoders to `binascii` and four new
functions, `binascii.a2b_ascii85()`, `a2b_base85()`, `b2a_ascii85()`,
and `b2a_base85()`.

These replace the existing implementations in `base64.a85encode()`,
`b85encode()`, `a85decode()`, and `b85encode()`. Performance is greatly
improved, and memory usage is now constant instead of linear.

No API or documentation changes are necessary with respect to
`base64.a85encode()`, `b85encode()`, etc., and all existing unit tests
for those functions continue to pass without modification.

Note that attempting to decode Ascii85 or base85 data of length 1 mod 5
(after accounting for Ascii85 quirks) now produces an error, as no
encoder would emit such data. This should be the only significant
externally visible difference compared to the old implementation.

Resolves: pythongh-101178
kangtastic added a commit to kangtastic/cpython that referenced this issue Mar 23, 2023
romuald added a commit to romuald/cpython that referenced this issue Nov 18, 2023
Refactor code to make use of generators instead of allocating 2 potentially
huge lists for large datasets
romuald added a commit to romuald/cpython that referenced this issue Nov 19, 2023
Refactor code to make use of generators instead of allocating 2 potentially
huge lists for large datasets
@iritkatriel iritkatriel added the stdlib Python modules in the Lib dir label Nov 26, 2023
romuald added a commit to romuald/cpython that referenced this issue Mar 3, 2024
Refactor code to make use of generators instead of allocating 2 potentially
huge lists for large datasets
kangtastic added a commit to kangtastic/cpython that referenced this issue Mar 19, 2024
kangtastic added a commit to kangtastic/cpython that referenced this issue Apr 21, 2025
Add Ascii85, base85, and Z85 encoders and decoders to `binascii`,
replacing the existing pure Python implementations in `base64`.

No API or documentation changes are necessary with respect to
`base64.a85encode()`, `b85encode()`, etc., and all existing unit
tests for those functions continue to pass without modification.

Note that attempting to decode Ascii85 or base85 data of length 1 mod 5
(after accounting for Ascii85 quirks) now produces an error, as no
encoder would emit such data. This should be the only significant
externally visible difference compared to the old implementation.

Resolves: pythongh-101178
@kangtastic
Copy link

Pinging this issue in hopes of finding a reviewer for my PR gh-102753. It's been over two years 😅

@eendebakpt
Copy link
Contributor

Pinging this issue in hopes of finding a reviewer for my PR gh-102753. It's been over two years 😅

Thanks for the patience! I pinged the cpython triage team to have a look.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
performance Performance or resource usage stdlib Python modules in the Lib dir type-bug An unexpected behavior, bug, or error
Projects
None yet
Development

Successfully merging a pull request may close this issue.

6 participants