bpo-47070: improve performance of array_inplace_repeat #31999

eendebakpt · 2022-03-19T21:47:12Z

The array_inplace_repeat is inefficient for small arrays and a high number of repeats. This can be improved by using the same approach as in the array_inplace method (same approach as https://bugs.python.org/issue47005)
Create utility methods _PyBytes_Repeat and _PyBytes_RepeatInPlace for code that was common to various objects ( bytes, bytearray, unicodeobject and array ).

Microbenchmark:

import pyperf
runner = pyperf.Runner()

setup="import array; a=array.array('d',[1,2,3,])"
code= "a*=1"
runner.timeit('repeat 1', stmt=code, setup=setup)
setup="import array;"
code= "a=array.array('d',[1,2,3,]); a*=1000"
runner.timeit('repeat 1000', stmt=code, setup=setup)

Results in:

repeat 1: Mean +- std dev: [base] 53.0 ns +- 1.4 ns -> [patch] 40.4 ns +- 0.7 ns: 1.31x faster
repeat 1000: Mean +- std dev: [base] 2.78 us +- 0.08 us -> [patch] 908 ns +- 52 ns: 3.06x faster

Geometric mean: 2.01x faster

https://bugs.python.org/issue47070

Modules/arraymodule.c

eendebakpt · 2022-03-20T21:40:39Z

@sweeneyde Based on your suggestion I created utility methods _PyBytes_Repeat and _PyBytes_RepeatInPlace. They are in bytearrayobject.h, but perhaps there is a better location.

I defined the objects to be static inline, but am not sure they are inline on all platforms. If not, there could be (small) performance difference for the bytes and bytearray objects due to the overhead of calling the utility methods.

sweeneyde · 2022-03-20T21:46:07Z

They should go in Include/internal/pycore_bytes.h so they don't leak into the public API.

I wouldn't worry too much about whether things will get inlined -- if the compiler decides it's beneficial, it can inline it. Also, I'm guessing the single function call overhead is vastly outscaled by the several memcpy calls.

Include/internal/pycore_bytesobject.h

Objects/unicodeobject.c

sweeneyde · 2022-03-22T18:40:49Z

Include/internal/pycore_bytesobject.h

+    if (len_dest > 0) {
+        if (len_src == 1)
+            memset(dest, src[0], len_dest);
+        else {
+            memcpy(dest, src, len_src);
+
+            Py_ssize_t copied = len_src;
+            while (copied < len_dest) {
+                Py_ssize_t bytes_to_copy = Py_MIN(copied, len_dest - copied);
+                memcpy(dest + copied, dest, bytes_to_copy);
+                copied += bytes_to_copy;
+            }
+        }
+    }


Suggested change

if (len_dest > 0) {

if (len_src == 1)

memset(dest, src[0], len_dest);

else {

memcpy(dest, src, len_src);

Py_ssize_t copied = len_src;

while (copied < len_dest) {

Py_ssize_t bytes_to_copy = Py_MIN(copied, len_dest - copied);

memcpy(dest + copied, dest, bytes_to_copy);

copied += bytes_to_copy;

}

}

}

if (len_dest > 0) {

memcpy(dest, src, len_src);

_PyBytes_RepeatInPlace(dest, len_src, len_dest);

}

@sweeneyde Thanks for the suggestion. If really simplifies the code. Part of this I already considered, but did not push because it seems to hurt the small cases. E.g. [None] * 10 or b'a' * 5.

Both your suggestions (e.g. no inlining and skipping the special case where len_src==1) I implemented in https://github.com/eendebakpt/cpython/tree/performance/array_inplace_repeat_v2

Microbenchmarks show an overall slowdown, in particular for the cases with a low number of repeats or length of the object equal to 1.

array repeat 1: Mean +- std dev: [patchfull] 185 ns +- 9 ns -> [patchv2full] 191 ns +- 1 ns: 1.03x slower array repeat 2: Mean +- std dev: [patchfull] 185 ns +- 2 ns -> [patchv2full] 194 ns +- 2 ns: 1.05x slower array repeat 10: Mean +- std dev: [patchfull] 211 ns +- 5 ns -> [patchv2full] 220 ns +- 3 ns: 1.05x slower array repeat 1000: Mean +- std dev: [patchfull] 1.67 us +- 0.01 us -> [patchv2full] 1.70 us +- 0.01 us: 1.02x slower array repeat 5000: Mean +- std dev: [patchfull] 9.18 us +- 0.07 us -> [patchv2full] 9.43 us +- 0.08 us: 1.03x slower bytes(1) repeat 2: Mean +- std dev: [patchfull] 48.3 ns +- 0.7 ns -> [patchv2full] 50.7 ns +- 0.5 ns: 1.05x slower bytes(1) repeat 10: Mean +- std dev: [patchfull] 48.1 ns +- 3.0 ns -> [patchv2full] 49.4 ns +- 0.6 ns: 1.03x slower bytes(1) repeat 100: Mean +- std dev: [patchfull] 47.6 ns +- 0.5 ns -> [patchv2full] 50.2 ns +- 0.9 ns: 1.05x slower bytes(1) repeat 1000: Mean +- std dev: [patchfull] 99.8 ns +- 0.5 ns -> [patchv2full] 103 ns +- 1 ns: 1.03x slower array(1) repeat 1: Mean +- std dev: [patchfull] 84.4 ns +- 0.5 ns -> [patchv2full] 89.1 ns +- 1.7 ns: 1.06x slower array(1) repeat 10: Mean +- std dev: [patchfull] 96.3 ns +- 1.3 ns -> [patchv2full] 101 ns +- 1 ns: 1.05x slower array(1) repeat 1000: Mean +- std dev: [patchfull] 255 ns +- 9 ns -> [patchv2full] 260 ns +- 13 ns: 1.02x slower array repeat inplace 10: Mean +- std dev: [patchfull] 472 ns +- 8 ns -> [patchv2full] 487 ns +- 22 ns: 1.03x slower array repeat inplace 1000: Mean +- std dev: [patchfull] 867 ns +- 85 ns -> [patchv2full] 941 ns +- 81 ns: 1.08x slower [control] create array 1: Mean +- std dev: [patchfull] 224 ns +- 2 ns -> [patchv2full] 226 ns +- 3 ns: 1.01x slower [control] create array 10: Mean +- std dev: [patchfull] 447 ns +- 3 ns -> [patchv2full] 449 ns +- 3 ns: 1.00x slower [control] create array 5000: Mean +- std dev: [patchfull] 128 us +- 4 us -> [patchv2full] 127 us +- 3 us: 1.01x faster Benchmark hidden because not significant (4): bytes(1) repeat 1, array(1) repeat 5000, array repeat inplace 1, [control] create array 1000

My estimate is the small cases are about 5% slower. Is that acceptable?

One way to reduce two function calls to one would be something like (perhaps with the case len_src==1 added):

void _PyBytes_Repeat(char* dest, Py_ssize_t len_dest, const char* src, Py_ssize_t len_src, bool inplace) { if (len_dest > 0) { if (!inplace) memcpy(dest, src, len_src); Py_ssize_t copied = start_len; while (copied < end_len) { Py_ssize_t bytes_to_copy = Py_MIN(copied, end_len - copied); memcpy(dest + copied, buffer, bytes_to_copy); copied += bytes_to_copy; } } }

Thanks for those benchmarks. Out of curiosity, how are you compiling? PGO and LTO can change things.

I'd say the important things to optimize for are code complexity and the linear coefficient, and the O(1) overhead is not too important. Though since the starting-length-1 case is probably the most common, you're probably right that having a memset branch in both functions might be worth it then.

How would you feel about something like this?

def _PyBytes_RepeatInPlace(...): if len_dest == 1: memset(...) while ...: ... def _PyBytes_Repeat(...): if len_dest == 0: return elif len_dest == 1: memset(...) else: memcpy(...) _PyBytes_RepeatInPlace(...)

I would still prefer non-static inline functions though, even if there is a small penalty.

The PR has been updated.

Performance for the cases that this PR was initialy started for (e.g. the inplace repeat of array.array) is better than main

The helper function _PyBytes_Repeat is used at 6 places (potentially at 3 more places in gh-91247: improve performance of list and tuple repeat #32045). If not inlined (this PR) that helps reduce the amount of code.

Performance is better if we inline the method (branch https://github.com/eendebakpt/cpython/tree/performance/array_inplace_repeat_v3_inline). When compiled with --enable-optimizations (so LTO and PGO) and benchmarked with ./python test_array.py --rigorous --min-time .2 -o base.json I get:

array(3) repeat 0: Mean +- std dev: [patchv3] 162 ns +- 10 ns -> [patchv3inline] 157 ns +- 3 ns: 1.03x faster array(3) repeat 1: Mean +- std dev: [patchv3] 189 ns +- 4 ns -> [patchv3inline] 183 ns +- 2 ns: 1.03x faster array(3) repeat 2: Mean +- std dev: [patchv3] 191 ns +- 3 ns -> [patchv3inline] 184 ns +- 9 ns: 1.04x faster array(3) repeat 10: Mean +- std dev: [patchv3] 222 ns +- 3 ns -> [patchv3inline] 206 ns +- 2 ns: 1.08x faster array(3) repeat 1000: Mean +- std dev: [patchv3] 1.67 us +- 0.01 us -> [patchv3inline] 1.66 us +- 0.02 us: 1.01x faster array(1) repeat 10: Mean +- std dev: [patchv3] 101 ns +- 2 ns -> [patchv3inline] 96.0 ns +- 1.6 ns: 1.05x faster array(1) repeat 1000: Mean +- std dev: [patchv3] 263 ns +- 4 ns -> [patchv3inline] 260 ns +- 4 ns: 1.01x faster array(1) repeat 5000: Mean +- std dev: [patchv3] 1.40 us +- 0.02 us -> [patchv3inline] 1.40 us +- 0.01 us: 1.00x faster array(3) repeat inplace 1: Mean +- std dev: [patchv3] 401 ns +- 9 ns -> [patchv3inline] 405 ns +- 12 ns: 1.01x slower array(3) repeat inplace 10: Mean +- std dev: [patchv3] 462 ns +- 15 ns -> [patchv3inline] 457 ns +- 10 ns: 1.01x faster int array(3) repeat inplace 0: Mean +- std dev: [patchv3] 364 ns +- 8 ns -> [patchv3inline] 367 ns +- 11 ns: 1.01x slower int array(3) repeat inplace 1: Mean +- std dev: [patchv3] 347 ns +- 8 ns -> [patchv3inline] 351 ns +- 7 ns: 1.01x slower bytes(1) repeat 1: Mean +- std dev: [patchv3] 36.6 ns +- 0.4 ns -> [patchv3inline] 37.0 ns +- 0.7 ns: 1.01x slower bytes(1) repeat 2: Mean +- std dev: [patchv3] 48.2 ns +- 0.4 ns -> [patchv3inline] 49.1 ns +- 1.0 ns: 1.02x slower bytes(1) repeat 10: Mean +- std dev: [patchv3] 48.0 ns +- 0.6 ns -> [patchv3inline] 48.8 ns +- 0.5 ns: 1.02x slower bytes(1) repeat 100: Mean +- std dev: [patchv3] 48.5 ns +- 0.7 ns -> [patchv3inline] 49.1 ns +- 1.2 ns: 1.01x slower bytes(1) repeat 1000: Mean +- std dev: [patchv3] 103 ns +- 2 ns -> [patchv3inline] 102 ns +- 2 ns: 1.01x faster bytes(7) repeat 1: Mean +- std dev: [patchv3] 36.5 ns +- 0.3 ns -> [patchv3inline] 36.8 ns +- 0.5 ns: 1.01x slower bytes(7) repeat 2: Mean +- std dev: [patchv3] 50.8 ns +- 0.4 ns -> [patchv3inline] 50.6 ns +- 1.0 ns: 1.00x faster bytes(7) repeat 10: Mean +- std dev: [patchv3] 59.3 ns +- 0.9 ns -> [patchv3inline] 58.1 ns +- 0.7 ns: 1.02x faster bytearray(1) repeat 1: Mean +- std dev: [patchv3] 57.9 ns +- 0.8 ns -> [patchv3inline] 55.2 ns +- 0.6 ns: 1.05x faster bytearray(1) repeat 2: Mean +- std dev: [patchv3] 58.7 ns +- 0.6 ns -> [patchv3inline] 55.9 ns +- 0.9 ns: 1.05x faster bytearray(1) repeat 10: Mean +- std dev: [patchv3] 58.1 ns +- 0.6 ns -> [patchv3inline] 55.6 ns +- 0.7 ns: 1.04x faster bytearray(1) repeat 100: Mean +- std dev: [patchv3] 58.2 ns +- 0.5 ns -> [patchv3inline] 56.3 ns +- 1.0 ns: 1.03x faster bytearray(1) repeat 1000: Mean +- std dev: [patchv3] 72.8 ns +- 1.5 ns -> [patchv3inline] 71.4 ns +- 2.0 ns: 1.02x faster bytearray(10) repeat 1: Mean +- std dev: [patchv3] 58.2 ns +- 0.9 ns -> [patchv3inline] 55.4 ns +- 0.6 ns: 1.05x faster bytearray(10) repeat 2: Mean +- std dev: [patchv3] 61.3 ns +- 1.1 ns -> [patchv3inline] 58.3 ns +- 0.7 ns: 1.05x faster bytearray(10) repeat 10: Mean +- std dev: [patchv3] 69.5 ns +- 1.2 ns -> [patchv3inline] 67.5 ns +- 0.8 ns: 1.03x faster bytearray(10) repeat 1000: Mean +- std dev: [patchv3] 231 ns +- 6 ns -> [patchv3inline] 229 ns +- 2 ns: 1.01x faster [control] create array 1: Mean +- std dev: [patchv3] 220 ns +- 4 ns -> [patchv3inline] 221 ns +- 8 ns: 1.01x slower [control] create array 1000: Mean +- std dev: [patchv3] 25.5 us +- 0.6 us -> [patchv3inline] 25.2 us +- 0.4 us: 1.01x faster [control] create array 5000: Mean +- std dev: [patchv3] 126 us +- 1 us -> [patchv3inline] 125 us +- 4 us: 1.01x faster Benchmark hidden because not significant (11): array(3) repeat 5000, array(1) repeat 0, array(1) repeat 1, array(3) repeat inplace 0, array(3) repeat inplace 1000, int array(3) repeat inplace 10, int array(3) repeat inplace 1000, bytes(7) repeat 100, bytes(7) repeat 1000, bytearray(10) repeat 100, [control] create array 10 Geometric mean: 1.01x faster

I will try to get a benchmark with pyperformance.

Pyperformance results do not show any significant differences

chaos: Mean +- std dev: [bb] 90.2 ms +- 0.7 ms -> [vv3] 89.8 ms +- 0.6 ms: 1.00x faster deltablue: Mean +- std dev: [bb] 4.89 ms +- 0.07 ms -> [vv3] 4.98 ms +- 0.11 ms: 1.02x slower go: Mean +- std dev: [bb] 171 ms +- 2 ms -> [vv3] 173 ms +- 2 ms: 1.01x slower hexiom: Mean +- std dev: [bb] 8.35 ms +- 0.05 ms -> [vv3] 8.43 ms +- 0.05 ms: 1.01x slower json_dumps: Mean +- std dev: [bb] 14.9 ms +- 0.1 ms -> [vv3] 15.1 ms +- 0.4 ms: 1.02x slower logging_format: Mean +- std dev: [bb] 7.51 us +- 0.24 us -> [vv3] 7.69 us +- 0.12 us: 1.02x slower logging_simple: Mean +- std dev: [bb] 6.86 us +- 0.10 us -> [vv3] 6.91 us +- 0.09 us: 1.01x slower meteor_contest: Mean +- std dev: [bb] 126 ms +- 1 ms -> [vv3] 126 ms +- 1 ms: 1.01x faster nbody: Mean +- std dev: [bb] 121 ms +- 3 ms -> [vv3] 113 ms +- 1 ms: 1.07x faster nqueens: Mean +- std dev: [bb] 110 ms +- 1 ms -> [vv3] 107 ms +- 1 ms: 1.02x faster pathlib: Mean +- std dev: [bb] 23.6 ms +- 1.1 ms -> [vv3] 22.7 ms +- 0.5 ms: 1.04x faster pickle: Mean +- std dev: [bb] 10.7 us +- 0.1 us -> [vv3] 10.5 us +- 0.1 us: 1.02x faster pickle_dict: Mean +- std dev: [bb] 30.6 us +- 0.2 us -> [vv3] 30.3 us +- 0.4 us: 1.01x faster pyflate: Mean +- std dev: [bb] 568 ms +- 6 ms -> [vv3] 577 ms +- 9 ms: 1.02x slower python_startup_no_site: Mean +- std dev: [bb] 7.89 ms +- 0.35 ms -> [vv3] 7.99 ms +- 0.39 ms: 1.01x slower raytrace: Mean +- std dev: [bb] 390 ms +- 2 ms -> [vv3] 393 ms +- 3 ms: 1.01x slower regex_compile: Mean +- std dev: [bb] 169 ms +- 1 ms -> [vv3] 168 ms +- 1 ms: 1.01x faster regex_dna: Mean +- std dev: [bb] 235 ms +- 3 ms -> [vv3] 236 ms +- 2 ms: 1.01x slower regex_v8: Mean +- std dev: [bb] 29.1 ms +- 0.3 ms -> [vv3] 28.6 ms +- 0.3 ms: 1.02x faster richards: Mean +- std dev: [bb] 59.8 ms +- 1.6 ms -> [vv3] 61.2 ms +- 1.8 ms: 1.02x slower scimark_fft: Mean +- std dev: [bb] 451 ms +- 7 ms -> [vv3] 446 ms +- 2 ms: 1.01x faster scimark_monte_carlo: Mean +- std dev: [bb] 88.4 ms +- 1.3 ms -> [vv3] 87.6 ms +- 1.6 ms: 1.01x faster scimark_sparse_mat_mult: Mean +- std dev: [bb] 6.04 ms +- 0.04 ms -> [vv3] 5.84 ms +- 0.03 ms: 1.04x faster spectral_norm: Mean +- std dev: [bb] 147 ms +- 1 ms -> [vv3] 148 ms +- 2 ms: 1.01x slower unpack_sequence: Mean +- std dev: [bb] 55.4 ns +- 1.3 ns -> [vv3] 54.7 ns +- 0.8 ns: 1.01x faster unpickle_list: Mean +- std dev: [bb] 5.85 us +- 0.38 us -> [vv3] 5.72 us +- 0.06 us: 1.02x faster unpickle_pure_python: Mean +- std dev: [bb] 313 us +- 3 us -> [vv3] 317 us +- 2 us: 1.01x slower xml_etree_parse: Mean +- std dev: [bb] 175 ms +- 4 ms -> [vv3] 173 ms +- 3 ms: 1.02x faster Benchmark hidden because not significant (18): 2to3, fannkuch, float, json_loads, logging_silent, pickle_list, pickle_pure_python, pidigits, python_startup, regex_effbot, scimark_lu, scimark_sor, sqlite_synth, telco, unpickle, xml_etree_iterparse, xml_etree_generate, xml_etree_process Geometric mean: 1.00x faster

Include/internal/pycore_bytesobject.h

bedevere-bot · 2022-03-22T18:58:23Z

A Python core developer has requested some changes be made to your pull request before we can consider merging it. If you could please address their requests along with any other requests in other reviews from core developers that would be appreciated.

Once you have made the requested changes, please leave a comment on this pull request containing the phrase I have made the requested changes; please review again. I will then notify any core developers who have left a review that you're ready for them to take another look at this pull request.

eendebakpt · 2022-03-25T10:04:41Z

I have made the requested changes; please review again

sweeneyde

Thanks for your patience. This is looking close!

The new API feels like a good solution -- it's nice that there's no need to pass an inplace=1, but it's also good that there's only one function to remember.

Side note: I believe --enable-optimizations does not enable LTO by default, you have to explicitly write --enable-optimizations --with-lto.

Objects/bytearrayobject.c

Objects/bytesobject.c

Objects/unicodeobject.c

sweeneyde

Last couple things, but otherwise LGTM.

sweeneyde · 2022-03-26T17:11:36Z

Objects/bytearrayobject.c

        count = 0;
-    mysize = Py_SIZE(self);
+    else if (count == 1) {
+            Py_INCREF(self);


Indentation

Objects/bytesobject.c

kumaraditya303 · 2022-03-27T08:42:52Z

Include/internal/pycore_bytesobject.h

+ * This method repeately doubles the number of bytes copied to reduce
+ * the number of invocations of memcpy.
+ */
+PyAPI_FUNC(void)


It should not be exported.

Shouldn't it? I think that's the way to be able to use it from the dynamically-loaded extension module created by arraymodule.c.

Ya, I thought that array module was a builtin core module.

bedevere-bot added the awaiting review label Mar 19, 2022

the-knights-who-say-ni added the CLA signed label Mar 19, 2022

arhadthedev reviewed Mar 20, 2022

View reviewed changes

Modules/arraymodule.c Outdated Show resolved Hide resolved

eendebakpt force-pushed the performance/array_inplace_repeat branch from 0212e6e to b603ad8 Compare March 20, 2022 22:34

sweeneyde reviewed Mar 21, 2022

View reviewed changes

Include/internal/pycore_bytesobject.h Outdated Show resolved Hide resolved

sweeneyde reviewed Mar 21, 2022

View reviewed changes

Include/internal/pycore_bytesobject.h Outdated Show resolved Hide resolved

sweeneyde reviewed Mar 21, 2022

View reviewed changes

Objects/unicodeobject.c Outdated Show resolved Hide resolved

eendebakpt mentioned this pull request Mar 22, 2022

gh-91247: improve performance of list and tuple repeat #32045

Closed

sweeneyde requested changes Mar 22, 2022

View reviewed changes

bedevere-bot removed the awaiting review label Mar 22, 2022

bedevere-bot added the awaiting changes label Mar 22, 2022

sweeneyde self-assigned this Mar 22, 2022

eendebakpt force-pushed the performance/array_inplace_repeat branch from 457ef6d to 76a999d Compare March 25, 2022 09:42

refactor repeat and inplace repeat methods

71d4952

eendebakpt force-pushed the performance/array_inplace_repeat branch from 76a999d to 71d4952 Compare March 25, 2022 09:45

sweeneyde reviewed Mar 25, 2022

View reviewed changes

Objects/bytearrayobject.c Outdated Show resolved Hide resolved

Objects/bytesobject.c Outdated Show resolved Hide resolved

Objects/unicodeobject.c Outdated Show resolved Hide resolved

address review comments

bf05449

sweeneyde reviewed Mar 26, 2022

View reviewed changes

address review comments

37d2869

sweeneyde approved these changes Mar 26, 2022

View reviewed changes

bedevere-bot added awaiting merge and removed awaiting changes labels Mar 26, 2022

kumaraditya303 reviewed Mar 27, 2022

View reviewed changes

sweeneyde merged commit 850687d into python:main Mar 28, 2022

bedevere-bot removed the awaiting merge label Mar 28, 2022

eendebakpt mannequin mentioned this pull request Apr 10, 2022

Improve performance of array_inplace_repeat #91226

Closed

eendebakpt mentioned this pull request Apr 21, 2022

gh-91247: improve performance of list and tuple repeat (with specialization for n=1) #91482

Merged

eendebakpt deleted the performance/array_inplace_repeat branch June 26, 2025 13:45

Uh oh!

bpo-47070: improve performance of array_inplace_repeat #31999

bpo-47070: improve performance of array_inplace_repeat #31999

Uh oh!

Conversation

eendebakpt commented Mar 19, 2022 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Uh oh!

eendebakpt commented Mar 20, 2022

Uh oh!

sweeneyde commented Mar 20, 2022

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

bedevere-bot commented Mar 22, 2022

Uh oh!

eendebakpt commented Mar 25, 2022

Uh oh!

sweeneyde left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

sweeneyde left a comment

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

6 participants

eendebakpt commented Mar 19, 2022 •

edited

Loading