Skip to content

Conversation

@eendebakpt
Copy link
Contributor

@eendebakpt eendebakpt commented Mar 19, 2022

  • The array_inplace_repeat is inefficient for small arrays and a high number of repeats. This can be improved by using the same approach as in the array_inplace method (same approach as https://bugs.python.org/issue47005)
  • Create utility methods _PyBytes_Repeat and _PyBytes_RepeatInPlace for code that was common to various objects ( bytes, bytearray, unicodeobject and array ).

Microbenchmark:

import pyperf
runner = pyperf.Runner()

setup="import array; a=array.array('d',[1,2,3,])"
code= "a*=1"
runner.timeit('repeat 1', stmt=code, setup=setup)
setup="import array;"
code= "a=array.array('d',[1,2,3,]); a*=1000"
runner.timeit('repeat 1000', stmt=code, setup=setup)              

Results in:

repeat 1: Mean +- std dev: [base] 53.0 ns +- 1.4 ns -> [patch] 40.4 ns +- 0.7 ns: 1.31x faster
repeat 1000: Mean +- std dev: [base] 2.78 us +- 0.08 us -> [patch] 908 ns +- 52 ns: 3.06x faster

Geometric mean: 2.01x faster

https://bugs.python.org/issue47070

@eendebakpt
Copy link
Contributor Author

@sweeneyde Based on your suggestion I created utility methods _PyBytes_Repeat and _PyBytes_RepeatInPlace. They are in bytearrayobject.h, but perhaps there is a better location.

I defined the objects to be static inline, but am not sure they are inline on all platforms. If not, there could be (small) performance difference for the bytes and bytearray objects due to the overhead of calling the utility methods.

@sweeneyde
Copy link
Member

They should go in Include/internal/pycore_bytes.h so they don't leak into the public API.

I wouldn't worry too much about whether things will get inlined -- if the compiler decides it's beneficial, it can inline it. Also, I'm guessing the single function call overhead is vastly outscaled by the several memcpy calls.

@eendebakpt eendebakpt force-pushed the performance/array_inplace_repeat branch from 0212e6e to b603ad8 Compare March 20, 2022 22:34
Comment on lines 71 to 84
if (len_dest > 0) {
if (len_src == 1)
memset(dest, src[0], len_dest);
else {
memcpy(dest, src, len_src);

Py_ssize_t copied = len_src;
while (copied < len_dest) {
Py_ssize_t bytes_to_copy = Py_MIN(copied, len_dest - copied);
memcpy(dest + copied, dest, bytes_to_copy);
copied += bytes_to_copy;
}
}
}
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
if (len_dest > 0) {
if (len_src == 1)
memset(dest, src[0], len_dest);
else {
memcpy(dest, src, len_src);
Py_ssize_t copied = len_src;
while (copied < len_dest) {
Py_ssize_t bytes_to_copy = Py_MIN(copied, len_dest - copied);
memcpy(dest + copied, dest, bytes_to_copy);
copied += bytes_to_copy;
}
}
}
if (len_dest > 0) {
memcpy(dest, src, len_src);
_PyBytes_RepeatInPlace(dest, len_src, len_dest);
}

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@sweeneyde Thanks for the suggestion. If really simplifies the code. Part of this I already considered, but did not push because it seems to hurt the small cases. E.g. [None] * 10 or b'a' * 5.

Both your suggestions (e.g. no inlining and skipping the special case where len_src==1) I implemented in https://github.com/eendebakpt/cpython/tree/performance/array_inplace_repeat_v2

Microbenchmarks show an overall slowdown, in particular for the cases with a low number of repeats or length of the object equal to 1.

array repeat 1: Mean +- std dev: [patchfull] 185 ns +- 9 ns -> [patchv2full] 191 ns +- 1 ns: 1.03x slower
array repeat 2: Mean +- std dev: [patchfull] 185 ns +- 2 ns -> [patchv2full] 194 ns +- 2 ns: 1.05x slower
array repeat 10: Mean +- std dev: [patchfull] 211 ns +- 5 ns -> [patchv2full] 220 ns +- 3 ns: 1.05x slower
array repeat 1000: Mean +- std dev: [patchfull] 1.67 us +- 0.01 us -> [patchv2full] 1.70 us +- 0.01 us: 1.02x slower
array repeat 5000: Mean +- std dev: [patchfull] 9.18 us +- 0.07 us -> [patchv2full] 9.43 us +- 0.08 us: 1.03x slower
bytes(1) repeat 2: Mean +- std dev: [patchfull] 48.3 ns +- 0.7 ns -> [patchv2full] 50.7 ns +- 0.5 ns: 1.05x slower
bytes(1) repeat 10: Mean +- std dev: [patchfull] 48.1 ns +- 3.0 ns -> [patchv2full] 49.4 ns +- 0.6 ns: 1.03x slower
bytes(1) repeat 100: Mean +- std dev: [patchfull] 47.6 ns +- 0.5 ns -> [patchv2full] 50.2 ns +- 0.9 ns: 1.05x slower
bytes(1) repeat 1000: Mean +- std dev: [patchfull] 99.8 ns +- 0.5 ns -> [patchv2full] 103 ns +- 1 ns: 1.03x slower
array(1) repeat 1: Mean +- std dev: [patchfull] 84.4 ns +- 0.5 ns -> [patchv2full] 89.1 ns +- 1.7 ns: 1.06x slower
array(1) repeat 10: Mean +- std dev: [patchfull] 96.3 ns +- 1.3 ns -> [patchv2full] 101 ns +- 1 ns: 1.05x slower
array(1) repeat 1000: Mean +- std dev: [patchfull] 255 ns +- 9 ns -> [patchv2full] 260 ns +- 13 ns: 1.02x slower
array repeat inplace 10: Mean +- std dev: [patchfull] 472 ns +- 8 ns -> [patchv2full] 487 ns +- 22 ns: 1.03x slower
array repeat inplace 1000: Mean +- std dev: [patchfull] 867 ns +- 85 ns -> [patchv2full] 941 ns +- 81 ns: 1.08x slower
[control] create array 1: Mean +- std dev: [patchfull] 224 ns +- 2 ns -> [patchv2full] 226 ns +- 3 ns: 1.01x slower
[control] create array 10: Mean +- std dev: [patchfull] 447 ns +- 3 ns -> [patchv2full] 449 ns +- 3 ns: 1.00x slower
[control] create array 5000: Mean +- std dev: [patchfull] 128 us +- 4 us -> [patchv2full] 127 us +- 3 us: 1.01x faster

Benchmark hidden because not significant (4): bytes(1) repeat 1, array(1) repeat 5000, array repeat inplace 1, [control] create array 1000

My estimate is the small cases are about 5% slower. Is that acceptable?

One way to reduce two function calls to one would be something like (perhaps with the case len_src==1 added):

void
_PyBytes_Repeat(char* dest, Py_ssize_t len_dest,
    const char* src, Py_ssize_t len_src, bool inplace)
{
    if (len_dest > 0) {
        if (!inplace)
            memcpy(dest, src, len_src);

        Py_ssize_t copied = start_len;
        while (copied < end_len) {
            Py_ssize_t bytes_to_copy = Py_MIN(copied, end_len - copied);
            memcpy(dest + copied, buffer, bytes_to_copy);
            copied += bytes_to_copy;
        }
    }
}

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for those benchmarks. Out of curiosity, how are you compiling? PGO and LTO can change things.

I'd say the important things to optimize for are code complexity and the linear coefficient, and the O(1) overhead is not too important. Though since the starting-length-1 case is probably the most common, you're probably right that having a memset branch in both functions might be worth it then.

How would you feel about something like this?

def _PyBytes_RepeatInPlace(...):
    if len_dest == 1:
        memset(...)
    while ...:
        ...

def _PyBytes_Repeat(...):
    if len_dest == 0:
        return
    elif len_dest == 1:
        memset(...)
    else:
        memcpy(...)
        _PyBytes_RepeatInPlace(...)

I would still prefer non-static inline functions though, even if there is a small penalty.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The PR has been updated.

array(3) repeat 0: Mean +- std dev: [patchv3] 162 ns +- 10 ns -> [patchv3inline] 157 ns +- 3 ns: 1.03x faster
array(3) repeat 1: Mean +- std dev: [patchv3] 189 ns +- 4 ns -> [patchv3inline] 183 ns +- 2 ns: 1.03x faster
array(3) repeat 2: Mean +- std dev: [patchv3] 191 ns +- 3 ns -> [patchv3inline] 184 ns +- 9 ns: 1.04x faster
array(3) repeat 10: Mean +- std dev: [patchv3] 222 ns +- 3 ns -> [patchv3inline] 206 ns +- 2 ns: 1.08x faster
array(3) repeat 1000: Mean +- std dev: [patchv3] 1.67 us +- 0.01 us -> [patchv3inline] 1.66 us +- 0.02 us: 1.01x faster
array(1) repeat 10: Mean +- std dev: [patchv3] 101 ns +- 2 ns -> [patchv3inline] 96.0 ns +- 1.6 ns: 1.05x faster
array(1) repeat 1000: Mean +- std dev: [patchv3] 263 ns +- 4 ns -> [patchv3inline] 260 ns +- 4 ns: 1.01x faster
array(1) repeat 5000: Mean +- std dev: [patchv3] 1.40 us +- 0.02 us -> [patchv3inline] 1.40 us +- 0.01 us: 1.00x faster
array(3) repeat inplace 1: Mean +- std dev: [patchv3] 401 ns +- 9 ns -> [patchv3inline] 405 ns +- 12 ns: 1.01x slower
array(3) repeat inplace 10: Mean +- std dev: [patchv3] 462 ns +- 15 ns -> [patchv3inline] 457 ns +- 10 ns: 1.01x faster
int array(3) repeat inplace 0: Mean +- std dev: [patchv3] 364 ns +- 8 ns -> [patchv3inline] 367 ns +- 11 ns: 1.01x slower
int array(3) repeat inplace 1: Mean +- std dev: [patchv3] 347 ns +- 8 ns -> [patchv3inline] 351 ns +- 7 ns: 1.01x slower
bytes(1) repeat 1: Mean +- std dev: [patchv3] 36.6 ns +- 0.4 ns -> [patchv3inline] 37.0 ns +- 0.7 ns: 1.01x slower
bytes(1) repeat 2: Mean +- std dev: [patchv3] 48.2 ns +- 0.4 ns -> [patchv3inline] 49.1 ns +- 1.0 ns: 1.02x slower
bytes(1) repeat 10: Mean +- std dev: [patchv3] 48.0 ns +- 0.6 ns -> [patchv3inline] 48.8 ns +- 0.5 ns: 1.02x slower
bytes(1) repeat 100: Mean +- std dev: [patchv3] 48.5 ns +- 0.7 ns -> [patchv3inline] 49.1 ns +- 1.2 ns: 1.01x slower
bytes(1) repeat 1000: Mean +- std dev: [patchv3] 103 ns +- 2 ns -> [patchv3inline] 102 ns +- 2 ns: 1.01x faster
bytes(7) repeat 1: Mean +- std dev: [patchv3] 36.5 ns +- 0.3 ns -> [patchv3inline] 36.8 ns +- 0.5 ns: 1.01x slower
bytes(7) repeat 2: Mean +- std dev: [patchv3] 50.8 ns +- 0.4 ns -> [patchv3inline] 50.6 ns +- 1.0 ns: 1.00x faster
bytes(7) repeat 10: Mean +- std dev: [patchv3] 59.3 ns +- 0.9 ns -> [patchv3inline] 58.1 ns +- 0.7 ns: 1.02x faster
bytearray(1) repeat 1: Mean +- std dev: [patchv3] 57.9 ns +- 0.8 ns -> [patchv3inline] 55.2 ns +- 0.6 ns: 1.05x faster
bytearray(1) repeat 2: Mean +- std dev: [patchv3] 58.7 ns +- 0.6 ns -> [patchv3inline] 55.9 ns +- 0.9 ns: 1.05x faster
bytearray(1) repeat 10: Mean +- std dev: [patchv3] 58.1 ns +- 0.6 ns -> [patchv3inline] 55.6 ns +- 0.7 ns: 1.04x faster
bytearray(1) repeat 100: Mean +- std dev: [patchv3] 58.2 ns +- 0.5 ns -> [patchv3inline] 56.3 ns +- 1.0 ns: 1.03x faster
bytearray(1) repeat 1000: Mean +- std dev: [patchv3] 72.8 ns +- 1.5 ns -> [patchv3inline] 71.4 ns +- 2.0 ns: 1.02x faster
bytearray(10) repeat 1: Mean +- std dev: [patchv3] 58.2 ns +- 0.9 ns -> [patchv3inline] 55.4 ns +- 0.6 ns: 1.05x faster
bytearray(10) repeat 2: Mean +- std dev: [patchv3] 61.3 ns +- 1.1 ns -> [patchv3inline] 58.3 ns +- 0.7 ns: 1.05x faster
bytearray(10) repeat 10: Mean +- std dev: [patchv3] 69.5 ns +- 1.2 ns -> [patchv3inline] 67.5 ns +- 0.8 ns: 1.03x faster
bytearray(10) repeat 1000: Mean +- std dev: [patchv3] 231 ns +- 6 ns -> [patchv3inline] 229 ns +- 2 ns: 1.01x faster
[control] create array 1: Mean +- std dev: [patchv3] 220 ns +- 4 ns -> [patchv3inline] 221 ns +- 8 ns: 1.01x slower
[control] create array 1000: Mean +- std dev: [patchv3] 25.5 us +- 0.6 us -> [patchv3inline] 25.2 us +- 0.4 us: 1.01x faster
[control] create array 5000: Mean +- std dev: [patchv3] 126 us +- 1 us -> [patchv3inline] 125 us +- 4 us: 1.01x faster

Benchmark hidden because not significant (11): array(3) repeat 5000, array(1) repeat 0, array(1) repeat 1, array(3) repeat inplace 0, array(3) repeat inplace 1000, int array(3) repeat inplace 10, int array(3) repeat inplace 1000, bytes(7) repeat 100, bytes(7) repeat 1000, bytearray(10) repeat 100, [control] create array 10

Geometric mean: 1.01x faster

I will try to get a benchmark with pyperformance.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pyperformance results do not show any significant differences

chaos: Mean +- std dev: [bb] 90.2 ms +- 0.7 ms -> [vv3] 89.8 ms +- 0.6 ms: 1.00x faster
deltablue: Mean +- std dev: [bb] 4.89 ms +- 0.07 ms -> [vv3] 4.98 ms +- 0.11 ms: 1.02x slower
go: Mean +- std dev: [bb] 171 ms +- 2 ms -> [vv3] 173 ms +- 2 ms: 1.01x slower
hexiom: Mean +- std dev: [bb] 8.35 ms +- 0.05 ms -> [vv3] 8.43 ms +- 0.05 ms: 1.01x slower
json_dumps: Mean +- std dev: [bb] 14.9 ms +- 0.1 ms -> [vv3] 15.1 ms +- 0.4 ms: 1.02x slower
logging_format: Mean +- std dev: [bb] 7.51 us +- 0.24 us -> [vv3] 7.69 us +- 0.12 us: 1.02x slower
logging_simple: Mean +- std dev: [bb] 6.86 us +- 0.10 us -> [vv3] 6.91 us +- 0.09 us: 1.01x slower
meteor_contest: Mean +- std dev: [bb] 126 ms +- 1 ms -> [vv3] 126 ms +- 1 ms: 1.01x faster
nbody: Mean +- std dev: [bb] 121 ms +- 3 ms -> [vv3] 113 ms +- 1 ms: 1.07x faster
nqueens: Mean +- std dev: [bb] 110 ms +- 1 ms -> [vv3] 107 ms +- 1 ms: 1.02x faster
pathlib: Mean +- std dev: [bb] 23.6 ms +- 1.1 ms -> [vv3] 22.7 ms +- 0.5 ms: 1.04x faster
pickle: Mean +- std dev: [bb] 10.7 us +- 0.1 us -> [vv3] 10.5 us +- 0.1 us: 1.02x faster
pickle_dict: Mean +- std dev: [bb] 30.6 us +- 0.2 us -> [vv3] 30.3 us +- 0.4 us: 1.01x faster
pyflate: Mean +- std dev: [bb] 568 ms +- 6 ms -> [vv3] 577 ms +- 9 ms: 1.02x slower
python_startup_no_site: Mean +- std dev: [bb] 7.89 ms +- 0.35 ms -> [vv3] 7.99 ms +- 0.39 ms: 1.01x slower
raytrace: Mean +- std dev: [bb] 390 ms +- 2 ms -> [vv3] 393 ms +- 3 ms: 1.01x slower
regex_compile: Mean +- std dev: [bb] 169 ms +- 1 ms -> [vv3] 168 ms +- 1 ms: 1.01x faster
regex_dna: Mean +- std dev: [bb] 235 ms +- 3 ms -> [vv3] 236 ms +- 2 ms: 1.01x slower
regex_v8: Mean +- std dev: [bb] 29.1 ms +- 0.3 ms -> [vv3] 28.6 ms +- 0.3 ms: 1.02x faster
richards: Mean +- std dev: [bb] 59.8 ms +- 1.6 ms -> [vv3] 61.2 ms +- 1.8 ms: 1.02x slower
scimark_fft: Mean +- std dev: [bb] 451 ms +- 7 ms -> [vv3] 446 ms +- 2 ms: 1.01x faster
scimark_monte_carlo: Mean +- std dev: [bb] 88.4 ms +- 1.3 ms -> [vv3] 87.6 ms +- 1.6 ms: 1.01x faster
scimark_sparse_mat_mult: Mean +- std dev: [bb] 6.04 ms +- 0.04 ms -> [vv3] 5.84 ms +- 0.03 ms: 1.04x faster
spectral_norm: Mean +- std dev: [bb] 147 ms +- 1 ms -> [vv3] 148 ms +- 2 ms: 1.01x slower
unpack_sequence: Mean +- std dev: [bb] 55.4 ns +- 1.3 ns -> [vv3] 54.7 ns +- 0.8 ns: 1.01x faster
unpickle_list: Mean +- std dev: [bb] 5.85 us +- 0.38 us -> [vv3] 5.72 us +- 0.06 us: 1.02x faster
unpickle_pure_python: Mean +- std dev: [bb] 313 us +- 3 us -> [vv3] 317 us +- 2 us: 1.01x slower
xml_etree_parse: Mean +- std dev: [bb] 175 ms +- 4 ms -> [vv3] 173 ms +- 3 ms: 1.02x faster

Benchmark hidden because not significant (18): 2to3, fannkuch, float, json_loads, logging_silent, pickle_list, pickle_pure_python, pidigits, python_startup, regex_effbot, scimark_lu, scimark_sor, sqlite_synth, telco, unpickle, xml_etree_iterparse, xml_etree_generate, xml_etree_process

Geometric mean: 1.00x faster

@bedevere-bot
Copy link

A Python core developer has requested some changes be made to your pull request before we can consider merging it. If you could please address their requests along with any other requests in other reviews from core developers that would be appreciated.

Once you have made the requested changes, please leave a comment on this pull request containing the phrase I have made the requested changes; please review again. I will then notify any core developers who have left a review that you're ready for them to take another look at this pull request.

@sweeneyde sweeneyde self-assigned this Mar 22, 2022
@eendebakpt eendebakpt force-pushed the performance/array_inplace_repeat branch from 457ef6d to 76a999d Compare March 25, 2022 09:42
@eendebakpt eendebakpt force-pushed the performance/array_inplace_repeat branch from 76a999d to 71d4952 Compare March 25, 2022 09:45
@eendebakpt
Copy link
Contributor Author

I have made the requested changes; please review again

Copy link
Member

@sweeneyde sweeneyde left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for your patience. This is looking close!

The new API feels like a good solution -- it's nice that there's no need to pass an inplace=1, but it's also good that there's only one function to remember.

Side note: I believe --enable-optimizations does not enable LTO by default, you have to explicitly write --enable-optimizations --with-lto.

Copy link
Member

@sweeneyde sweeneyde left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Last couple things, but otherwise LGTM.

count = 0;
mysize = Py_SIZE(self);
else if (count == 1) {
Py_INCREF(self);
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Indentation

* This method repeately doubles the number of bytes copied to reduce
* the number of invocations of memcpy.
*/
PyAPI_FUNC(void)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It should not be exported.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Shouldn't it? I think that's the way to be able to use it from the dynamically-loaded extension module created by arraymodule.c.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ya, I thought that array module was a builtin core module.

@sweeneyde sweeneyde merged commit 850687d into python:main Mar 28, 2022
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

6 participants