-
-
Notifications
You must be signed in to change notification settings - Fork 33.7k
bpo-47070: improve performance of array_inplace_repeat #31999
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
bpo-47070: improve performance of array_inplace_repeat #31999
Conversation
|
@sweeneyde Based on your suggestion I created utility methods I defined the objects to be |
|
They should go in Include/internal/pycore_bytes.h so they don't leak into the public API. I wouldn't worry too much about whether things will get inlined -- if the compiler decides it's beneficial, it can inline it. Also, I'm guessing the single function call overhead is vastly outscaled by the several memcpy calls. |
0212e6e to
b603ad8
Compare
| if (len_dest > 0) { | ||
| if (len_src == 1) | ||
| memset(dest, src[0], len_dest); | ||
| else { | ||
| memcpy(dest, src, len_src); | ||
|
|
||
| Py_ssize_t copied = len_src; | ||
| while (copied < len_dest) { | ||
| Py_ssize_t bytes_to_copy = Py_MIN(copied, len_dest - copied); | ||
| memcpy(dest + copied, dest, bytes_to_copy); | ||
| copied += bytes_to_copy; | ||
| } | ||
| } | ||
| } |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
| if (len_dest > 0) { | |
| if (len_src == 1) | |
| memset(dest, src[0], len_dest); | |
| else { | |
| memcpy(dest, src, len_src); | |
| Py_ssize_t copied = len_src; | |
| while (copied < len_dest) { | |
| Py_ssize_t bytes_to_copy = Py_MIN(copied, len_dest - copied); | |
| memcpy(dest + copied, dest, bytes_to_copy); | |
| copied += bytes_to_copy; | |
| } | |
| } | |
| } | |
| if (len_dest > 0) { | |
| memcpy(dest, src, len_src); | |
| _PyBytes_RepeatInPlace(dest, len_src, len_dest); | |
| } |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@sweeneyde Thanks for the suggestion. If really simplifies the code. Part of this I already considered, but did not push because it seems to hurt the small cases. E.g. [None] * 10 or b'a' * 5.
Both your suggestions (e.g. no inlining and skipping the special case where len_src==1) I implemented in https://github.com/eendebakpt/cpython/tree/performance/array_inplace_repeat_v2
Microbenchmarks show an overall slowdown, in particular for the cases with a low number of repeats or length of the object equal to 1.
array repeat 1: Mean +- std dev: [patchfull] 185 ns +- 9 ns -> [patchv2full] 191 ns +- 1 ns: 1.03x slower
array repeat 2: Mean +- std dev: [patchfull] 185 ns +- 2 ns -> [patchv2full] 194 ns +- 2 ns: 1.05x slower
array repeat 10: Mean +- std dev: [patchfull] 211 ns +- 5 ns -> [patchv2full] 220 ns +- 3 ns: 1.05x slower
array repeat 1000: Mean +- std dev: [patchfull] 1.67 us +- 0.01 us -> [patchv2full] 1.70 us +- 0.01 us: 1.02x slower
array repeat 5000: Mean +- std dev: [patchfull] 9.18 us +- 0.07 us -> [patchv2full] 9.43 us +- 0.08 us: 1.03x slower
bytes(1) repeat 2: Mean +- std dev: [patchfull] 48.3 ns +- 0.7 ns -> [patchv2full] 50.7 ns +- 0.5 ns: 1.05x slower
bytes(1) repeat 10: Mean +- std dev: [patchfull] 48.1 ns +- 3.0 ns -> [patchv2full] 49.4 ns +- 0.6 ns: 1.03x slower
bytes(1) repeat 100: Mean +- std dev: [patchfull] 47.6 ns +- 0.5 ns -> [patchv2full] 50.2 ns +- 0.9 ns: 1.05x slower
bytes(1) repeat 1000: Mean +- std dev: [patchfull] 99.8 ns +- 0.5 ns -> [patchv2full] 103 ns +- 1 ns: 1.03x slower
array(1) repeat 1: Mean +- std dev: [patchfull] 84.4 ns +- 0.5 ns -> [patchv2full] 89.1 ns +- 1.7 ns: 1.06x slower
array(1) repeat 10: Mean +- std dev: [patchfull] 96.3 ns +- 1.3 ns -> [patchv2full] 101 ns +- 1 ns: 1.05x slower
array(1) repeat 1000: Mean +- std dev: [patchfull] 255 ns +- 9 ns -> [patchv2full] 260 ns +- 13 ns: 1.02x slower
array repeat inplace 10: Mean +- std dev: [patchfull] 472 ns +- 8 ns -> [patchv2full] 487 ns +- 22 ns: 1.03x slower
array repeat inplace 1000: Mean +- std dev: [patchfull] 867 ns +- 85 ns -> [patchv2full] 941 ns +- 81 ns: 1.08x slower
[control] create array 1: Mean +- std dev: [patchfull] 224 ns +- 2 ns -> [patchv2full] 226 ns +- 3 ns: 1.01x slower
[control] create array 10: Mean +- std dev: [patchfull] 447 ns +- 3 ns -> [patchv2full] 449 ns +- 3 ns: 1.00x slower
[control] create array 5000: Mean +- std dev: [patchfull] 128 us +- 4 us -> [patchv2full] 127 us +- 3 us: 1.01x faster
Benchmark hidden because not significant (4): bytes(1) repeat 1, array(1) repeat 5000, array repeat inplace 1, [control] create array 1000
My estimate is the small cases are about 5% slower. Is that acceptable?
One way to reduce two function calls to one would be something like (perhaps with the case len_src==1 added):
void
_PyBytes_Repeat(char* dest, Py_ssize_t len_dest,
const char* src, Py_ssize_t len_src, bool inplace)
{
if (len_dest > 0) {
if (!inplace)
memcpy(dest, src, len_src);
Py_ssize_t copied = start_len;
while (copied < end_len) {
Py_ssize_t bytes_to_copy = Py_MIN(copied, end_len - copied);
memcpy(dest + copied, buffer, bytes_to_copy);
copied += bytes_to_copy;
}
}
}
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for those benchmarks. Out of curiosity, how are you compiling? PGO and LTO can change things.
I'd say the important things to optimize for are code complexity and the linear coefficient, and the O(1) overhead is not too important. Though since the starting-length-1 case is probably the most common, you're probably right that having a memset branch in both functions might be worth it then.
How would you feel about something like this?
def _PyBytes_RepeatInPlace(...):
if len_dest == 1:
memset(...)
while ...:
...
def _PyBytes_Repeat(...):
if len_dest == 0:
return
elif len_dest == 1:
memset(...)
else:
memcpy(...)
_PyBytes_RepeatInPlace(...)I would still prefer non-static inline functions though, even if there is a small penalty.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The PR has been updated.
- Performance for the cases that this PR was initialy started for (e.g. the inplace repeat of
array.array) is better than main - The helper function
_PyBytes_Repeatis used at 6 places (potentially at 3 more places in gh-91247: improve performance of list and tuple repeat #32045). If not inlined (this PR) that helps reduce the amount of code. - Performance is better if we inline the method (branch https://github.com/eendebakpt/cpython/tree/performance/array_inplace_repeat_v3_inline). When compiled with
--enable-optimizations(so LTO and PGO) and benchmarked with./python test_array.py --rigorous --min-time .2 -o base.jsonI get:
array(3) repeat 0: Mean +- std dev: [patchv3] 162 ns +- 10 ns -> [patchv3inline] 157 ns +- 3 ns: 1.03x faster
array(3) repeat 1: Mean +- std dev: [patchv3] 189 ns +- 4 ns -> [patchv3inline] 183 ns +- 2 ns: 1.03x faster
array(3) repeat 2: Mean +- std dev: [patchv3] 191 ns +- 3 ns -> [patchv3inline] 184 ns +- 9 ns: 1.04x faster
array(3) repeat 10: Mean +- std dev: [patchv3] 222 ns +- 3 ns -> [patchv3inline] 206 ns +- 2 ns: 1.08x faster
array(3) repeat 1000: Mean +- std dev: [patchv3] 1.67 us +- 0.01 us -> [patchv3inline] 1.66 us +- 0.02 us: 1.01x faster
array(1) repeat 10: Mean +- std dev: [patchv3] 101 ns +- 2 ns -> [patchv3inline] 96.0 ns +- 1.6 ns: 1.05x faster
array(1) repeat 1000: Mean +- std dev: [patchv3] 263 ns +- 4 ns -> [patchv3inline] 260 ns +- 4 ns: 1.01x faster
array(1) repeat 5000: Mean +- std dev: [patchv3] 1.40 us +- 0.02 us -> [patchv3inline] 1.40 us +- 0.01 us: 1.00x faster
array(3) repeat inplace 1: Mean +- std dev: [patchv3] 401 ns +- 9 ns -> [patchv3inline] 405 ns +- 12 ns: 1.01x slower
array(3) repeat inplace 10: Mean +- std dev: [patchv3] 462 ns +- 15 ns -> [patchv3inline] 457 ns +- 10 ns: 1.01x faster
int array(3) repeat inplace 0: Mean +- std dev: [patchv3] 364 ns +- 8 ns -> [patchv3inline] 367 ns +- 11 ns: 1.01x slower
int array(3) repeat inplace 1: Mean +- std dev: [patchv3] 347 ns +- 8 ns -> [patchv3inline] 351 ns +- 7 ns: 1.01x slower
bytes(1) repeat 1: Mean +- std dev: [patchv3] 36.6 ns +- 0.4 ns -> [patchv3inline] 37.0 ns +- 0.7 ns: 1.01x slower
bytes(1) repeat 2: Mean +- std dev: [patchv3] 48.2 ns +- 0.4 ns -> [patchv3inline] 49.1 ns +- 1.0 ns: 1.02x slower
bytes(1) repeat 10: Mean +- std dev: [patchv3] 48.0 ns +- 0.6 ns -> [patchv3inline] 48.8 ns +- 0.5 ns: 1.02x slower
bytes(1) repeat 100: Mean +- std dev: [patchv3] 48.5 ns +- 0.7 ns -> [patchv3inline] 49.1 ns +- 1.2 ns: 1.01x slower
bytes(1) repeat 1000: Mean +- std dev: [patchv3] 103 ns +- 2 ns -> [patchv3inline] 102 ns +- 2 ns: 1.01x faster
bytes(7) repeat 1: Mean +- std dev: [patchv3] 36.5 ns +- 0.3 ns -> [patchv3inline] 36.8 ns +- 0.5 ns: 1.01x slower
bytes(7) repeat 2: Mean +- std dev: [patchv3] 50.8 ns +- 0.4 ns -> [patchv3inline] 50.6 ns +- 1.0 ns: 1.00x faster
bytes(7) repeat 10: Mean +- std dev: [patchv3] 59.3 ns +- 0.9 ns -> [patchv3inline] 58.1 ns +- 0.7 ns: 1.02x faster
bytearray(1) repeat 1: Mean +- std dev: [patchv3] 57.9 ns +- 0.8 ns -> [patchv3inline] 55.2 ns +- 0.6 ns: 1.05x faster
bytearray(1) repeat 2: Mean +- std dev: [patchv3] 58.7 ns +- 0.6 ns -> [patchv3inline] 55.9 ns +- 0.9 ns: 1.05x faster
bytearray(1) repeat 10: Mean +- std dev: [patchv3] 58.1 ns +- 0.6 ns -> [patchv3inline] 55.6 ns +- 0.7 ns: 1.04x faster
bytearray(1) repeat 100: Mean +- std dev: [patchv3] 58.2 ns +- 0.5 ns -> [patchv3inline] 56.3 ns +- 1.0 ns: 1.03x faster
bytearray(1) repeat 1000: Mean +- std dev: [patchv3] 72.8 ns +- 1.5 ns -> [patchv3inline] 71.4 ns +- 2.0 ns: 1.02x faster
bytearray(10) repeat 1: Mean +- std dev: [patchv3] 58.2 ns +- 0.9 ns -> [patchv3inline] 55.4 ns +- 0.6 ns: 1.05x faster
bytearray(10) repeat 2: Mean +- std dev: [patchv3] 61.3 ns +- 1.1 ns -> [patchv3inline] 58.3 ns +- 0.7 ns: 1.05x faster
bytearray(10) repeat 10: Mean +- std dev: [patchv3] 69.5 ns +- 1.2 ns -> [patchv3inline] 67.5 ns +- 0.8 ns: 1.03x faster
bytearray(10) repeat 1000: Mean +- std dev: [patchv3] 231 ns +- 6 ns -> [patchv3inline] 229 ns +- 2 ns: 1.01x faster
[control] create array 1: Mean +- std dev: [patchv3] 220 ns +- 4 ns -> [patchv3inline] 221 ns +- 8 ns: 1.01x slower
[control] create array 1000: Mean +- std dev: [patchv3] 25.5 us +- 0.6 us -> [patchv3inline] 25.2 us +- 0.4 us: 1.01x faster
[control] create array 5000: Mean +- std dev: [patchv3] 126 us +- 1 us -> [patchv3inline] 125 us +- 4 us: 1.01x faster
Benchmark hidden because not significant (11): array(3) repeat 5000, array(1) repeat 0, array(1) repeat 1, array(3) repeat inplace 0, array(3) repeat inplace 1000, int array(3) repeat inplace 10, int array(3) repeat inplace 1000, bytes(7) repeat 100, bytes(7) repeat 1000, bytearray(10) repeat 100, [control] create array 10
Geometric mean: 1.01x faster
I will try to get a benchmark with pyperformance.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Pyperformance results do not show any significant differences
chaos: Mean +- std dev: [bb] 90.2 ms +- 0.7 ms -> [vv3] 89.8 ms +- 0.6 ms: 1.00x faster
deltablue: Mean +- std dev: [bb] 4.89 ms +- 0.07 ms -> [vv3] 4.98 ms +- 0.11 ms: 1.02x slower
go: Mean +- std dev: [bb] 171 ms +- 2 ms -> [vv3] 173 ms +- 2 ms: 1.01x slower
hexiom: Mean +- std dev: [bb] 8.35 ms +- 0.05 ms -> [vv3] 8.43 ms +- 0.05 ms: 1.01x slower
json_dumps: Mean +- std dev: [bb] 14.9 ms +- 0.1 ms -> [vv3] 15.1 ms +- 0.4 ms: 1.02x slower
logging_format: Mean +- std dev: [bb] 7.51 us +- 0.24 us -> [vv3] 7.69 us +- 0.12 us: 1.02x slower
logging_simple: Mean +- std dev: [bb] 6.86 us +- 0.10 us -> [vv3] 6.91 us +- 0.09 us: 1.01x slower
meteor_contest: Mean +- std dev: [bb] 126 ms +- 1 ms -> [vv3] 126 ms +- 1 ms: 1.01x faster
nbody: Mean +- std dev: [bb] 121 ms +- 3 ms -> [vv3] 113 ms +- 1 ms: 1.07x faster
nqueens: Mean +- std dev: [bb] 110 ms +- 1 ms -> [vv3] 107 ms +- 1 ms: 1.02x faster
pathlib: Mean +- std dev: [bb] 23.6 ms +- 1.1 ms -> [vv3] 22.7 ms +- 0.5 ms: 1.04x faster
pickle: Mean +- std dev: [bb] 10.7 us +- 0.1 us -> [vv3] 10.5 us +- 0.1 us: 1.02x faster
pickle_dict: Mean +- std dev: [bb] 30.6 us +- 0.2 us -> [vv3] 30.3 us +- 0.4 us: 1.01x faster
pyflate: Mean +- std dev: [bb] 568 ms +- 6 ms -> [vv3] 577 ms +- 9 ms: 1.02x slower
python_startup_no_site: Mean +- std dev: [bb] 7.89 ms +- 0.35 ms -> [vv3] 7.99 ms +- 0.39 ms: 1.01x slower
raytrace: Mean +- std dev: [bb] 390 ms +- 2 ms -> [vv3] 393 ms +- 3 ms: 1.01x slower
regex_compile: Mean +- std dev: [bb] 169 ms +- 1 ms -> [vv3] 168 ms +- 1 ms: 1.01x faster
regex_dna: Mean +- std dev: [bb] 235 ms +- 3 ms -> [vv3] 236 ms +- 2 ms: 1.01x slower
regex_v8: Mean +- std dev: [bb] 29.1 ms +- 0.3 ms -> [vv3] 28.6 ms +- 0.3 ms: 1.02x faster
richards: Mean +- std dev: [bb] 59.8 ms +- 1.6 ms -> [vv3] 61.2 ms +- 1.8 ms: 1.02x slower
scimark_fft: Mean +- std dev: [bb] 451 ms +- 7 ms -> [vv3] 446 ms +- 2 ms: 1.01x faster
scimark_monte_carlo: Mean +- std dev: [bb] 88.4 ms +- 1.3 ms -> [vv3] 87.6 ms +- 1.6 ms: 1.01x faster
scimark_sparse_mat_mult: Mean +- std dev: [bb] 6.04 ms +- 0.04 ms -> [vv3] 5.84 ms +- 0.03 ms: 1.04x faster
spectral_norm: Mean +- std dev: [bb] 147 ms +- 1 ms -> [vv3] 148 ms +- 2 ms: 1.01x slower
unpack_sequence: Mean +- std dev: [bb] 55.4 ns +- 1.3 ns -> [vv3] 54.7 ns +- 0.8 ns: 1.01x faster
unpickle_list: Mean +- std dev: [bb] 5.85 us +- 0.38 us -> [vv3] 5.72 us +- 0.06 us: 1.02x faster
unpickle_pure_python: Mean +- std dev: [bb] 313 us +- 3 us -> [vv3] 317 us +- 2 us: 1.01x slower
xml_etree_parse: Mean +- std dev: [bb] 175 ms +- 4 ms -> [vv3] 173 ms +- 3 ms: 1.02x faster
Benchmark hidden because not significant (18): 2to3, fannkuch, float, json_loads, logging_silent, pickle_list, pickle_pure_python, pidigits, python_startup, regex_effbot, scimark_lu, scimark_sor, sqlite_synth, telco, unpickle, xml_etree_iterparse, xml_etree_generate, xml_etree_process
Geometric mean: 1.00x faster
|
A Python core developer has requested some changes be made to your pull request before we can consider merging it. If you could please address their requests along with any other requests in other reviews from core developers that would be appreciated. Once you have made the requested changes, please leave a comment on this pull request containing the phrase |
457ef6d to
76a999d
Compare
76a999d to
71d4952
Compare
|
I have made the requested changes; please review again |
sweeneyde
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for your patience. This is looking close!
The new API feels like a good solution -- it's nice that there's no need to pass an inplace=1, but it's also good that there's only one function to remember.
Side note: I believe --enable-optimizations does not enable LTO by default, you have to explicitly write --enable-optimizations --with-lto.
sweeneyde
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Last couple things, but otherwise LGTM.
Objects/bytearrayobject.c
Outdated
| count = 0; | ||
| mysize = Py_SIZE(self); | ||
| else if (count == 1) { | ||
| Py_INCREF(self); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Indentation
| * This method repeately doubles the number of bytes copied to reduce | ||
| * the number of invocations of memcpy. | ||
| */ | ||
| PyAPI_FUNC(void) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It should not be exported.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Shouldn't it? I think that's the way to be able to use it from the dynamically-loaded extension module created by arraymodule.c.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ya, I thought that array module was a builtin core module.
array_inplace_repeatis inefficient for small arrays and a high number of repeats. This can be improved by using the same approach as in thearray_inplacemethod (same approach as https://bugs.python.org/issue47005)_PyBytes_Repeatand_PyBytes_RepeatInPlacefor code that was common to various objects (bytes,bytearray,unicodeobjectandarray).Microbenchmark:
Results in:
https://bugs.python.org/issue47070