-
Notifications
You must be signed in to change notification settings - Fork 52
Increase _PY_NSMALLPOSINTS size #725
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Comments
The creation of In your second example I fail to see how this would measure a difference between freelist usage or not. The iteration size (100_000) is much larger than the max freelist size, so any initial freelist size will not really matter. |
Standard library grep: ints = [m.group(0) for m in re.finditer(r'range\((\d+)\)', `CPython repo`)]
len(ints) # 15233
len([i for i in ints if i > 1000]) # 288 1.9%
len([i for i in ints if i > 1500]) # 198 1.3%
len([i for i in ints if i > 2000]) # 158 1.0%
len([i for i in ints if i > 3000]) # 154 1.0%
len([i for i in ints if 256 < i <= 2000]) # 504 3.3% So 2000 would capture extra 3.3% leaving 1.0% of use cases out of range. This would be 16KB. Will run
I played with
There were many slower ones, but I will refrain from sharing until I run a reliable benchmark. :)
I am not completely sure how they work in practice, but what I did was added In first case, nothing was printed at all. Thus I figured that this mostly benefits initialisation and once there are more ints around than they can store, they don't make much of a difference. What am I missing? |
By the way:
Given that it went to the opposite direction just shows that I screwed up benchmarks, but also suggests that it doesn't have significant impact on startup time. |
So those tests that have Except for startup. So all in all, this has actually improved startup time, not made it worse. A bit more info regarding the number: 2048 does feel like some sort of happy middle if there is a need to be conservative on memory, while 8192 would be great for capturing a bit more of far tail if 64KB of memory is not an issue. BenchmarksResults for 512, 2048 & 8192
+-------------------------+--------------+----------------+--------------+-----------------------+
| Benchmark | pyMAIN2.json | pyWT5_512.json | Change | Significance |
+=========================+==============+================+==============+=======================+
| python_startup | 23.9 ms | 22.2 ms | 1.08x faster | Significant (t=15.65) |
| python_startup_no_site | 19.2 ms | 18.2 ms | 1.06x faster | Significant (t=8.79) |
| regex_dna | 210 ms | 201 ms | 1.04x faster | Significant (t=10.03) |
| regex_v8 | 32.3 ms | 29.3 ms | 1.10x faster | Significant (t=17.15) |
+------------------------+--------------+-----------------+--------------+-----------------------+
| Benchmark | pyMAIN2.json | pyWT5_2048.json | Change | Significance |
+========================+==============+=================+==============+=======================+
| python_startup | 23.9 ms | 22.0 ms | 1.09x faster | Significant (t=18.10) |
| python_startup_no_site | 19.2 ms | 18.1 ms | 1.06x faster | Significant (t=9.35) |
| regex_dna | 210 ms | 205 ms | 1.03x faster | Significant (t=4.35) |
| regex_v8 | 32.3 ms | 29.8 ms | 1.08x faster | Significant (t=12.71) |
+------------------------+--------------+-----------------+--------------+-----------------------+
| Benchmark | pyMAIN2.json | pyWT5_8192.json | Change | Significance |
+========================+==============+=================+==============+=======================+
| pprint_pformat | 2.05 sec | 1.97 sec | 1.04x faster | Significant (t=11.12) |
| python_startup | 23.9 ms | 22.3 ms | 1.07x faster | Significant (t=15.45) |
| python_startup_no_site | 19.2 ms | 18.5 ms | 1.04x faster | Significant (t=6.37) |
| regex_dna | 210 ms | 201 ms | 1.05x faster | Significant (t=10.98) |
| regex_effbot | 3.66 ms | 3.54 ms | 1.03x faster | Significant (t=5.53) |
| regex_v8 | 32.3 ms | 29.1 ms | 1.11x faster | Significant (t=18.73) | |
So this is not correct. I looked a bit more into it by taking Benchmark with capped range sizes: Also, see python/pyperformance#388. Those 2 regexes run 15-20% faster (they run on strings of size 2000), thus noticeable improvement. |
You are right. |
I gathered some statistics on how many ints are allocated during startup. I executed So unless I am missing some other important path to create ints, increasing I will try to collect some statistics on some of the pyperformance benchmarks as well. |
That would be great. I am having a bit of trouble getting reliable results. Majority of results are far from true value when I do batch run. And then I need to keep rerunning individually one by one eliminating false positives and finding out which ones are actually significant. Maybe your machine is more reliable for this. Some timers report the minimum, not average. Nevertheless, eliminating certain percentage of outliers on a high end might make this more reliable. Or maybe even extra command-line arg to do something like: timing_list = ...
minimums = map(min, itertools.batch(timing_list, 3))
result = mean(minimums)
std = std(minimums) I will test couple of options and issue a PR if some approach manages to report more accurately for the files that I am having trouble with.
Yup, these were false positives. Every other run I get higher results and this has repeated in several consecutive runs. By the way, with this: _PyLong_FromMedium(sdigit x)
if ((int)x >= 0 && (int)x < 2092) {
printf("hit %i\n", (int)x);
} I get 135 hits on startup: Details
But I think both: " One more thing.
#define _PY_IS_SMALL_INT(val) ((val) >= 0 && (val) < 256 && (val) < _PY_NSMALLPOSINTS) It doesn't seem to have any observable impact on performance. But in either case if there is no specific reason to double cap this for compiler, it might be sensible to remove second condition and let compiler make use of this. |
I don't think I know exactly what would be appropriate - my information on other implications is too limited for me to have a strong opinion. I have stored 200K for myself at the cost of 1.6MB, so personally I would like it as high as possible. :) But from github search and the chart above, 2048 seems to capture a good number of cases. |
To see the impact on startup time I tested with 100K:
so 1-2% slower for each 10K. |
Did some more testing. There is still a chance for false positives, but I did these individually and few times. Regarding individual tests, of course this will have an impact, but observable changes are: For 2048:
Then, for 8192, additional ones that surface are:
And few more that come to the surface when run with 100K. These are sparse matrices and integer operations where integers fall more or less within the range.
Additionally, I counted all calls to Data is split into 2 parts as second half is dominated by couple of benchmarks. Empirical CDFs for both: So at least for benchmarks in So 2 values that stand out: 2K and 8/10K Cost: Benefits: From github From benchmarks plot 1: 4K/5K is also an option, on top of 2K it would capture the most beneficial part of 8/10K range. But I think 2K would be most reasonable. |
The compiler will handle this, but it is a bit confusing why it is written this way. Note that python/cpython#125972 (comment) (So a a better name would be something like |
@dg-pb Here is data similar to your EMCD. For And zoomed into the region around zero with the other types of allocations: The performance increase of some of the benchmarks might be due to the |
@eendebakpt by the way how do you create your plots? do you aggregate to certain ranges on x-axis? |
Ah ok, these I guess are frequencies of individual numbers. Was just thinking why there are no values below 10, but it makes sense that there aren't. |
Although this would be beneficial for all
int
use cases, I am currently motivated byrange
/enumerate
/itertools.count
performance.What dominates performance in all 3 mentioned utilities is
PyLongObject
creation:Also, there are FREELISTs:
I checked that FREELIST is utilised when using
S1
and is not when usingS1
.python/cpython#126865 suggests 10-20% performance improvement, but I can not see any big difference in performance when objects are re-used versus when they are not. Either way, although 10-20% improvement is great, this would provide improvement of different order.
E.g. performance difference when using pre-stored ints.
So I would dare to suggest increasing
_PY_NSMALLPOSINTS
to1000
or10_000
.It would provide 4x better performance.
It would be extra 10 or 100 KB.
Given extensive usage of mentioned objects (and
int
in general), I suspect this could have observable benefit at quite reasonable cost.If this is feasible, exact number needs to be determined. It should not be too hard to find out what sort of number would cover a certain percentage of use cases.
The text was updated successfully, but these errors were encountered: