Skip to content

Increase _PY_NSMALLPOSINTS size #725

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Open
dg-pb opened this issue Apr 24, 2025 · 17 comments
Open

Increase _PY_NSMALLPOSINTS size #725

dg-pb opened this issue Apr 24, 2025 · 17 comments

Comments

@dg-pb
Copy link

dg-pb commented Apr 24, 2025

Although this would be beneficial for all int use cases, I am currently motivated by range / enumerate / itertools.count performance.

What dominates performance in all 3 mentioned utilities is PyLongObject creation:

# PY3.12
from collections import deque

consume = deque(maxlen=0).extend

INT_RANGE = list(range(100_000))

%timeit consume(range(100_000))    # 2.3 ms
%timeit consume(iter(INT_RANGE))   # 0.6 ms

Also, there are FREELISTs:

S1="
import collections as coll
BHOLE = coll.deque(maxlen=0).extend
"

S2="
import collections as coll
BHOLE = coll.deque(maxlen=0).extend
a = list(range(100_000))   # This consumes FREELIST
"

PYEXE="./cpython/main/python.exe"

$PYEXE -m timeit -s $S1 'BHOLE(range(100_000))'     # 1.2 ms
$PYEXE -m timeit -s $S2 'BHOLE(range(100_000))'     # 1.2 ms
$PYEXE -m timeit -s $S2 'BHOLE(iter(a))'            # 0.3 ms

I checked that FREELIST is utilised when using S1 and is not when using S1.

python/cpython#126865 suggests 10-20% performance improvement, but I can not see any big difference in performance when objects are re-used versus when they are not. Either way, although 10-20% improvement is great, this would provide improvement of different order.

E.g. performance difference when using pre-stored ints.

S="
from collections import deque
from itertools import chain
chain = chain.from_iterable
consume = deque(maxlen=0).extend
"
$PYEXE -m timeit -s $S "consume(range(100_000))"                            # 1.2 ms
$PYEXE -m timeit -s $S "consume(chain([range(256)] * (100_000 // 256)))"    # 0.5 ms

So I would dare to suggest increasing _PY_NSMALLPOSINTS to 1000 or 10_000.

It would provide 4x better performance.
It would be extra 10 or 100 KB.
Given extensive usage of mentioned objects (and int in general), I suspect this could have observable benefit at quite reasonable cost.

If this is feasible, exact number needs to be determined. It should not be too hard to find out what sort of number would cover a certain percentage of use cases.

@eendebakpt
Copy link

The creation of PyLongObjects will indeed be a significant part of the computation cost in the examples above. Increasing the size of _PY_NSMALLPOSINTS will help, and the memory trade-off for 1_000 seems reasonable. But the increase will also add to the python startup time, and 10_000 might be too much there (I did not benchmark this).

In your second example I fail to see how this would measure a difference between freelist usage or not. The iteration size (100_000) is much larger than the max freelist size, so any initial freelist size will not really matter.

@dg-pb
Copy link
Author

dg-pb commented Apr 24, 2025

memory trade-off for 1_000 seems reasonable. But the increase will also add to the python startup time, and 10_000 might be too much there (I did not benchmark this).

Standard library grep:

ints = [m.group(0) for m in re.finditer(r'range\((\d+)\)', `CPython repo`)]
len(ints)                                   # 15233
len([i for i in ints if i > 1000])          #   288 1.9%
len([i for i in ints if i > 1500])          #   198 1.3%
len([i for i in ints if i > 2000])          #   158 1.0%
len([i for i in ints if i > 3000])          #   154 1.0%

len([i for i in ints if 256 < i <= 2000])   #   504 3.3%

So 2000 would capture extra 3.3% leaving 1.0% of use cases out of range. This would be 16KB.

Will run pyperf later to see what impact this would have for different sizes. I guess I will try 512, 1024, 1536, 2048.

10K is way too big.

I played with 1K for a bit, but as I was continuing to do stuff, benchmarks do not seem reliable, but potentially significant positives were:

+--------------------------+-------------+------------+--------------+-------------------------+
| Benchmark                | pyMAIN.json | pyWT5.json | Change       | Significance            |
+==========================+=============+============+==============+=========================+
| json_loads               | 39.8 us     | 35.4 us    | 1.13x faster | Significant (t=16.21)   |
| nbody                    | 138 ms      | 123 ms     | 1.12x faster | Significant (t=12.78)   |
| regex_v8                 | 33.3 ms     | 29.4 ms    | 1.13x faster | Significant (t=27.24)   |
| unpack_sequence          | 56.3 ns     | 42.1 ns    | 1.34x faster | Significant (t=82.05)   |

There were many slower ones, but I will refrain from sharing until I run a reliable benchmark. :)

In your second example I fail to see how this would measure a difference between freelist usage or not. The iteration size (100_000) is much larger than the max freelist size, so any initial freelist size will not really matter.

I am not completely sure how they work in practice, but what I did was added printf in _PyLong_FromMedium to print out when v == NULL.

In first case, nothing was printed at all.
It looked like consumption was freeing the list.
But in the second case, when I initialised large list of ints prior to that, I got all 100_000 being newly allocated.

Thus I figured that this mostly benefits initialisation and once there are more ints around than they can store, they don't make much of a difference.

What am I missing?
Are there any docs on how they work?

@dg-pb
Copy link
Author

dg-pb commented Apr 24, 2025

By the way:

+--------------------------+-------------+------------+--------------+-------------------------+
| Benchmark                | pyMAIN.json | pyWT5.json | Change       | Significance            |
+==========================+=============+============+==============+=========================+
| python_startup           | 24.5 ms     | 22.7 ms    | 1.08x faster | Significant (t=16.77)   |
+--------------------------+-------------+------------+--------------+-------------------------+
| python_startup_no_site   | 19.4 ms     | 17.8 ms    | 1.09x faster | Significant (t=27.39)   |

Given that it went to the opposite direction just shows that I screwed up benchmarks, but also suggests that it doesn't have significant impact on startup time.

@dg-pb
Copy link
Author

dg-pb commented Apr 25, 2025

So those tests that have range loops in them that have values above 256 have improved. So really, it is not the cases themselves that run faster, but the test scripts are more performant. :)

Except for startup.
Which has improved.
I suspect that this is due to various pre-stored maps that use range(>256) for construction.
Not sure about exact reasons here, but this is quite common and this should improve import times of some packages as well.
The actual performance penalty due to initialisation of extra ints is hardly observable.

So all in all, this has actually improved startup time, not made it worse.

A bit more info regarding the number:
Github:
/range\(\d{1}\)/ Language:Python - 2.2M files
/range\(\d{2}\)/ Language:Python - 1.4M files
/range\(\d{3}\)/ Language:Python - 563K files
/range\(\d{4}\)/ Language:Python - 295K files
/range\(\d{5}\)/ Language:Python - 83K files

2048 does feel like some sort of happy middle if there is a need to be conservative on memory, while 8192 would be great for capturing a bit more of far tail if 64KB of memory is not an issue.

Benchmarks

Results for 512, 2048 & 8192

+-------------------------+--------------+----------------+--------------+-----------------------+
| Benchmark               | pyMAIN2.json | pyWT5_512.json | Change       | Significance          |
+=========================+==============+================+==============+=======================+
| python_startup          | 23.9 ms      | 22.2 ms        | 1.08x faster | Significant (t=15.65) |
| python_startup_no_site  | 19.2 ms      | 18.2 ms        | 1.06x faster | Significant (t=8.79)  |
| regex_dna               | 210 ms       | 201 ms         | 1.04x faster | Significant (t=10.03) |
| regex_v8                | 32.3 ms      | 29.3 ms        | 1.10x faster | Significant (t=17.15) |

+------------------------+--------------+-----------------+--------------+-----------------------+
| Benchmark              | pyMAIN2.json | pyWT5_2048.json | Change       | Significance          |
+========================+==============+=================+==============+=======================+
| python_startup         | 23.9 ms      | 22.0 ms         | 1.09x faster | Significant (t=18.10) |
| python_startup_no_site | 19.2 ms      | 18.1 ms         | 1.06x faster | Significant (t=9.35)  |
| regex_dna              | 210 ms       | 205 ms          | 1.03x faster | Significant (t=4.35)  |
| regex_v8               | 32.3 ms      | 29.8 ms         | 1.08x faster | Significant (t=12.71) |

+------------------------+--------------+-----------------+--------------+-----------------------+
| Benchmark              | pyMAIN2.json | pyWT5_8192.json | Change       | Significance          |
+========================+==============+=================+==============+=======================+
| pprint_pformat         | 2.05 sec     | 1.97 sec        | 1.04x faster | Significant (t=11.12) |
| python_startup         | 23.9 ms      | 22.3 ms         | 1.07x faster | Significant (t=15.45) |
| python_startup_no_site | 19.2 ms      | 18.5 ms         | 1.04x faster | Significant (t=6.37)  |
| regex_dna              | 210 ms       | 201 ms          | 1.05x faster | Significant (t=10.98) |
| regex_effbot           | 3.66 ms      | 3.54 ms         | 1.03x faster | Significant (t=5.53)  |
| regex_v8               | 32.3 ms      | 29.1 ms         | 1.11x faster | Significant (t=18.73) |

@dg-pb
Copy link
Author

dg-pb commented Apr 25, 2025

So really, it is not the cases themselves that run faster, but the test scripts are more performant. :)

So this is not correct. I looked a bit more into it by taking regex_v8. I capped all loop counts to 256.

Benchmark with capped range sizes:
a) Current speed: 2.53 s
b) 8192 size: 2.29 s (10% lower)

Also, see python/pyperformance#388. Those 2 regexes run 15-20% faster (they run on strings of size 2000), thus noticeable improvement.

@dg-pb
Copy link
Author

dg-pb commented Apr 25, 2025

In your second example I fail to see how this would measure a difference between freelist usage or not.

You are right.
Both cases use freelist there.

@eendebakpt
Copy link

eendebakpt commented Apr 25, 2025

Except for startup. Which has improved. I suspect that this is due to various pre-stored maps that use range(>256) for construction. Not sure about exact reasons here, but this is quite common and this should improve import times of some packages as well. The actual performance penalty due to initialisation of extra ints is hardly observable.

So all in all, this has actually improved startup time, not made it worse.

I gathered some statistics on how many ints are allocated during startup. I executed ./python -X pystats -c "pass" and collected calls to long_alloc, _PyLong_FromMedium and get_small_int. In total I get 1065 calls to get_small_int, 291 calls to _PyLong_FromMedium (most in the 0 to 2000 range) and 857 calls to long_alloc. For the long_alloc it is only called with small values (e.g. below _PY_NSMALLPOSINTS ).

So unless I am missing some other important path to create ints, increasing _PY_NSMALLPOSINTS (or _PY_NSMALLNEGINTS) will not improve startup time because less ints have to be allocated. It will slightly increase startup time because the small int table is bigger (not a lot though, increasing the size to the 1000 to 2000 range seems ok).

I will try to collect some statistics on some of the pyperformance benchmarks as well.

@dg-pb
Copy link
Author

dg-pb commented Apr 25, 2025

I will try to collect some statistics on some of the pyperformance benchmarks as well.

That would be great. I am having a bit of trouble getting reliable results. Majority of results are far from true value when I do batch run. And then I need to keep rerunning individually one by one eliminating false positives and finding out which ones are actually significant. Maybe your machine is more reliable for this.

Some timers report the minimum, not average.
But in this case, standard deviation is meaningless and statistical comparison isn't possible.

Nevertheless, eliminating certain percentage of outliers on a high end might make this more reliable.

Or maybe even extra command-line arg to do something like:

timing_list = ...

minimums = map(min, itertools.batch(timing_list, 3))
result = mean(minimums)
std = std(minimums)

I will test couple of options and issue a PR if some approach manages to report more accurately for the files that I am having trouble with.



So unless I am missing some other important path to create ints, increasing _PY_NSMALLPOSINTS (or _PY_NSMALLNEGINTS) will not improve startup time

Yup, these were false positives. Every other run I get higher results and this has repeated in several consecutive runs.

By the way, with this:

_PyLong_FromMedium(sdigit x)
    if ((int)x >= 0 && (int)x < 2092) {
        printf("hit %i\n", (int)x);
    }

I get 135 hits on startup:

Details

hit 1013
hit 1024
hit 308
hit 640
hit 512
hit 1024
hit 2048
hit 392
hit 413
hit 565
hit 974
hit 1026
hit 1221
hit 438
hit 683
hit 740
hit 770
hit 915
hit 965
hit 1010
hit 1035
hit 1088
hit 1159
hit 1201
hit 1325
hit 1464
hit 512
hit 2048
hit 1024
hit 2048
hit 1980
hit 747
hit 501
hit 501
hit 501
hit 501
hit 501
hit 501
hit 257
hit 306
hit 349
hit 425
hit 677
hit 767
hit 501
hit 501
hit 501
hit 501
hit 501
hit 501
hit 501
hit 1005
hit 501
hit 1005
hit 1000
hit 283
hit 301
hit 336
hit 352
hit 403
hit 418
hit 435
hit 446
hit 461
hit 529
hit 549
hit 690
hit 775
hit 830
hit 846
hit 864
hit 889
hit 908
hit 995
hit 1065
hit 695
hit 1043
hit 1121
hit 1142
hit 511
hit 2048
hit 1024
hit 512
hit 448
hit 1032
hit 513
hit 2048
hit 1024
hit 512
hit 1024
hit 448
hit 501
hit 501
hit 501
hit 501
hit 501
hit 501
hit 501
hit 501
hit 288
hit 501
hit 501
hit 501
hit 501
hit 501
hit 501
hit 501
hit 501
hit 501
hit 288
hit 501
hit 288
hit 501
hit 288
hit 501
hit 501
hit 501
hit 501
hit 501
hit 501
hit 288
hit 501
hit 501
hit 501
hit 501
hit 288
hit 501
hit 1139
hit 1139
hit 1139
hit 1139
hit 1139
hit 1139
hit 501
hit 1139

But I think both: "small_ints initialisation" and "benefit of it for the startup time" are insignificant among other things.



One more thing.

pycore_long.h defines:

#define _PY_IS_SMALL_INT(val) ((val) >= 0 && (val) < 256 && (val) < _PY_NSMALLPOSINTS)

It doesn't seem to have any observable impact on performance.
I would guess all compiler optimizations are doing good job for constants?
Or maybe the impact is just too small to be observed?

But in either case if there is no specific reason to double cap this for compiler, it might be sensible to remove second condition and let compiler make use of this.

@eendebakpt
Copy link

For regex_v8 there are 979460 calls to get_small_int. For _PyLong_FromMedium the distribution of sizes (up to 10_000) is

Image

So for this benchmark increasing _PY_NSMALLPOSINTS to 1000 or 2000 would cover the bulk of the allocations.

@dg-pb
Copy link
Author

dg-pb commented Apr 25, 2025

1000 or 2000

I don't think I know exactly what would be appropriate - my information on other implications is too limited for me to have a strong opinion.

I have stored 200K for myself at the cost of 1.6MB, so personally I would like it as high as possible. :)

But from github search and the chart above, 2048 seems to capture a good number of cases.
1K feels slightly lacking while 4K/8K feels like a bit of a stretch.

@dg-pb
Copy link
Author

dg-pb commented Apr 26, 2025

To see the impact on startup time I tested with 100K:

+------------------------+-------------+-------------+--------------+-------------------------+
| Benchmark              | pyMAIN.json | py100K.json | Change       | Significance            |
+========================+=========+=================+==============+=========================+
| python_startup         | 23.2 ms     | 26.0 ms     | 1.12x slower | Significant (t=-85.37)  |
+------------------------+-------------+-------------+--------------+-------------------------+
| python_startup_no_site | 18.7 ms     | 21.9 ms     | 1.17x slower | Significant (t=-146.07) |
+------------------------+-------------+-------------+--------------+-------------------------+

so 1-2% slower for each 10K.

@dg-pb
Copy link
Author

dg-pb commented Apr 26, 2025

Did some more testing. There is still a chance for false positives, but I did these individually and few times.


Regarding individual tests, of course this will have an impact, but observable changes are:

For 2048:

+--------------------------+-------------+-----------------+--------------+------------------------+
| Benchmark                | pyMAIN.json | pyWT5_2048.json | Change       | Significance           |
+==========================+=============+=================+==============+========================+
| regex_dna                | 203 ms      | 195 ms          | 1.04x faster | Significant (t=19.88)  |
| regex_effbot             | 3.73 ms     | 3.45 ms         | 1.08x faster | Significant (t=26.66)  |
| regex_v8                 | 32.6 ms     | 29.0 ms         | 1.13x faster | Significant (t=36.89)  |

Then, for 8192, additional ones that surface are:

+--------------------------+-------------+-----------------+--------------+------------------------+
| Benchmark                | pyMAIN.json | pyWT5_8192.json | Change       | Significance           |
+==========================+=============+=================+==============+========================+
| nbody                    | 127 ms      | 135 ms          | 1.06x slower | Significant (t=-12.08) |
+--------------------------+
| deepcopy_memo            | 39.1 us     | 36.6 us         | 1.07x faster | Significant (t=18.75)  |
| genshi_text              | 28.7 ms     | 27.5 ms         | 1.05x faster | Significant (t=9.37)   |
| genshi_xml               | 68.9 ms     | 66.0 ms         | 1.05x faster | Significant (t=9.28)   |
| pickle_list              | 5.52 us     | 5.29 us         | 1.04x faster | Significant (t=11.05)  |
| regex_compile            | 182 ms      | 164 ms          | 1.11x faster | Significant (t=23.62)  |
| scimark_fft              | 454 ms      | 431 ms          | 1.05x faster | Significant (t=18.67)  |
| scimark_sor              | 151 ms      | 142 ms          | 1.06x faster | Significant (t=19.80)  |
| spectral_norm            | 142 ms      | 133 ms          | 1.07x faster | Significant (t=23.72)  |
| sqlglot_parse            | 1.67 ms     | 1.53 ms         | 1.09x faster | Significant (t=23.51)  |
| xml_etree_generate       | 129 ms      | 123 ms          | 1.05x faster | Significant (t=8.87)   |
| xml_etree_process        | 88.4 ms     | 85.0 ms         | 1.04x faster | Significant (t=11.19)  |

And few more that come to the surface when run with 100K. These are sparse matrices and integer operations where integers fall more or less within the range.

+-------------------------+-------------+-----------------+--------------+-------------------------+
| Benchmark               | pyMAIN.json | pyWT5_100k.json | Change       | Significance            |
+=========================+=============+=================+==============+=========================+
| python_startup          | 23.2 ms     | 26.0 ms         | 1.12x slower | Significant (t=-85.37)  |
| python_startup_no_site  | 18.7 ms     | 21.9 ms         | 1.17x slower | Significant (t=-146.07) |
+------
| scimark_monte_carlo     | 85.3 ms     | 78.0 ms         | 1.09x faster | Significant (t=36.84) |
| scimark_sparse_mat_mult | 6.37 ms     | 6.00 ms         | 1.06x faster | Significant (t=9.88)  |
| spectral_norm           | 142 ms      | 126 ms          | 1.12x faster | Significant (t=45.80) |

Additionally, I counted all calls to _PyLong_FromMedium for the larger part of benchmark suite.

Data is split into 2 parts as second half is dominated by couple of benchmarks. Empirical CDFs for both:

Image

Image


So at least for benchmarks in pyperformance cutoff points seem to be: 2K, 10K, then 300K for some reason, then 10M.

So 2 values that stand out: 2K and 8/10K

Cost:
a) 2K - 0.2%-0.3% runtime increase, 16KB mem
b) 8K - 1.0%-1.5% runtime increase, 64KB mem

Benefits:

From github range search:
a) Current 256 captures ~78% of use cases
b) 2K would capture addtional 15%
c) 8/10K would add ~1%

From benchmarks plot 1:
a) 2K would capture 40% of integers above 256 used
b) 8/10K would add extra 20%

4K/5K is also an option, on top of 2K it would capture the most beneficial part of 8/10K range.


But I think 2K would be most reasonable.

@eendebakpt
Copy link

One more thing.

pycore_long.h defines:

#define _PY_IS_SMALL_INT(val) ((val) >= 0 && (val) < 256 && (val) < _PY_NSMALLPOSINTS)
It doesn't seem to have any observable impact on performance. I would guess all compiler optimizations are doing good job for constants? Or maybe the impact is just too small to be observed?

The compiler will handle this, but it is a bit confusing why it is written this way. Note that _PY_IS_SMALL_INT is only used in flowgraph.c for the LOAD_SMALL_INT bytecode (and not in and of the methods in longobject.c, there IS_SMALL_INT is used). The reason for the condition that the val needs to be >=0 and < 256 is mentioned in

python/cpython#125972 (comment)

(So a a better name would be something like _PY_IS_SMALL_INT_FOR_LOAD_SMALL_INT). If we increase _PY_NSMALLPOSINTS, then we could try to relax this condition so that more ints can be optimized.

@eendebakpt
Copy link

@dg-pb Here is data similar to your EMCD. For _PyLong_FromMedium the distribution of allocated ints

Image

And zoomed into the region around zero with the other types of allocations:

Image

The performance increase of some of the benchmarks might be due to the for loop around the actual benchmark. For example for deepcopy_memo I see no reason for a performance increase with a larger _PY_NSMALLPOSINTS in the actual deepcopy.

https://github.com/python/pyperformance/blob/9e29e3f69595fed2b74a8dab081ce9caf1640cd9/pyperformance/data-files/benchmarks/bm_deepcopy/run_benchmark.py#L42-L51

@dg-pb
Copy link
Author

dg-pb commented Apr 26, 2025

For example for deepcopy_memo

Yup, this one and pickle_list did not seem plausible.

Then there are sqlglot_parse, genshi_* and xml_* that I haven't checked.

But the rest seem to be true positives.

Here is data similar to your EMCD

Here is one for stop arg for all created ranges:

Image

@dg-pb
Copy link
Author

dg-pb commented Apr 26, 2025

@eendebakpt by the way how do you create your plots? do you aggregate to certain ranges on x-axis?

@dg-pb
Copy link
Author

dg-pb commented Apr 26, 2025

Ah ok, these I guess are frequencies of individual numbers. Was just thinking why there are no values below 10, but it makes sense that there aren't.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants