Skip to content

gh-115758: Optimizer constant propagation for 64-bit ints and doubles #117396

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Closed

Conversation

Fidget-Spinner
Copy link
Member

@Fidget-Spinner Fidget-Spinner commented Mar 30, 2024

@Fidget-Spinner Fidget-Spinner changed the title gh-115758: Optimizer constant propagation for 64-bit ints gh-115758: Optimizer constant propagation for 64-bit ints and doubles Mar 31, 2024
@Fidget-Spinner
Copy link
Member Author

For the following code:


def bench_constants():
    for _ in range(10000000):
        x = 1.0
        y = x + x + x + x + x + x + x + x + x + x + x
    for _ in range(10000000):
        x = 1
        y = x + x + x + x + x + x + x + x + x + x + x

if __name__ == '__main__':
    bench_constants()

Main with uops: 2.726s
This branch with uops: 861.7ms
The microbenchmark takes less than half the time on this branch than the original.

@@ -792,48 +792,44 @@ def testfunc(n):

def test_float_add_constant_propagation(self):
def testfunc(n):
a = 1.0
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is a limitation of loop-based traces. We should start projecting traces from function entries as well, but I'll leave that for Mark or someone else to do in the future.

@Fidget-Spinner
Copy link
Member Author

@brandtbucher Windows builds are complaining there's no PyAPI_DATA for memcpy. But memcpy is a C stdlib function. I know stencils are built without stdlib, but is there some way around this? Or do I just expose memcpy as an extern?

@brandtbucher
Copy link
Member

@brandtbucher Windows builds are complaining there's no PyAPI_DATA for memcpy. But memcpy is a C stdlib function. I know stencils are built without stdlib, but is there some way around this? Or do I just expose memcpy as an extern?

Yeah, sorry. You're gonna need to get rid of memcpy in the jitted code, either by rewriting the copy as a loop or making a call to a C API function that wraps memcpy.

The problem is that the C compiler wants to assume that memcpy is nearby, which isn't necessarily the case. So we have to add a layer of indirection to turn the 32-bit jump into a 64-bit one.

@gvanrossum
Copy link
Member

The problem is that the C compiler wants to assume that memcpy is nearby, which isn't necessarily the case. So we have to add a layer of indirection to turn the 32-bit jump into a 64-bit one.

TIL. :-) What is it that makes it work for C API functions? Are they always nearby, or do we always do an extra indirection?

And why would the C compiler "know" that memcpy is nearby?

@brandtbucher
Copy link
Member

brandtbucher commented Apr 1, 2024

TIL. :-) What is it that makes it work for C API functions? Are they always nearby, or do we always do an extra indirection?

We always do the indirection. If it turns out that they are in range, we rewrite the jump to a more efficient one at runtime.

And why would the C compiler "know" that memcpy is nearby?

Because the default "small" code model assumes that all code lives within 31 bits of each other. We want this code model so the (common) jumps between instructions are efficent, at the cost of a layer of indirection between the (ideally less common) function calls.

So basically, all function calls have to be extern (or, for Windows, __declspec(dllimport)). We could change this by jitting trampolines for the out-of-range functions at some point, but that's just trading one type of indirection for another (and adds quite a bit of complexity). We already do this for AArch64 Linux out of neccessity... and it ain't pretty.

@Fidget-Spinner
Copy link
Member Author

Fidget-Spinner commented Apr 1, 2024

@brandtbucher I'm getting assertion failure on the JIT debug build of 32-bit Win. Specifically the failing assertion is

                // 32-bit absolute address.
                // Check that we're not out of range of 32 unsigned bits:
                assert(value < (1ULL << 32));

from Jit.c (at least, this is all I can decode anyways because the error logs has special broken characters.

Also failing WASI, but that seems to be a floating point issue.

@brandtbucher
Copy link
Member

Hm, okay, I'll try to dig into that a bit today. Sorry for the friction!

@brandtbucher
Copy link
Member

It looks like non-JIT 32-bit Windows is failing too, though. Is it because you're trying to cram a 64-bit double into a 32-bit pointer?

@brandtbucher
Copy link
Member

WASI is 32 bits also. I think that's the issue.

@Fidget-Spinner
Copy link
Member Author

Fidget-Spinner commented Apr 1, 2024

It looks like non-JIT 32-bit Windows is failing too, though. Is it because you're trying to cram a 64-bit double into a 32-bit pointer?

WAIT I forgot it casts to 32-bit pointer. Ahh thanks so much for that! So sorry for the noise!

@brandtbucher
Copy link
Member

FWIW, I think 32-bit JIT builds only have a 32-bit operand for uops. This is by accident, not by design, but I just don't think we've noticed since we never stick anything wider than a pointer in there.

@Fidget-Spinner
Copy link
Member Author

Ok WASI now passes. So there is indeed a bug in 32-bit JIT builds.

@markshannon
Copy link
Member

The stats show a very small reduction in the number of _BINARY_OP_ADD_INTs and the equivalent number of _LOAD_INT added. Given that _LOAD_INT still needs to create a new int and we still need to destroy the left operand and most of the right operands, I would expect the performance impact to be unmeasurably small.

An optimization like this could make sense when combined with another to pass to more effectively eliminated dead values, but I'd be surprised if that were worth the effort. At least until the tier 2 optimization is much more capable.

@brandtbucher
Copy link
Member

Is this still something we want to do right now, or should it wait for the partial evaluation pass?

@Fidget-Spinner
Copy link
Member Author

I have a feeling we might want a pre-processing step in the specializer/eedundancy eliminator pass first before passing it to the partial evaluator. But either ways works. So lets wait and see.

@Fidget-Spinner
Copy link
Member Author

Doing this in the PE instead.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants