Skip to content

WASM to X86 Backend #1222

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Closed
Tracked by #1485
ubaidsk opened this issue Oct 22, 2022 · 25 comments
Closed
Tracked by #1485

WASM to X86 Backend #1222

ubaidsk opened this issue Oct 22, 2022 · 25 comments

Comments

@ubaidsk
Copy link
Collaborator

ubaidsk commented Oct 22, 2022

There is an idea to implement a wasm to x86 backend. We have to benchmark ASR->x86 and ASR->WASM->x86. If the performance difference is not much, then the big advantage of WASM->x86 would (hopefully) be that it would be a easier and quicker for us to deliver a working backend.

  • We can start with 32bit, since it is already figured out, unless it is more complicated,then we can just do 64 bit.

*Notes:

@ubaidsk
Copy link
Collaborator Author

ubaidsk commented Oct 22, 2022

For x86 do we need the binary format or text based format or both?

@certik
Copy link
Contributor

certik commented Oct 23, 2022

I think you can use the X86 assembler, which just generates a binary, but in Debug mode it can also do a text format.

Alternatively, we just do a binary, and then do x86 -> text assembly.

@ubaidsk
Copy link
Collaborator Author

ubaidsk commented Oct 30, 2022

} else if (t->type == ASR::ttypeType::Character) {
// the string location is already on function stack
// now, push the string length onto stack
wasm::emit_i32_const(
m_code_section, m_al,
ASR::down_cast<ASR::Character_t>(t)->m_len);
// call JavaScript print_str
wasm::emit_call(
m_code_section, m_al,
m_func_name_idx_map[get_hash(
m_import_func_asr_map["print_str"])]
->index);
}

In wasm, when printing strings, we have the length of the string and its location in memory onto the stack. In x86 as well, when there is a call to print string, we would be having the length of the string and its memory location onto the x86 stack. The concern here is that the memory location of the string is not available with the wasm->x86 backend (during compile-time), but instead is available with x86 (during runtime, since the location is already onto the stack). Thus, if we (somehow) create a label for each string, we are unable to reference the label for a particular string (the label and the string's memory might not be same or we might not be able to create a relation between them). It seems we might need to dynamically allocate strings in x86.

@ubaidsk
Copy link
Collaborator Author

ubaidsk commented Oct 30, 2022

In wasm, when printing strings, we have the length of the string and its location in memory onto the stack.

This idea of printing strings using its length and memory location is more of our own concept/choice. It is possible that WASI might have different mechanism of printing string. At the moment, we do not have support for WASI.

@ubaidsk
Copy link
Collaborator Author

ubaidsk commented Nov 2, 2022

Prospective Roadmap:

@Thirumalai-Shaktivel
Copy link
Collaborator

@Shaikh-Ubaid sorry, I was a little busy with this PR: #1256.
So, were you able to figure out a way for the comment?
I'm interested in working on this issue alongside with you! What are you working now?

@certik
Copy link
Contributor

certik commented Nov 2, 2022

Here is how to benchmark:

N = 10000

A_functions = ""
calls = ""
for i in range(N):
    func_A = f"""
def A{i}(x: i32) -> i32:
    y: i32
    z: i32
    y = {i}
    z = 5
    x = x + y * z
    return x
"""
    A_functions += func_A
    calls += f"    y = A{i}(y)\n"

source = f"""\
from ltypes import i32

{A_functions}

def Driver(radius: i32) -> i32:
    y: i32
    y = radius
{calls}
    return y

def Main0():
    print(Driver(5))

Main0()
"""

print(source)

@ubaidsk
Copy link
Collaborator Author

ubaidsk commented Nov 2, 2022

So, were you able to figure out a way for the #1222 (comment)?

There is a hackish idea to support printing strings in the wasm_x86 backend. I am not sure if it would work but I am positive.

I'm interested in working on this issue alongside with you! What are you working now?

Sure, we can work together. I am hoping to follow the roadmap shared here #1222 (comment). You can proceed with supporting if else in the wasm_x86. Please feel free to get in touch for any clarification/query related to wasm or wasm_x86 backends.

@certik
Copy link
Contributor

certik commented Nov 2, 2022

Here are the timing results. With N = 10000, the above script produces a file with 90,015 lines. Here are the results (best timing each):

$ time lpython --backend=llvm a.py -o a_llvm.x
lpython --backend=llvm a.py -o a_llvm.x  2.94s user 0.10s system 99% cpu 3.034 total
$ time lpython --backend=x86 a.py -o a_x86.x
lpython --backend=x86 a.py -o a_x86.x  0.09s user 0.01s system 93% cpu 0.104 total
$ time lpython --backend=wasm_x86 a.py -o a_wasm_x86.x
lpython --backend=wasm_x86 a.py -o a_wasm_x86.x  0.11s user 0.01s system 95% cpu 0.129 total

The programs runs on x86 Linux:

$ ./a_wasm_x86.x 
249975005$ ./a_x86.x 
249975005
$ 

Summary for this particular example:

Backend Time Speedup over LLVM Lines / s
LLVM 3.034 1x 29,668
WASM x86 0.129 23.5x 697,790
x86 0.104 29.2x 865,528

The WASM x86 backend is roughly 25% slower than the direct x86 backend, which is not bad, given that both are over 20x faster than LLVM. The timings include parsing and semantics too, so the relative speeds of the backends themselves are larger, but the above is a good start to get an idea of the performance.

@certik
Copy link
Contributor

certik commented Nov 2, 2022

@Shaikh-Ubaid can you please submit a PR with the script in #1222 (comment) and put it into src/bin? That way we can use it for benchmarking.

@ubaidsk
Copy link
Collaborator Author

ubaidsk commented Nov 3, 2022

@Shaikh-Ubaid can you please submit a PR with the script in #1222 (comment) and put it into src/bin? That way we can use it for benchmarking.

Submitted here #1260

@ubaidsk
Copy link
Collaborator Author

ubaidsk commented Nov 3, 2022

Debug Mode Benchmark results for N = 10000 on AMD Ryzen 5 2500U with Radeon Vega Mobile Gfx @1.6GHz

(lp) ubaid@ubaid-Lenovo-ideapad-330-15ARR:~/OpenSource/lpython$ time python bench.py 
249975005

real    0m0.461s
user    0m0.396s
sys     0m0.064s
(lp) ubaid@ubaid-Lenovo-ideapad-330-15ARR:~/OpenSource/lpython$ time lpython bench.py --backend llvm
249975005

real    1m55.329s
user    1m55.093s
sys     0m0.154s
(lp) ubaid@ubaid-Lenovo-ideapad-330-15ARR:~/OpenSource/lpython$ time lpython bench.py --backend x86

real    0m24.622s
user    0m24.588s
sys     0m0.028s
(lp) ubaid@ubaid-Lenovo-ideapad-330-15ARR:~/OpenSource/lpython$ time lpython bench.py --backend wasm_x86

real    0m48.122s
user    0m47.993s
sys     0m0.081s

Release Mode Benchmark results for N = 10000 on AMD Ryzen 5 2500U with Radeon Vega Mobile Gfx @1.6GHz

(lp) ubaid@ubaid-Lenovo-ideapad-330-15ARR:~/OpenSource/lpython$ time python bench.py 
249975005

real    0m0.511s
user    0m0.412s
sys     0m0.095s
(lp) ubaid@ubaid-Lenovo-ideapad-330-15ARR:~/OpenSource/lpython$ time lpython bench.py --backend llvm
249975005

real    0m6.126s
user    0m6.021s
sys     0m0.084s
(lp) ubaid@ubaid-Lenovo-ideapad-330-15ARR:~/OpenSource/lpython$ time lpython bench.py --backend x86

real    0m0.282s
user    0m0.241s
sys     0m0.040s
(lp) ubaid@ubaid-Lenovo-ideapad-330-15ARR:~/OpenSource/lpython$ time lpython bench.py --backend wasm_x86

real    0m0.369s
user    0m0.308s
sys     0m0.061s

Note: The above results (of both Debug mode and Release mode) are for single time execution.

@Thirumalai-Shaktivel
Copy link
Collaborator

This might help with printing negative numbers, I looking for some other resources.

@Thirumalai-Shaktivel
Copy link
Collaborator

How do I implement asm_shr_r32_mm8?
Just like:

void asm_shl_r32_imm8(X86Reg r32, uint8_t imm8) {
if (r32 == X86Reg::eax) {
m_code.push_back(m_al, 0xc1);
m_code.push_back(m_al, 0xe0);
m_code.push_back(m_al, imm8);
} else {
throw AssemblerError("Not implemented.");
}
EMIT("shl " + r2s(r32) + ", " + i2s(imm8));
}

Where to look for hex value for shr?
@certik @Shaikh-Ubaid

@Thirumalai-Shaktivel
Copy link
Collaborator

@Shaikh-Ubaid, I have a doubt?

void visit_I32Const(int32_t value) {
m_a.asm_push_imm32(value);

HereConstI32 value is pushed into the stack, right? Does this mean it will be stored in the eax?
Instead why not we mov value to eax and pop the eax in the ConstI32 visitor?

@ubaidsk
Copy link
Collaborator Author

ubaidsk commented Nov 7, 2022

HereConstI32 value is pushed into the stack, right? Does this mean it will be stored in the eax?

It will be stored in the stack. It will not be stored in eax, I think.

Instead why not we mov value to eax and pop the eax in the ConstI32 visitor?

We need the value to be on top of stack. Do you mean to first store the value in eax and then push the eax value onto stack? It would be a two step process and hence not be efficient, since we can achieve the same by directly pushing onto stack.

@Thirumalai-Shaktivel
Copy link
Collaborator

Thirumalai-Shaktivel commented Nov 7, 2022

Okay, After pushing the value into the stack.
Do we use that value, later in the process? Or How do we use that?

@Thirumalai-Shaktivel
Copy link
Collaborator

Thirumalai-Shaktivel commented Nov 7, 2022

I'm a little confused here, Can you share some useful resources to better understand the workings of eax, ebx, stack ...?

@ubaidsk
Copy link
Collaborator Author

ubaidsk commented Nov 7, 2022

Do we use that value, later in the process? Or How do we use that?

Yes, the other instructions that follow use it. For example, After the I32Const, there could be an instruction I32Neg (hypothetical example) which would use the value on top of stack (it pops it to use it), negate it and then push back onto stack.

@Thirumalai-Shaktivel
Copy link
Collaborator

Cool, got it. Thanks

@ubaidsk
Copy link
Collaborator Author

ubaidsk commented Nov 7, 2022

I'm a little confused here, Can you share some useful resources to better understand the workings of eax, ebx, stack ...?

I mostly used a combination of https://www.intel.com/content/dam/www/public/us/en/documents/manuals/64-ia-32-architectures-software-developer-instruction-set-reference-manual-325383.pdf (also shared on Zulip, it seems slightly big), Google Search and (some) Youtube.

I did not find any specific (great) resource for x86 yet. If you find something, please let me know.

@certik
Copy link
Contributor

certik commented Nov 10, 2022

In here:

 void asm_shl_r32_imm8(X86Reg r32, uint8_t imm8) { 
     if (r32 == X86Reg::eax) { 
         m_code.push_back(m_al, 0xc1); 
         m_code.push_back(m_al, 0xe0); 
         m_code.push_back(m_al, imm8); 
     } else { 
         throw AssemblerError("Not implemented."); 
     } 
     EMIT("shl " + r2s(r32) + ", " + i2s(imm8)); 
 } 

we see encoding of the SHL r/m32, imm8 instruction, which has an opcode of C1 /4 ib.

The /4 should be ModRM_byte(0b11, 4, rm), where rm is the register, in the above case it is eax (0), reg is 4, and the mode is 0b11 (3), so total we get 0xe0. The mode has to be 0b11, as documented in "Table 2-2. 32-Bit Addressing Forms with the ModR/M Byte", specifying that the rm contains a register 0 - 7 (not a memory address).

@ubaidsk
Copy link
Collaborator Author

ubaidsk commented Nov 18, 2022

support floating point numbers (and operations on them)

@certik do we need support for floating point numbers in the x86 and wasm_x86 backends? It seems for floating point numbers, we would need to extend x86_assembler.cpp.

@certik
Copy link
Contributor

certik commented Nov 21, 2022

Yes, we should do floating point numbers.

@certik certik added the wasm label Jan 31, 2023
@certik certik mentioned this issue Jan 31, 2023
23 tasks
@certik
Copy link
Contributor

certik commented Jan 31, 2023

We now have WASM->x86 backends (both 32 and 64 bit), so I am closing this issue. We can open specific issues for missing features.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

3 participants