-
-
Notifications
You must be signed in to change notification settings - Fork 671
Debugging Performance #40
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Comments
Also, if it means anything, I'm almost exclusively using |
Source map supports appears to be a bit shaky at this point. If you are bundling the wasm binary in some way, you'd have to provide an absolute path to the source map (here: as the argument to I haven't experimented a lot with it, but afaik Firefox has some support, and I think I read somewhere that even breakpoints might be supported, but not sure. I'd expect the most useful information from compiling without optimizations (otherwise inlining might hide where the load actually comes from), with source maps, and then maybe splitting up the heavy functions into multiple smaller parts so these show up as their own entries in the analysis.
In AssemblyScript a |
Thanks for all the tips! So I ran through them,. and running builds without optimizations does help in figuring things out a tiny bit. But after reading your blurb about Firefox, I decided to do some research, and Firefox has really good wasm profiling support! Chrome will just tell you how long you were running in wasm by an arbitrary number. Whereas Firefox will tell you how long you were in each function, and knows the names thanks to the source maps :) So I guess the solution to this is, use firefox haha! If I could ask, I notice that my functions that call I think this is good to be closed unless you have anything else to say on this topic 😄 |
Sure, feel free to close. Did you find where the load comes from? If it is related to AssemblyScript doing something bad, let me know :) |
Sorry, just noticed your question.
Yeah, that seems odd. Are they doing unaligned accesses maybe? But even that should be much faster than 1ms. |
Yeah I kinda snuck that question in an awkward spot haha! Actually let me get back to you on the loads, it may be some weird logic I am, doing that makes the loads appear slow, even though they are probably running fine. EDIT: Yeah I think it is on my end. I have to do a lot of "is it currently safe to read/write in this spot of memory" for the gameboy. And I think all the logic being performed there is taking a lot longer than it should. However, the weird part is that the firefox profiler only shows the "traps" being checked sometimes, when I know for sure that they are. But I'll investigate in a bit. EDIT:EDIT: After investigating, I'm not doing any heavy logic in the "traps" functions. It's just 3 if statements. So I actually think, it may be the load itself. Also, after learning some more, I found out the source map UIRL is placed in the actual wasm file. Which now makes more sense to me for what you said earlier. Is there any way to manually pass the source map URL in the CLI? And could you elaborate more on |
The
|
Oh awesome! I'll try the source map flag later tonight :) And I think I found a good function where we can discuss what's going on underneath the hood. So, I have a function called Here's a performance profile screenshot: Here is the code for the function:
So in particular, I want to look at Anyways,
This function is quite small, and appears that it would run fast. Do you have any idea why it may run slow? And the other function,
So from here we see that only one call is ever made to But what I am starting to suspect is that all of the variable masking, or the amount of one-time variables being declared is what makes things slow down? Though I also just noticed I'm doing quite a lot of loads in the Also, thanks for all of the help! |
On export function checkBitOnByte(bitPosition: u8, byte: u8): boolean {
byte = byte & (0x01 << bitPosition);
if(byte > 0) {
return true;
} else {
return false;
}
} I believe a faster way to write it is: export function checkBitOnByte(bitPosition: u8, byte: u8): bool {
return (<u32>byte & (1 << bitPosition)) != 0;
} compiling to just (func $checkBitOnByte (; 0 ;) (type $iii) (param $0 i32) (param $1 i32) (result i32)
(return
(i32.ne
(i32.and
(get_local $1)
(i32.shl
(i32.const 1)
(get_local $0)
)
)
(i32.const 0)
)
)
) before optimizations, that is without a branch or any masking/sign-extension being performed. Upcasting a smaller type to a larger one is basically free. Eliminating the branch avoids the penalty of missing branch predictions. Still strange that it takes up so much time. Might well be though that wrapping of emulated small types like |
I don't know the exact performance characteristics, but similar things like memcpy and so on try hard to optimize small loads/stores, like just a byte, to larger ones like up to 64-bits where possible while retaining alignment. Not sure if optimizations like these make sense in your scenario, though. |
Look inside setPixelOnFrame: export function setPixelOnFrame(x: u16, y: u16, color: u8): void {
let offset: i32 = Memory.frameInProgressVideoOutputLocation + (y * 160) + x;
store<u8>(offset, color + 1);
} And how you call: for (let i: u16 = 0; i < 160; i++) {
....
let pixelColorInTileFromPalette = ...
setPixelOnFrame(i, scanlineRegister, pixelColorInTileFromPalette);
} It seems you can do better like: let stride: i32 = Memory.frameInProgressVideoOutputLocation + (scanlineRegister * 160); // cache
for (let i: i32 = 0; i < 160; i++) { // <- use i32 instead u16
let pixelColorInTileFromPalette = ...
store<u8>(stride + i, pixelColorInTileFromPalette + 1); // unwrapped setPixelOnFrame
} |
This getWasmBoyOffsetFromGameBoyOffset function you call many times and it looks very unoptimal with a lot of branches. |
@dcodeIO Rad, thanks! I'll plop that in later today, and post screenshots of performance improvements :) @MaxGraey Welcome to the party! Thanks for the tips! I think I may just go through the whole project and make a scan. To be honest, I wrote a lot of this at 3am, so all the code may have some areas like that haha! Cool, so I think we can agree that I should just go back and look through my code, probably not an Assemblyscript thing besides #38 . I'll close this, and re-open if anything else seems weird. Thank you everyone! 😄 |
@dcodeIO As I see @torch2424 used -O3z flag, but via documentation we have only -O3s. Or this combination also valid? |
Oops, I just closed this right as you commented haha! Want me to re-open @MaxGraey ? |
Not sure how relevant this is, but also note that static class 'fields' like Hence, sometimes it might make sense to pull these into locals before a loop or otherwise accesssing them frequently. See also: WebAssembly/binaryen#1371 |
It's valid, yeah. A combination of optimize level 3 and shrink level 2. |
@dcodeIO I'll take note of that as well! Thanks for opening the issue on binaryen. Really helps programmers like me haha! 😂 |
Also, @MaxGraey I was looking at the function once more, and I honestly can't think of another way of doing the logic in a more optimal way. Essentially, if just points one section of the memory in one place, to another section of memory in the wasm memory. Which is why I have the if statements that way. But I guess I could try reducing the number of branches? Any tips would help, thanks! |
@torch2424 Try to assign static access stuff like Memory.gameBytesLocation and so on to local variables and work with this variables. |
Thanks! I'll try that tommorow once I wake up! :) Just in case anyone is wondering, I've been putting a lot of work into performance. And I'm almost at a state where I have enough to run it on my mobile phone browser. So we shall see! :) |
Would you be interested in contributing a wiki page on debugging / performance analysis based on your experience? :) |
Oh yeah definitely :) Let me see what other small hacks I can find, and I still need to see how much of a difference creating locals for static globals will make. Then I will write some stuff up! Also, for the static globals, would it perhaps be faster if I just pulled some of these out as global constants, and imported/exported them as needed? Or is it simply the fact that it is global the makes it slow? Thanks! |
👍
No need to. A
Slow is relative here. Also note that I have no knowledge of how these things are actually implemented in VMs, just speculating. Worst case scenario I see is that a global has the performance characteristics of a direct |
Ahhhhhh! I'm not using read-only on my "static constants", I'll go through and do that after I'm done running errands for the day. And yes, slow is definitely relative here, and even then, that doesn't sound too bad. And I just realized I forgot to change my build flag from And I saw your comment in my commit, thank you for that! :) |
Oh wow, just a quick update, I changed Still need to try the |
Great news! 👍 |
|
I just benchmarked the latest branch as native code for fun, and it's about 2 to 3ms there, very rarely 5ms. |
Oh what :o on your awesome rust version? And within the chrome dev tools profiler correct? |
with the rust version and some very crude time stamp tracking around the update calls. So no, no chrome dev tools. I don't think I can use that with native code right? |
Ahhh No you do not, I am running everything in the context of the browser, which makes perfect sense of why it takes longer. But that's definitely good to know thank you :) |
Why static readonly for most constants and not const? // wasm/joypad/index/_getJoypadButtonStateFromButtonId
fn func145<I: Imports<Memory = M>>(&mut self, imports: &mut I, mut var0: i32) -> i32 {
let var1 = var0;
if var1 != 0 {
let var2 = var0;
if (var2 == 1i32) as i32 != 0 {
let var3 = self.global25;
return var3;
} else {
let var4 = var0;
if (var4 == 2i32) as i32 != 0 {
let var5 = self.global26;
return var5;
} else {
let var6 = var0;
if (var6 == 3i32) as i32 != 0 {
let var7 = self.global27;
return var7;
} else {
let var8 = var0;
if (var8 == 4i32) as i32 != 0 {
let var9 = self.global20;
return var9;
} else {
let var10 = var0;
if (var10 == 5i32) as i32 != 0 {
let var11 = self.global21;
return var11;
} else {
let var12 = var0;
if (var12 == 6i32) as i32 != 0 {
let var13 = self.global22;
return var13;
} else {
let var14 = var0;
if (var14 == 7i32) as i32 != 0 {
let var15 = self.global23;
return var15;
}
}
}
}
}
}
}
} else {
let var16 = self.global24;
return var16;
}
0i32
} Also this could just be a switch, unless that's not supported yet. |
Switches are supported, and I am just weird in the fact that I prefer Also, for the The reason for having the joypad state as mutable globals, is simply because instance classes aren't quite done yet, and this is the only way I could think of sectioning my code to be somewhat object orientated. 😞 Do you have another idea for doing this? I'd appreciate the help :) |
Switches are more efficient than if-else-elseif when done right, as they compile to a jump table. The key is that the values you switch over are close to each other, like switching over an enum of 0, 1, 2, 3, 4. If the values are spread out (large holes in between), it will either fall back to br_ifs, for example with an
Yeah, that's essentially because TypeScript doesn't have
You can actually make them manually pointing to some memory using |
Oh, seems like I just misread the code. I just checked again and it seems like the actual static readonly values are actually inlined :D |
So I tried using switch statements, and they only helped by a slight amount unfortunately. I think my last major bottle neck is my Though, after all of this, it runs like butter on desktop. Just not there for mobile devices :( |
You could reduce the number of checks a bit by eliminating the export function getWasmBoyOffsetFromGameBoyOffset(gameboyOffset: u32): u32 {
// Find the wasmboy offser
if (gameboyOffset < Memory.switchableCartridgeRomLocation) {
// Cartridge ROM - Bank 0 (fixed)
// 0x0000 -> 0x073800
return gameboyOffset + Memory.gameBytesLocation;
}
if (gameboyOffset < Memory.videoRamLocation) {
// Cartridge ROM - Switchable Banks 1-xx
// 0x4000 -> (0x073800 + 0x4000)
return getRomBankAddress(gameboyOffset) + Memory.gameBytesLocation;
}
if (gameboyOffset < Memory.cartridgeRamLocation) {
// Video RAM
// 0x8000 -> 0x000400
return (gameboyOffset - Memory.videoRamLocation) + Memory.gameBoyInternalMemoryLocation;
}
if (gameboyOffset < Memory.internalRamBankZeroLocation) {
// Cartridge RAM - A.K.A External RAM
// 0xA000 -> 0x008400
return getRamBankAddress(gameboyOffset) + Memory.gameRamBanksLocation;
}
// NOTE / TODO: Switchable Internal Ram Banks?
// 0xC000 -> 0x000400
return (gameboyOffset - Memory.videoRamLocation) + Memory.gameBoyInternalMemoryLocation;
} Certainly depends on correct order, that is the compared I believe what MaxGraey meant above (or was it regarding |
Oh snap! Thanks! :) I plugged that in and it helped a little bit. I think I am going to see if I can get some more info from the FF profiler, and go from there. Again, thank you EVERYONE for all the help, it is very much appreciated :) I may try to re-architect how I am syncing all of my stuff by audio to perhaps get it running more consistently, rather than having one frame be 10ms where the next is 2 may help a lot. I'm going to dig around a bit more, and then I'll start a write up :) |
Maybe one note on switches: In an unoptimized build, switches work like if statements (through |
Oh awesome that is nice to know! It makes sense with some results I am about to post below... So on the topic of this, I actually got a great idea from @binji of using performance.now() instead of the firefox profiler. And Lo and behold, the docs for performance.now() say: While this is for performance.now(), I am starting top suspect, firefox profiler won't show any functions taking anything less than 2ms, which is why my loads (which happen very often but I now know are short), would appear to look so large :) I wrote a performance timestamping function in the env namespace in my project, that simply bubbles up to JS to call performance.now() to track times of individual functions, and now things are starting to make a ALOT more sense. However, I can't verify how accurate this is, as some functions run so fast, chrome can't catch them in performance.now() My Sound implementation takes the longest to run, with the CPU being second, and graphics third. Sound makes sense, as other Gameboy emu devs said they had to optimize theirs, so it leads me to think my numbers and methods are working, allowing me to more accurately figure this out. Since I am not quite done yet, and am still learning new tricks/hacks like this, I think I am going to wait a bit longer before writing that write up, but I am super close to being able to say I got a good grasp on how to solve these performance problems 😄 |
So, I think I finally figured out my problem. Ended up not being how fast or slow my wasm was, but how often I was running it. Essentially I accidentally ran two requestAnimationFrame() of a loop of my emulator per second, and all of a sudden, mobile could handle running everything at 1.5x the speed. I realized that I need to pre-fetch some things when I can afford it (like when my wasm is only taking 2-5 ms), for when I can't (7-12 ms). I have yet to implement but I am like 99.99% sure it is going to work, and this is not a wasm performance problem anymore :) But, with that being said, @CryZe Confirmed all the debugging I did almost sped up the emulator by about 30-40%. So In my write up, was going to go over profiling, and If not, I can split this off into a separate Medium article (which I may do anyway), and just link to it in the write up :) |
If you are doing a Medium article anyway, I am also fine with just linking to it from the wiki and you can include any information you'd like right away. Think that'd be easiest? :) |
Yeah I agree, thanks! :) I can stay up late tonight, because I don't have to start my day till about noon tommorow. So if I can finish up my new requestAnimationFrame syncing tonight, and verify the performance on mobile by a reasonable hour, I'll go ahead and start the write up 😄 If not, Maybe this weekend or Monday evening. |
Bumps [assemblyscript](https://github.com/AssemblyScript/assemblyscript) from 0.12.3 to 0.12.4. - [Release notes](https://github.com/AssemblyScript/assemblyscript/releases) - [Commits](AssemblyScript/assemblyscript@v0.12.3...v0.12.4) Signed-off-by: dependabot-preview[bot] <[email protected]> Co-authored-by: dependabot-preview[bot] <27856297+dependabot-preview[bot]@users.noreply.github.com>
Hello,
Figured I'd ask here instead of Twitter, just so this is archived in some sort.
So, I'm reaching a point where performance is now my bottleneck on my wasmBoy project before moving onto the final major task :)
By using the chrome dev tools profiler, I can see that my main emulation Loop in wasm is on average taking about 14 seconds to run, which means I should probably work on optimizing that if I hope to reach the 16ms for 60fps.
I'm currently building using the following CLI command:
npx asc wasm/index.ts -b dist/wasm/index.untouched.wasm -t dist/wasm/index.untouched.wast -O3z --validate --sourceMap --memoryBase 8861696 && echo \"Built Wasm!\"
.And here's a screenshot of my profile if that helps at all:
I know the project is still very young, therefore performance is something that would be focused on much later.
But if you know any tips/tricks/hacks I could use to get this running a bit fast that would be awesome :). Especially any tips on running benchmarks through the source map or something.
No worries if not, I am sure I could manully go through and find things here and there, but yeah. Thanks!
The text was updated successfully, but these errors were encountered: