Skip to content

Debugging Performance #40

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Closed
torch2424 opened this issue Mar 12, 2018 · 44 comments
Closed

Debugging Performance #40

torch2424 opened this issue Mar 12, 2018 · 44 comments
Labels

Comments

@torch2424
Copy link
Contributor

Hello,

Figured I'd ask here instead of Twitter, just so this is archived in some sort.

So, I'm reaching a point where performance is now my bottleneck on my wasmBoy project before moving onto the final major task :)

By using the chrome dev tools profiler, I can see that my main emulation Loop in wasm is on average taking about 14 seconds to run, which means I should probably work on optimizing that if I hope to reach the 16ms for 60fps.

I'm currently building using the following CLI command: npx asc wasm/index.ts -b dist/wasm/index.untouched.wasm -t dist/wasm/index.untouched.wast -O3z --validate --sourceMap --memoryBase 8861696 && echo \"Built Wasm!\".

And here's a screenshot of my profile if that helps at all:

screen shot 2018-03-12 at 2 28 12 am

I know the project is still very young, therefore performance is something that would be focused on much later.

But if you know any tips/tricks/hacks I could use to get this running a bit fast that would be awesome :). Especially any tips on running benchmarks through the source map or something.

No worries if not, I am sure I could manully go through and find things here and there, but yeah. Thanks!

@torch2424
Copy link
Contributor Author

Also, if it means anything, I'm almost exclusively using let instead of const, simply because I get weird errors when building while using a const. If that would make a huge difference, let me know :)

@dcodeIO
Copy link
Member

dcodeIO commented Mar 12, 2018

Especially any tips on running benchmarks through the source map or something.

Source map supports appears to be a bit shaky at this point. If you are bundling the wasm binary in some way, you'd have to provide an absolute path to the source map (here: as the argument to --sourceMap) because the blob doesn't have a base URL that it could use to look up the relative source map location. Alternatively, instantiateStreaming on an actual binary can be used. The latter is able to figure out relative locations because it is passed the result of fetch and thus retains location information.

I haven't experimented a lot with it, but afaik Firefox has some support, and I think I read somewhere that even breakpoints might be supported, but not sure.

I'd expect the most useful information from compiling without optimizations (otherwise inlining might hide where the load actually comes from), with source maps, and then maybe splitting up the heavy functions into multiple smaller parts so these show up as their own entries in the analysis.

Also, if it means anything, I'm almost exclusively using let instead of const

In AssemblyScript a const declaration requires that the expression evaluates to an actual constant value at compile time. Something like const b = a + 1 can become a constant, for example, but only if a is constant as well. There shouldn't be much of a penalty for using let, though (not sure how these are implemented exactly, but I'd assume locals to go into registers), except that constant values are automatically inlined. The difference between var and let is that the former is guaranteed to become a distinct WebAssembly local while the latter become shared locals depending on their block's lifetime. Globals might differ, as these probably become loads and stores, but again, not exactly sure.

@torch2424
Copy link
Contributor Author

torch2424 commented Mar 12, 2018

Thanks for all the tips!

So I ran through them,. and running builds without optimizations does help in figuring things out a tiny bit.

But after reading your blurb about Firefox, I decided to do some research, and Firefox has really good wasm profiling support!

screen shot 2018-03-12 at 11 34 28 am

Chrome will just tell you how long you were running in wasm by an arbitrary number. Whereas Firefox will tell you how long you were in each function, and knows the names thanks to the source maps :)

So I guess the solution to this is, use firefox haha!

If I could ask, I notice that my functions that call load() or store() take about 1ms on average. Do you think that should be expected? Though, I do feel 1ms should be acceptable, but it does kind of feel a little slow?

I think this is good to be closed unless you have anything else to say on this topic 😄

@dcodeIO
Copy link
Member

dcodeIO commented Mar 12, 2018

Sure, feel free to close. Did you find where the load comes from? If it is related to AssemblyScript doing something bad, let me know :)

@dcodeIO
Copy link
Member

dcodeIO commented Mar 12, 2018

Sorry, just noticed your question.

If I could ask, I notice that my functions that call load() or store() take about 1ms on average. Do you think that should be expected? Though, I do feel 1ms should be acceptable, but it does kind of feel a little slow?

Yeah, that seems odd. Are they doing unaligned accesses maybe? But even that should be much faster than 1ms.

@torch2424
Copy link
Contributor Author

torch2424 commented Mar 12, 2018

Yeah I kinda snuck that question in an awkward spot haha!

Actually let me get back to you on the loads, it may be some weird logic I am, doing that makes the loads appear slow, even though they are probably running fine.

EDIT: Yeah I think it is on my end. I have to do a lot of "is it currently safe to read/write in this spot of memory" for the gameboy. And I think all the logic being performed there is taking a lot longer than it should. However, the weird part is that the firefox profiler only shows the "traps" being checked sometimes, when I know for sure that they are. But I'll investigate in a bit.

EDIT:EDIT: After investigating, I'm not doing any heavy logic in the "traps" functions. It's just 3 if statements. So I actually think, it may be the load itself.

screen shot 2018-03-12 at 12 14 03 pm

Also, after learning some more, I found out the source map UIRL is placed in the actual wasm file. Which now makes more sense to me for what you said earlier. Is there any way to manually pass the source map URL in the CLI? And could you elaborate more on instantiateStreaming Is that something I pass when I instantiated the wasm module?

@dcodeIO
Copy link
Member

dcodeIO commented Mar 12, 2018

Is there any way to manually pass the source map URL in the CLI

The --sourceMap parameter takes an optional URL as its parameter. That's then written into the binary.

And could you elaborate more on instantiateStreaming Is that something I pass when I instantiated the wasm module?

WebAssembly.instantiateStreaming is the streaming compilation/instantiation variant of WebAssembly.instantiate. The former takes a fetch(...) result, while the second takes a buffer. (see)

@torch2424
Copy link
Contributor Author

torch2424 commented Mar 12, 2018

Oh awesome! I'll try the source map flag later tonight :)

And I think I found a good function where we can discuss what's going on underneath the hood.

So, I have a function called renderBackground() Where I found can take forever (~4ms) to run.

Here's a performance profile screenshot:

screen shot 2018-03-12 at 12 49 36 pm

Here is the code for the function:

export function renderBackground(scanlineRegister: u8, tileDataMemoryLocation: u16, tileMapMemoryLocation: u16): void {

  // NOTE: Camera is reffering to what you can see inside the 160x144 viewport of the entire rendered 256x256 map.

  // Get our scrollX and scrollY (u16 to play nice with assemblyscript)
  let scrollX: u16 = <u16>eightBitLoadFromGBMemorySkipTraps(Graphics.memoryLocationScrollX);
  let scrollY: u16 = <u16>eightBitLoadFromGBMemorySkipTraps(Graphics.memoryLocationScrollY);

  // Get our current pixel y positon on the 160x144 camera (Row that the scanline draws across)
  // this is done by getting the current scroll Y position,
  // and adding it do what Y Value the scanline is drawing on the camera.
  let pixelYPositionInMap: u16 = <u16>scanlineRegister + scrollY;

  // Gameboy camera will "wrap" around the background map,
  // meaning that if the pixelValue is 350, then we need to subtract 256 (decimal) to get it's actual value
  // pixel values (scrollX and scrollY) range from 0x00 - 0xFF
  if(pixelYPositionInMap >= 0x100) {
    pixelYPositionInMap -= 0x100;
  }

  // Loop through x to draw the line like a CRT
  for (let i: u16 = 0; i < 160; i++) {

    // Get our Current X position of our pixel on the on the 160x144 camera
    // this is done by getting the current scroll X position,
    // and adding it do what X Value the scanline is drawing on the camera.
    let pixelXPositionInMap: u16 = i + scrollX;

    // This is to compensate wrapping, same as above
    if(pixelXPositionInMap >= 0x100) {
      pixelXPositionInMap -= 0x100;
    }

    // Divide our pixel position by 8 to get our tile.
    // Since, there are 256x256 pixels, and 32x32 tiles.
    // 256 / 8 = 32.
    // Also, bitshifting by 3, do do a division by 8
    // Need to use u16s, as they will be used to compute an address, which will cause weird errors and overflows
    let tileXPositionInMap: u16 = pixelXPositionInMap >> 3;
    let tileYPositionInMap: u16 = pixelYPositionInMap >> 3;


    // Get our tile address on the tileMap
    // NOTE: (tileMap represents where each tile is displayed on the screen)
    // NOTE: (tile map represents the entire map, now just what is within the "camera")
    // For instance, if we have y pixel 144. 144 / 8 = 18. 18 * 32 = line address in map memory.
    // And we have x pixel 160. 160 / 8 = 20.
    // * 32, because remember, this is NOT only for the camera, the actual map is 32x32. Therefore, the next tile line of the map, is 32 byte offset.
    // Think like indexing a 2d array, as a 1d array and it make sense :)
    let tileMapAddress: u16 = tileMapMemoryLocation + (tileYPositionInMap * 32) + tileXPositionInMap;

    // Get the tile Id on the Tile Map
    let tileIdFromTileMap: u8 = eightBitLoadFromGBMemorySkipTraps(tileMapAddress);

    // Now get our tileDataAddress for the corresponding tileID we found in the map
    // Read the comments in _getTileDataAddress() to see what's going on.
    // tl;dr if we had the tile map of "a b c d", and wanted tileId 2.
    // This funcitons returns the start of memory locaiton for the tile 'c'.
    let tileDataAddress: u16 = getTileDataAddress(tileDataMemoryLocation, tileIdFromTileMap);

    // Now we can process the the individual bytes that represent the pixel on a tile

    // Get the y pixel of the 8 by 8 tile.
    // Simply modulo the scanline.
    // For instance, let's say we are printing the first line of pixels on our camera,
    // And the first line of pixels on our tile.
    // yPixel = 1. 1 % 8 = 1.
    // And for the last line
    // yPixel = 144. 144 % 8 = 0.
    // 0 Represents last line of pixels in a tile, 1 represents first. 1 2 3 4 5 6 7 0.
    // Because remember, we are counting lines on the display NOT including zero
    let pixelYInTile: u16 = pixelYPositionInMap % 8;
    // Remember to represent a single line of 8 pixels on a tile, we need two bytes.
    // Therefore, we need to times our modulo by 2, to get the correct line of pixels on the tile.
    // Again, think like you had to map a 2d array as a 1d.
    let byteOneForLineOfTilePixels: u8 = eightBitLoadFromGBMemorySkipTraps(tileDataAddress + (pixelYInTile * 2))
    let byteTwoForLineOfTilePixels: u8 = eightBitLoadFromGBMemorySkipTraps(tileDataAddress + (pixelYInTile * 2) + 1);

    // Same logic as pixelYInTile.
    // However, We need to reverse our byte,
    // As pixel 0 is on byte 7, and pixel 1 is on byte 6, etc...
    // Therefore, is pixelX was 2, then really is need to be 5
    // So 2 - 7 = -5, * 1 = 5
    let reversedPixelXInTile: i16 = <i16>pixelXPositionInMap % 8;
    let pixelXInTile: u8 = <u8>((reversedPixelXInTile - 7) * -1);

    // Now we can get the color for that pixel
    // Colors are represented by getting X position of ByteTwo, and X positon of Byte One
    // To Get the color Id.
    // For example, the result of the color id is 0000 00[xPixelByteTwo][xPixelByteOne]
    // See: How to draw a tile/sprite from memory: http://www.codeslinger.co.uk/pages/projects/gameboy/graphics.html
    let paletteColorId: u8 = 0;
    if (checkBitOnByte(<u8>pixelXInTile, byteTwoForLineOfTilePixels)) {
      // Byte one represents the second bit in our color id, so bit shift
      paletteColorId += 1;
      paletteColorId = (paletteColorId << 1);
    }
    if (checkBitOnByte(<u8>pixelXInTile, byteOneForLineOfTilePixels)) {
      paletteColorId += 1;
    }

    // Now get the colorId from the pallete, to get our final color
    // Developers could change colorIds to represents different colors
    // in their palette, thus we need to grab the color from there
    let pixelColorInTileFromPalette: u8 = getColorFromPalette(paletteColorId, Graphics.memoryLocationBackgroundPalette);

    // FINALLY, RENDER THAT PIXEL!
    // Only rendering camera for now, so coordinates are for the camera.
    setPixelOnFrame(i, scanlineRegister, pixelColorInTileFromPalette);
  }
}

So in particular, I want to look at checkBitOnByte() and getTileDataAddress() as those appear to be the functions slowing everything down. Though, I am starting to second guess the accuracy of the profiler.

Anyways, checkBitOnByte() is the following:

export function checkBitOnByte(bitPosition: u8, byte: u8): boolean {
  byte = byte & (0x01 << bitPosition);
  if(byte > 0) {
    return true;
  } else {
    return false;
  }
}

This function is quite small, and appears that it would run fast. Do you have any idea why it may run slow?

And the other function, getTileDataAddress() is as follows:

export function getTileDataAddress(tileDataMemoryLocation: u16, tileIdFromTileMap: u8): u16 {

  // Watch this part of The ultimate gameboy talk: https://youtu.be/HyzD8pNlpwI?t=30m50s
  // A line of 8 pixels on a single tile, is represented by 2 bytes.
  // since a single tile is 8x8 pixels, 8 * 2 = 16 bytes
  let sizeOfTileInMemory: u8 = 16;
  let tileDataAddress: u16 = 0;

  // Get the tile ID's tile addess from tile data.
  // For instance, let's say our first line of tile data represents tiles for letters:
  // a b c d e f g
  // And we have tileId 0x02. That means we want the tile for the 'c' character
  // Since each tile is 16 bytes, it would be the starting tileDataAddress + (tileId * tileSize), to skip over tiles we dont want
  // The whole signed thing is weird, and has something to do how the second set of tile data is stored :p
  if(tileDataMemoryLocation === Graphics.memoryLocationTileDataSelectZeroStart) {
    // Treat the tile Id as a signed int, subtract an offset of 128
    // if the tileId was 0 then the tile would be in memory region 0x9000-0x900F
    // NOTE: Assemblyscript, Can't cast to i16, need to make negative manually
    let convertedTileIdFromTileMap = <i16>tileIdFromTileMap;
    let signedTileId: i16 = tileIdFromTileMap + 128;
    if (checkBitOnByte(7, tileIdFromTileMap)) {
      signedTileId = convertedTileIdFromTileMap - 128;
    }
    let tileIdAddress: i16 = signedTileId * sizeOfTileInMemory;
    tileDataAddress = tileDataMemoryLocation + <u16>tileIdAddress;
  } else {
    // if the background layout gave us the tileId 0, then the tile data would be between 0x8000-0x800F.
    let sixteenBitTileIdFromTileMap: u16 = <u16>tileIdFromTileMap;
    tileDataAddress = tileDataMemoryLocation + <u16>(sixteenBitTileIdFromTileMap * sizeOfTileInMemory);
  }

  return tileDataAddress;
}

So from here we see that only one call is ever made to checkBitOnByte(). And please excuse all the messiness I am doing to mask the byte back into a signed int.

But what I am starting to suspect is that all of the variable masking, or the amount of one-time variables being declared is what makes things slow down? Though I also just noticed I'm doing quite a lot of loads in the renderBackground() function. Perhaps it all leads back to the performance to load(), or the logic I am doing before loading (which is very little)?

Also, thanks for all of the help!

@dcodeIO
Copy link
Member

dcodeIO commented Mar 12, 2018

On

export function checkBitOnByte(bitPosition: u8, byte: u8): boolean {
  byte = byte & (0x01 << bitPosition);
  if(byte > 0) {
    return true;
  } else {
    return false;
  }
}

I believe a faster way to write it is:

export function checkBitOnByte(bitPosition: u8, byte: u8): bool {
  return (<u32>byte & (1 << bitPosition)) != 0;
}

compiling to just

 (func $checkBitOnByte (; 0 ;) (type $iii) (param $0 i32) (param $1 i32) (result i32)
  (return
   (i32.ne
    (i32.and
     (get_local $1)
     (i32.shl
      (i32.const 1)
      (get_local $0)
     )
    )
    (i32.const 0)
   )
  )
 )

before optimizations, that is without a branch or any masking/sign-extension being performed. Upcasting a smaller type to a larger one is basically free. Eliminating the branch avoids the penalty of missing branch predictions.

Still strange that it takes up so much time. Might well be though that wrapping of emulated small types like u8, u16 etc. can be optimized a lot more than it is at this point, see #38.

@dcodeIO
Copy link
Member

dcodeIO commented Mar 12, 2018

Though I also just noticed I'm doing quite a lot of loads in the renderBackground() function

I don't know the exact performance characteristics, but similar things like memcpy and so on try hard to optimize small loads/stores, like just a byte, to larger ones like up to 64-bits where possible while retaining alignment. Not sure if optimizations like these make sense in your scenario, though.

@MaxGraey
Copy link
Member

MaxGraey commented Mar 12, 2018

Look inside setPixelOnFrame:

export function setPixelOnFrame(x: u16, y: u16, color: u8): void {
  let offset: i32 = Memory.frameInProgressVideoOutputLocation + (y * 160) + x;
  store<u8>(offset, color + 1);
}

And how you call:

 for (let i: u16 = 0; i < 160; i++) {
    ....
    let pixelColorInTileFromPalette = ...
    setPixelOnFrame(i, scanlineRegister, pixelColorInTileFromPalette);
  }

It seems you can do better like:

  let stride: i32 = Memory.frameInProgressVideoOutputLocation + (scanlineRegister * 160); // cache
  for (let i: i32 = 0; i < 160; i++) { // <- use i32 instead u16
     let pixelColorInTileFromPalette = ...
     store<u8>(stride + i, pixelColorInTileFromPalette + 1); // unwrapped setPixelOnFrame
  }

@MaxGraey
Copy link
Member

MaxGraey commented Mar 12, 2018

This getWasmBoyOffsetFromGameBoyOffset function you call many times and it looks very unoptimal with a lot of branches.

@torch2424
Copy link
Contributor Author

torch2424 commented Mar 12, 2018

@dcodeIO Rad, thanks! I'll plop that in later today, and post screenshots of performance improvements :)

@MaxGraey Welcome to the party! Thanks for the tips! I think I may just go through the whole project and make a scan. To be honest, I wrote a lot of this at 3am, so all the code may have some areas like that haha!

Cool, so I think we can agree that I should just go back and look through my code, probably not an Assemblyscript thing besides #38 . I'll close this, and re-open if anything else seems weird.

Thank you everyone! 😄

@MaxGraey
Copy link
Member

@dcodeIO As I see @torch2424 used -O3z flag, but via documentation we have only -O3s. Or this combination also valid?

@torch2424
Copy link
Contributor Author

torch2424 commented Mar 12, 2018

Oops, I just closed this right as you commented haha! Want me to re-open @MaxGraey ?

@dcodeIO
Copy link
Member

dcodeIO commented Mar 12, 2018

Not sure how relevant this is, but also note that static class 'fields' like Memory.frameInProgressVideoOutputLocation become WASM globals. As mentioned earlier, globals can't be optimized as well as locals. Actually, Binaryen was even missing quite a few validation checks for globals until lately, because compiling C/C++ usually doesn't use them at all but places them into memory right away.

Hence, sometimes it might make sense to pull these into locals before a loop or otherwise accesssing them frequently.

See also: WebAssembly/binaryen#1371

@dcodeIO
Copy link
Member

dcodeIO commented Mar 12, 2018

used -O3z flag, but via documentation we have only -O3s. Or this combination also valid?

It's valid, yeah. A combination of optimize level 3 and shrink level 2. s stands for shrink level 1. Where performance is the most important concern, just -O3 might yield better results because s/z make trade-offs in favor of code size. The default, that is just -O, is the same as -O2s.

@torch2424
Copy link
Contributor Author

torch2424 commented Mar 12, 2018

@dcodeIO I'll take note of that as well! Thanks for opening the issue on binaryen. Really helps programmers like me haha! 😂

@torch2424
Copy link
Contributor Author

Also, @MaxGraey I was looking at the function once more, and I honestly can't think of another way of doing the logic in a more optimal way. Essentially, if just points one section of the memory in one place, to another section of memory in the wasm memory.

Which is why I have the if statements that way. But I guess I could try reducing the number of branches? Any tips would help, thanks!

@MaxGraey
Copy link
Member

@torch2424 Try to assign static access stuff like Memory.gameBytesLocation and so on to local variables and work with this variables.

@torch2424
Copy link
Contributor Author

Thanks! I'll try that tommorow once I wake up! :)

Just in case anyone is wondering, I've been putting a lot of work into performance. And I'm almost at a state where I have enough to run it on my mobile phone browser. So we shall see! :)

@dcodeIO
Copy link
Member

dcodeIO commented Mar 13, 2018

Would you be interested in contributing a wiki page on debugging / performance analysis based on your experience? :)

@torch2424
Copy link
Contributor Author

Oh yeah definitely :) Let me see what other small hacks I can find, and I still need to see how much of a difference creating locals for static globals will make.

Then I will write some stuff up!

Also, for the static globals, would it perhaps be faster if I just pulled some of these out as global constants, and imported/exported them as needed? Or is it simply the fact that it is global the makes it slow?

Thanks!

@AssemblyScript AssemblyScript deleted a comment from torch2424 Mar 13, 2018
@dcodeIO
Copy link
Member

dcodeIO commented Mar 13, 2018

Then I will write some stuff up!

👍

static globals, would it perhaps be faster if I just pulled some of these out as global constants

No need to. A static readonly 'field' compiles to a constant as well, and is inlined just like other constants.

Or is it simply the fact that it is global the makes it slow

Slow is relative here. Also note that I have no knowledge of how these things are actually implemented in VMs, just speculating. Worst case scenario I see is that a global has the performance characteristics of a direct load, which I'd not say is slow per se.

@torch2424
Copy link
Contributor Author

Ahhhhhh! I'm not using read-only on my "static constants", I'll go through and do that after I'm done running errands for the day.

And yes, slow is definitely relative here, and even then, that doesn't sound too bad.

And I just realized I forgot to change my build flag from -O3Z to just -O3 I'll do that as well.

And I saw your comment in my commit, thank you for that! :)

@torch2424
Copy link
Contributor Author

Oh wow, just a quick update, I changed -O3Z to just -O3 and it got my worst case scenarios down from 16ms to 10. And on average I'm hitting about 6 ms now. I'm stoked haha!

Still need to try the static readonly 😄

@MaxGraey
Copy link
Member

Great news! 👍

@torch2424
Copy link
Contributor Author

static readonly didn't help much if at all unfortunately, and I'm really close to getting to a point where I can consider this playable on mobile :)

@CryZe
Copy link

CryZe commented Mar 13, 2018

I just benchmarked the latest branch as native code for fun, and it's about 2 to 3ms there, very rarely 5ms.

@torch2424
Copy link
Contributor Author

Oh what :o on your awesome rust version? And within the chrome dev tools profiler correct?

@CryZe
Copy link

CryZe commented Mar 13, 2018

with the rust version and some very crude time stamp tracking around the update calls. So no, no chrome dev tools. I don't think I can use that with native code right?

@torch2424
Copy link
Contributor Author

Ahhh No you do not, I am running everything in the context of the browser, which makes perfect sense of why it takes longer.

But that's definitely good to know thank you :)

@CryZe
Copy link

CryZe commented Mar 13, 2018

Why static readonly for most constants and not const?
The _getJoypadButtonStateFromButtonId function compiles down to this (warning big), which is suboptimal, as those are mutable globals, and are therefore not optimized away at compile time:

    // wasm/joypad/index/_getJoypadButtonStateFromButtonId
    fn func145<I: Imports<Memory = M>>(&mut self, imports: &mut I, mut var0: i32) -> i32 {
        let var1 = var0;
        if var1 != 0 {
            let var2 = var0;
            if (var2 == 1i32) as i32 != 0 {
                let var3 = self.global25;
                return var3;
            } else {
                let var4 = var0;
                if (var4 == 2i32) as i32 != 0 {
                    let var5 = self.global26;
                    return var5;
                } else {
                    let var6 = var0;
                    if (var6 == 3i32) as i32 != 0 {
                        let var7 = self.global27;
                        return var7;
                    } else {
                        let var8 = var0;
                        if (var8 == 4i32) as i32 != 0 {
                            let var9 = self.global20;
                            return var9;
                        } else {
                            let var10 = var0;
                            if (var10 == 5i32) as i32 != 0 {
                                let var11 = self.global21;
                                return var11;
                            } else {
                                let var12 = var0;
                                if (var12 == 6i32) as i32 != 0 {
                                    let var13 = self.global22;
                                    return var13;
                                } else {
                                    let var14 = var0;
                                    if (var14 == 7i32) as i32 != 0 {
                                        let var15 = self.global23;
                                        return var15;
                                    }
                                }
                            }
                        }
                    }
                }
            }
        } else {
            let var16 = self.global24;
            return var16;
        }
        0i32
    }

Also this could just be a switch, unless that's not supported yet.

@torch2424
Copy link
Contributor Author

torch2424 commented Mar 13, 2018

Switches are supported, and I am just weird in the fact that I prefer if statements, but I can definitely go through and make major if/else blocks a switch. Do you think that will improve performance a lot? I know switches are faster than if, but not by how much

Also, for the static readonly, dcodeIO stated a little above, "A static readonly 'field' compiles to a constant as well, and is inlined just like other constants", so I went ahead and did that on my sound-fixes branch.

The reason for having the joypad state as mutable globals, is simply because instance classes aren't quite done yet, and this is the only way I could think of sectioning my code to be somewhat object orientated. 😞 Do you have another idea for doing this? I'd appreciate the help :)

@dcodeIO
Copy link
Member

dcodeIO commented Mar 13, 2018

Do you think that will improve performance a lot?

Switches are more efficient than if-else-elseif when done right, as they compile to a jump table. The key is that the values you switch over are close to each other, like switching over an enum of 0, 1, 2, 3, 4. If the values are spread out (large holes in between), it will either fall back to br_ifs, for example with an -s flag, or, in -O3, will generate huge jump tables that blow up the code size significantly. Someone told me that if is the fastest, but only if branch prediction can do its job very well, that is most of the time one specific branch is taken. Next comes a select<T>, which becomes a conditional move but is limited to simple operands (Binaryen optimizes simple ifs this way) and then jump tables. An if where branch prediction doesn't perform well comes in last.

Also, for the static readonly, dcodeIO stated a little above, "A static readonly 'field' compiles to a constant as well, and is inlined just like other constants", so I went ahead and did that on my sound-fixes branch.

Yeah, that's essentially because TypeScript doesn't have static const, so I had to be creative.

  • A static mutable becomes a mutable global, e.g. "SomeClass.someVar", similar to a global var
  • A static readonly becomes a immutable global (removed by optimizing if not exported) with its value being inlined, similar to a global const

because instance classes aren't quite done yet

You can actually make them manually pointing to some memory using changetype, if they are simple. You won't gain anything from it as property accesses like ´this.someVar` compile to direct loads with an offset immediate anyway, similar to the worst case scenario I made up for global accesses above.

@CryZe
Copy link

CryZe commented Mar 13, 2018

Oh, seems like I just misread the code. I just checked again and it seems like the actual static readonly values are actually inlined :D

@torch2424
Copy link
Contributor Author

So I tried using switch statements, and they only helped by a slight amount unfortunately.

I think my last major bottle neck is my getWasmBoyOffsetFromGameboyOffset() function. I know MaxGraey suggest I try caching the globals, but personally, I don't see how/where I could cache them 😢

Though, after all of this, it runs like butter on desktop. Just not there for mobile devices :(

@dcodeIO
Copy link
Member

dcodeIO commented Mar 14, 2018

You could reduce the number of checks a bit by eliminating the && parts where it is already known in else that the offset must be greater than or equal the value compared to in the if:

export function getWasmBoyOffsetFromGameBoyOffset(gameboyOffset: u32): u32 {
  // Find the wasmboy offser
  if (gameboyOffset < Memory.switchableCartridgeRomLocation) {
    // Cartridge ROM - Bank 0 (fixed)
    // 0x0000 -> 0x073800
    return gameboyOffset + Memory.gameBytesLocation;
  }
  if (gameboyOffset < Memory.videoRamLocation) {
    // Cartridge ROM - Switchable Banks 1-xx
    // 0x4000 -> (0x073800 + 0x4000)
    return getRomBankAddress(gameboyOffset) + Memory.gameBytesLocation;
  }
  if (gameboyOffset < Memory.cartridgeRamLocation) {
    // Video RAM
    // 0x8000 -> 0x000400
    return (gameboyOffset - Memory.videoRamLocation) + Memory.gameBoyInternalMemoryLocation;
  }
  if (gameboyOffset < Memory.internalRamBankZeroLocation) {
    // Cartridge RAM - A.K.A External RAM
    // 0xA000 -> 0x008400
    return getRamBankAddress(gameboyOffset) + Memory.gameRamBanksLocation;
  }
  // NOTE / TODO: Switchable Internal Ram Banks?
  // 0xC000 -> 0x000400
  return (gameboyOffset - Memory.videoRamLocation) + Memory.gameBoyInternalMemoryLocation;
}

Certainly depends on correct order, that is the compared Memory.XY offset should increase with each if. Alternatively you could come up with a bit-fiddle that maps 0x0000+, 0x4000+ etc. to specific integer values, and switch over these (like making the cases [0. 1. 2. 3], [4. 5, 6, 7], [8, 9], [A, B], default and shifting gameboyOffset so it maps).

I believe what MaxGraey meant above (or was it regarding setPixelOnFrame?) is that the function is used in loops, where when inlined instead, some of the values could be cached in locals.

@torch2424
Copy link
Contributor Author

Oh snap! Thanks! :)

I plugged that in and it helped a little bit.

I think I am going to see if I can get some more info from the FF profiler, and go from there.

Again, thank you EVERYONE for all the help, it is very much appreciated :) I may try to re-architect how I am syncing all of my stuff by audio to perhaps get it running more consistently, rather than having one frame be 10ms where the next is 2 may help a lot.

I'm going to dig around a bit more, and then I'll start a write up :)

@dcodeIO
Copy link
Member

dcodeIO commented Mar 14, 2018

Maybe one note on switches: In an unoptimized build, switches work like if statements (through br_if), while in an optimized build, Binaryen makes them proper br_tables. This is done so Binaryen can decide, based on optimization levels, how much space a br_table may take when there are 'holes' between the values.

@torch2424
Copy link
Contributor Author

Oh awesome that is nice to know! It makes sense with some results I am about to post below...

So on the topic of this, I actually got a great idea from @binji of using performance.now() instead of the firefox profiler.

And Lo and behold, the docs for performance.now() say: To offer protection against timing attacks and fingerprinting, the precision of performance.now() might get rounded depending on browser settings. In Firefox, the privacy.reduceTimerPrecision preference is enabled by default and defaults to 20us in Firefox 59; in 60 it will be 2ms.

While this is for performance.now(), I am starting top suspect, firefox profiler won't show any functions taking anything less than 2ms, which is why my loads (which happen very often but I now know are short), would appear to look so large :)

I wrote a performance timestamping function in the env namespace in my project, that simply bubbles up to JS to call performance.now() to track times of individual functions, and now things are starting to make a ALOT more sense. However, I can't verify how accurate this is, as some functions run so fast, chrome can't catch them in performance.now()

My Sound implementation takes the longest to run, with the CPU being second, and graphics third. Sound makes sense, as other Gameboy emu devs said they had to optimize theirs, so it leads me to think my numbers and methods are working, allowing me to more accurately figure this out.

Since I am not quite done yet, and am still learning new tricks/hacks like this, I think I am going to wait a bit longer before writing that write up, but I am super close to being able to say I got a good grasp on how to solve these performance problems 😄

@torch2424
Copy link
Contributor Author

@dcodeIO

So, I think I finally figured out my problem. Ended up not being how fast or slow my wasm was, but how often I was running it.

Essentially I accidentally ran two requestAnimationFrame() of a loop of my emulator per second, and all of a sudden, mobile could handle running everything at 1.5x the speed. I realized that I need to pre-fetch some things when I can afford it (like when my wasm is only taking 2-5 ms), for when I can't (7-12 ms).

I have yet to implement but I am like 99.99% sure it is going to work, and this is not a wasm performance problem anymore :)

But, with that being said, @CryZe Confirmed all the debugging I did almost sped up the emulator by about 30-40%.

So In my write up, was going to go over profiling, and performance.now(), but this requestAnimationFrame() is more JS specific. However, since games, emulators, and other heavy tasks that wasm is kinda of "meant" (huge air quotes) would probably fall into the same trap as me. Do you mind if I include this in the write up?

If not, I can split this off into a separate Medium article (which I may do anyway), and just link to it in the write up :)

@dcodeIO
Copy link
Member

dcodeIO commented Mar 15, 2018

If not, I can split this off into a separate Medium article (which I may do anyway), and just link to it in the write up :)

If you are doing a Medium article anyway, I am also fine with just linking to it from the wiki and you can include any information you'd like right away. Think that'd be easiest? :)

@torch2424
Copy link
Contributor Author

Yeah I agree, thanks! :) I can stay up late tonight, because I don't have to start my day till about noon tommorow. So if I can finish up my new requestAnimationFrame syncing tonight, and verify the performance on mobile by a reasonable hour, I'll go ahead and start the write up 😄 If not, Maybe this weekend or Monday evening.

radu-matei pushed a commit to radu-matei/assemblyscript that referenced this issue Oct 13, 2020
Bumps [assemblyscript](https://github.com/AssemblyScript/assemblyscript) from 0.12.3 to 0.12.4.
- [Release notes](https://github.com/AssemblyScript/assemblyscript/releases)
- [Commits](AssemblyScript/assemblyscript@v0.12.3...v0.12.4)

Signed-off-by: dependabot-preview[bot] <[email protected]>

Co-authored-by: dependabot-preview[bot] <27856297+dependabot-preview[bot]@users.noreply.github.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

No branches or pull requests

4 participants