Skip to content

Conversation

@IgnacioFDM
Copy link
Contributor

I'm surprised nobody found this earlier, it's been here for 10 years.

Very simple bug

    new Float:test = 1261496448.00000;
    client_print(0, print_chat, "%d %f", floatround(test), test); // floatround will return a negative number

floatround would overflow for floats greater than 2^30 because internally
it would double the number (therefore anything greater than 2^30 results
in something greater than 2^31 which would cause overflow of course)

floatround behaviour is left exactly identical otherwise (although I find
it very weird and wrong to deliberately avoid banker's rounding, it would
be a bad idea to change this behaviour due to compatibility)

Remember to reassemble amxexecn and amxjitsn. I have not included the updated obj files because it's best if arkshine assembles them by himself for security reasons.

floatround would overflow for floats greater than 2^30 because internally
it would double the number (therefore anything greater than 2^30 results
in something greater than 2^31 which would cause overflow of course)

floatround behaviour is left exactly identical otherwise (although I find
it very weird and wrong to deliberately avoid banker's rounding, it would
be a bad idea to change this behaviour due to compatibility)

Remember to reassemble amxexecn and amxjitsn
@dvander
Copy link
Member

dvander commented Jun 21, 2016

Great catch, I'd be curious how you found this.

The intent is round-to-nearest, with the half-up tiebreaking method. I have no idea why, but that's how it's implemented upstream: https://github.com/compuphase/pawn/blob/master/amx/amxfloat.c#L185

I have only very vague memories of why we implemented this by doubling the value. My guess is performance - changing the FPU control word is very slow. See [1] for where this algorithm came from.

If Arkshine & friends don't care about the performance difference, this seems like a reasonable change. Otherwise, you could switch to 64-bit integers. This bug would be present in SourceMod too and that's how we'd want to fix it there.[2]

[1] http://ldesoras.free.fr/doc/articles/rounding_en.pd page 7 or 8
[2] https://github.com/alliedmodders/sourcepawn/blob/master/vm/x86/jit_x86.cpp#L925

@IgnacioFDM
Copy link
Contributor Author

Great catch, I'd be curious how you found this.

Well, to be honest literally half of the possible values floatround can return were erroneous 😝, it's just that you don't usually deal with numbers this big. Being serious though, in my mod players sometimes have ridiculous amounts of HP, and I noticed that their HP started printing incorrectly for 2^30 instead of 2^31 and I narrowed it down to floatround. I looked at amxmodx source code, found there were two codepaths for floatround (with or without optimizations), so I turned off optimizations in core.ini and floatround started working correctly. Therefore the issue had to be in the hand assembled floatround.

If Arkshine & friends don't care about the performance difference, this seems like a reasonable change.

In this commit I tried to change the least code possible (mostly because I was tired and lazy), but both the old bugged and new fixed functions could be optimized quite a bit. Regarding perf differences, although as I said it could be optimized further, my fixed version should be faster than the old bugged version. The old version already channged FCW every single time, and had extra operations (it has an additional add, an additional shift, and extra branching) which I removed. So I see no downsides to this change, as this should be actually faster.

Since I'm no expert in x87 wizardry and I'm kind of busy right now, it would be best if someone more knowledgeable or with more time came with a more optimized version, although as I said, at least for amxmodx this commit should actually be faster than before.

@Arkshine
Copy link
Member

Arkshine commented Jun 22, 2016

Assembly is out of my knowledge atm, therefore, I can't measure the performance impact on the old and new code. Assuming new code is correct, if the performance is equal or superior, and no one is willing is to switch to 64 bits integers, then it seems reasonable to leave as it is. But considering this native is used quite often and everywhere if we could have the right code from the start, this would be preferred.

@IgnacioFDM
Copy link
Contributor Author

@dvander

By the way, now that I look at that that SourcePawn code, there is an additional "bug". If SSE is enabled, it uses cvtss2si, which doesn't support in any MXCSR rounding mode this rounding to nearest with half-up tie breaking. So basically floatround will produce different results if SSE is enabled or not (with SSE enabled, floatround(0.5) == 0, with SSE disabled floatround(0.5) == 1).

Also I find it odd that you are not used the optimized fastfloor for OP_RND_TO_FLOOR

About 64 bit ints, considering everything in amxmodx is compiled for 32 bits, introducing a x86_64 dependency would be a bad idea (since some guy in china who still uses XP will complain, and 32 and 64bit builds or a dispatcher would be needed, and that would be way beyond the scope of this PR).

@Arkshine
You could always benchmark in a plugin by calling floatround() a lot of times. Anyway, my modified version should be faster since simply less instructions are being executed.

I don't think this could be much further optimized without either introducing new instruction set dependencies (x86_64 basically) or keeping it broken for ints > 2^30, other than a few little microoptimizations (FCW could be calculated from a lookup table, and I think some stack operations could be avoided too) which I have no time to do now.

@dvander
Copy link
Member

dvander commented Jun 22, 2016

@IgnacioFDM

By the way, now that I look at that that SourcePawn code, there is an additional "bug". If SSE is enabled, it uses cvtss2si, which doesn't support in any MXCSR rounding mode this rounding to nearest with half-up tie breaking. So basically floatround will produce different results if SSE is enabled or not (with SSE enabled, floatround(0.5) == 0, with SSE disabled floatround(0.5) == 1).

According to Intel docs, "when the conversion is inexact, the result is rounded according to the rounding mode selected in the MXCSR register". So SSE shouldn't make a difference.

Also I find it odd that you are not used the optimized fastfloor for OP_RND_TO_FLOOR

We only optimized the float functions that get really heavy use. But it's certainly game too.

About 64 bit ints, considering everything in amxmodx is compiled for 32 bits, introducing a x86_64 dependency would be a bad idea (since some guy in china who still uses XP will complain, and 32 and 64bit builds or a dispatcher would be needed, and that would be way beyond the scope of this PR).

I meant use FISTP with encoding DF /7 - which stores the result as a 64-bit integer at the given address. Then you would divide that by two, and return 0x80000000 if the result is out of bounds. I'll try to fix this in SourcePawn today and see which approach is faster there.

@dvander
Copy link
Member

dvander commented Jun 22, 2016

@IgnacioFDM Sorry, I misunderstood what you were saying. CVT indeed does not implement half-up, it's half-to-even. Apparently I totally misunderstood that when making the change[1].

Given that change was three years ago, I'm going to remove the half-up rounding entirely from SourceMod. It's dead code anyway since like 99% of machines are SSE2.

For AMX Mod X, I still leave it up to Arkshine whether to break or take a small perf hit - it wouldn't be hard to get performance numbers for this, but on the other hand it's going to be in the single or double digits of extra cycles, so not the worst thing in the world.

[1] https://bugs.alliedmods.net/show_bug.cgi?id=5841

@IgnacioFDM
Copy link
Contributor Author

@dvander

According to Intel docs, "when the conversion is inexact, the result is rounded according to the rounding mode selected in the MXCSR register". So SSE shouldn't make a difference.

I think you don't understand what I mean. This is from sourcepawn

 case OP_RND_TO_NEAREST:
    {
      if (MacroAssemblerX86::Features().sse) {
        // Assume no one is touching MXCSR.
        __ cvtss2si(pri, Operand(stk, 0));
      } else {
        static float kRoundToNearest = 0.5f;
        // From http://wurstcaptures.untergrund.net/assembler_tricks.html#fastfloorf
        __ fld32(Operand(stk, 0));
        __ fadd32(st0, st0);
        __ fadd32(Operand(ExternalAddress(&kRoundToNearest)));
        __ subl(esp, 4);
        __ fistp32(Operand(esp, 0));
        __ pop(pri);
        __ sarl(pri, 1);
      }
      __ addl(stk, 4);
      break;
    }

If the assembler detects SSE it will use the cvtss2si, otherwise it will use the code in the else block. These two alternatives produce different results as I explained above

(Quote myself)

If SSE is enabled, it uses cvtss2si, which doesn't support in any MXCSR rounding mode this rounding to nearest with half-up tie breaking. So basically floatround will produce different results if SSE is enabled or not (with SSE enabled, floatround(0.5) == 0, with SSE disabled floatround(0.5) == 1).

Dvander

We only optimized the float functions that get really heavy use. But it's certainly game too.

Yeah, I just found it very odd considering the quoted assembly tricks doc was about "fastfloor" that you optimized everything (including ceil) but floor 😝

I meant use FISTP with encoding DF /7 - which stores the result as a 64-bit integer at the given address. Then you would divide that by two, and return 0x80000000 if the result is out of bounds. I'll try to fix this in SourcePawn today and see which approach is faster there.

Oh okay I see. That could work. I really have no idea how slow setting FCW is, so this should definitely be benchmarked (since we're trading setting FCW for a few additional ops)

I'll adapt your sourcepawn version to this PR when you're done.

@dvander
Copy link
Member

dvander commented Jun 22, 2016

alliedmodders/sourcepawn#66 for SP patch.

@IgnacioFDM
Copy link
Contributor Author

Ok didn't see your second post in time. Anyway, since the old amxmodx was unoptimized (even though it was using the fast floor method, it still unnecessarily was setting FCW every single time) my change here should actually make things faster than before, so that wouldn't be an issue. It could of course be optimized further both by doing what you suggested, or by breaking compatibility and using the half-even approach with my code which would be very slightly faster (no fadd, no branching, and remove an or) but this second one alone is not worth it in my opinion.

Anyway, it's up to Arkshine, but since I have no time to do a 64 bit int version, and this commit should already make things faster than before, unless someone else wants to improve it further we should go ahead with this change.

@dvander
Copy link
Member

dvander commented Jun 22, 2016

From alliedmodders/sourcepawn#67 I did some performance testing on my Core 2.

  • Broken 32-bit method is about 2.5ns per operation.
  • Correct 64-bit method is about 3ns per operation.
  • FPU control word method is about 4.5ns per operation.

So I think AMX Mod X could take the control word method without any problem.

@Arkshine Arkshine merged commit b9997eb into alliedmodders:master Jun 27, 2016
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants