Skip to content

Removed usage of BMI2 pdep with SSE2 alternative #19009

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merged
5 commits merged into from
Feb 26, 2020

Conversation

gfoidl
Copy link
Member

@gfoidl gfoidl commented Feb 13, 2020

Description

Replaced Bmi2.ParallelBitDeposit with SSE2 alternative.

  • Perf-win on Intel (see benchmark results)
  • Bmi2 pdep has bad perf on AMD, so this issue can be sailed around
  • SSE2 has broader support than BMI2

Addresses #18949

Benchmarks

Code
For the benchmarks the vectorized path are disabled, so only sequential processing is performed, to have better figures for replacing BMI2. With the vectorized path only the "remainder" is processed by BMI2, so the gain won't be as huge.

I don't have access to any AMD machine, so can't run a benchmark for this.
But according to the results from the runtime-repo for a similar change, this should be a huge boost for AMD-machines.

BenchmarkDotNet=v0.12.0, OS=Windows 10.0.18362
Intel Core i7-7700HQ CPU 2.80GHz (Kaby Lake), 1 CPU, 8 logical and 4 physical cores
.NET Core SDK=5.0.100-alpha1-015774
  [Host]     : .NET Core 3.1.1 (CoreCLR 4.700.19.60701, CoreFX 4.700.19.60801), X64 RyuJIT
  DefaultJob : .NET Core 3.1.1 (CoreCLR 4.700.19.60701, CoreFX 4.700.19.60801), X64 RyuJIT

Method BytesLen Mean Error StdDev Ratio RatioSD
Current_master 4 6.431 ns 0.1013 ns 0.0846 ns 1.00 0.00
This_PR 4 5.083 ns 0.0531 ns 0.0471 ns 0.79 0.01
Current_master 8 5.446 ns 0.0757 ns 0.0671 ns 1.00 0.00
This_PR 8 5.071 ns 0.0529 ns 0.0495 ns 0.93 0.01
Current_master 16 7.341 ns 0.1798 ns 0.1766 ns 1.00 0.00
This_PR 16 6.341 ns 0.1068 ns 0.0999 ns 0.86 0.03
Current_master 32 10.612 ns 0.0405 ns 0.0379 ns 1.00 0.00
This_PR 32 8.439 ns 0.0964 ns 0.0902 ns 0.80 0.01
Current_master 100 26.499 ns 0.2693 ns 0.2519 ns 1.00 0.00
This_PR 100 18.494 ns 0.1560 ns 0.1459 ns 0.70 0.01

@@ -560,12 +561,14 @@ private static bool CheckBytesInAsciiRange(long check)
return (((check - 0x0101010101010101L) | check) & HighBits) == 0;
}

[MethodImpl(MethodImplOptions.AggressiveInlining)]
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This didn't inline always in some constellations. So I'll force the JIT to do so 😉

@ghost
Copy link

ghost commented Feb 26, 2020

Hello @halter73!

Because this pull request has the auto-merge label, I will be glad to assist with helping to merge this pull request once all check-in policies pass.

p.s. you can customize the way I help with merging this pull request, such as holding this pull request until a specific person approves. Simply @mention me (@msftbot) and give me an instruction to get started! Learn more here.

@ghost ghost merged commit a29bacc into dotnet:master Feb 26, 2020
@gfoidl gfoidl deleted the bmi2_alternative branch February 26, 2020 09:55
@amcasey amcasey added area-networking Includes servers, yarp, json patch, bedrock, websockets, http client factory, and http abstractions and removed area-runtime labels Aug 24, 2023
This pull request was closed.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
area-networking Includes servers, yarp, json patch, bedrock, websockets, http client factory, and http abstractions
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants