-
Notifications
You must be signed in to change notification settings - Fork 10.3k
Don't mutate strings in Kestrel #17556
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Some notes for review.
@@ -86,52 +88,65 @@ private static unsafe ulong GetMaskAsLong(byte[] bytes) | |||
} | |||
|
|||
// The same as GetAsciiStringNonNullCharacters but throws BadRequest | |||
[MethodImpl(MethodImplOptions.AggressiveInlining)] |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think it's OK to inline this methods, as in effect it's just the delegate invocation.
var resultString = string.Create(span.Length, new IntPtr(source), s_getAsciiOrUtf8StringNonNullCharacters); | ||
|
||
// If rersultString is marked, perform UTF-8 encoding | ||
if (resultString[0] == '\0') |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
See comment below.
if (!StringUtilities.TryGetAsciiString((byte*)state.ToPointer(), output, buffer.Length)) | ||
{ | ||
// Mark resultString for UTF-8 encoding | ||
output[0] = '\0'; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Didn't find a better solution on how to communicate back that non-acii data is available. Except using exceptions, but it's not exceptional and slow / miuse of exceptions.
That's why is marked with a "flag", which is checked above at L167.
@@ -17,102 +18,208 @@ internal class StringUtilities | |||
[MethodImpl(MethodImplOptions.AggressiveOptimization)] | |||
public static unsafe bool TryGetAsciiString(byte* input, char* output, int count) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm aware of "no further unsafe code in Kestrel", but for code that used hardware intrinsics there is nearly no other way. Therefore I left the signature of this method as is.
{ | ||
// If Vector not-accelerated or remaining less than vector size | ||
if (!Vector.IsHardwareAccelerated || input > end - Vector<sbyte>.Count) | ||
if (Avx2.IsSupported && input <= end - Vector256<sbyte>.Count) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I did some checks with higher threshold for AVX2 kicking in, but with not so good results. Hence AVX2 kicks in as soon there are enough elements.
} | ||
else | ||
{ | ||
throw new PlatformNotSupportedException("should not be here"); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Will never be reached, but to indiciate that if Arm / Neon comes up as intrinsics the check may be expanded here. JIT eleminates this branch anyway.
} | ||
else | ||
{ | ||
throw new PlatformNotSupportedException("should not be here"); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Same as above by AVX2.
Did using string.Create cause slowdown that was mitigated by the new usage of hardware intrinsics? |
Yes. Due the overhead of the delegate call.
Not really.
I would not force the inlining of the default implementations, as there is more code working and would result in bigger call-sites. |
src/Servers/Kestrel/Core/src/Internal/Infrastructure/HttpUtilities.cs
Outdated
Show resolved
Hide resolved
This reverts commit 5ae80c2834471baf34d1e5a05a42e3cce1ff02d7. This is a .NET STandard 2.0 project, so no span is available by default. I think it's not worth it to add a reference to System.Memory-package just for this change.
…ing a null check for the compiler generated cached delegate
Rebased due conflicts in
|
a477a53
to
80a77ff
Compare
{ | ||
// If Vector not-accelerated or remaining less than vector size | ||
if (!Vector.IsHardwareAccelerated || input > end - Vector<sbyte>.Count) | ||
if (Avx2.IsSupported && input <= end - Vector256<sbyte>.Count) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
pointers are unsigned and end - Vector256<sbyte>.Count
can't go negative; so
byte* p = null;
if (p <= p - 15)
{
Console.WriteLine(true);
}
Will output true
So its better to switch the subtraction around with a <=
test and have input
overflow if it goes negative e.g.
- input <= end - Vector256<sbyte>.Count
+ input - Vector256<sbyte>.Count <= end
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Hm, this condition won't work.
E.g. this constellation is valid and could be processed by AVX-codepath:
byte* input = 0;
byte* end = 32;
if (input - Vector256<sbyte>.Count <= end)
{
// ...
}
Here it will wrap around to 0xffffffffffffffe0 <= 32
which is false.
So it should be input + Vector256<sbyte>.Count <= end
which itself can overflow, hence the condition is written with the subtraction.
Your concern is valid, so I could add
public static unsafe bool TryGetAsciiString(byte* input, char* output, int count)
{
+ Debug.Assert(input != null);
+ Debug.Assert(output != null);
?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Or just count >= Vector256<sbyte>.Count
?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
That's the easy way 😃
In
} while (input <= end - Vector256<sbyte>.Count); |
And in the lines below -- either for differnt code-path or to process the remainder -- similar conditions are used. To use count
here this value needs bookkeeping.
As this is a Try...
-method I think it's better to just return false
in case input == null
.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is it provably impossible for end < Vector256<sbyte>.Count
if we know it's not null? I know we were already doing end - Vector<sbyte>.Count
before, but I'm curious.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
In theory end < Vector256<sbyte>.Count
can be true
, but not in practice.
Even if count = 0
then end
will be the same as input
and that pointer won't reside at addresses lower than 32
. Hence there is no real problem.
As it's internal I'll add another assert for this.
@@ -418,6 +529,36 @@ private static bool CheckBytesInAsciiRange(Vector<sbyte> check) | |||
return Vector.GreaterThanAll(check, Vector<sbyte>.Zero); | |||
} | |||
|
|||
[MethodImpl(MethodImplOptions.AggressiveInlining)] | |||
public static bool CheckBytesInAsciiRange(Vector256<sbyte> check, Vector256<sbyte> zero) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Out of curiosity, why is the zero vector passed in as an argument in these methods?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
So that the "zero register" can be reused. Otherwise the JIT won't reuse that register and xors a register again.
Although zeroing (xor) a register has no latency, the cpu-frontend must fetch and decode the instruction, so it still has some cost, that can be avoided with this way.
Note: this works only because this method will be inlined.
[MethodImpl(MethodImplOptions.AggressiveInlining)] | ||
public static bool CheckBytesInAsciiRange(Vector256<sbyte> check, Vector256<sbyte> zero) | ||
{ | ||
if (Avx2.IsSupported) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It's not actually possible for this to be called without Avx2.IsSupported being true, right?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Correct.
Right now I ask myself why I didn't go with a Debug.Assert
to make this obvious.
There must be some other code pattern that influenced me to write it that way, but I can't remember.
Also why this method is public
and not private
.
I'll update this to the be simpler.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ahhh, I didn't remember my own comment 😉
#17556 (comment)
That's why it's written that way.
Which way to go? With Debug.Assert
or leave as is (plus add a comment explaining this).
public static unsafe string GetHeaderName(this ReadOnlySpan<byte> span) | ||
{ | ||
if (span.IsEmpty) | ||
{ | ||
return string.Empty; | ||
} | ||
|
||
var asciiString = new string('\0', span.Length); | ||
fixed (byte* source = &MemoryMarshal.GetReference(span)) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Does this perform better than fixed (byte* buffer = span)
used previously?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes (it safes one length == 0 check).
GetPinnableReference
needs to check for empty span, this check is already done above, so we are sure that we have a non-empty span, so can skip the check for empty by using MemoryMarshal.GetReference
.
|
||
fixed (char* output = asciiString) | ||
fixed (byte* buffer = span) | ||
private static readonly SpanAction<char, IntPtr> s_getHeaderName = GetHeaderName; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Nit: can we move the static field to the top of the class and start it with _
for consistency? I know the convention is different in the runtime repo.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Done -- will be updated with the next push.
|
||
fixed (char* output = asciiString) | ||
fixed (byte* buffer = span) | ||
private static readonly SpanAction<char, IntPtr> s_getAsciiStringNonNullCharacters = GetAsciiStringNonNullCharacters; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
ditto for s_getHeaderName
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Done -- will be updated with the next push.
output += Vector128<sbyte>.Count; | ||
} while (input <= end - Vector128<sbyte>.Count); | ||
|
||
if (input == end) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This only helps if the decodes string happens to have a multiple of 16 characters, right? Seems unlikely to be worth it on average.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is a single check avoids the 4 check chain: <= sizeof(long)
, <= sizeof(int)
, <= sizeof(char)
, <= 1
; though suppose it adds one more to everything else.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes that's right. I tend to keep it, as @benaadams explained, and I think it doesn't harm on input-length that are not multiples of 16 chars long.
return false; | ||
} | ||
|
||
if (Bmi2.X64.IsSupported) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@GrabYourPitchforks with respect to dotnet/runtime#2251 and your PR dotnet/runtime#31904 should Bmi2
be removed here too?
I don't know what's the current stance on this.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes, BMI2 should be removed if possible. For AMD processors this PR is likely to introduce similar regressions as those we are fixing via dotnet/runtime#2251.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
-> #18949 to track this.
Hello @halter73! Because this pull request has the p.s. you can customize the way I help with merging this pull request, such as holding this pull request until a specific person approves. Simply @mention me (
|
Thanks @gfoidl! |
Description
Addresses #8978
Replaced the mutating of strings with
string.Create
. To get cheap invocations of theSpanAction
s they are cached in static readonly fields, as this is quite faster than by caching via the C#-compiler generated code (it safes the null-check amongst some other boilerplate-code).In
TryGetAsciiStringBenchmarks
added specific code-path for AVX2 and SSE2, as the codegen is much tighter than theVector<T>
-method. Further made use ofBmi2.ParallelBitDeposit
to gain some perf in the non-vectorized path.Also changed the algorithm to exit early on non-ascii values, instead of iterating to the end and then returning true/false.
Benchmarks
Code for benchmarks
BytesToStringBenchmark
Before
After
TryGetAsciiStringBenchmarks
ThisPR_ExitEarly
is the method finally used in this PR.Further benchmarks
GetHeaderName
StringCreate
is used in this PR.NewString
is a variant with stackallocing and thennew string(Span<char>)
, as it avoid the delegate-call. But it's slower here.Default_withImprovedStringUtitlites
is the default version (master-branch), but with the improvedTryGetAsciiStringBenchmarks
-method.GetAsciiStringNonNullCharactersBenchmark
StringCreate
is used in this PR.Default_withImprovedStringUtitlites
is the default version (master-branch), but with the improvedTryGetAsciiStringBenchmarks
-method.GetAsciiOrUTF8StringNonNullCharactersBenchmark
StringCreate
is used in this PR.Unfortunately BenchmarkDotNet trims the TestData-string so that the cases aren't clearly seen.
But each "set" of lenght has
ü
inserted at endü
inserted at index 0