-
Notifications
You must be signed in to change notification settings - Fork 2.6k
Fast-path for ASCII & UTF8 Encoding ASCII data #8969
Conversation
|
|
||
int lengthEncoded; | ||
if ((encoder?.InternalHasFallbackBuffer ?? false) && | ||
(encoder.FallbackBuffer.InternalGetNextChar()) != 0) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is this the correct test to check if the ASCII and UTF8 EncoderNLS encoder has a char in their buffer from a previous call?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
cc: @shiftylogic, @ahsonkhan |
@dotnet-bot retest Linux ARM Emulator Cross Debug Build |
The non-allocating method
|
Worse-case regression from complete fallback (though the perf test does double validation for Fastpath and coreclr version doesn't)
Isn't too significant? (as ascii is a very common path for high thoughput data e.g. json, html, xml, web apis; data based files etc) |
Will take a look today. |
@@ -12,6 +12,10 @@ namespace System.Text | |||
|
|||
internal static class EncodingForwarder | |||
{ | |||
#if !BIGENDIAN | |||
const int Shift16Shift24 = 256 * 256 * 256 + 256 * 256; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
nit: (1 << 16) | (1 << 24)
seems less magic, less error prone and more obvious when looking at both name and value
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Changed, also moved const into function
@@ -12,6 +12,10 @@ namespace System.Text | |||
|
|||
internal static class EncodingForwarder | |||
{ | |||
#if !BIGENDIAN | |||
const int Shift16Shift24 = 256 * 256 * 256 + 256 * 256; | |||
const int Shift8Identity = 256 + 1; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
ditto
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Changed, also moved const into function
public override byte[] GetBytes(String s) | ||
=> EncodingForwarder.GetBytesAsciiFastPath(this, s); | ||
|
||
public override int GetBytes(String s, int charIndex, int charCount, byte[] bytes, int byteIndex) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
String s [](start = 37, length = 8)
please keep the original paramter names . like in this case it should be "String chars"
@@ -281,7 +302,7 @@ internal override unsafe int GetByteCount(char* chars, int charCount, EncoderNLS | |||
return byteCount; | |||
} | |||
|
|||
internal override unsafe int GetBytes(char* chars, int charCount, | |||
internal override unsafe int GetBytesFallback(char* chars, int charCount, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I believe this change will cause the code compiled with BIGENDIAN to have stackoverflow (recursive calls).
public override int GetBytes(String chars, int charIndex, int charCount,byte[] bytes, int byteIndex) calls
EncodingForwarder.GetBytes(this, chars, charIndex, charCount, bytes, byteIndex); which will call
encoding.GetBytes(pChars + charIndex, charCount, pBytes + byteIndex, byteCount, encoder: null);
because you have renamed it here, the base encoding (Encoding class) version will get called which is
Encoding.GetBytes(char* chars, int charCount, byte* bytes, int byteCount, EncoderNLS encoder) which calls
ASCIIEncoding.GetBytes(char* chars, int charCount, byte* bytes, int byteCount) which calls
EncodingForwarder.GetBytes(this, chars, charIndex, charCount, bytes, byteIndex);
and we'll loop forever
byte[] bytes, int byteIndex) | ||
{ | ||
return EncodingForwarder.GetBytes(this, chars, charIndex, charCount, bytes, byteIndex); | ||
#if !BIGENDIAN |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is acceptable for now to have the ifdef as we have other code placed under same ifdef too.
@jkotas what you think of changing this to runtime check instead of compile time check? I mean we can detect if we are running on BIGENDIA.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
CoreLib is build architecture specific for number of different reasons. If you would like the code to be more pretty (and create more work for JIT to get rid of unreachable code), you can do the check dynamically based on a static property and have the property return true/false constant under ifdef.
@@ -580,7 +601,7 @@ private static bool InRange(int ch, int start, int end) | |||
|
|||
// Our workhorse | |||
// Note: We ignore mismatched surrogates, unless the exception flag is set in which case we throw | |||
internal override unsafe int GetBytes(char* chars, int charCount, | |||
internal override unsafe int GetBytesFallback(char* chars, int charCount, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
same previous comment
it would be better to move these lines to its own helper method and have both overloads of GetByteCount to call this helper instead of duplicating the code. the reason is if we need to do more changes in the future we'll have one place to change Refers to: src/mscorlib/src/System/Text/EncoderNLS.cs:118 in 1b8a70c. [](commit_id = 1b8a70c, deletion_comment = False) |
byte[] bytes; | ||
fixed (char* pChar = s) { | ||
int byteCount = GetByteCount(pChar + index, count); | ||
if (byteCount == 0) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
if (byteCount == 0) { [](start = 15, length = 22)
please try to use the coding style
if ()
{
}
else
{
}
I know the old code was not doing that but any changed code we should use the right style
{ | ||
int byteCount = GetByteCount(pChar + index, count); | ||
if (byteCount == 0) | ||
return Array.Empty<byte>(); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
nit: why not keeping this logic? I mean just return when byteCount is zero. that will make the code readability better (I guess)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Avoiding an inline return in fixed block.
However I was wondering about this; if count > 0
could byteCount
ever be zero? If not the test and return could be done before either the fixed
or call to GetByteCount
. Was assuming an edge case I hadn't considered...
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Maybe BOM, or encoder always outputting something?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
what is wrong if you return from the fixed block?
for returning 0 byte count, I believe this can happen if we have custom encoder who can fallback and will have a state after finishing.
In reply to: 97555441 [](ancestors = 97555441)
int charCount = s.Length; | ||
|
||
byte[] bytes; | ||
if (charCount > 0) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
please use the right coding style. thanks.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
note that my comment apply in different places so I'll not repeat the comment again assuming you'll fix all touched places.
In reply to: 97451474 [](ancestors = 97451474)
char* input = pInput + charIndex; | ||
byte* output = pOutput + byteIndex; | ||
lengthEncoded = GetBytesAsciiFastPath(input, output, Math.Min(charCount, byteCount)); | ||
if (lengthEncoded < byteCount) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
should we check if lengthEncoded is >= charCount and then exit? your check here can cause calling encoding.GetBytesFallback while is not needed (like in cases having pure ascii string and calling with big value of byteCount)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think it should be testing against the Min
d length in all cases? e.g. Math.Min(charCount, byteCount)
will add var and change
ThrowBytesOverflow(encoding); | ||
|
||
int lengthEncoded; | ||
if (charCount > 0 && byteCount > 0) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
if (charCount > 0 && byteCount > 0) { [](start = 1, length = 48)
same as previous comment
} | ||
else if (charCount > 0 && byteCount > 0) { | ||
lengthEncoded = GetBytesAsciiFastPath(chars, bytes, Math.Min(charCount, byteCount)); | ||
if (lengthEncoded < charCount) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
ditto. I'll not comment again on the same issue in the rest of the changes
#if BIT64 | ||
if (byteCount < 4) goto trailing; | ||
|
||
int unaligned = (unchecked(-(int)input) >> 1) & 0x3; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
int unaligned = (unchecked(-(int)input) >> 1) & 0x3; [](start = 11, length = 53)
would it be clearer if we write this as
int unaligned = (((ulong)input) & 0x7) >> 1;
@benaadams thanks a lot for your effort and getting this done. LGTM |
Just want to factor out a couple throw blocks |
you mean you are going to push one more commit? I'll wait for that before merging then. |
@tarekgh added, also added for other function that had all the vars to check |
@benaadams thanks again. |
* ASCII Encoding fast-path * Add skipp for BIGENDIAN * fixes * ascii GetBytes(char[] chars) fix * feedback * Clean up * Reuse exception block * Add debug Asserts Commit migrated from dotnet/coreclr@5c20488
Up to x4 faster on
Encoding.GetBytes(String s)
for UTF8 and ASCII when string is ascii.Have
#if !BIGENDIAN
the changes as don't have have a good way of confirming the pattern for little endian.Call chains changed mostly to Virtual call->validate->concrete call; previously it would do virtual call chains with validations on each call.
The string overload has a wrinkle. It will optimistically take
string.length
as the length of they output bytes; if it encounters a non-ascii char it will then count from that point forward, reallocate the array and copy the bytes to the correct sized array; then pass over to the current implementation to complete - so you maybe get doublebyte[]
allocations. The array versions don't exhibit this behaviour.Following up on https://github.com/dotnet/coreclr/issues/6759 /cc @jamesqo