Skip to content

API proposal: UTF-8 web encoders #28509

@GrabYourPitchforks

Description

@GrabYourPitchforks

(This is the follow-up to https://github.com/dotnet/corefx/issues/33509, where the review was marked as "needs work".)

Some customers (notably, the JSON libraries) need to perform UTF-8 escaping of strings. The existing UTF-16 encoding APIs are insufficient for their needs. This proposal is for a sister set of APIs that operate on UTF-8 data.

Proposed API

// Proposed NEW types in existing namespace

namespace System.Text.Encodings.Web
{
   public abstract class Utf8TextEncoder
   {    
      /*
       * ABSTRACT METHODS
       * Any subclassed type must override at minimum these two methods. All other methods
       * are built on top of these two.
       */

      // Returns the number of elements written to the destination buffer,
      // or -1 if the destination buffer is too small to contain the encoding of this value.
      public abstract int EncodeSingleRune(Rune value, Span<Utf8Char> destination);

      // Return true if this scalar value must be encoded before being written to the destination
      // buffer; false if it can be copied to the destination buffer as-is.
      public abstract bool RuneMustBeEncoded(Rune value);

      /*
       * VIRTUAL METHODS
       * The default implementations of these methods will work but may not be very optimized.
       * Subclassed types may with to override them to provide more optimized behavior.
       */

      // Typical OperationStatus-based API.
      public virtual OperationStatus Encode(ReadOnlySpan<Utf8Char> source, Span<Utf8Char> destination, out int countConsumed, out int countWritten, bool isFinalChunk = true);

      // Takes an IBufferWriter instead of a Span as a sink. Can never return DestinationTooSmall
      // since buffer writers should always grow. Note that the "count written" parameter is outed
      // as an int64 instead of an int32 so that we can avoid integer overflow. Also note that we're
      // using IBufferWriter<byte> instead of IBufferWriter<Utf8Char>, as the typical use case is
      // for buffer writers to wrap i/o pipes.
      public virtual OperationStatus Encode(ReadOnlySpan<Utf8Char> source, IBufferWriter<byte> utf8Destination, out int countConsumed, out long countWritten, bool isFinalChunk = true);

      // Convenience API for working with UTF-16 source data.
      // Transcoding and escaping are performed as a single step when writing to the destination.
      public virtual OperationStatus Encode(ReadOnlySpan<char> source, Span<Utf8Char> destination, out int countConsumed, out int countWritten, bool isFinalChunk = true);
      public virtual OperationStatus Encode(ReadOnlySpan<char> source, IBufferWriter<byte> utf8Destination, out int countConsumed, out long countWritten, bool isFinalChunk = true);

      // When Utf8String comes online, this virtual method will be added.
      public virtual Utf8String Encode(Utf8String value);

      // Returns the index in the span of the first element that must be escaped, or -1 if no
      // elements require escaping. This is generally used as an optimization by callers who
      // may wish to perform their own bulk memcpy of data that doesn't require escaping.
      // (Or, if it returns -1, they may opt to skip the memcpy entirely.)
      public virtual int GetIndexOfFirstElementToEncode(ReadOnlySpan<Utf8Char> span);
   }
}

And the subclassed types and the factories to create them:

// Proposed NEW types in existing namespace

namespace System.Text.Encodings.Web
{
   public abstract class Utf8HtmlEncoder : Utf8TextEncoder
   {
      // These methods mimic the static factories on the existing UTF-16 "HtmlEncoder" type.

      public static Utf8HtmlEncoder Default { get; }
      public static Utf8HtmlEncoder Create(TextEncoderSettings settings);
      public static Utf8HtmlEncoder Create(params UnicodeRange[] allowedRanges);
   }

   public abstract class Utf8JavaScriptEncoder : Utf8TextEncoder
   {
      // These methods mimic the static factories on the existing UTF-16 "JavaScriptEncoder" type.

      public static Utf8JavaScriptEncoder Default { get; }
      public static Utf8JavaScriptEncoder Create(TextEncoderSettings settings);
      public static Utf8JavaScriptEncoder Create(params UnicodeRange[] allowedRanges);
   }

   public abstract class Utf8UrlEncoder : Utf8TextEncoder
   {
      // These methods mimic the static factories on the existing UTF-16 "UrlEncoder" type.

      public static Utf8UrlEncoder Default { get; }
      public static Utf8UrlEncoder Create(TextEncoderSettings settings);
      public static Utf8UrlEncoder Create(params UnicodeRange[] allowedRanges);
   }
}

Alternative API proposal

If we instead want to enlighten the existing encoding APIs to understand UTF-8 data in addition to UTF-16 data, the modifications to the existing types might look like the following.

namespace System.Text.Encodings.Web
{
   // MODIFICATIONS to existing class

   public abstract class TextEncoder
   {
      // MODIFICATION: existing abstract method becomes virtual
      [EditorBrowsable(EditorBrowsableState.Never)] // EXISTING attribute
      [Obsolete("This value is a lie. Don't depend on it.")] // NEW attribute
      public virtual int MaxOutputCharactersPerInputCharacter
      {
         get
         {
            // default implementation
            return 0;
         }
      }

      // MODIFICATION: existing abstract method becomes virtual
      [EditorBrowsable(EditorBrowsableState.Never)] // EXISTING attribute
      public abstract bool WillEncode(int unicodeScalar)
      {
         // default implementation
         return (!Rune.TryCreate(unicodeScalar, out Rune value) || RuneMustBeEncoded(value));
      }

      // MODIFICATION: existing abstract method becomes virtual
      [CLSCompliant(false)] // EXISTING attribute
      [EditorBrowsable(EditorBrowsableState.Never)] // EXISTING attribute
      public unsafe abstract int FindFirstCharacterToEncode(char* text, int textLength)
      {
         // default implementation
         return GetIndexOfFirstElementToEncode(new ReadOnlySpan<char>(text, textLength));
      }

      // MODIFICATION: existing abstract method becomes virtual
      [CLSCompliant(false)] // EXISTING attribute
      [EditorBrowsable(EditorBrowsableState.Never)] // EXISTING attribute
      public unsafe abstract bool TryEncodeUnicodeScalar(int unicodeScalar, char* buffer, int bufferLength, out int numberOfCharactersWritten)
      {
         // default implementation
         if (Rune.TryCreate(unicodeScalar, out Rune rune))
         {
            int charsWritten = EncodeSingleRune(rune, new Span<char>(buffer, bufferLength));
            if (charsWritten > 0)
            {
               numberOfCharsWritten = charsWritten;
               return true;
            }
         }
         numberOfCharsWritten = 0;
         return false;
      }

      /*
       * NEW virtual methods that serve as the two "abstract" methods.
       */

      // Default implementation attempts to call WillEncode(int) if it has been overridden.
      // If neither this nor WillEncode(int) has been overridden, the default implementation
      // throws an exception saying "you must override me."
      public virtual bool RuneMustBeEncoded(Rune value);

      // Default implementation attempts to call TryEncodeUnicodeScalar(...) if it has been
      // overridden. If neither this nor EncodeSingleRune has been overridden, the default
      // implementation throws an exception saying "you must override me."
      //
      // Returns the number of chars written to the buffer, or -1 if the buffer is too small
      // to contain the escaped scalar value.
      public virtual int EncodeSingleRune(Rune value, Span<char> buffer);

      /*
       * NEW virtual method that serves as a UTF-16 workhorse method. The derived type
       * is _not required_ to override it since we can provide an approriate base implementation,
       * but best performance will be achieved when this method is overridden.
       */

      public virtual int GetIndexOfFirstElementToEncode(ReadOnlySpan<char> span);

      /*
       * EXISTING APIs dealing with UTF-16 output - no modifications to signatures.
       */

      public virtual string Encode(string value);
      public virtual void Encode(TextWriter output, char[] value, int startIndex, int characterCount);
      public void Encode(TextWriter output, string value);
      public virtual void Encode(TextWriter output, string value, int startIndex, int characterCount);

      /*
       * NEW APIs dealing with UTF-8 output - all are virtual but do not require overriding.
       * The default implementations of all UTF-8 APIs delegate to the UTF-16 implementations.
       * This means that the behavior will still be correct if these virtuals are not overridden, but
       * performance will tank.
       */
       
      // UTF-8 equivalent of EncodeSingleRune(Rune, Span<char>).
      public virtual int EncodeSingleRune(Rune value, Span<Utf8Char> buffer);

      // UTF-8 equivalent of GetIndexOfFirstElementToEncode(ReadOnlySpan<char>).
      public virtual int GetIndexOfFirstElementToEncode(ReadOnlySpan<Utf8Char> span);

      // Typical OperationStatus-based API.
      public virtual OperationStatus Encode(ReadOnlySpan<Utf8Char> source, Span<Utf8Char> destination, out int countConsumed, out int countWritten, bool isFinalChunk = true);

      // Takes an IBufferWriter instead of a Span as a sink. Can never return DestinationTooSmall
      // since buffer writers should always grow. Note that the "count written" parameter is outed
      // as an int64 instead of an int32 so that we can avoid integer overflow. Also note that we're
      // using IBufferWriter<byte> instead of IBufferWriter<Utf8Char>, as the typical use case is
      // for buffer writers to wrap i/o pipes.
      public virtual OperationStatus Encode(ReadOnlySpan<Utf8Char> source, IBufferWriter<byte> utf8Destination, out int countConsumed, out long countWritten, bool isFinalChunk = true);

      // Convenience API for working with UTF-16 source data.
      // Transcoding and escaping are performed as a single step when writing to the destination.
      public virtual OperationStatus Encode(ReadOnlySpan<char> source, Span<Utf8Char> destination, out int countConsumed, out int countWritten, bool isFinalChunk = true);
      public virtual OperationStatus Encode(ReadOnlySpan<char> source, IBufferWriter<byte> utf8Destination, out int countConsumed, out long countWritten, bool isFinalChunk = true);

      // When Utf8String comes online, this virtual method will be added.
      public virtual Utf8String Encode(Utf8String value);
   }
}

Discussion

The first proposal uses newer data structures and concepts that weren't available when we designed the original types, so we can eliminate the warts from the original types. It's also fully able to be subclassed without requiring the developer to enable advanced Intellisense features or to drop down to unsafe code blocks.

The biggest problem with the first proposal is that it leads to type explosion throughout the namespace. Since there are two copies of each encoder, consumers would need to be trained on when it's appropriate to use each type. Dependency injection systems (like used in aspnet) would need to take both as inputs since they don't know what the application will need at runtime.

The second proposal attempts to avoid type explosion by modifying the existing types. On one hand this is an improvement on https://github.com/dotnet/corefx/issues/33509 in that this is not a source or binary breaking change. However, it leads to significant complexity both for developers who subclass this (how does the dev know which methods are mandatory to override if everything is virtual instead of abstract?) and for the implementation (how do we quickly and reliably tell if the current instance has overridden a specific method?).

Metadata

Metadata

Labels

Type

No type

Projects

No projects

Milestone

Relationships

None yet

Development

No branches or pull requests

Issue actions