Modernize API shape of HtmlEncoder and related classes

The classes `HtmlEncoder`, `JavaScriptEncoder`, and `UrlEncoder` are intended to be the One True Way™ of performing HTML-encoding, JavaScript/JSON string-escaping, and URL-encoding from untrusted input framework-wide. Their primary consumer is ASP.NET, but they're also intended to replace the methods on types like `WebUtility` which don't always generate standards-compliant output and which are disallowed per SDL.

I believe there are two primary problems with the current shape of these APIs. First, they all subclass a shared `TextEncoder` type ([see apisof.net](https://apisof.net/catalog/System.Text.Encodings.Web.TextEncoder)), which doesn't quite make sense given the scenario. These methods aren't intended to act as generic sinks like our `Stream`, `TextWriter`, and other abstractions. They're not interchangeable with one another, even though they all share a common shape. These are transforms. It's up to the consuming component to know which specific transform is applicable to the scenario at hand and to flow the class which represents that transform through its internal logic.

Second, they expose `char*`-consuming abstract methods, which necessitates that any component which wants to call the more powerful versions of these methods drop down to unsafe code. Even worse, since these methods are _abstract_, we require that anybody who wants to subclass these types drop down to unsafe code as well. This can be made safer and more accessible by using modern types like `Span`.

Here is my proposal for the new API shape for the `HtmlEncoder`, `JavaScriptEncoder`, and `UrlEncoder` classes.

```cs
namespace System.Text.Encodings.Web
{
    // Also the same shape for JavaScriptEncoder and UrlEncoder
    public abstract class HtmlEncoder
    {
        /* factories */
        public static HtmlEncoder Default { get; }
        public static HtmlEncoder Create(TextEncoderSettings settings);
        public static HtmlEncoder Create(params UnicodeRange[] allowedRanges);

        /* protected ctor */
        protected HtmlEncoder();
        
        /* abstract methods - must override */
        public abstract int MaxOutputCharactersPerInputCharacter { get; } // referring to UTF-16 "System.Char"
        public abstract bool RuneMustBeEncoded(Rune rune);
        public abstract bool TryEncodeRune(Rune rune, Span<char> output, out int charsWritten);
        
        /* virtual methods - *may* override as optimization, but not required */
        public virtual string Encode(string value);
        public virtual void Encode(TextWriter output, char[] value, int startIndex, int characterCount);
        public virtual void Encode(TextWriter output, string value, int startIndex, int characterCount);
        public virtual void Encode(TextWriter output, ReadOnlySpan<char> value);
        public virtual void Encode(IBufferWriter<char> output, ReadOnlySpan<char> value);
        public virtual OperationStatus Encode(ReadOnlySpan<char> input, Span<char> output, bool isFinalChunk, out int numCharsConsumed, out int numCharsWritten);
        public virtual int GetIndexOfFirstCharacterToEncode(ReadOnlySpan<char> input);
        
        /* non-virtual methods */
        public void Encode(TextWriter output, string value);
    }
}
```

The existing `TextEncoderSettings` and `UnicodeRange` types remain unchanged.

This addresses both of the concerns I raised above and is generally source-compatible with existing callers. (It may not be binary-compatible due to removal of the base type.) Data from apisof.net shows that the number of callers of the methods which have been removed is insignificant, less than 0.1% of analyzed code. This refactoring also puts us in a position going forward to get UTF-8 support for these APIs.

_Open issue:_ the property `MaxOutputCharactersPerInputCharacter` is written in terms of UTF-16 `System.Char` instances, but perhaps it's better off as `MaxOutputCharactersPerRune` (maximum UTF-16 `System.Char` elements output per input `Rune`). This may trade off some internal efficiency in order to make life easier for callers, but the internal implementations may be able to make up for the loss in efficiency through some other mechanism.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Modernize API shape of HtmlEncoder and related classes #27919

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Modernize API shape of HtmlEncoder and related classes #27919

Description

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions