-
Notifications
You must be signed in to change notification settings - Fork 5.2k
Description
The classes HtmlEncoder
, JavaScriptEncoder
, and UrlEncoder
are intended to be the One True Way™ of performing HTML-encoding, JavaScript/JSON string-escaping, and URL-encoding from untrusted input framework-wide. Their primary consumer is ASP.NET, but they're also intended to replace the methods on types like WebUtility
which don't always generate standards-compliant output and which are disallowed per SDL.
I believe there are two primary problems with the current shape of these APIs. First, they all subclass a shared TextEncoder
type (see apisof.net), which doesn't quite make sense given the scenario. These methods aren't intended to act as generic sinks like our Stream
, TextWriter
, and other abstractions. They're not interchangeable with one another, even though they all share a common shape. These are transforms. It's up to the consuming component to know which specific transform is applicable to the scenario at hand and to flow the class which represents that transform through its internal logic.
Second, they expose char*
-consuming abstract methods, which necessitates that any component which wants to call the more powerful versions of these methods drop down to unsafe code. Even worse, since these methods are abstract, we require that anybody who wants to subclass these types drop down to unsafe code as well. This can be made safer and more accessible by using modern types like Span
.
Here is my proposal for the new API shape for the HtmlEncoder
, JavaScriptEncoder
, and UrlEncoder
classes.
namespace System.Text.Encodings.Web
{
// Also the same shape for JavaScriptEncoder and UrlEncoder
public abstract class HtmlEncoder
{
/* factories */
public static HtmlEncoder Default { get; }
public static HtmlEncoder Create(TextEncoderSettings settings);
public static HtmlEncoder Create(params UnicodeRange[] allowedRanges);
/* protected ctor */
protected HtmlEncoder();
/* abstract methods - must override */
public abstract int MaxOutputCharactersPerInputCharacter { get; } // referring to UTF-16 "System.Char"
public abstract bool RuneMustBeEncoded(Rune rune);
public abstract bool TryEncodeRune(Rune rune, Span<char> output, out int charsWritten);
/* virtual methods - *may* override as optimization, but not required */
public virtual string Encode(string value);
public virtual void Encode(TextWriter output, char[] value, int startIndex, int characterCount);
public virtual void Encode(TextWriter output, string value, int startIndex, int characterCount);
public virtual void Encode(TextWriter output, ReadOnlySpan<char> value);
public virtual void Encode(IBufferWriter<char> output, ReadOnlySpan<char> value);
public virtual OperationStatus Encode(ReadOnlySpan<char> input, Span<char> output, bool isFinalChunk, out int numCharsConsumed, out int numCharsWritten);
public virtual int GetIndexOfFirstCharacterToEncode(ReadOnlySpan<char> input);
/* non-virtual methods */
public void Encode(TextWriter output, string value);
}
}
The existing TextEncoderSettings
and UnicodeRange
types remain unchanged.
This addresses both of the concerns I raised above and is generally source-compatible with existing callers. (It may not be binary-compatible due to removal of the base type.) Data from apisof.net shows that the number of callers of the methods which have been removed is insignificant, less than 0.1% of analyzed code. This refactoring also puts us in a position going forward to get UTF-8 support for these APIs.
Open issue: the property MaxOutputCharactersPerInputCharacter
is written in terms of UTF-16 System.Char
instances, but perhaps it's better off as MaxOutputCharactersPerRune
(maximum UTF-16 System.Char
elements output per input Rune
). This may trade off some internal efficiency in order to make life easier for callers, but the internal implementations may be able to make up for the loss in efficiency through some other mechanism.