Make an distinction between Unicode whitespace and regular whitespace #107

TimothyGu · 2016-08-05T08:25:15Z

The spec makes an distinction between "whitespace" and "Unicode whitespace": whereas the latter include many additional whitespace characters, particularly the non-breaking space (U+00A0), the former does not.

Per ECMA-262 6th Edition ("ECMAScript 2015") §21.2.2.12 CharacterClassEscape, the JavaScript \s escape character matches the characters specified by "Unicode whitespace," but not "whitespace."

To fix this issue, rename the existing regular expression variable to UnicodeWhitespace, and create and use a new regular expression variable that only matches the limited set of "whitespace" characters.

The test suite does not yet cover this distinction. I will add a corresponding test case when this pull request has been accepted.

For additional information, the distinction in the spec was challenged and reaffirmed by commonmark/commonmark-spec#343.

The spec makes an distinction between "[whitespace]" and "[Unicode whitespace]": whereas the latter include many additional whitespace characters, particularly the non-breaking space (U+00A0), the former does not. Per ECMA-262 6th Edition ("ECMAScript 2015") §21.2.2.12 [CharacterClassEscape], the JavaScript `\s` escape character matches the characters specified by "Unicode whitespace," but not "whitespace." To fix this issue, rename the existing regular expression variable to `UnicodeWhitespace`, and create and use a new regular expression variable that only matches the limited set of "whitespace" characters. For additional information, the distinction in the spec was challenged and reaffirmed by commonmark/commonmark-spec#343. [whitespace]: http://spec.commonmark.org/0.26/#whitespace-character [Unicode whitespace]: http://spec.commonmark.org/0.26/#unicode-whitespace-character [CharacterClassEscape]: http://www.ecma-international.org/ecma-262/6.0/#sec-characterclassescape

TimothyGu · 2016-08-05T08:40:55Z

For the sake of completeness, I did an audit on the usage of whitespace characters in this library.

parseBackticks: Should use whitespace for code spans

The contents of the code span are the characters between the two backtick strings, with leading and trailing spaces and line endings removed, and whitespace collapsed to single spaces.
scanDelims: Should use Unicode whitespace for delimiter runs:

A left-flanking delimiter run is a delimiter run that is (a) not followed by Unicode whitespace, ...

A right-flanking delimiter run is a delimiter run that is (a) not preceded by Unicode whitespace, ...
parseCloseBracket: Should use whitespace for inline link:

An inline link consists of a link text followed immediately by a left parenthesis (, optional whitespace, an optional link destination, an optional link title separated from the link destination by whitespace, optional whitespace, and a right parenthesis ).

This pull request addresses all these instances.

TimothyGu · 2016-08-05T08:47:34Z

On another note, cmark handles this correctly.

jgm · 2016-08-06T20:34:19Z

Many thanks for the careful work!

TimothyGu force-pushed the unicode-whitespace branch from e716e2d to 7b27c58 Compare August 5, 2016 08:27

TimothyGu mentioned this pull request Aug 5, 2016

Escaping typographer markdown-it/markdown-it#271

Closed

jgm merged commit 3587c91 into commonmark:master Aug 6, 2016

TimothyGu deleted the unicode-whitespace branch August 6, 2016 20:50

TimothyGu mentioned this pull request Aug 6, 2016

Add examples for Unicode whitespace commonmark/commonmark-spec#422

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Make an distinction between Unicode whitespace and regular whitespace #107

Make an distinction between Unicode whitespace and regular whitespace #107

TimothyGu commented Aug 5, 2016 •

edited

Loading

TimothyGu commented Aug 5, 2016

TimothyGu commented Aug 5, 2016

jgm commented Aug 6, 2016

Make an distinction between Unicode whitespace and regular whitespace #107

Make an distinction between Unicode whitespace and regular whitespace #107

Conversation

TimothyGu commented Aug 5, 2016 • edited Loading

TimothyGu commented Aug 5, 2016

TimothyGu commented Aug 5, 2016

jgm commented Aug 6, 2016

TimothyGu commented Aug 5, 2016 •

edited

Loading