Skip to content

Make an distinction between Unicode whitespace and regular whitespace #107

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merged
merged 1 commit into from
Aug 6, 2016

Conversation

TimothyGu
Copy link
Contributor

@TimothyGu TimothyGu commented Aug 5, 2016

The spec makes an distinction between "whitespace" and "Unicode whitespace": whereas the latter include many additional whitespace characters, particularly the non-breaking space (U+00A0), the former does not.

Per ECMA-262 6th Edition ("ECMAScript 2015") §21.2.2.12 CharacterClassEscape, the JavaScript \s escape character matches the characters specified by "Unicode whitespace," but not "whitespace."

To fix this issue, rename the existing regular expression variable to UnicodeWhitespace, and create and use a new regular expression variable that only matches the limited set of "whitespace" characters.

The test suite does not yet cover this distinction. I will add a corresponding test case when this pull request has been accepted.

For additional information, the distinction in the spec was challenged and reaffirmed by commonmark/commonmark-spec#343.

The spec makes an distinction between "[whitespace]" and "[Unicode
whitespace]": whereas the latter include many additional whitespace
characters, particularly the non-breaking space (U+00A0), the former
does not.

Per ECMA-262 6th Edition ("ECMAScript 2015") §21.2.2.12
[CharacterClassEscape], the JavaScript `\s` escape character matches the
characters specified by "Unicode whitespace," but not "whitespace."

To fix this issue, rename the existing regular expression variable to
`UnicodeWhitespace`, and create and use a new regular expression
variable that only matches the limited set of "whitespace" characters.

For additional information, the distinction in the spec was challenged
and reaffirmed by commonmark/commonmark-spec#343.

[whitespace]: http://spec.commonmark.org/0.26/#whitespace-character
[Unicode whitespace]: http://spec.commonmark.org/0.26/#unicode-whitespace-character
[CharacterClassEscape]: http://www.ecma-international.org/ecma-262/6.0/#sec-characterclassescape
@TimothyGu TimothyGu force-pushed the unicode-whitespace branch from e716e2d to 7b27c58 Compare August 5, 2016 08:27
@TimothyGu
Copy link
Contributor Author

For the sake of completeness, I did an audit on the usage of whitespace characters in this library.

  1. parseBackticks: Should use whitespace for code spans

    The contents of the code span are the characters between the two backtick strings, with leading and trailing spaces and line endings removed, and whitespace collapsed to single spaces.

  2. scanDelims: Should use Unicode whitespace for delimiter runs:

    A left-flanking delimiter run is a delimiter run that is (a) not followed by Unicode whitespace, ...

    A right-flanking delimiter run is a delimiter run that is (a) not preceded by Unicode whitespace, ...

  3. parseCloseBracket: Should use whitespace for inline link:

    An inline link consists of a link text followed immediately by a left parenthesis (, optional whitespace, an optional link destination, an optional link title separated from the link destination by whitespace, optional whitespace, and a right parenthesis ).

This pull request addresses all these instances.

@TimothyGu
Copy link
Contributor Author

On another note, cmark handles this correctly.

@jgm jgm merged commit 3587c91 into commonmark:master Aug 6, 2016
@jgm
Copy link
Member

jgm commented Aug 6, 2016

Many thanks for the careful work!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants