Skip to content

Commit 354780b

Browse files
leebyronandimarek
andcommitted
Revised RFC after feedback
Co-authored-by: Andreas Marek <[email protected]>
1 parent f33e275 commit 354780b

File tree

5 files changed

+130
-67
lines changed

5 files changed

+130
-67
lines changed

build.sh

Lines changed: 2 additions & 2 deletions
Original file line numberDiff line numberDiff line change
@@ -7,13 +7,13 @@ GITTAG=$(git tag --points-at HEAD)
77
# Build the specification draft document
88
echo "Building spec draft"
99
mkdir -p public/draft
10-
spec-md --githubSource "https://github.com/graphql/graphql-spec/blame/main/" spec/GraphQL.md > public/draft/index.html
10+
spec-md --metadata spec/metadata.json --githubSource "https://github.com/graphql/graphql-spec/blame/main/" spec/GraphQL.md > public/draft/index.html
1111

1212
# If this is a tagged commit, also build the release document
1313
if [ -n "$GITTAG" ]; then
1414
echo "Building spec release $GITTAG"
1515
mkdir -p "public/$GITTAG"
16-
spec-md --githubSource "https://github.com/graphql/graphql-spec/blame/$GITTAG/" spec/GraphQL.md > "public/$GITTAG/index.html"
16+
spec-md --metadata spec/metadata.json --githubSource "https://github.com/graphql/graphql-spec/blame/$GITTAG/" spec/GraphQL.md > "public/$GITTAG/index.html"
1717
fi
1818

1919
# Create the index file

package.json

Lines changed: 1 addition & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -14,7 +14,7 @@
1414
},
1515
"scripts": {
1616
"test": "npm run test:build && npm run test:spellcheck",
17-
"test:build": "spec-md spec/GraphQL.md > /dev/null",
17+
"test:build": "spec-md --metadata spec/metadata.json spec/GraphQL.md > /dev/null",
1818
"test:spellcheck": "cspell 'spec/**/*.md' README.md",
1919
"format": "prettier --write '**/*.{md,yml,yaml,json}'",
2020
"format:check": "prettier --check '**/*.{md,yml,yaml,json}'",

spec/Appendix B -- Grammar Summary.md

Lines changed: 2 additions & 7 deletions
Original file line numberDiff line numberDiff line change
@@ -2,12 +2,7 @@
22

33
## Source Text
44

5-
SourceCharacter ::
6-
7-
- "U+0009"
8-
- "U+000A"
9-
- "U+000D"
10-
- "U+0020–U+10FFFF"
5+
SourceCharacter :: "Any Unicode scalar value"
116

127
## Ignored Tokens
138

@@ -115,8 +110,8 @@ StringCharacter ::
115110

116111
EscapedUnicode ::
117112

113+
- `{` HexDigit+ `}`
118114
- HexDigit HexDigit HexDigit HexDigit
119-
- `{` HexDigit+ `}` "but only if <= 0x10FFFF"
120115

121116
HexDigit :: one of
122117

spec/Section 2 -- Language.md

Lines changed: 110 additions & 57 deletions
Original file line numberDiff line numberDiff line change
@@ -45,32 +45,22 @@ match, however some lookahead restrictions include additional constraints.
4545

4646
## Source Text
4747

48-
SourceCharacter ::
48+
SourceCharacter :: "Any Unicode scalar value"
4949

50-
- "U+0009"
51-
- "U+000A"
52-
- "U+000D"
53-
- "U+0020–U+10FFFF"
50+
GraphQL documents are interpreted from a source text, which is a sequence of
51+
{SourceCharacter}, each {SourceCharacter} being a _Unicode scalar value_ which
52+
may be any Unicode code point from U+0000 to U+D7FF or U+E000 to U+10FFFF
53+
(informally referred to as _"characters"_ through most of this specification).
5454

55-
GraphQL documents are expressed as a sequence of
56-
[Unicode](https://unicode.org/standard/standard.html) code points (informally
57-
referred to as _"characters"_ through most of this specification). However, with
58-
few exceptions, most of GraphQL is expressed only in the original non-control
59-
ASCII range so as to be as widely compatible with as many existing tools,
60-
languages, and serialization formats as possible and avoid display issues in
61-
text editors and source control.
55+
A GraphQL document may be expressed only in the ASCII range to be as widely
56+
compatible with as many existing tools, languages, and serialization formats as
57+
possible and avoid display issues in text editors and source control. Non-ASCII
58+
Unicode scalar values may appear within {StringValue} and {Comment}.
6259

63-
Note: Non-ASCII Unicode characters may appear freely within {StringValue} and
64-
{Comment} portions of GraphQL.
65-
66-
### Unicode
67-
68-
UnicodeBOM :: "Byte Order Mark (U+FEFF)"
69-
70-
The "Byte Order Mark" is a special Unicode character which may appear at the
71-
beginning of a file containing Unicode which programs may use to determine the
72-
fact that the text stream is Unicode, what endianness the text stream is in, and
73-
which of several Unicode encodings to interpret.
60+
Note: An implementation which uses _UTF-16_ to represent GraphQL documents in
61+
memory (for example, JavaScript or Java) may encounter a _surrogate pair_. This
62+
encodes a _supplementary code point_ and is a single valid source character,
63+
however an unpaired _surrogate code point_ is not a valid source character.
7464

7565
### White Space
7666

@@ -175,6 +165,17 @@ significant way, for example a {StringValue} may contain white space characters.
175165
No {Ignored} may appear _within_ a {Token}, for example no white space
176166
characters are permitted between the characters defining a {FloatValue}.
177167

168+
**Byte order mark**
169+
170+
UnicodeBOM :: "Byte Order Mark (U+FEFF)"
171+
172+
The _Byte Order Mark_ is a special Unicode code point which may appear at the
173+
beginning of a file which programs may use to determine the fact that the text
174+
stream is Unicode, and what specific encoding has been used.
175+
176+
As files are often concatenated, a _Byte Order Mark_ may appear anywhere within
177+
a GraphQL document and is {Ignored}.
178+
178179
### Punctuators
179180

180181
Punctuator :: one of ! $ & ( ) ... : = @ [ ] { | }
@@ -814,8 +815,8 @@ StringCharacter ::
814815

815816
EscapedUnicode ::
816817

818+
- `{` HexDigit+ `}`
817819
- HexDigit HexDigit HexDigit HexDigit
818-
- `{` HexDigit+ `}` "but only if <= 0x10FFFF"
819820

820821
HexDigit :: one of
821822

@@ -830,19 +831,58 @@ BlockStringCharacter ::
830831
- SourceCharacter but not `"""` or `\"""`
831832
- `\"""`
832833

833-
Strings are sequences of characters wrapped in quotation marks (U+0022). (ex.
834-
{`"Hello World"`}). White space and other otherwise-ignored characters are
835-
significant within a string value.
834+
{StringValue} is a sequence of characters wrapped in quotation marks (U+0022).
835+
(ex. {`"Hello World"`}). White space and other characters ignored in other parts
836+
of a GraphQL document are significant within a string value.
837+
838+
A {StringValue} is evaluated to a Unicode text value, a sequence of Unicode
839+
scalar values, by interpreting all escape sequences using the static semantics
840+
defined below.
836841

837842
The empty string {`""`} must not be followed by another {`"`} otherwise it would
838843
be interpreted as the beginning of a block string. As an example, the source
839844
{`""""""`} can only be interpreted as a single empty block string and not three
840845
empty strings.
841846

842-
Non-ASCII Unicode characters are allowed within single-quoted strings. Since
843-
{SourceCharacter} must not contain some ASCII control characters, escape
844-
sequences must be used to represent these characters. The {`\`}, {`"`}
845-
characters also must be escaped. All other escape sequences are optional.
847+
**Escape Sequences**
848+
849+
In a single-quoted {StringValue}, any Unicode scalar value may be expressed
850+
using an escape sequence. GraphQL strings allow both C-style escape sequences
851+
(for example `\n`) and two forms of Unicode escape sequences: one with a
852+
fixed-width of 4 hexadecimal digits (for example `\u000A`) and one with a
853+
variable-width most useful for representing a _supplementary character_ such as
854+
an Emoji (for example `\u{1F4A9}`).
855+
856+
The hexadecimal number encoded by a Unicode escape sequence must describe a
857+
Unicode scalar value, otherwise parsing should stop with an early error. For
858+
example both sources `"\uDEAD"` and `"\u{110000}"` should not be considered
859+
valid {StringValue}.
860+
861+
Escape sequences are only meaningful within a single-quoted string. Within a
862+
block string, they are simply that sequence of characters (for example
863+
`"""\n"""` represents the Unicode text [U+005C, U+006E]). Within a comment an
864+
escape sequence is not a significant sequence of characters. They may not appear
865+
elsewhere in a GraphQL document.
866+
867+
Since {StringCharacter} must not contain some characters, escape sequences must
868+
be used to represent these characters. All other escape sequences are optional
869+
and unescaped non-ASCII Unicode characters are allowed within strings. If using
870+
GraphQL within a system which only supports ASCII, then escape sequences may be
871+
used to represent all Unicode characters outside of the ASCII range.
872+
873+
For legacy reasons, a _supplementary character_ may be escaped by two
874+
fixed-width unicode escape sequences forming a _surrogate pair_. For example the
875+
input `"\uD83D\uDCA9"` is a valid {StringValue} which represents the same
876+
Unicode text as `"\u{1F4A9}"`. While this legacy form is allowed, it should be
877+
avoided as a variable-width unicode escape sequence is a clearer way to encode
878+
such code points.
879+
880+
When producing a {StringValue}, implementations should use escape sequences to
881+
represent non-printable control characters (U+0000 to U+001F and U+007F to
882+
U+009F). Other escape sequences are not necessary, however an implementation may
883+
use escape sequences to represent any other range of code points. If an
884+
implementation chooses to escape a _supplementary character_, it should not use
885+
a fixed-width surrogate pair unicode escape sequence.
846886

847887
**Block Strings**
848888

@@ -898,44 +938,57 @@ Note: If non-printable ASCII characters are needed in a string value, a standard
898938
quoted string with appropriate escape sequences must be used instead of a block
899939
string.
900940

901-
**Semantics**
941+
**Static Semantics**
942+
943+
A {StringValue} describes a Unicode text value, a sequence of *Unicode scalar
944+
value*s. These semantics describe how to apply the {StringValue} grammar to a
945+
source text to evaluate a Unicode text. Errors encountered during this
946+
evaluation are considered a failure to apply the {StringValue} grammar to a
947+
source and result in a parsing error.
902948

903949
StringValue :: `""`
904950

905951
- Return an empty sequence.
906952

907953
StringValue :: `"` StringCharacter+ `"`
908954

909-
- Let {string} be the sequence of all {StringCharacter} code points.
910-
- For each {codePoint} at {index} in {string}:
911-
- If {codePoint} is >= 0xD800 and <= 0xDBFF (a
912-
[_High Surrogate_](https://unicodebook.readthedocs.io/unicode_encodings.html#utf-16-surrogate-pairs)):
913-
- Let {lowPoint} be the code point at {index} + {1} in {string}.
914-
- Assert {lowPoint} is >= 0xDC00 and <= 0xDFFF (a
915-
[_Low Surrogate_](https://unicodebook.readthedocs.io/unicode_encodings.html#utf-16-surrogate-pairs)).
916-
- Let {decodedPoint} = ({codePoint} - 0xD800) × 0x400 + ({lowPoint} -
917-
0xDC00) + 0x10000.
918-
- Within {string}, replace {codePoint} and {lowPoint} with {decodedPoint}.
919-
- Otherwise, assert {codePoint} is not >= 0xDC00 and <= 0xDFFF (a
920-
[_Low Surrogate_](https://unicodebook.readthedocs.io/unicode_encodings.html#utf-16-surrogate-pairs)).
921-
- Return {string}.
922-
923-
Note: {StringValue} should avoid encoding code points as surrogate pairs. While
924-
services must interpret them accordingly, a braced escape (for example
925-
`"\u{1F4A9}"`) is a clearer way to encode code points outside of the
926-
[Basic Multilingual Plane](https://unicodebook.readthedocs.io/unicode.html#bmp).
955+
- Return the concatenated sequence of _Unicode scalar value_ by evaluating all
956+
{StringCharacter}.
927957

928958
StringCharacter :: SourceCharacter but not `"` or `\` or LineTerminator
929959

930-
- Return the code point {SourceCharacter}.
960+
- Return the _Unicode scalar value_ {SourceCharacter}.
931961

932962
StringCharacter :: `\u` EscapedUnicode
933963

934-
- Let {value} be the 21-bit hexadecimal value represented by the sequence of
935-
{HexDigit} within {EscapedUnicode}.
936-
- Assert {value} <= 0x10FFFF.
964+
- Let {value} be the hexadecimal value represented by the sequence of {HexDigit}
965+
within {EscapedUnicode}.
966+
- Assert {value} is a within the _Unicode scalar value_ range (>= 0x0000 and <=
967+
0xD7FF or >= 0xE000 and <= 0x10FFFF).
937968
- Return the code point {value}.
938969

970+
StringCharacter :: `\u` HexDigit HexDigit HexDigit HexDigit `\u` HexDigit
971+
HexDigit HexDigit HexDigit
972+
973+
- Let {leadingValue} be the hexadecimal value represented by the first sequence
974+
of {HexDigit}.
975+
- Let {trailingValue} be the hexadecimal value represented by the second
976+
sequence of {HexDigit}.
977+
- If {leadingValue} is >= 0xD800 and <= 0xDBFF (a _Leading Surrogate_):
978+
- Assert {trailingValue} is >= 0xDC00 and <= 0xDFFF (a _Trailing Surrogate_).
979+
- Return ({leadingValue} - 0xD800) × 0x400 + ({trailingValue} - 0xDC00) +
980+
0x10000.
981+
- Otherwise:
982+
- Assert {leadingValue} is within the _Unicode scalar value_ range.
983+
- Assert {trailingValue} is within the _Unicode scalar value_ range.
984+
- Return the sequence of the code point {leadingValue} followed by the code
985+
point {trailingValue}.
986+
987+
Note: If both escape sequences encode a _Unicode scalar value_, then this
988+
semantic is identical to applying the prior semantic on each fixed-width escape
989+
sequence. A variable-width escape sequence must only encode a _Unicode scalar
990+
value_.
991+
939992
StringCharacter :: `\` EscapedCharacter
940993

941994
- Return the code point represented by {EscapedCharacter} according to the table
@@ -954,13 +1007,13 @@ StringCharacter :: `\` EscapedCharacter
9541007

9551008
StringValue :: `"""` BlockStringCharacter\* `"""`
9561009

957-
- Let {rawValue} be the Unicode character sequence of all {BlockStringCharacter}
958-
Unicode character values (which may be an empty sequence).
1010+
- Let {rawValue} be the concatenated sequence of _Unicode scalar value_ by
1011+
evaluating all {BlockStringCharacter} (which may be an empty sequence).
9591012
- Return the result of {BlockStringValue(rawValue)}.
9601013

9611014
BlockStringCharacter :: SourceCharacter but not `"""` or `\"""`
9621015

963-
- Return the character value of {SourceCharacter}.
1016+
- Return the _Unicode scalar value_ {SourceCharacter}.
9641017

9651018
BlockStringCharacter :: `\"""`
9661019

spec/metadata.json

Lines changed: 15 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,15 @@
1+
{
2+
"biblio": {
3+
"https://www.unicode.org/glossary": {
4+
"byte-order-mark": "#byte_order_mark",
5+
"leading-surrogate": "#leading_surrogate",
6+
"trailing-surrogate": "#trailing_surrogate",
7+
"supplementary-character": "#supplementary_character",
8+
"supplementary-code-point": "#supplementary_code_point",
9+
"surrogate-code-point": "#surrogate_code_point",
10+
"surrogate-pair": "#surrogate_pair",
11+
"unicode-scalar-value": "#unicode_scalar_value",
12+
"utf-16": "#UTF_16"
13+
}
14+
}
15+
}

0 commit comments

Comments
 (0)