@@ -45,32 +45,22 @@ match, however some lookahead restrictions include additional constraints.
45
45
46
46
## Source Text
47
47
48
- SourceCharacter ::
48
+ SourceCharacter :: "Any Unicode scalar value"
49
49
50
- - "U+0009"
51
- - "U+000A"
52
- - "U+000D"
53
- - "U+0020–U+10FFFF"
50
+ GraphQL documents are interpreted from a source text, which is a sequence of
51
+ {SourceCharacter}, each {SourceCharacter} being a _ Unicode scalar value _ which
52
+ may be any Unicode code point from U+0000 to U+D7FF or U+E000 to U+10FFFF
53
+ (informally referred to as _ "characters" _ through most of this specification).
54
54
55
- GraphQL documents are expressed as a sequence of
56
- [ Unicode] ( https://unicode.org/standard/standard.html ) code points (informally
57
- referred to as _ "characters"_ through most of this specification). However, with
58
- few exceptions, most of GraphQL is expressed only in the original non-control
59
- ASCII range so as to be as widely compatible with as many existing tools,
60
- languages, and serialization formats as possible and avoid display issues in
61
- text editors and source control.
55
+ A GraphQL document may be expressed only in the ASCII range to be as widely
56
+ compatible with as many existing tools, languages, and serialization formats as
57
+ possible and avoid display issues in text editors and source control. Non-ASCII
58
+ Unicode scalar values may appear within {StringValue} and {Comment}.
62
59
63
- Note: Non-ASCII Unicode characters may appear freely within {StringValue} and
64
- {Comment} portions of GraphQL.
65
-
66
- ### Unicode
67
-
68
- UnicodeBOM :: "Byte Order Mark (U+FEFF)"
69
-
70
- The "Byte Order Mark" is a special Unicode character which may appear at the
71
- beginning of a file containing Unicode which programs may use to determine the
72
- fact that the text stream is Unicode, what endianness the text stream is in, and
73
- which of several Unicode encodings to interpret.
60
+ Note: An implementation which uses _ UTF-16_ to represent GraphQL documents in
61
+ memory (for example, JavaScript or Java) may encounter a _ surrogate pair_ . This
62
+ encodes a _ supplementary code point_ and is a single valid source character,
63
+ however an unpaired _ surrogate code point_ is not a valid source character.
74
64
75
65
### White Space
76
66
@@ -175,6 +165,17 @@ significant way, for example a {StringValue} may contain white space characters.
175
165
No {Ignored} may appear _ within_ a {Token}, for example no white space
176
166
characters are permitted between the characters defining a {FloatValue}.
177
167
168
+ ** Byte order mark**
169
+
170
+ UnicodeBOM :: "Byte Order Mark (U+FEFF)"
171
+
172
+ The _ Byte Order Mark_ is a special Unicode code point which may appear at the
173
+ beginning of a file which programs may use to determine the fact that the text
174
+ stream is Unicode, and what specific encoding has been used.
175
+
176
+ As files are often concatenated, a _ Byte Order Mark_ may appear anywhere within
177
+ a GraphQL document and is {Ignored}.
178
+
178
179
### Punctuators
179
180
180
181
Punctuator :: one of ! $ & ( ) ... : = @ [ ] { | }
@@ -814,8 +815,8 @@ StringCharacter ::
814
815
815
816
EscapedUnicode ::
816
817
818
+ - ` { ` HexDigit+ ` } `
817
819
- HexDigit HexDigit HexDigit HexDigit
818
- - ` { ` HexDigit+ ` } ` "but only if <= 0x10FFFF"
819
820
820
821
HexDigit :: one of
821
822
@@ -830,19 +831,58 @@ BlockStringCharacter ::
830
831
- SourceCharacter but not ` """ ` or ` \""" `
831
832
- ` \""" `
832
833
833
- Strings are sequences of characters wrapped in quotation marks (U+0022). (ex.
834
- {` "Hello World" ` }). White space and other otherwise-ignored characters are
835
- significant within a string value.
834
+ {StringValue} is a sequence of characters wrapped in quotation marks (U+0022).
835
+ (ex. {` "Hello World" ` }). White space and other characters ignored in other parts
836
+ of a GraphQL document are significant within a string value.
837
+
838
+ A {StringValue} is evaluated to a Unicode text value, a sequence of Unicode
839
+ scalar values, by interpreting all escape sequences using the static semantics
840
+ defined below.
836
841
837
842
The empty string {` "" ` } must not be followed by another {` " ` } otherwise it would
838
843
be interpreted as the beginning of a block string. As an example, the source
839
844
{` """""" ` } can only be interpreted as a single empty block string and not three
840
845
empty strings.
841
846
842
- Non-ASCII Unicode characters are allowed within single-quoted strings. Since
843
- {SourceCharacter} must not contain some ASCII control characters, escape
844
- sequences must be used to represent these characters. The {` \ ` }, {` " ` }
845
- characters also must be escaped. All other escape sequences are optional.
847
+ ** Escape Sequences**
848
+
849
+ In a single-quoted {StringValue}, any Unicode scalar value may be expressed
850
+ using an escape sequence. GraphQL strings allow both C-style escape sequences
851
+ (for example ` \n ` ) and two forms of Unicode escape sequences: one with a
852
+ fixed-width of 4 hexadecimal digits (for example ` \u000A ` ) and one with a
853
+ variable-width most useful for representing a _ supplementary character_ such as
854
+ an Emoji (for example ` \u{1F4A9} ` ).
855
+
856
+ The hexadecimal number encoded by a Unicode escape sequence must describe a
857
+ Unicode scalar value, otherwise parsing should stop with an early error. For
858
+ example both sources ` "\uDEAD" ` and ` "\u{110000}" ` should not be considered
859
+ valid {StringValue}.
860
+
861
+ Escape sequences are only meaningful within a single-quoted string. Within a
862
+ block string, they are simply that sequence of characters (for example
863
+ ` """\n""" ` represents the Unicode text [ U+005C, U+006E] ). Within a comment an
864
+ escape sequence is not a significant sequence of characters. They may not appear
865
+ elsewhere in a GraphQL document.
866
+
867
+ Since {StringCharacter} must not contain some characters, escape sequences must
868
+ be used to represent these characters. All other escape sequences are optional
869
+ and unescaped non-ASCII Unicode characters are allowed within strings. If using
870
+ GraphQL within a system which only supports ASCII, then escape sequences may be
871
+ used to represent all Unicode characters outside of the ASCII range.
872
+
873
+ For legacy reasons, a _ supplementary character_ may be escaped by two
874
+ fixed-width unicode escape sequences forming a _ surrogate pair_ . For example the
875
+ input ` "\uD83D\uDCA9" ` is a valid {StringValue} which represents the same
876
+ Unicode text as ` "\u{1F4A9}" ` . While this legacy form is allowed, it should be
877
+ avoided as a variable-width unicode escape sequence is a clearer way to encode
878
+ such code points.
879
+
880
+ When producing a {StringValue}, implementations should use escape sequences to
881
+ represent non-printable control characters (U+0000 to U+001F and U+007F to
882
+ U+009F). Other escape sequences are not necessary, however an implementation may
883
+ use escape sequences to represent any other range of code points. If an
884
+ implementation chooses to escape a _ supplementary character_ , it should not use
885
+ a fixed-width surrogate pair unicode escape sequence.
846
886
847
887
** Block Strings**
848
888
@@ -898,44 +938,57 @@ Note: If non-printable ASCII characters are needed in a string value, a standard
898
938
quoted string with appropriate escape sequences must be used instead of a block
899
939
string.
900
940
901
- ** Semantics**
941
+ ** Static Semantics**
942
+
943
+ A {StringValue} describes a Unicode text value, a sequence of * Unicode scalar
944
+ value* s. These semantics describe how to apply the {StringValue} grammar to a
945
+ source text to evaluate a Unicode text. Errors encountered during this
946
+ evaluation are considered a failure to apply the {StringValue} grammar to a
947
+ source and result in a parsing error.
902
948
903
949
StringValue :: ` "" `
904
950
905
951
- Return an empty sequence.
906
952
907
953
StringValue :: ` " ` StringCharacter+ ` " `
908
954
909
- - Let {string} be the sequence of all {StringCharacter} code points.
910
- - For each {codePoint} at {index} in {string}:
911
- - If {codePoint} is >= 0xD800 and <= 0xDBFF (a
912
- [ _ High Surrogate_ ] ( https://unicodebook.readthedocs.io/unicode_encodings.html#utf-16-surrogate-pairs ) ):
913
- - Let {lowPoint} be the code point at {index} + {1} in {string}.
914
- - Assert {lowPoint} is >= 0xDC00 and <= 0xDFFF (a
915
- [ _ Low Surrogate_ ] ( https://unicodebook.readthedocs.io/unicode_encodings.html#utf-16-surrogate-pairs ) ).
916
- - Let {decodedPoint} = ({codePoint} - 0xD800) × 0x400 + ({lowPoint} -
917
- 0xDC00) + 0x10000.
918
- - Within {string}, replace {codePoint} and {lowPoint} with {decodedPoint}.
919
- - Otherwise, assert {codePoint} is not >= 0xDC00 and <= 0xDFFF (a
920
- [ _ Low Surrogate_ ] ( https://unicodebook.readthedocs.io/unicode_encodings.html#utf-16-surrogate-pairs ) ).
921
- - Return {string}.
922
-
923
- Note: {StringValue} should avoid encoding code points as surrogate pairs. While
924
- services must interpret them accordingly, a braced escape (for example
925
- ` "\u{1F4A9}" ` ) is a clearer way to encode code points outside of the
926
- [ Basic Multilingual Plane] ( https://unicodebook.readthedocs.io/unicode.html#bmp ) .
955
+ - Return the concatenated sequence of _ Unicode scalar value_ by evaluating all
956
+ {StringCharacter}.
927
957
928
958
StringCharacter :: SourceCharacter but not ` " ` or ` \ ` or LineTerminator
929
959
930
- - Return the code point {SourceCharacter}.
960
+ - Return the _ Unicode scalar value _ {SourceCharacter}.
931
961
932
962
StringCharacter :: ` \u ` EscapedUnicode
933
963
934
- - Let {value} be the 21-bit hexadecimal value represented by the sequence of
935
- {HexDigit} within {EscapedUnicode}.
936
- - Assert {value} <= 0x10FFFF.
964
+ - Let {value} be the hexadecimal value represented by the sequence of {HexDigit}
965
+ within {EscapedUnicode}.
966
+ - Assert {value} is a within the _ Unicode scalar value_ range (>= 0x0000 and <=
967
+ 0xD7FF or >= 0xE000 and <= 0x10FFFF).
937
968
- Return the code point {value}.
938
969
970
+ StringCharacter :: ` \u ` HexDigit HexDigit HexDigit HexDigit ` \u ` HexDigit
971
+ HexDigit HexDigit HexDigit
972
+
973
+ - Let {leadingValue} be the hexadecimal value represented by the first sequence
974
+ of {HexDigit}.
975
+ - Let {trailingValue} be the hexadecimal value represented by the second
976
+ sequence of {HexDigit}.
977
+ - If {leadingValue} is >= 0xD800 and <= 0xDBFF (a _ Leading Surrogate_ ):
978
+ - Assert {trailingValue} is >= 0xDC00 and <= 0xDFFF (a _ Trailing Surrogate_ ).
979
+ - Return ({leadingValue} - 0xD800) × 0x400 + ({trailingValue} - 0xDC00) +
980
+ 0x10000.
981
+ - Otherwise:
982
+ - Assert {leadingValue} is within the _ Unicode scalar value_ range.
983
+ - Assert {trailingValue} is within the _ Unicode scalar value_ range.
984
+ - Return the sequence of the code point {leadingValue} followed by the code
985
+ point {trailingValue}.
986
+
987
+ Note: If both escape sequences encode a _ Unicode scalar value_ , then this
988
+ semantic is identical to applying the prior semantic on each fixed-width escape
989
+ sequence. A variable-width escape sequence must only encode a _ Unicode scalar
990
+ value_ .
991
+
939
992
StringCharacter :: ` \ ` EscapedCharacter
940
993
941
994
- Return the code point represented by {EscapedCharacter} according to the table
@@ -954,13 +1007,13 @@ StringCharacter :: `\` EscapedCharacter
954
1007
955
1008
StringValue :: ` """ ` BlockStringCharacter\* ` """ `
956
1009
957
- - Let {rawValue} be the Unicode character sequence of all {BlockStringCharacter}
958
- Unicode character values (which may be an empty sequence).
1010
+ - Let {rawValue} be the concatenated sequence of _ Unicode scalar value _ by
1011
+ evaluating all {BlockStringCharacter} (which may be an empty sequence).
959
1012
- Return the result of {BlockStringValue(rawValue)}.
960
1013
961
1014
BlockStringCharacter :: SourceCharacter but not ` """ ` or ` \""" `
962
1015
963
- - Return the character value of {SourceCharacter}.
1016
+ - Return the _ Unicode scalar value _ {SourceCharacter}.
964
1017
965
1018
BlockStringCharacter :: ` \""" `
966
1019
0 commit comments