-
Notifications
You must be signed in to change notification settings - Fork 301
Description
HTML 5 Proposed Recommendation §8.2.2 The input byte stream, HTML 5.1 Draft §8.2.2 The input byte stream:
Note: Bytes or sequences of bytes in the original byte stream that did not conform to the Encoding standard (e.g. invalid UTF-8 byte sequences in a UTF-8 input byte stream) are errors that conformance checkers are expected to report. [ENCODING]
Test case:
class TestInvalidSequences(unittest.TestCase):
def test_invalid_sequences(self):
parser = html5lib.HTMLParser()
doc = parser.parse(io.BytesIO('<!DOCTYPE html>\xA0'), encoding='ascii')
self.assertTrue(parser.errors)Expected behavior: parser.errors is not empty
Observed behavior: parser.errors is empty; doc contains a tree which contains the \uFFFD replacement character in place of the invalid byte.
Cause: In HTMLBinaryInputStream.reset, the codec is constructed with the option 'replace'; the HTMLUnicodeInputStream only reports errors for Unicode code points which were successfully decoded but are either non-characters or surrogates.