-
-
Notifications
You must be signed in to change notification settings - Fork 32k
Update comment about utf-8 BOM being ignored #107607
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Comments
According to the Python
cc @corona10 [1] https://docs.python.org/3/library/codecs.html#encodings-and-unicode |
@terryjreedy |
If it's referring to our |
Since the current text has the incorrect 'UTF-8' unmarked, I think the replacement should be an 'official name' of the inferred encoding, unmarked. @zooba proposes 'UTF-8-BOM', but I do not believe this is endorsed by the Unicode Consortium. Current Win 10 Microsoft Notepad lists this Encoding option as "UTF-8 with BOM". Given the immediately following parenthetical comment "(this is supported, among others, by Microsoft’s notepad)", I am inclined to use 'with BOM'). ("Notepad" should be titlecased.) I have made both changes on the PR, but have more to say in another comment. |
I did a bit more research and thinking. The current 2 sentence paragraph is this:
The first sentence and "In addition, " were added for Python 3. Before that, the default assumption was only that the encoding was 7-bit ASCII compatible. The presence of the UTF-8 then acted as a declaration that the encoding was specifically UTF-8. In Python 3, the default encoding is already UFT-8, so the sentence is redundant except for the implication that the BOM is ignored rather than seen as a syntax error, which it is treated as for encodings other than UTF-8. I checked that this is also the case if the encoding is explicitly UTF-8
in a file with BOM runs. So I now think the line should be replace with "If the implicit or explicit encoding of a file is UTF-8, a UTF-8 byte-order mark (b'\xef\xbb\xbf') is ignored rather than being a syntax error." This explicit says what a user needs to know. In other words, the actual issue was the redundancy in the 3.x version of the paragraph and I now think that I proposed the wrong fix by focusing on the wrong thing. EDIT: I am not sure how wrote the file above, bom.py. But loading it into Notepad or Notepad++ and the encoding is given as with encoding "UTF-8 with BOM" or "UTF-8-BOM". I reran with the first line quoted as a docstring and it printed "ran" again. In current Notepad, the encoding defaults to UTF-8 but one can select ASCII, UTF-16-XY, or UTF-8-BOM when saving. |
--------- Co-authored-by: Terry Jan Reedy <[email protected]>
…GH-107858) --------- Co-authored-by: Terry Jan Reedy <[email protected]> (cherry picked from commit 7f64ae3) Co-authored-by: Sunghyun Kim <[email protected]>
…GH-107858) --------- Co-authored-by: Terry Jan Reedy <[email protected]> (cherry picked from commit 7f64ae3) Co-authored-by: Sunghyun Kim <[email protected]>
…7858) (#117015) (cherry picked from commit 7f64ae3) Co-authored-by: Terry Jan Reedy [email protected]
…7858) (#117016) (cherry picked from commit 7f64ae3) Co-authored-by: Terry Jan Reedy [email protected]
…#107858) --------- Co-authored-by: Terry Jan Reedy <[email protected]>
…#107858) --------- Co-authored-by: Terry Jan Reedy <[email protected]>
…#107858) --------- Co-authored-by: Terry Jan Reedy <[email protected]>
Uh oh!
There was an error while loading. Please reload this page.
[EDIT: I opened this because I saw a redundancy in a paragraph in Reference / 2. Lexical analysis / 2.1 Line structure / 2.1.4 Encoding declarations. I neglected to explain the problem and instead jumped to what I now think is the wrong solution. See my explanation and better fix in https://github.com//issues/107607#issuecomment-1675967835. I leave the original post so the ensuing discussion makes sense.]
I believe "if the first bytes of the file are the UTF-8 byte-order mark (b'\xef\xbb\xbf'), the declared file encoding is UTF-8" in
Encoding declarations should end with "UTF-8-sig" or "UTF_8_sig". (Not sure which.)
Easy issue once fix verified.
Linked PRs
The text was updated successfully, but these errors were encountered: