Skip to content

Update comment about utf-8 BOM being ignored #107607

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Closed
terryjreedy opened this issue Aug 3, 2023 · 5 comments
Closed

Update comment about utf-8 BOM being ignored #107607

terryjreedy opened this issue Aug 3, 2023 · 5 comments
Labels
docs Documentation in the Doc dir easy OS-windows

Comments

@terryjreedy
Copy link
Member

terryjreedy commented Aug 3, 2023

[EDIT: I opened this because I saw a redundancy in a paragraph in Reference / 2. Lexical analysis / 2.1 Line structure / 2.1.4 Encoding declarations. I neglected to explain the problem and instead jumped to what I now think is the wrong solution. See my explanation and better fix in https://github.com//issues/107607#issuecomment-1675967835. I leave the original post so the ensuing discussion makes sense.]

I believe "if the first bytes of the file are the UTF-8 byte-order mark (b'\xef\xbb\xbf'), the declared file encoding is UTF-8" in
Encoding declarations should end with "UTF-8-sig" or "UTF_8_sig". (Not sure which.)

Easy issue once fix verified.

Linked PRs

@terryjreedy terryjreedy added docs Documentation in the Doc dir OS-windows easy labels Aug 3, 2023
@AlexWaygood AlexWaygood changed the title utf-8 byte-order means utf_8_sig codec "Lexical analysis" docs: utf-8 byte-order means utf_8_sig codec Aug 3, 2023
@rscarrera27
Copy link
Contributor

rscarrera27 commented Aug 11, 2023

@terryjreedy

According to the Python codecs docs[1], Python calls UTF-8 with BOM as utf-8-sig. Therefore, using "UTF-8-Sig" seems more appropriate.

(...) To increase the reliability with which a UTF-8 encoding can be detected, Microsoft invented a variant of UTF-8 (that Python calls "utf-8-sig")

cc @corona10


[1] https://docs.python.org/3/library/codecs.html#encodings-and-unicode

@corona10
Copy link
Member

@terryjreedy
@sierrasevn is the participant in KR sprint :)

@zooba
Copy link
Member

zooba commented Aug 11, 2023

If it's referring to our utf-8-sig encoding rather than Unicode's UTF-8-BOM definition, then we should ensure it's quoted as code, and probably shown as a string literal. That way people will (more likely) know that we're referring to our own parameter rather than the proper title.

@terryjreedy
Copy link
Member Author

Since the current text has the incorrect 'UTF-8' unmarked, I think the replacement should be an 'official name' of the inferred encoding, unmarked. @zooba proposes 'UTF-8-BOM', but I do not believe this is endorsed by the Unicode Consortium. Current Win 10 Microsoft Notepad lists this Encoding option as "UTF-8 with BOM". Given the immediately following parenthetical comment "(this is supported, among others, by Microsoft’s notepad)", I am inclined to use 'with BOM'). ("Notepad" should be titlecased.) I have made both changes on the PR, but have more to say in another comment.

@terryjreedy
Copy link
Member Author

terryjreedy commented Aug 12, 2023

I did a bit more research and thinking. The current 2 sentence paragraph is this:

If no encoding declaration is found, the default encoding is UTF-8. In addition, if the first bytes of the file are the UTF-8 byte-order mark (b'\xef\xbb\xbf'), the declared file encoding is UTF-8 (this is supported, among others, by Microsoft’s notepad).

The first sentence and "In addition, " were added for Python 3. Before that, the default assumption was only that the encoding was 7-bit ASCII compatible. The presence of the UTF-8 then acted as a declaration that the encoding was specifically UTF-8. In Python 3, the default encoding is already UFT-8, so the sentence is redundant except for the implication that the BOM is ignored rather than seen as a syntax error, which it is treated as for encodings other than UTF-8. I checked that this is also the case if the encoding is explicitly UTF-8

# coding: utf-8
print('ran')

in a file with BOM runs. So I now think the line should be replace with "If the implicit or explicit encoding of a file is UTF-8, a UTF-8 byte-order mark (b'\xef\xbb\xbf') is ignored rather than being a syntax error." This explicit says what a user needs to know.

In other words, the actual issue was the redundancy in the 3.x version of the paragraph and I now think that I proposed the wrong fix by focusing on the wrong thing.

EDIT: I am not sure how wrote the file above, bom.py. But loading it into Notepad or Notepad++ and the encoding is given as with encoding "UTF-8 with BOM" or "UTF-8-BOM". I reran with the first line quoted as a docstring and it printed "ran" again. In current Notepad, the encoding defaults to UTF-8 but one can select ASCII, UTF-16-XY, or UTF-8-BOM when saving.

terryjreedy added a commit that referenced this issue Mar 19, 2024
miss-islington pushed a commit to miss-islington/cpython that referenced this issue Mar 19, 2024
…GH-107858)

---------
Co-authored-by: Terry Jan Reedy <[email protected]>
(cherry picked from commit 7f64ae3)

Co-authored-by: Sunghyun Kim <[email protected]>
miss-islington pushed a commit to miss-islington/cpython that referenced this issue Mar 19, 2024
…GH-107858)

---------
Co-authored-by: Terry Jan Reedy <[email protected]>
(cherry picked from commit 7f64ae3)

Co-authored-by: Sunghyun Kim <[email protected]>
terryjreedy pushed a commit that referenced this issue Mar 19, 2024
terryjreedy pushed a commit that referenced this issue Mar 19, 2024
@terryjreedy terryjreedy changed the title "Lexical analysis" docs: utf-8 byte-order means utf_8_sig codec Update comment about utf-8 BOM being ignored Mar 19, 2024
vstinner pushed a commit to vstinner/cpython that referenced this issue Mar 20, 2024
adorilson pushed a commit to adorilson/cpython that referenced this issue Mar 25, 2024
diegorusso pushed a commit to diegorusso/cpython that referenced this issue Apr 17, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
docs Documentation in the Doc dir easy OS-windows
Projects
None yet
Development

No branches or pull requests

4 participants