Update comment about utf-8 BOM being ignored #107607

terryjreedy · 2023-08-03T21:00:46Z

[EDIT: I opened this because I saw a redundancy in a paragraph in Reference / 2. Lexical analysis / 2.1 Line structure / 2.1.4 Encoding declarations. I neglected to explain the problem and instead jumped to what I now think is the wrong solution. See my explanation and better fix in https://github.com//issues/107607#issuecomment-1675967835. I leave the original post so the ensuing discussion makes sense.]

I believe "if the first bytes of the file are the UTF-8 byte-order mark (b'\xef\xbb\xbf'), the declared file encoding is UTF-8" in
Encoding declarations should end with "UTF-8-sig" or "UTF_8_sig". (Not sure which.)

Easy issue once fix verified.

Linked PRs

rscarrera27 · 2023-08-11T08:09:31Z

@terryjreedy

According to the Python codecs docs[1], Python calls UTF-8 with BOM as utf-8-sig. Therefore, using "UTF-8-Sig" seems more appropriate.

(...) To increase the reliability with which a UTF-8 encoding can be detected, Microsoft invented a variant of UTF-8 (that Python calls "utf-8-sig")

cc @corona10

[1] https://docs.python.org/3/library/codecs.html#encodings-and-unicode

corona10 · 2023-08-11T08:14:23Z

@terryjreedy
@sierrasevn is the participant in KR sprint :)

zooba · 2023-08-11T09:20:45Z

If it's referring to our utf-8-sig encoding rather than Unicode's UTF-8-BOM definition, then we should ensure it's quoted as code, and probably shown as a string literal. That way people will (more likely) know that we're referring to our own parameter rather than the proper title.

terryjreedy · 2023-08-11T18:44:50Z

Since the current text has the incorrect 'UTF-8' unmarked, I think the replacement should be an 'official name' of the inferred encoding, unmarked. @zooba proposes 'UTF-8-BOM', but I do not believe this is endorsed by the Unicode Consortium. Current Win 10 Microsoft Notepad lists this Encoding option as "UTF-8 with BOM". Given the immediately following parenthetical comment "(this is supported, among others, by Microsoft’s notepad)", I am inclined to use 'with BOM'). ("Notepad" should be titlecased.) I have made both changes on the PR, but have more to say in another comment.

terryjreedy · 2023-08-12T15:37:40Z

I did a bit more research and thinking. The current 2 sentence paragraph is this:

If no encoding declaration is found, the default encoding is UTF-8. In addition, if the first bytes of the file are the UTF-8 byte-order mark (b'\xef\xbb\xbf'), the declared file encoding is UTF-8 (this is supported, among others, by Microsoft’s notepad).

The first sentence and "In addition, " were added for Python 3. Before that, the default assumption was only that the encoding was 7-bit ASCII compatible. The presence of the UTF-8 then acted as a declaration that the encoding was specifically UTF-8. In Python 3, the default encoding is already UFT-8, so the sentence is redundant except for the implication that the BOM is ignored rather than seen as a syntax error, which it is treated as for encodings other than UTF-8. I checked that this is also the case if the encoding is explicitly UTF-8

# coding: utf-8
print('ran')

in a file with BOM runs. So I now think the line should be replace with "If the implicit or explicit encoding of a file is UTF-8, a UTF-8 byte-order mark (b'\xef\xbb\xbf') is ignored rather than being a syntax error." This explicit says what a user needs to know.

In other words, the actual issue was the redundancy in the 3.x version of the paragraph and I now think that I proposed the wrong fix by focusing on the wrong thing.

EDIT: I am not sure how wrote the file above, bom.py. But loading it into Notepad or Notepad++ and the encoding is given as with encoding "UTF-8 with BOM" or "UTF-8-BOM". I reran with the first line quoted as a docstring and it printed "ran" again. In current Notepad, the encoding defaults to UTF-8 but one can select ASCII, UTF-16-XY, or UTF-8-BOM when saving.

--------- Co-authored-by: Terry Jan Reedy <[email protected]>

…GH-107858) --------- Co-authored-by: Terry Jan Reedy <[email protected]> (cherry picked from commit 7f64ae3) Co-authored-by: Sunghyun Kim <[email protected]>

…7858) (#117015) (cherry picked from commit 7f64ae3) Co-authored-by: Terry Jan Reedy [email protected]

…7858) (#117016) (cherry picked from commit 7f64ae3) Co-authored-by: Terry Jan Reedy [email protected]

…#107858) --------- Co-authored-by: Terry Jan Reedy <[email protected]>

terryjreedy added docs Documentation in the Doc dir OS-windows easy labels Aug 3, 2023

AlexWaygood changed the title ~~utf-8 byte-order means utf_8_sig codec~~ "Lexical analysis" docs: utf-8 byte-order means utf_8_sig codec Aug 3, 2023

rscarrera27 added a commit to rscarrera27/cpython that referenced this issue Aug 11, 2023

pythonGH-107607: Correct encoding name in docs

24f7470

bedevere-bot mentioned this issue Aug 11, 2023

gh-107607: Update comment about utf-8 BOM being ignored #107858

Merged

terryjreedy added a commit that referenced this issue Mar 19, 2024

gh-107607: Update comment about utf-8 BOM being ignored (#107858)

7f64ae3

--------- Co-authored-by: Terry Jan Reedy <[email protected]>

This was referenced Mar 19, 2024

[3.11] gh-107607: Update comment about utf-8 BOM being ignored (GH-107858) #117015

Merged

[3.12] gh-107607: Update comment about utf-8 BOM being ignored (GH-107858) #117016

Merged

terryjreedy pushed a commit that referenced this issue Mar 19, 2024

[3.11] gh-107607: Update comment about utf-8 BOM being ignored (GH-10…

bb7a6d4

…7858) (#117015) (cherry picked from commit 7f64ae3) Co-authored-by: Terry Jan Reedy [email protected]

terryjreedy pushed a commit that referenced this issue Mar 19, 2024

[3.12] gh-107607: Update comment about utf-8 BOM being ignored (GH-10…

1627c1e

…7858) (#117016) (cherry picked from commit 7f64ae3) Co-authored-by: Terry Jan Reedy [email protected]

terryjreedy changed the title ~~"Lexical analysis" docs: utf-8 byte-order means utf_8_sig codec~~ Update comment about utf-8 BOM being ignored Mar 19, 2024

terryjreedy closed this as completed Mar 19, 2024

vstinner pushed a commit to vstinner/cpython that referenced this issue Mar 20, 2024

pythongh-107607: Update comment about utf-8 BOM being ignored (python…

0517860

…#107858) --------- Co-authored-by: Terry Jan Reedy <[email protected]>

adorilson pushed a commit to adorilson/cpython that referenced this issue Mar 25, 2024

pythongh-107607: Update comment about utf-8 BOM being ignored (python…

1b0d593

…#107858) --------- Co-authored-by: Terry Jan Reedy <[email protected]>

diegorusso pushed a commit to diegorusso/cpython that referenced this issue Apr 17, 2024

pythongh-107607: Update comment about utf-8 BOM being ignored (python…

b30f7e6

…#107858) --------- Co-authored-by: Terry Jan Reedy <[email protected]>

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Uh oh!

Update comment about utf-8 BOM being ignored #107607

Update comment about utf-8 BOM being ignored #107607

terryjreedy commented Aug 3, 2023 •

edited by bedevere-app bot

Loading

rscarrera27 commented Aug 11, 2023 •

edited

Loading

Uh oh!

corona10 commented Aug 11, 2023

Uh oh!

zooba commented Aug 11, 2023

Uh oh!

terryjreedy commented Aug 11, 2023

Uh oh!

terryjreedy commented Aug 12, 2023 •

edited

Loading

Uh oh!

Uh oh!

Update comment about utf-8 BOM being ignored #107607

Update comment about utf-8 BOM being ignored #107607

Comments

terryjreedy commented Aug 3, 2023 • edited by bedevere-app bot Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Linked PRs

rscarrera27 commented Aug 11, 2023 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

corona10 commented Aug 11, 2023

Uh oh!

zooba commented Aug 11, 2023

Uh oh!

terryjreedy commented Aug 11, 2023

Uh oh!

terryjreedy commented Aug 12, 2023 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

terryjreedy commented Aug 3, 2023 •

edited by bedevere-app bot

Loading

rscarrera27 commented Aug 11, 2023 •

edited

Loading

terryjreedy commented Aug 12, 2023 •

edited

Loading