Skip to content

html.parser produces different output than documented #131535

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Closed
mouchen626 opened this issue Mar 21, 2025 · 3 comments · Fixed by #131551
Closed

html.parser produces different output than documented #131535

mouchen626 opened this issue Mar 21, 2025 · 3 comments · Fixed by #131551
Labels
3.13 bugs and security fixes 3.14 bugs and security fixes 3.15 new features, bugs and security fixes docs Documentation in the Doc dir type-bug An unexpected behavior, bug, or error

Comments

@mouchen626
Copy link

mouchen626 commented Mar 21, 2025

When parsing >>> using html.parser, the actual output differs from the expected behavior as documented.

Run the following code:

from html.parser import HTMLParser
from html.entities import name2codepoint

class MyHTMLParser(HTMLParser):
    def handle_starttag(self, tag, attrs):
        print("Start tag:", tag)
        for attr in attrs:
            print("     attr:", attr)

    def handle_endtag(self, tag):
        print("End tag  :", tag)

    def handle_data(self, data):
        print("Data     :", data)

    def handle_comment(self, data):
        print("Comment  :", data)

    def handle_entityref(self, name):
        c = chr(name2codepoint[name])
        print("Named ent:", c)

    def handle_charref(self, name):
        if name.startswith('x'):
            c = chr(int(name[1:], 16))
        else:
            c = chr(int(name))
        print("Num ent  :", c)

    def handle_decl(self, data):
        print("Decl     :", data)

parser = MyHTMLParser()
parser.feed('>>>')

According to the documentation, the expected output should be:

Named ent: >
Num ent  : >
Num ent  : >

The actual output is:

Data     : >>>

Linked PRs

@brianschubert
Copy link
Contributor

Hi! This is the expected behavior. You need to set convert_charrefs to False in order for handle_entityref / handle_charref to be called (the default is True):

>>> parser = MyHTMLParser(convert_charrefs=False)
>>> parser.feed('>>>')
Named ent: >
Num ent  : >
Num ent  : >

From the docs:

HTMLParser.handle_entityref(name)
This method is called to process a named character reference of the form &name; (e.g. >), where name is a general entity reference (e.g. 'gt'). This method is never called if convert_charrefs is True.

HTMLParser.handle_charref(name)
This method is called to process decimal and hexadecimal numeric character references of the form &#NNN; and &#xNNN;. For example, the decimal equivalent for > is >, whereas the hexadecimal is >; in this case the method will receive '62' or 'x3E'. This method is never called if convert_charrefs is True.

@mouchen626
Copy link
Author

Thank you, I understand now. However, the example code at the bottom of this documentation page does not explicitly set convert_charrefs=False, yet it still produces results as if convert_charrefs were True. This has confused me. Would it be better to refine this example for clarity?

@brianschubert
Copy link
Contributor

Ah, in that case I agree the example should be clarified. It looks like it wasn't updated when the default for convert_charref was changed from False to True in Python 3.5.

It would be good to migrate the examples to .. doctest:: blocks at the same time to help catch this sort of discrepancy.

@encukou encukou added type-feature A feature request or enhancement docs Documentation in the Doc dir labels Mar 21, 2025
@picnixz picnixz added type-bug An unexpected behavior, bug, or error and removed type-feature A feature request or enhancement labels Mar 22, 2025
brianschubert added a commit to brianschubert/cpython that referenced this issue Apr 19, 2025
@serhiy-storchaka serhiy-storchaka added 3.13 bugs and security fixes 3.14 bugs and security fixes 3.15 new features, bugs and security fixes labels May 7, 2025
serhiy-storchaka added a commit to brianschubert/cpython that referenced this issue May 7, 2025
miss-islington pushed a commit to miss-islington/cpython that referenced this issue May 7, 2025
… doctests (pythonGH-131551)

(cherry picked from commit ee76e36)

Co-authored-by: Brian Schubert <[email protected]>
brianschubert added a commit to brianschubert/cpython that referenced this issue May 7, 2025
…xamples doctests (pythonGH-131551)

(cherry picked from commit ee76e36)

Co-authored-by: Brian Schubert <[email protected]>
serhiy-storchaka pushed a commit that referenced this issue May 7, 2025
…s doctests (GH-131551) (GH-133587)

(cherry picked from commit ee76e36)

Co-authored-by: Brian Schubert <[email protected]>
serhiy-storchaka pushed a commit that referenced this issue May 7, 2025
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
3.13 bugs and security fixes 3.14 bugs and security fixes 3.15 new features, bugs and security fixes docs Documentation in the Doc dir type-bug An unexpected behavior, bug, or error
Projects
Status: Todo
Development

Successfully merging a pull request may close this issue.

5 participants