-
-
Notifications
You must be signed in to change notification settings - Fork 31.9k
gh-104400: pygettext: use an AST parser instead of a tokenizer #104402
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
@ambv This is what I talked to you about at EuroPython. If you have time I'd be very happy if you could have a look :) The TL;DR is pygettext has a couple of bugs which stem from it using a tokenizer-based extraction (and overall the code needs modernizing). I fix those bugs in this PR by switching to a parser. Otherwise I try to keep the functionality as close as possible. I also added lots more tests which compare the entire output and not just the messages as it was previously. There are also lots of features missing in pygettext - handling ngettext, pgettext and others, format flags, etc.. Thank you! |
@tomasr8 would it be possible to break this PR up into several chunks / stages? You may have more luck with reviewers & progress -- I'm happy to help if wanted. A |
I can definitely give it a try! I think it'll be difficult to separate the actual change from tokens to AST, since that's kind of an all-or-nothing change but I could start with improving the tests first in a separate PR. That should be an added value regardless of whether the rest gets merged or not. I'll see if I can get a separate PR for the tests in the coming days. Any help/review is greatly appreciated of course! :) |
@AA-Turner I opened a separate PR just adding extra tests, if you wanna have a look ;) |
There was a recent issue in support of f-string #113604 , that mentioned this. AST makes this feature easier to add https://github.com/python/cpython/pull/108173/files already contains the required tests, can you remove them in this PR, so that the diff will be smaller and easier to review |
Don't waste your time reviewing this PR just yet, we should get the tests merged before moving on with this :) Actually, it's been a while since I opened the tests PR, it might need updating first.. |
Thanks for making the requested changes! @AA-Turner: please review the changes made to this pull request. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm not an amazing fan of reusing the same visitor for every file. There's not massive performance benefits, and it means we have tricks with resetting the filename, etc.
However, if you'd prefer to keep the current design, could we add a walkabout
or initiate_visit
or etc method that takes the node tree and filename, to conceptually keep the two together during tree traversal?
Tools/i18n/pygettext.py
Outdated
if ttype == tokenize.NAME and tstring in ('class', 'def'): | ||
self.__state = self.__ignorenext | ||
class GettextVisitor(ast.NodeVisitor): | ||
def __init__(self, options, filename=None): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
def __init__(self, options, filename=None): | |
def __init__(self, options): |
Tools/i18n/pygettext.py
Outdated
def __init__(self, options, filename=None): | ||
super().__init__() | ||
self.options = options | ||
self.filename = filename |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The filename argument is never used
self.filename = filename | |
self.filename: str = '' |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Great! It is now much simpler. And it works more correctly!
There are still some issues with var-positional arguments. Since we cannot determine the argument position in such case, it should be rejected.
I didn't really do this for performance but so that I don't need to pass the |
(The Windows failure is some unrelated build failure) |
Woohoo! 🎉🎉 it took almost two years but we got there :) Thanks @serhiy-storchaka and @AA-Turner for your patience and help! |
Congratulations on getting this through, it's a great improvement! A |
I think pushing other changes (tests, etc.) before this helped with that. |
This PR replaces the token-based message extraction with one that uses the AST parser instead.
See the issue or the forum discussion for more info.
This change fixes some issues just by virtue of using AST instead of working directly with tokens:
"""Hello, {}!""".format('world')
are no longer extractedinspect.cleandoc()
viaast.get_docstring()
I added a CLI argument--charset
(same as in pybabel and--from-code
in xgettext) to force a file encoding, e.g.--charset=utf-8
will open the source files withutf-8
encoding. This is useful because currently we are relying on the system default which is error-prone. For example on Windows,open()
in my locale defaults tocp1250
which mangles uputf-8
files and vice versa (with someUnicodeDecoreError
s in between).-Xutf8
)This PR has lots more tests to make sure we don't regress on anything. The tests now compare the script output to a
.po
file rather than just comparing themsgid
s (basically snapshot tests). This ensures that we also catch issues with formatting, line locations or anything else.@warsaw if you feel like having a look (or anyone else ;))