-
-
Notifications
You must be signed in to change notification settings - Fork 32k
email.message get_payload throws UnicodeEncodeError with some surrogate Unicode characters #94606
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Comments
I tracked this down to code in email.message.py that has the following
This looks like it is supposed to check if payload is a string that was created by decode with surrogateescape, and if it was, then convert it back to bytes the same way. PEP 383 is clear that surrogateescape is only for round trip decoding then encoding. encode with surrogateescaape should never be called on strings not created with decode surrogateescape. The problem is that utils._has_surrogates(payload) is only a fast heuristic when it is used to guess if payload contains a string that was produced by decoding with surrogateescape. The function actually flags strings that contain any Unicode surrogate characters. However, strings that were created by decode with surrogateescape will only have Unicode surrogate characters in the U-DC80 through U-DCFF range and remaining characters will only be 7 bit ASCII. If a string either has an non-ASCII UTF8 character in addition to a Unicode surrogate character, or has a Unicode surrogate character out of that range, utils._has_surrogates() will return true but the string encode will raise the exception. This should be able to be fixed by catching the exception raised by the encode and proceeding to get the same result the code would have if utils._has_surrogates(payload) had returned false. I'll submit a PR for that after testing. |
…ot valid surrogateescaped string
…escaped string (GH-94641) Co-authored-by: Serhiy Storchaka <[email protected]>
…rogateescaped string (pythonGH-94641) (cherry picked from commit 27a5fd8) Co-authored-by: Sidney Markowitz <[email protected]> Co-authored-by: Serhiy Storchaka <[email protected]>
…rogateescaped string (pythonGH-94641) (cherry picked from commit 27a5fd8) Co-authored-by: Sidney Markowitz <[email protected]> Co-authored-by: Serhiy Storchaka <[email protected]>
…rrogateescaped string (GH-94641) (GH-112972) (cherry picked from commit 27a5fd8) Co-authored-by: Sidney Markowitz <[email protected]> Co-authored-by: Serhiy Storchaka <[email protected]>
…rrogateescaped string (GH-94641) (GH-112971) (cherry picked from commit 27a5fd8) Co-authored-by: Sidney Markowitz <[email protected]> Co-authored-by: Serhiy Storchaka <[email protected]>
…rogateescaped string (pythonGH-94641) Co-authored-by: Serhiy Storchaka <[email protected]>
…rogateescaped string (pythonGH-94641) Co-authored-by: Serhiy Storchaka <[email protected]>
email.message get_payload gets a UnicodeEncodeError if the message body contains a line that has either:
a Unicode surrogate code point that is valid for surrogateescape encoding (U-DC80 through U-DCFF) and a non ASCII UTF-8 character
OR
a Unicode surrogate character that is not valid for surrogateescape encoding
Here is a minimal code example with one of the cases commented out
On my python 3.10.5 on macOS this produces:
This was tested on python 3.10.5 on macOS, however I tracked it down based on a report in the wild that was running python 3.8 on Ubuntu 20.04 processing actual emails
Linked PRs
The text was updated successfully, but these errors were encountered: