-
-
Notifications
You must be signed in to change notification settings - Fork 32.1k
email: set_content() always assumes trailing EOL #121515
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Comments
I think this is more "odd behavior" than a bug. Anyway, it would be a breaking change to fix this. It's probably better to just update the documentation. FWIW, |
I agreed with @ZeroIntensity. |
Thank you both for the review! :D
Tiny correction: I was specifically comparing the behavior of Still there seems to be a tiny misunderstanding here... I think the misbehaving method is |
This is fixed by #121543 |
This feels like it is a bug, in that it breaks an implied symmetry between the modern body = "body content"
message = email.message.EmailMessage()
message.set_content(body)
assert message.get_content() == body
# AssertionError: 'body content\n' != 'body content' In practice, it probably doesn't matter that much: most real-world text/* content probably already ends with a newline, or at least won't have its meaning changed by adding one. Where it's most likely to come up is in test cases, like the one above. (Or like the large Django email test suite I've been updating to use the modern email API; apologies for the "referenced this issue" spam earlier.) This could be fixed—while maintaining compatibility—by introducing a new policy option. Something like (A precedent for adding new options like this is the recently introduced |
Sounds reasonable, but I would create a new issue (marked as a feature instead of a bug) to get some fresh eyes on it. |
Thank you so much for justifying my proposal! Seeing that there is hope for getting this into Python standard library, I'd like to give my two cents of thoughts :) First off though:
The symmetry between old and new APIs were only a cheap motivation for me to escalate this to an issue. ;P (Usually I don't dare to question decisions the Python core developers made...) However the real motivation behind this issue is simply that it can be done (depending on whether my further justification below is valid, of course). RFC imposes no restrictions on the trailing newlines (in fact it even explicitly allows for that possibility). As an email library (as opposed to a client) we should leave the decision on whether to "fix" an issue that is merely a matter of convention to the user... at least that's what I believe in. Another problem (as you addressed in your pull request in your Django repo; which by the way, I appreciate the reference!) is that there is virtually no way to work around this if one so insists that the email content do not end with an EOL without re-implementing everything: the line wrapping for variable max_line_length has to be implemented from scratch since Regarding the practicality of preserving the lack of trailing newline:
I agree that most text subtypes are not affected by this distinction. In Unix, it is even expected that files terminate with a line feed (e.g. One consideration is enabling the possibility round-trip conversion: on a purely theoretical basis, RFC only mandates that we represent line breaks as CRLF, and either invalidate or translate lone CR and LF characters. Thus the email library should only concern itself with the translation newline sequences into the canonical representation in Python, as the implementation of In practice, none of this matters if the text payload is the sole MIME part and the email is to be delivered to another mailbox (as is often the case): as the email message is transmitted over SMTP, that last CRLF will have to be appended for the server to acknowledge that the DATA command has terminated. (Python adds a trailing CRLF if the original email did not end with one, and I assume many other email clients do too.) For multipart messages however there is a difference, since contents that do not end with a newline can still be properly represented 1 and are thus "protected" when transmitted over SMTP --- this was my initial motivation for this proposal, because as a MIME part I see no reason to impose a trailing newline when the multipart syntax is perfectly capable of representing the lack of one. To push this further, even if SMTP limitations prevent the absence of trailing newline from being represented in text parts as the sole part of a message, the same does not apply to content transfer encodings like quoted-printable and base64. (The reason why I gave an example with quoted-printable rather than 7bit/8bit 2) base64 requires little explanation as it is capable of representing any arbitrary number of bytes. As for quoted-printable, the lack of trailing newline can be "protected" (similar to how multipart boundary works) by appending a soft wrap at the end. This is how the qprint command works, which I stumbled upon after sometime of writing up this issue 3: $ printf résumé | iconv -t iso-8859-1 | qprint -e | xxd
00000000: 723d 4539 7375 6d3d 4539 3d0d 0a r=E9sum=E9=.. This, in retrospect, is a lot better than how
For the record I think adding a keyword argument can also work. ( (Sorry if there are any typos as I do not have a lot of time to proofread all this /\ but I do want to get this out as soon as possible since this is an issue I care about.) Footnotes
|
@rapidcow appreciate your enthusiasm and deep dive into the labyrinth that is email RFCs. You might be aware that Python's modern email API has no guarantee of compatibility with the legacy API. So reports that compare the behavior of the two (or mix use of them) tend not to get much attention. I agree with you it's a bug that the modern API's
And personally, I think it would be appropriate to handle this as a bug, so long as there's also an option for users who need to retain the current buggy behavior. (Similar to what was done with verify_generated_headers. Part of the point of the modern API was to faithfully implement the email specs without needing to maintain compatibility with earlier bugs.) But those are just my opinions. And to be clear, I'm not part of the Python email team. (Also, to set expectations, those volunteers face a huge backlog of email related issues. It can take a very long time—months or even years—for email PRs to get any attention.)
I think a policy option might be preferable to a
You might raise this as a separate, unrelated feature request. And probably report it against |
Sorry it's been a while @medmunds, but I did not mean to ignore your response. College is starting soon and so I have a lot of stuff to deal with, but I'll try my best to make my point clear. ^^;
My proposal in this issue was focused not on the compatibility with the legacy API itself, but rather on the compliance with the design principles which the email library seems to follow. (When suggesting that we encode the empty string as Namely, Message objects are designed to be a model of RFC5322/MIME-conformant messages, while, as per Robustness Principle, enabling the representation of a loose superset of common yet invalid messages. As an interface to a model, the Yes, I tried to make a point that this deviates from the implementation of the now-obsolete def body_encode(body, maxlinelen=76, eol=NL):
"""Encode with quoted-printable, wrapping at maxlinelen characters."""
if not body:
return body
...
encoded_body = []
append = encoded_body.append
...
# add back final newline if present
if body[-1] in CRLF:
append('')
return eol.join(encoded_body) It is apparent that the original author intended to represent messages that lack an EOL. It is this intent to care and respect for user input that I find blatantly missing in the new email library code.
Very much agreed! :D
And those were mine too :) Honestly, I'm not ranting to you because I thought I'd make a difference; I'm only writing because doing this is not completely pointless, and someone is willing to accept that this is a bug of sorts (or whatever you call this... "removal of feature", I think XD).
You spotted that, haha! That was just a slightly sarcastic side rant of my many dissatisfactions with the content manager library, which I apologize :P (I try not to get opinionated, but um... it's hard to not get worked up over the sheer overwhelming number of arguments.) I'll try to lay down the real problem that I believe the content manager module has below: my point was that There are many reasons I speculate why this elaborate method signature was made, but the most important one is this: when I want to attach a MIME part, I have to set everything in one go, because # NOTE: to make this actually run in python 3.12.4+, replace
# part = msg.add_blah(...)
# with
# msg.add_blah(...)
# part = msg.get_payload()[-1]
# get_content() doesn't work because it doesn't know
# how to return content for a multipart message.
big_msg = EmailMessage()
text_part = big_msg.add_alternative('hello', 'plain', charset='iso-8859-1')
text_part.set_param('format', 'flowed')
text_part['Content-Description'] = 'a text part.'
text_part['Content-ID'] = '<[email protected]>'
html_part = big_msg.add_alternative('<p>hello peeps</p>', 'html', charset='us-ascii')
html_part['Content-Description'] = 'an HTML part.'
html_part['Content-ID'] = '<[email protected]>'
# being the weirdo i am, i sometimes add filename to inline parts :)
html_part['Content-Disposition'] = 'inline'
html_part.set_param('filename', 'peepsgreet.html', header='Content-Disposition') (NOTE: in spite of the substantially lengthy code, I am effectively only using three methods, responsible for three clearly defined actions: set body, set header, and set header parameter. I would like it better if I can just create the MIMEPart from scratch, but welp, ... I get to write this: big_msg = EmailMessage()
big_msg.add_alternative('hello', 'plain', charset='iso-8859-1',
params=dict(format='flowed'),
headers=['Content-Description: a text part.'],
cid='<[email protected]>')
big_msg.add_alternative('<p>hello peeps</p>', 'plain', charset='us-ascii',
headers=['Content-Description: an HTML part.'],
cid='<[email protected]>',
# I was, admittedly, being weird only to demonstrate that
# having disposition default to "attachment" for when
# filename is provided doesn't hide the the fact that
# you still need to specify it when it comes to peculiar
# combinations.
#
# I am, by no means, against this though.
# It's just that I believe a functionality so specific
# deserves to be managed singly and separately, in say,
# a separate function, rather than bundled with
# random arguments from which it cannot be separated.
disposition='inline', filename='peepsgreet.html') As you can see, each indented line above (save for "disposition" and "filename") is identical to calls to TL;DR: The new email library is an improvement in many ways but gives us, the users, too little control. And I especially don't appreciate the sentiment of deprecating the legacy API when parts like |
The issue seems to have a successful resolution at #121543. There was discussion for adding this as a switch, named along the lines of It was once tempting to think that a str with a CTE of base64, or even quoted-printable (if care is taken to escape special characters (and the lack thereof) with soft wrap (see RFC 2045 Section 6.7 Rule 5)), would be naïvely encoded with the given charset by raw_data_manager, the default content manager of EmailMessages. But since the PR above made it clear that text and binary content are differentiated on the basis of the data type rather than the CTE, there should be no such confusion anymore. It is unfortunate that the new email API masks the naïve legacy quoted-printable encoder (which conditionally appends the trailing newline with respect to the given text), though in hindsight, this is probably more consistent with most of Python's interface (e.g. the print function, where the end parameter is a newline by default. The difference here is there isn't an option to not have that trailing newline; but as I explained in the first paragraph, there probably isn't a strong practical reason to have such an option anyway.) As a final remark I just thought of (somewhat more relevant to the documentation PR): the empty string is treated as having one line and encoded with exactly one newline (and that happens even if you set CTE to base64); this is expected behavior. In fact, one can expect every email message whose content is created from a str to contain at least one newline (that being the |
Python's modern email API will force a trailing newline onto all text/* bodies and attachments. Updated mail tests to include (and check for) the newline while still using the legacy email API. See python/cpython#121515 which reasons that, apart from artificial test cases, most text content already ends in a newline. If it doesn't, adding one won't change the meaning.
Python's modern email API will force a trailing newline onto all text/* bodies and attachments. Updated mail tests to include (and check for) the newline while still using the legacy email API. See python/cpython#121515 which reasons that, apart from artificial test cases, most text content already ends in a newline. If it doesn't, adding one won't change the meaning.
Python's modern email API will force a trailing newline onto all text/* bodies and attachments. Updated mail tests to include (and check for) the newline while still using the legacy email API. See python/cpython#121515 which reasons that, apart from artificial test cases, most text content already ends in a newline. If it doesn't, adding one won't change the meaning.
Uh oh!
There was an error while loading. Please reload this page.
Bug report
Bug description:
The
set_content()
handler for Unicode strings does not differentiate between strings with and without a trailing EOL. As of Python v3.12.3 (and v3.12.4), these two calls are equivalent:...whereas
set_payload()
differentiates the two:I thought that this might not be intended, considering how, in this particular case where CTE is quoted-printable,
quoprimime.body_encode()
(whatset_payload()
ultimately calls to perform QP encoding) conditionally appends that last'\n'
depending on whether the original string ended with CR/LF. Meanwhile, the str set handler called byset_content()
appends the'\n'
(orlinesep
in the case of base64) unconditionally.I was a bit hesitant to submit this as a bug because a handful of unit tests depend on this implicit EOL; some of these tests I found were:
test_contentmanager.py
, line 309test_contentmanager.py
, line 506test_contentmanager.py
, lines 535–536test_contentmanager.py
, line 768test_generator.py
, line 363test_generator.py
, line 384However I can't think of a good reason why implicit EOL should be assumed... and IMHO unless there is a reason why users should be discouraged from setting content with no EOL, I think the current implementation deserves a reconsideration. :)
CPython versions tested on:
3.12
Operating systems tested on:
macOS
The text was updated successfully, but these errors were encountered: