Skip to content

email: set_content() always assumes trailing EOL #121515

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Closed
rapidcow opened this issue Jul 8, 2024 · 10 comments
Closed

email: set_content() always assumes trailing EOL #121515

rapidcow opened this issue Jul 8, 2024 · 10 comments
Labels
topic-email type-bug An unexpected behavior, bug, or error

Comments

@rapidcow
Copy link
Contributor

rapidcow commented Jul 8, 2024

Bug report

Bug description:

The set_content() handler for Unicode strings does not differentiate between strings with and without a trailing EOL. As of Python v3.12.3 (and v3.12.4), these two calls are equivalent:

# Note: the following code blocks are meant to be run in REPL
# (I left out the >>> prompts to make copying easier)
from email.message import *; msg = EmailMessage()
msg.set_content('resumé', charset='latin-1', cte='quoted-printable')
msg.get_content() # 'resumé\n'

msg.set_content('resumé\n', charset='latin-1', cte='quoted-printable')
msg.get_content() # 'resumé\n'

...whereas set_payload() differentiates the two:

# Note: The legacy 'set_charset()' method encodes
# latin-1 charset with QP encoding by default
msg.clear_content()
msg.set_payload('resumé', charset='latin-1')
msg.get_content() # 'resumé'

msg.clear_content()
msg.set_payload('resumé\n', charset='latin-1')
msg.get_content() # 'resumé\n'

I thought that this might not be intended, considering how, in this particular case where CTE is quoted-printable, quoprimime.body_encode() (what set_payload() ultimately calls to perform QP encoding) conditionally appends that last '\n' depending on whether the original string ended with CR/LF. Meanwhile, the str set handler called by set_content() appends the '\n' (or linesep in the case of base64) unconditionally.

I was a bit hesitant to submit this as a bug because a handful of unit tests depend on this implicit EOL; some of these tests I found were:

However I can't think of a good reason why implicit EOL should be assumed... and IMHO unless there is a reason why users should be discouraged from setting content with no EOL, I think the current implementation deserves a reconsideration. :)

CPython versions tested on:

3.12

Operating systems tested on:

macOS

@rapidcow rapidcow added the type-bug An unexpected behavior, bug, or error label Jul 8, 2024
@ZeroIntensity
Copy link
Member

ZeroIntensity commented Jul 9, 2024

I think this is more "odd behavior" than a bug. Anyway, it would be a breaking change to fix this. It's probably better to just update the documentation. FWIW, get_payload is a legacy method, we shouldn't try and fix things on it.

@Eclips4
Copy link
Member

Eclips4 commented Jul 9, 2024

I agreed with @ZeroIntensity.
The best thing we can do here is to document the behaviour of trailing newline in docs.

@rapidcow
Copy link
Contributor Author

rapidcow commented Jul 10, 2024

Thank you both for the review! :D

FWIW, get_payload is a legacy method, we shouldn't try and fix things on it.

Tiny correction: I was specifically comparing the behavior of set_content() to set_payload(), not get_payload(). But otherwise yes -- *_payload() methods are legacy methods and I understand we should not change those.

Still there seems to be a tiny misunderstanding here... I think the misbehaving method is set_content(), not set_payload(), because it fails to account for the absence of trailing newline. Whether that would be considered breaking change is beyond my expertise, but just to clarify I am not proposing to change the legacy interface.

@ZeroIntensity
Copy link
Member

This is fixed by #121543

@medmunds
Copy link
Contributor

medmunds commented Aug 21, 2024

This feels like it is a bug, in that it breaks an implied symmetry between the modern set_content() and get_content() APIs:

body = "body content"
message = email.message.EmailMessage()
message.set_content(body)
assert message.get_content() == body
# AssertionError: 'body content\n' != 'body content'

In practice, it probably doesn't matter that much: most real-world text/* content probably already ends with a newline, or at least won't have its meaning changed by adding one.

Where it's most likely to come up is in test cases, like the one above. (Or like the large Django email test suite I've been updating to use the modern email API; apologies for the "referenced this issue" spam earlier.)

This could be fixed—while maintaining compatibility—by introducing a new policy option. Something like force_trailing_text_eol. Then the relevant code in content_manager._encode_text() could be changed to only add a trailing newline if that policy option is enabled (or the original string already had one).

(A precedent for adding new options like this is the recently introduced verify_generated_headers policy option, which can be used to restore earlier buggy behavior.)

@ZeroIntensity
Copy link
Member

Sounds reasonable, but I would create a new issue (marked as a feature instead of a bug) to get some fresh eyes on it.

@rapidcow
Copy link
Contributor Author

Thank you so much for justifying my proposal! Seeing that there is hope for getting this into Python standard library, I'd like to give my two cents of thoughts :)

First off though:

This feels like it is a bug, in that it breaks an implied symmetry between the modern set_content() and get_content() APIs.

The symmetry between old and new APIs were only a cheap motivation for me to escalate this to an issue. ;P (Usually I don't dare to question decisions the Python core developers made...)

However the real motivation behind this issue is simply that it can be done (depending on whether my further justification below is valid, of course). RFC imposes no restrictions on the trailing newlines (in fact it even explicitly allows for that possibility). As an email library (as opposed to a client) we should leave the decision on whether to "fix" an issue that is merely a matter of convention to the user... at least that's what I believe in.

Another problem (as you addressed in your pull request in your Django repo; which by the way, I appreciate the reference!) is that there is virtually no way to work around this if one so insists that the email content do not end with an EOL without re-implementing everything: the line wrapping for variable max_line_length has to be implemented from scratch since email.quoprimime is an undocumented module and you supposedly should not depend on that. Not to mention line-wrapping for base64 which is nowhere to be seen (and given that the implementation is email.base64mime is wrong even, so that email.contentmanager uses a private re-implementation of that, which is something you can rely on even less...) If it were the other way around, you can still enforce a newline by... appending a newline before passing to set_content() if one isn't there already.

Regarding the practicality of preserving the lack of trailing newline:

In practice, it probably doesn't matter that much: most real-world text/* content probably already ends with a newline, or at least won't have its meaning changed by adding one.

Where it's most likely to come up is in test cases, like the one above.

I agree that most text subtypes are not affected by this distinction. In Unix, it is even expected that files terminate with a line feed (e.g. wc -l). Contents where EOL matters are more or less binary, and probably shouldn't be encoded as text (I am guilty for doing this...)

One consideration is enabling the possibility round-trip conversion: on a purely theoretical basis, RFC only mandates that we represent line breaks as CRLF, and either invalidate or translate lone CR and LF characters. Thus the email library should only concern itself with the translation newline sequences into the canonical representation in Python, as the implementation of set_content() did.

In practice, none of this matters if the text payload is the sole MIME part and the email is to be delivered to another mailbox (as is often the case): as the email message is transmitted over SMTP, that last CRLF will have to be appended for the server to acknowledge that the DATA command has terminated. (Python adds a trailing CRLF if the original email did not end with one, and I assume many other email clients do too.) For multipart messages however there is a difference, since contents that do not end with a newline can still be properly represented 1 and are thus "protected" when transmitted over SMTP --- this was my initial motivation for this proposal, because as a MIME part I see no reason to impose a trailing newline when the multipart syntax is perfectly capable of representing the lack of one.

To push this further, even if SMTP limitations prevent the absence of trailing newline from being represented in text parts as the sole part of a message, the same does not apply to content transfer encodings like quoted-printable and base64. (The reason why I gave an example with quoted-printable rather than 7bit/8bit 2) base64 requires little explanation as it is capable of representing any arbitrary number of bytes. As for quoted-printable, the lack of trailing newline can be "protected" (similar to how multipart boundary works) by appending a soft wrap at the end. This is how the qprint command works, which I stumbled upon after sometime of writing up this issue 3:

$ printf résumé | iconv -t iso-8859-1 | qprint -e | xxd
00000000: 723d 4539 7375 6d3d 4539 3d0d 0a         r=E9sum=E9=..

This, in retrospect, is a lot better than how set_payload() handled it. (I guess this was a feature request, after all! XD)

This could be fixed—while maintaining compatibility—by introducing a new policy option. Something like force_trailing_text_eol.

For the record I think adding a keyword argument can also work. (set_content() is missing a bunch of parameters 4 for content disposition anyways...) As for the name, Vim has an option called fixeol that does precisely this (or the longer name fixendofline), so to make it fit the readable names in Python, I suppose we can call it fix_trailing_newline? :) There is also Sublime Text's ensure_newline_at_eof_on_save, which is another word we can use.

(Sorry if there are any typos as I do not have a lot of time to proofread all this /\ but I do want to get this out as soon as possible since this is an issue I care about.)

Footnotes

  1. RFC 2046 5.1.1 writes: "NOTE: The CRLF preceding the boundary delimiter line is conceptually attached to the boundary so that it is possible to have a part that does not end with a CRLF (line break). Body parts that must be considered to end with line breaks, therefore, must have two CRLFs preceding the boundary delimiter line" (emphasis mine).

  2. Very ironically though, the "breaking" test cases I found were all using 7bit/8bit CTE. I apologize for misguiding everyone! It appears that the end-of-line fix is necessary in those cases (unless, of course, it is a MIME part, which is an easy exception we can make with isinstance(message, MIMEPart).)

  3. I just realized that I forgot the second accent mark in "résumé" in my original example...

  4. In the same RFC where the Content-Disposition parameter filename is defined, the parameters creation-date, modification-date, read-date, size are also defined. They are lesser known but it doesn't stop me from questioning why they were left out despite being standard!

@medmunds
Copy link
Contributor

@rapidcow appreciate your enthusiasm and deep dive into the labyrinth that is email RFCs.

You might be aware that Python's modern email API has no guarantee of compatibility with the legacy API. So reports that compare the behavior of the two (or mix use of them) tend not to get much attention.

I agree with you it's a bug that the modern API's set_content() forces trailing line breaks in MIME text parts, based solely on:

  1. RFC 2046's specific support for text parts that don't end in line breaks (the note you identified)
  2. The fact that get_content() does not always return the value passed to set_content()

And personally, I think it would be appropriate to handle this as a bug, so long as there's also an option for users who need to retain the current buggy behavior. (Similar to what was done with verify_generated_headers. Part of the point of the modern API was to faithfully implement the email specs without needing to maintain compatibility with earlier bugs.)

But those are just my opinions. And to be clear, I'm not part of the Python email team. (Also, to set expectations, those volunteers face a huge backlog of email related issues. It can take a very long time—months or even years—for email PRs to get any attention.)

I think adding a keyword argument can also work

I think a policy option might be preferable to a set_content() arg, in case the message is later refolded in a generator. (Otherwise refolding might reintroduce the trailing line break. But I could be misunderstanding that.)

set_content() is missing a bunch of parameters for content disposition

You might raise this as a separate, unrelated feature request. And probably report it against add_attachment() where one is more likely to want the content-disposition params. (There is a params arg, but it adds params to content-type, not content-disposition.)

@rapidcow
Copy link
Contributor Author

rapidcow commented Aug 31, 2024

Sorry it's been a while @medmunds, but I did not mean to ignore your response. College is starting soon and so I have a lot of stuff to deal with, but I'll try my best to make my point clear. ^^;

You might be aware that Python's modern email API has no guarantee of compatibility with the legacy API. So reports that compare the behavior of the two (or mix use of them) tend not to get much attention.

My proposal in this issue was focused not on the compatibility with the legacy API itself, but rather on the compliance with the design principles which the email library seems to follow. (When suggesting that we encode the empty string as =\r\n in quoted-printable CTE, I am not at all concerned with compatibility with the legacy API. Granted, I have started this proposal on the grounds of how the legacy API did it, but I have better reasons to believe that this change would result in more desirable behavior.)

Namely, Message objects are designed to be a model of RFC5322/MIME-conformant messages, while, as per Robustness Principle, enabling the representation of a loose superset of common yet invalid messages. As an interface to a model, the set_content() method should only concern itself with the validation and representation of user input, and nothing more. I've mentioned that performing EOL translation is obligatory to ensure that the content of text/* messages conforms with RFC 2046 Section 4.1.1 - that aligns with the API's duty to validate user input. Enforcing EOL at the end of the content of text/* messages is, as I tried to argue, neither a necessary step for validation (not by RFC with regards to any standard track specification or convention - at least to my knowledge) nor a satisfactory attempt to accurately represent user input. My rationale for this proposal is as simple as that a lack of EOL can (and should) be represented upon user's request.

Yes, I tried to make a point that this deviates from the implementation of the now-obsolete set_payload() in the legacy API. But if you haven't already, I would kindly ask you to examine this part of the code I have linked in my original post (the comment + the two lines following it):

def body_encode(body, maxlinelen=76, eol=NL):
    """Encode with quoted-printable, wrapping at maxlinelen characters."""
    if not body:
        return body
    ...
    encoded_body = []
    append = encoded_body.append
    ...
    # add back final newline if present
    if body[-1] in CRLF:
        append('')
    return eol.join(encoded_body)

It is apparent that the original author intended to represent messages that lack an EOL. It is this intent to care and respect for user input that I find blatantly missing in the new email library code.

And personally, I think it would be appropriate to handle this as a bug, so long as there's also an option for users who need to retain the current buggy behavior.

Very much agreed! :D

But those are just my opinions. And to be clear, I'm not part of the Python email team. (Also, to set expectations, those volunteers face a huge backlog of email related issues. It can take a very long time—months or even years—for email PRs to get any attention.)

And those were mine too :) Honestly, I'm not ranting to you because I thought I'd make a difference; I'm only writing because doing this is not completely pointless, and someone is willing to accept that this is a bug of sorts (or whatever you call this... "removal of feature", I think XD).

set_content() is missing a bunch of parameters for content disposition

You might raise this as a separate, unrelated feature request. And probably report it against add_attachment() where one is more likely to want the content-disposition params. (There is a params arg, but it adds params to content-type, not content-disposition.)

You spotted that, haha! That was just a slightly sarcastic side rant of my many dissatisfactions with the content manager library, which I apologize :P (I try not to get opinionated, but um... it's hard to not get worked up over the sheer overwhelming number of arguments.)

I'll try to lay down the real problem that I believe the content manager module has below: my point was that set_content() tries to assume too many responsibilities at once. With set_payload() it affects at most two places: payload and the charset parameter. With set_content() it affects... well, everything.

There are many reasons I speculate why this elaborate method signature was made, but the most important one is this: when I want to attach a MIME part, I have to set everything in one go, because add_attachment(), add_related(), add_alternative() all return nothing. So (assuming that I am not allowed to use any legacy API methods), instead of taking my time, setting all the fields I want to set with the well-established methods that fulfill a single dedicated job...

# NOTE: to make this actually run in python 3.12.4+, replace
#   part = msg.add_blah(...)
# with
#   msg.add_blah(...)
#   part = msg.get_payload()[-1]
# get_content() doesn't work because it doesn't know
# how to return content for a multipart message.
big_msg = EmailMessage()

text_part = big_msg.add_alternative('hello', 'plain', charset='iso-8859-1')
text_part.set_param('format', 'flowed')
text_part['Content-Description'] = 'a text part.'
text_part['Content-ID'] = '<[email protected]>'

html_part = big_msg.add_alternative('<p>hello peeps</p>', 'html', charset='us-ascii')
html_part['Content-Description'] = 'an HTML part.'
html_part['Content-ID'] = '<[email protected]>'
# being the weirdo i am, i sometimes add filename to inline parts :)
html_part['Content-Disposition'] = 'inline'
html_part.set_param('filename', 'peepsgreet.html', header='Content-Disposition')

(NOTE: in spite of the substantially lengthy code, I am effectively only using three methods, responsible for three clearly defined actions: set body, set header, and set header parameter. I would like it better if I can just create the MIMEPart from scratch, but welp, attach() is marked "legacy", and set_content() takes no lists)

... I get to write this:

big_msg = EmailMessage()

big_msg.add_alternative('hello', 'plain', charset='iso-8859-1',
  params=dict(format='flowed'),
  headers=['Content-Description: a text part.'],
  cid='<[email protected]>')

big_msg.add_alternative('<p>hello peeps</p>', 'plain', charset='us-ascii',
  headers=['Content-Description: an HTML part.'],
  cid='<[email protected]>',
  # I was, admittedly, being weird only to demonstrate that
  # having disposition default to "attachment" for when
  # filename is provided doesn't hide the the fact that
  # you still need to specify it when it comes to peculiar
  # combinations.
  #
  # I am, by no means, against this though.
  # It's just that I believe a functionality so specific
  # deserves to be managed singly and separately, in say,
  # a separate function, rather than bundled with
  # random arguments from which it cannot be separated.
  disposition='inline', filename='peepsgreet.html')

As you can see, each indented line above (save for "disposition" and "filename") is identical to calls to __setitem__() and set_param() above. They are shorter to type, sure - but that is no business for the standard library IMO. Users can define their own "content manager" if it is merely a bundle of convenient calls to set the MIME headers by their full name, but I would say leave it to the users. (And for goodness sake, don't encourage this sort of everything-function paradigm! I mean, I would have at least appreciated a separate set_disposition() method that does the non-trivial default disposition inference logic! XD)

TL;DR: The new email library is an improvement in many ways but gives us, the users, too little control. And I especially don't appreciate the sentiment of deprecating the legacy API when parts like attach() and get_payload() (for multipart messages (with the exception of quirkily represented embedded messages like message/delivery-status and message/rfc822)) work perfectly fine even in the settings of the new API, while introducing new problems (multiple responsibilities, being hard to implement and hard to extend (I mean set_content() and get_content() are meant to be hooks), and - worst of all - inability to represent perfectly valid RFC text content)!

@rapidcow
Copy link
Contributor Author

rapidcow commented Jan 20, 2025

The issue seems to have a successful resolution at #121543. There was discussion for adding this as a switch, named along the lines of force_trailing_text_eol and fix_trailing_newline. But I imagine there won't be great uses for this feature: files whose content that ought to be preserved exactly are often encoded in base64 anyways; and the fact that the content type is assumed to be text/* for str content does it clear that it cares about representation of text as lines of Unicode characters; and lines ought to be terminated by CRLF in transit (see RFC 2046 Section 4.1.1 and RFC 2049 Section 4.)

It was once tempting to think that a str with a CTE of base64, or even quoted-printable (if care is taken to escape special characters (and the lack thereof) with soft wrap (see RFC 2045 Section 6.7 Rule 5)), would be naïvely encoded with the given charset by raw_data_manager, the default content manager of EmailMessages. But since the PR above made it clear that text and binary content are differentiated on the basis of the data type rather than the CTE, there should be no such confusion anymore.

It is unfortunate that the new email API masks the naïve legacy quoted-printable encoder (which conditionally appends the trailing newline with respect to the given text), though in hindsight, this is probably more consistent with most of Python's interface (e.g. the print function, where the end parameter is a newline by default. The difference here is there isn't an option to not have that trailing newline; but as I explained in the first paragraph, there probably isn't a strong practical reason to have such an option anyway.)

As a final remark I just thought of (somewhat more relevant to the documentation PR): the empty string is treated as having one line and encoded with exactly one newline (and that happens even if you set CTE to base64); this is expected behavior. In fact, one can expect every email message whose content is created from a str to contain at least one newline (that being the linesep of the current policy); even though raw_data_manager's set_content() does use str.splitlines() as I noted in email.contentmanager.rst, concatenation is done by linesep.join(lines) + linesep, rather than ''.join(linesep + line for line in lines) (what one might suspect as the more natural way of creating newline-delimited text) thus forcing a newline on all strings including the empty string. (Come to think of it, it was not the best idea to reveal implementation detail in the documentation. I wanted to describe what is considered to be a line terminator by set_content in a terse manner (especially given that Python is Unicode-aware and looks for more than CR/LF/CRLF), but the fact that str.splitlines() returns the empty list on an empty string probably makes this behavior more surprising than it is obvious.)

sarahboyce pushed a commit to sarahboyce/django that referenced this issue Apr 8, 2025
Python's modern email API will force a trailing newline onto all text/*
bodies and attachments. Updated mail tests to include (and check for)
the newline while still using the legacy email API.

See python/cpython#121515 which reasons that,
apart from artificial test cases, most text content already ends in a
newline. If it doesn't, adding one won't change the meaning.
sarahboyce pushed a commit to django/django that referenced this issue Apr 9, 2025
Python's modern email API will force a trailing newline onto all text/*
bodies and attachments. Updated mail tests to include (and check for)
the newline while still using the legacy email API.

See python/cpython#121515 which reasons that,
apart from artificial test cases, most text content already ends in a
newline. If it doesn't, adding one won't change the meaning.
gmaOCR pushed a commit to gmaOCR/django that referenced this issue Apr 16, 2025
Python's modern email API will force a trailing newline onto all text/*
bodies and attachments. Updated mail tests to include (and check for)
the newline while still using the legacy email API.

See python/cpython#121515 which reasons that,
apart from artificial test cases, most text content already ends in a
newline. If it doesn't, adding one won't change the meaning.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
topic-email type-bug An unexpected behavior, bug, or error
Projects
None yet
Development

No branches or pull requests

5 participants