Description
Bug report
Bug description:
When using email.parser with the modern email.policy.default, the parser incorrectly inserts a space between adjacent rfc2047 encoded-words that are separated by folding whitespace. This can result in splitting words or names in unexpected places in the parsed headers.
Example:
Python 3.13.1 (main, Dec 3 2024, 17:59:52) [Clang 16.0.0 (clang-1600.0.26.4)] on darwin
>>> from email import message_from_bytes, policy
>>> from email.message import EmailMessage
>>> address = "Bérénice-Amélie Rosemonde Dûbois-Bénard <rose@example.com>"
>>> message = EmailMessage()
>>> message["From"] = address
>>> message_bytes = message.as_bytes()
>>> default_parsed = message_from_bytes(message_bytes, policy=policy.default)
>>> default_parsed_from = default_parsed["From"].addresses[0]
>>> assert default_parsed_from == address
Traceback (most recent call last):
File "<python-input-22>", line 1, in <module>
assert default_parsed_from == address
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
AssertionError
>>> print(default_parsed_from); print(address)
Bérénice-Amélie Rosemonde Dûbois-Béna rd <rose@example.com>
Bérénice-Amélie Rosemonde Dûbois-Bénard <rose@example.com>
>>> print(default_parsed["From"].addresses[0].display_name)
Bérénice-Amélie Rosemonde Dûbois-Béna rd
Notice the unexpected space in the parsed version. (I have more-problematic examples involving variations of the Scunthorpe problem, but they're not suitable for GitHub. This example uses Python's modern email API to generate the message, but the bug applies to parsing messages from any source.)
This seems to be caused by two consecutive rfc2047 encoded-words [edit: whether or not they cross a fold, see later comment]:
>>> print(message_bytes.decode())
From: =?utf-8?b?QsOpcsOpbmljZS1BbcOpbGllIFJvc2Vtb25kZSBEw7tib2lzLULDqW5h?=
=?utf-8?q?rd?= <rose@example.com>
RFC 2047 section 6.2 requires that space that separates adjacent encoded-words be ignored. (Specifically to allow splitting encoded-words at any character—behavior that the policy.default email.generator relies on.)
Note that the legacy compat32 parser did not have this problem. (The legacy parser requires separately calling legacy email.header.decode_header() to decode rfc2047.)
>>> from email.header import decode_header
>>> compat32_parsed = message_from_bytes(message_bytes, policy=policy.compat32)
>>> compat32_parsed_from = "".join(
... segment if charset is None and isinstance(segment, str)
... else segment.decode(charset or "ascii")
... for segment, charset in decode_header(compat32_parsed["From"])
... )
>>> assert compat32_parsed_from == address # (success)
>>> print("original:", address); print("compat32:", compat32_parsed_from); print(" default:", default_parsed_from)
original: Bérénice-Amélie Rosemonde Dûbois-Bénard <rose@example.com>
compat32: Bérénice-Amélie Rosemonde Dûbois-Bénard <rose@example.com>
default: Bérénice-Amélie Rosemonde Dûbois-Béna rd <rose@example.com>
CPython versions tested on:
3.12, 3.13
Operating systems tested on:
Linux, macOS