email.parser can insert extraneous spaces when parsing rfc2047 headers with policy.default

Bug report

Bug description:

When using email.parser with the modern email.policy.default, the parser incorrectly inserts a space between adjacent rfc2047 encoded-words that are separated by ~~folding~~ whitespace. This can result in splitting words or names in unexpected places in the parsed headers.

Example:

Python 3.13.1 (main, Dec  3 2024, 17:59:52) [Clang 16.0.0 (clang-1600.0.26.4)] on darwin
>>> from email import message_from_bytes, policy
>>> from email.message import EmailMessage

>>> address = "Bérénice-Amélie Rosemonde Dûbois-Bénard <rose@example.com>"
>>> message = EmailMessage()
>>> message["From"] = address
>>> message_bytes = message.as_bytes()

>>> default_parsed = message_from_bytes(message_bytes, policy=policy.default)
>>> default_parsed_from = default_parsed["From"].addresses[0]
>>> assert default_parsed_from == address
Traceback (most recent call last):
  File "<python-input-22>", line 1, in <module>
    assert default_parsed_from == address
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
AssertionError

>>> print(default_parsed_from); print(address)
Bérénice-Amélie Rosemonde Dûbois-Béna rd <rose@example.com>
Bérénice-Amélie Rosemonde Dûbois-Bénard <rose@example.com>

>>> print(default_parsed["From"].addresses[0].display_name)
Bérénice-Amélie Rosemonde Dûbois-Béna rd

Notice the unexpected space in the parsed version. (I have more-problematic examples involving variations of the Scunthorpe problem, but they're not suitable for GitHub. This example uses Python's modern email API to generate the message, but the bug applies to parsing messages from any source.)

This seems to be caused by two consecutive rfc2047 encoded-words [edit: whether or not they cross a fold, see later comment]:

>>> print(message_bytes.decode())
From: =?utf-8?b?QsOpcsOpbmljZS1BbcOpbGllIFJvc2Vtb25kZSBEw7tib2lzLULDqW5h?=
 =?utf-8?q?rd?= <rose@example.com>

RFC 2047 section 6.2 requires that space that separates adjacent encoded-words be ignored. (Specifically to allow splitting encoded-words at any character—behavior that the policy.default email.generator relies on.)

Note that the legacy compat32 parser did not have this problem. (The legacy parser requires separately calling legacy email.header.decode_header() to decode rfc2047.)

>>> from email.header import decode_header
>>> compat32_parsed = message_from_bytes(message_bytes, policy=policy.compat32)
>>> compat32_parsed_from = "".join(
...     segment if charset is None and isinstance(segment, str)
...     else segment.decode(charset or "ascii")
...     for segment, charset in decode_header(compat32_parsed["From"])
... )
>>> assert compat32_parsed_from == address  # (success)
>>> print("original:", address); print("compat32:", compat32_parsed_from); print(" default:", default_parsed_from)
original: Bérénice-Amélie Rosemonde Dûbois-Bénard <rose@example.com>
compat32: Bérénice-Amélie Rosemonde Dûbois-Bénard <rose@example.com>
 default: Bérénice-Amélie Rosemonde Dûbois-Béna rd <rose@example.com>

CPython versions tested on:

3.12, 3.13

Operating systems tested on:

Linux, macOS

Linked PRs

gh-128110: Fix rfc2047 handling in email parser address headers #130749

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Uh oh!

email.parser can insert extraneous spaces when parsing rfc2047 headers with policy.default #128110

Bug report

Bug description:

CPython versions tested on:

Operating systems tested on:

Linked PRs

Metadata

Assignees

Labels

Projects

Milestone

Relationships

Development

Search code, repositories, users, issues, pull requests...

Uh oh!

email.parser can insert extraneous spaces when parsing rfc2047 headers with policy.default #128110

Description

Bug report

Bug description:

CPython versions tested on:

Operating systems tested on:

Linked PRs

Metadata

Metadata

Assignees

Labels

Projects

Milestone

Relationships

Development

Issue actions