Skip to content

Navigation Menu

Search code, repositories, users, issues, pull requests...

Provide feedback

We read every piece of feedback, and take your input very seriously.

Saved searches

Use saved searches to filter your results more quickly

Appearance settings

email.parser can insert extraneous spaces when parsing rfc2047 headers with policy.default  #128110

Copy link
Copy link
@medmunds

Description

@medmunds
Issue body actions

Bug report

Bug description:

When using email.parser with the modern email.policy.default, the parser incorrectly inserts a space between adjacent rfc2047 encoded-words that are separated by folding whitespace. This can result in splitting words or names in unexpected places in the parsed headers.

Example:

Python 3.13.1 (main, Dec  3 2024, 17:59:52) [Clang 16.0.0 (clang-1600.0.26.4)] on darwin
>>> from email import message_from_bytes, policy
>>> from email.message import EmailMessage

>>> address = "Bérénice-Amélie Rosemonde Dûbois-Bénard <rose@example.com>"
>>> message = EmailMessage()
>>> message["From"] = address
>>> message_bytes = message.as_bytes()

>>> default_parsed = message_from_bytes(message_bytes, policy=policy.default)
>>> default_parsed_from = default_parsed["From"].addresses[0]
>>> assert default_parsed_from == address
Traceback (most recent call last):
  File "<python-input-22>", line 1, in <module>
    assert default_parsed_from == address
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
AssertionError

>>> print(default_parsed_from); print(address)
Bérénice-Amélie Rosemonde Dûbois-Béna rd <rose@example.com>
Bérénice-Amélie Rosemonde Dûbois-Bénard <rose@example.com>

>>> print(default_parsed["From"].addresses[0].display_name)
Bérénice-Amélie Rosemonde Dûbois-Béna rd

Notice the unexpected space in the parsed version. (I have more-problematic examples involving variations of the Scunthorpe problem, but they're not suitable for GitHub. This example uses Python's modern email API to generate the message, but the bug applies to parsing messages from any source.)

This seems to be caused by two consecutive rfc2047 encoded-words [edit: whether or not they cross a fold, see later comment]:

>>> print(message_bytes.decode())
From: =?utf-8?b?QsOpcsOpbmljZS1BbcOpbGllIFJvc2Vtb25kZSBEw7tib2lzLULDqW5h?=
 =?utf-8?q?rd?= <rose@example.com>

RFC 2047 section 6.2 requires that space that separates adjacent encoded-words be ignored. (Specifically to allow splitting encoded-words at any character—behavior that the policy.default email.generator relies on.)

Note that the legacy compat32 parser did not have this problem. (The legacy parser requires separately calling legacy email.header.decode_header() to decode rfc2047.)

>>> from email.header import decode_header
>>> compat32_parsed = message_from_bytes(message_bytes, policy=policy.compat32)
>>> compat32_parsed_from = "".join(
...     segment if charset is None and isinstance(segment, str)
...     else segment.decode(charset or "ascii")
...     for segment, charset in decode_header(compat32_parsed["From"])
... )
>>> assert compat32_parsed_from == address  # (success)
>>> print("original:", address); print("compat32:", compat32_parsed_from); print(" default:", default_parsed_from)
original: Bérénice-Amélie Rosemonde Dûbois-Bénard <rose@example.com>
compat32: Bérénice-Amélie Rosemonde Dûbois-Bénard <rose@example.com>
 default: Bérénice-Amélie Rosemonde Dûbois-Béna rd <rose@example.com>

CPython versions tested on:

3.12, 3.13

Operating systems tested on:

Linux, macOS

Linked PRs

Metadata

Metadata

Assignees

No one assigned

    Labels

    stdlibPython modules in the Lib dirPython modules in the Lib dirtopic-emailtype-bugAn unexpected behavior, bug, or errorAn unexpected behavior, bug, or error

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions

      Morty Proxy This is a proxified and sanitized view of the page, visit original site.