Skip to content

Navigation Menu

Sign in
Appearance settings

Search code, repositories, users, issues, pull requests...

Provide feedback

We read every piece of feedback, and take your input very seriously.

Saved searches

Use saved searches to filter your results more quickly

Appearance settings

Segfault/UB from expat when re-entering the XML Parser #146169

Copy link
Copy link
@stestagg

Description

@stestagg
Issue body actions

Crash report

What happened?

Semi-reliable(Interstingly, python in macos doesn't segfault for me, but docker/linux aarch64 does reliably) crash with this code:

from xml.parsers import expat

p = expat.ParserCreate(encoding="utf-16")

def start(name, attrs):
    p.CharacterDataHandler = lambda data: p.Parse(data, 0)

p.StartElementHandler = start

data = b"\xff\xfe<\x00a\x00>\x00x\x00"
for i in range(len(data)):
    try:
        p.Parse(data[i:i+1], i == len(data) - 1)
    except Exception:
        pass

This code /is/ doing some pretty naughty stuff, but the main problem seems to be that the handler is being set to re-enter the parser. The expat docs do say:

To state the obvious: the three parsing functions XML_Parse, XML_ParseBuffer and XML_GetBuffer must not be called from within a handler unless they operate on a separate parser instance, that is, one that did not call the handler. For example, it is OK to call the parsing functions from within an XML_ExternalEntityRefHandler, if they apply to the parser created by XML_ExternalEntityParserCreate.

and I see that the python expat parser code tracks in_callback:

int in_callback; /* Is a callback active? */

So I wonder if we can avoid the segfault by preventing Parse calls when in_callback==true?

There's also a secondary issue in play here, that Parse() seems to call

XML_SetEncoding

(void)XML_SetEncoding(self->itself, "utf-8");

Without the check outlined in the expat docs:

Set the encoding to be used by the parser. It is equivalent to passing a non-NULL encoding argument to the parser creation functions. It must not be called after XML_Parse or XML_ParseBuffer have been called on the given parser. Returns XML_STATUS_OK on success or XML_STATUS_ERROR on error.

This is almost definitely not going to cause issues unless the encoding is actually changing, (not that common) at which point the UB will rear its head as the internal state of the parser becomes inconsistent.

Dockerfile reproducer
ARG REPO=https://github.com/python/cpython.git
ARG BRANCH=main

FROM ubuntu:24.04

ARG REPO
ARG BRANCH
ENV DEBIAN_FRONTEND=noninteractive

RUN apt-get update && apt-get install -y \
    build-essential git pkg-config \
    libssl-dev libbz2-dev libreadline-dev libsqlite3-dev \
    liblzma-dev libffi-dev zlib1g-dev uuid-dev \
    && rm -rf /var/lib/apt/lists/*

RUN git clone --branch ${BRANCH} --depth 1 \
    ${REPO} /cpython

RUN cd /cpython && \
    ./configure --prefix=/python --without-ensurepip && \
    make -j$(nproc) && \
    make install

# ── TEST SCRIPT ──────────────────────────────────────────────────
RUN cat > /test.py << 'EOF'
from xml.parsers import expat

p = expat.ParserCreate(encoding="utf-16")

def start(name, attrs):
    p.CharacterDataHandler = lambda data: p.Parse(data, 0)

p.StartElementHandler = start

data = b"\xff\xfe<\x00a\x00>\x00x\x00"
for i in range(len(data)):
    try:
        p.Parse(data[i:i+1], i == len(data) - 1)
    except Exception:
        pass
EOF
# ──────────────────────────────────────────────────────────────────────────────

CMD ["/bin/sh", "-c", "uname -m && /python/bin/python3 -VV && /python/bin/python3 /test.py"]

Gives on my pc:

docker run --rm -it expattest
aarch64
Python 3.15.0a7+ (heads/main:52c0186, Mar 19 2026, 13:06:19) [GCC 13.3.0]
52c01864c4778a351e5aa3584e86ed6fd212a5a4
Segmentation fault (core dumped)

CPython versions tested on:

CPython main branch

Operating systems tested on:

macOS

Output from running 'python -VV' on the command line:

Python 3.15.0a7+ (heads/main:52c0186, Mar 19 2026, 13:06:19) [GCC 13.3.0]

Linked PRs

Reactions are currently unavailable

Metadata

Metadata

Assignees

Labels

extension-modulesC modules in the Modules dirC modules in the Modules dirtopic-XMLtype-crashA hard crash of the interpreter, possibly with a core dumpA hard crash of the interpreter, possibly with a core dump
No fields configured for issues without a type.

Projects

No projects

Milestone

No milestone

Relationships

None yet

Development

No branches or pull requests

Issue actions

    Morty Proxy This is a proxified and sanitized view of the page, visit original site.