RFC: Towards reproducible builds for our PyPI release wheels

Given the popularity of our project, our release automation might be considered an interesting target to conduct supply chain attacks to make our binaries ship spyware or ransomware to some of our users.

One way to detect such attacks would be to:

make sure we produce reproducible builds;
rebuild our wheels from independent build environments and check that we obtain the same hash as for binaries obtained by our release CI to make sure that our release CI environment has not been tampered to inject malware in our binaries;
optionally make it possible to publish GPG signed statements that some released artifact digests were successfully byte-for-byte reproduced from source independently.

The first step would to make our wheels as reproducible as possible would be to define deterministic values for the SOURCE_DATE_EPOCH (and maybe PYTHONHASHSEED, that cannot hurt) environment variables.

HHowever,this would not be enough.

To get this fully work as expected, we would also need to guarantee that:

we use recent enough versions of pip/setuptools/wheel/auditwheel/delocate
that honor SOURCE_DATE_EPOCH;
a full description of the build environment (e.g. versions and sha256 digests of the
compilers and other build dependencies) is archived in our source repo for a given tag
of scikit-learn. Ideally, all those build dependencies should themselves be
byte-for-byte reproducible from their own public source code repo.

Currently some build dependencies such as NumPy and Cython come from the pyproject.toml file which only specifies a minimum version. This means that we may end up with a newer versions of these dependencies than the one used to build the wheels for a given tag. cibuildwheel itself is not pinned, hence neither the dependencies it installs in its managed venvs (pip, setuptools, wheel, auditwheel, delocate).

Furthermore, we do not archive or pin the versions and sha256 digests of the compilers yet. For Linux, this depends on the manylinux docker image used by cibuildwheel, which at the time of writing, is not guaranteed to be reproducible, even when using the same docker image tag. For windows and macOS, the compilers come from the VM image used on our CI which we do not archive neither their version numbers nor the hash of their binaries.

Ideally all this information should be in our source code at the time of the release (reachable via a checkout of our commit tag).

Finally, we might need to set a specific umask:

Builds are not (fully) reproducible due to file permissions stored in .whl pypa/wheel#362

Not sure about how to get deterministic file permission metadata for macOS and Windows wheels.

EDIT: now that we use meson, this problem with umask might have gone away, but we need to check.

EDIT2: I tried and I think we still have a sensitivity to umask after the switch to the meson build system.

Finally, once our builds are made 100% reproducible, we would need to document:

document instructions (and provide official scripts) to allow anyone to easily re-build the binaries independently from source;
make it process easy to automate on private infrastructure (distinct from our usual public CI);
publish official reproducibility results on a public site, ideally not only on our github pages hosted website and maybe sign those with GPG for instance;
coordinate with people from https://scientific-python.org or even the PSF or pypi.org admins to define and follow community-wide best practices.

This is just for scikit-learn itself. But for this kind of supply chain audit to be meaningful, we would need to make sure that all the tools in the build pipeline of scikit-learn are themselves reproducible and regularly and independently reproduced, including:

compilers;
runtime libraries such as the libc;
build dependencies (numpy, Cython, meson-python, ninja, cibuildweel);
wheel binary editing tools such as auditwheel/delocate/delvewheel/repairwheel;
the sha256sum command :)
the whole manylinux docker image and probably docker since it is required to build manylinux wheels.

We would also need to snapshot the provenance info before running the tests (in case pytest or any test dependency are themselves supply chain attacked). For instance, the test environment was effectively used to hide the attack on the xz binaries.

Note that a large fraction of debian is already reproducible but we would need to trace everything in our build process to check that this is the case.

Doing all of this will require significant maintainers time investment, but we can probably start from low hanging fruits such as setting SOURCE_DATE_EPOCH in our release CI scripts.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Uh oh!

RFC: Towards reproducible builds for our PyPI release wheels #28151

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Search code, repositories, users, issues, pull requests...

Uh oh!

RFC: Towards reproducible builds for our PyPI release wheels #28151

Description

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions