-
-
Notifications
You must be signed in to change notification settings - Fork 7.9k
Type-1 font subsetting #20716
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: main
Are you sure you want to change the base?
Type-1 font subsetting #20716
Conversation
f6861ad
to
d8ae364
Compare
Is this ready for review, now that #20715 has been merged? |
This is failing on Ubuntu 22.04 and Windows but passing on 24.04 and Mac. Here's one failing image (test_usetex_pdf.png, so converted from pdf to png on the test system). This looks like the font is entirely broken. The expected image similarly converted looks like this: Strangely enough, the generated pdf file looks fine on my Mac. |
I can repeat the error running on an Ubuntu 22.04 docker image:
The subsetting is wrong in some way that breaks Ghostscript 9.55 but not the viewer in macOS or the newer Ghostscript in Ubuntu 24.04. (Ghostscript 9.56 has a completely rewritten PDF interpreter.) |
I hope I found the culprit... I was writing an extra delimiter between the Subrs and the Charstrings when one was already there. |
9c5d971
to
9a9dc05
Compare
lib/matplotlib/_type1font.py
Outdated
@@ -35,10 +37,64 @@ | ||
|
||
from matplotlib.cbook import _format_approx | ||
from . import _api | ||
if T.TYPE_CHECKING: | ||
from collections.abc import Iterable |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We tend not to have inline type hints (and I personally very much don't like them), but there are a few exceptions (e.g. _mathtext.py) so I guess it's up to you whether you want to leave them in.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Right – I saw that there are some files with type hints, and figured that the project might be in the process of adding them. I've found type hints pretty useful at work, but of course we should have a consistent style in the project. Has this been discussed in the past on the dev list or somewhere?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I gave my opinion on our gitter, but I'll copy the main comment here for the sake of consolidation/potential for finding it in the future:
I would say type hints are a net positive in my opinion, though I acknowledge that there are problems (perhaps especially in a case like ours where APIs were designed well before type hints).
We went with stub files on initial implementation specifically to minimize some risk, specifically since the stub files have are not even loaded at runtime, there was no chance of them interfering.
However, I do think inline reduces a level ongoing maintenance risk, in particular the chance of the two files getting out of sync. (We have tooling/CI in place to help catch such things, but don't think it fully removes the risk)
The other factor is that stub files allow us draw the line at public APIs, and not have to worry about typing our internal logic. (Which has positives and negatives, positives being largely catching additional problems by type checker, negative largely being the surface area that needs to be covered.)
So in all, I think my personal recommendation would be a relatively slow uptake, where we keep the stub files for most public facing things in the interim, do inline type hints for internal logic when it makes sense to do so (e.g. when doing so is actively helpful for some refactor/feature addition/etc) then once the majority of internal logic has inline hints, consider inlining the hints of the more public facing APIs.
So looking at this example in particular, I think this does fall within my recommendation there. I am not personally opposed to it, but also not hard pulling for it. I do question a bit whether to advocate for "do minimal as you are working and motivated" or "the unit for adding type hints should be one file at a time". Doing the latter would carry the advantage that you get a better sense of how complete the hints are, but the disadvantage of asking people to type hint 3-4x what they were otherwise looking at (in this case, for example), which also impacts the review-ability of the PR as there are lots of changes that are actually orthogonal.
For what its worth, this particular file already has one function which got an inline typehint added in #27796
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
and @tacaswell responded
It is worth re-opening the discussion of in-line type hints
the world seems to have stabilized and in other projects I have had type hints catch actual bugs before I ran the code (but I have also had to spend 15 minutes trying to sort out how to placate the type checker for code that clearly works a couple of times)
The type hints were useful for me while working on this, since VS Code pointed out some obvious mistakes in real time. But I agree that it is not ideal to leave the file only partially annotated, since it does not pass a full type check in its current state.
I could remove the extra type hints from this PR and make a separate PR to add hints to the whole file?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Sure, let's just go for it. I'm not going to be able to resist this forever 🙄
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
A few minor points to be considered, but overall this looks great.
lib/matplotlib/_type1font.py
Outdated
postscript_stack: list[float], | ||
opcode: int | str, | ||
) -> tuple[set, set, list[float], list[float]]: | ||
"""Run one step in the charstring interpreter.""" |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I wonder if this may be clearer if you mutate buildchar_stack and postscript_stack in-place? this way you would write
glyphs = set(); subrs = set()
if opcode in {...}:
buildchar_stack[:] = []
elif opcode == "seac":
codes = ...; glyphs.update(...)
buildchar_stack[:] = []
elif opcode == "div":
num2 = buildchar_stack.pop()
num1 = buildchar_stack.pop()
buildchar_stack.append(num1 / num2)
...
return glyphs, subrs
which feels perhaps more in the spirit of a postscript interpreter?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Could be! I'll think about this a little.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I moved this into a separate class with the stacks and glyph/subr sets as members. I agree that it looks clearer that way.
Type 1 fonts are now subsetted in PDF output | ||
-------------------------------------------- |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Type 1 fonts are now subsetted in PDF output | |
-------------------------------------------- | |
Type 1 fonts are now subset in PDF output | |
----------------------------------------- |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think I disagree here... the English verb "set" is irregular in that way, but if you search for "subsetted" in the context of fonts, it seems to be fairly common, including in fonttools and various Adobe forums. See also the accepted answer to this question and possibly the discussion of flied out in Steven Pinker's Words and Rules.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Maybe just be a bit more verbose?
Type 1 fonts are now subsetted in PDF output | |
-------------------------------------------- | |
PDFs embed just the subset of Type 1 glyphs that are used | |
----------------------------------------------------------- |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I don't disagree that this is the correct conjugation in the past tense, rather that these sentences are not in the past tense. It is stating what is and in the (foreseeable) future shall occur.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I reworded the whole paragraph to be hopefully more understandable on its own.
|
||
When using the usetex feature with the PDF backend, Type 1 fonts are embedded | ||
in the PDF output. These fonts used to be embedded in full, but they are now | ||
subsetted to only include the glyphs that are actually used in the figure. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
subsetted to only include the glyphs that are actually used in the figure. | |
subset to only include the glyphs that are actually used in the figure. |
The ligature problem is probably because we don't apply the encoding from TeX's font configuration to the font before subsetting. The custom encoding array is output in the PDF file but should also be used to map from character codes to glyph names. The seac issue might be a different encoding problem where we should do the lookups using Adobe Standard Encoding and not the font's own encoding. |
It seems that some of the latest changes broke compatibility with older GhostScript again. But while I debug that, a note about the new tests: they use font packages that are available on Debian or Ubuntu only by installing texlive-fonts-extra, which brings in a lot of other fonts too. Currently these tests get skipped on all runners, but would it make sense to install the extra fonts on just one of the runners to allow these tests to get run somewhere? |
4a8b6ff
to
cb204cd
Compare
I added a test using Bitstream Charter, which is part of texlive-fonts-recommended, so we get at least some coverage of the full Type-1 subsetting code path. I fixed the gs compatibility issue, which was about a broken Encoding object. |
lib/matplotlib/_type1font.py
Outdated
lenIV = self.prop.get('lenIV', 4) | ||
encrypted = [ | ||
self._encrypt(charstrings[glyph], 'charstring', lenIV).decode('latin-1') | ||
for glyph in glyphs |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Should this (and _subset_subrs below) sort the glyphs (and subrs) to ensure reproducibility? (as set ordering changes over runs)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The subrs are already in order (the loop is for i in range(n_subrs)
) but sorting the glyphs is a good idea.
I removed the extra type annotations, which were incomplete in any case. I'll make a separate PR to annotate the entire file. |
This reduces pdf file sizes when usetex is active, at the cost of some complexity in the code. We implement a charstring bytecode interpreter to keep track of subroutine calls in font programs. Give dviread.DviFont a fake filename attribute and a get_fontmap method for character tracking. In backend_pdf.py, refactor _get_subsetted_psname so it calls a method _get_subset_prefix, and reuse that to create tags for Type-1 fonts. Mark the methods static since they don't use anything from the instance. Recommend merging to main to give people time to test this, not to a 3.10 point release. Closes matplotlib#127. Co-Authored-By: Elliott Sales de Andrade <quantum.analyst@gmail.com>
The fonts that get used are usually "Type 1" fonts. | ||
They used to be embedded in full | ||
but are now limited to the glyphs that are actually used in the figure. | ||
This reduces the size of the resulting PDF files. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This reads well to me. Thanks!
Co-authored-by: Elliott Sales de Andrade <quantum.analyst@gmail.com>
Will these new test be adjusted by #29816 ? If so we should sequence that one first. |
I don't think these depend on FreeType, since the usetex case uses TeX for layout and parses dvi files to determine the coordinates of glyphs. |
PR Summary
Type-1 subsetting
This reduces pdf file sizes when usetex is active, at the cost of
some complexity in the code. We implement a charstring bytecode
interpreter to keep track of subroutine calls in font programs.
Recommend merging to main to give people time to test this, not to
a 3.10 point release.
Give dviread.DviFont a fake filename attribute and a get_fontmap
method for character tracking.
Add type hints to the code this touches.
Closes #127.
PR Checklist
pytest
passes).flake8
on changed files to check).flake8-docstrings
and runflake8 --docstring-convention=all
).doc/users/next_whats_new/
(follow instructions in README.rst there).doc/api/next_api_changes/
(follow instructions in README.rst there).