-
Notifications
You must be signed in to change notification settings - Fork 1.3k
Allow surrogates in str #5587
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Merged
Merged
Allow surrogates in str #5587
Conversation
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
0a1bdc1
to
1fdf376
Compare
can we separately merge less wtf8 related changes like format/cformat changes? |
c3388b2
to
a556bd7
Compare
youknowone
approved these changes
Mar 26, 2025
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
👍
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
This switches the backing format of
PyStr
fromBox<str>
toBox<Wtf8>
. From the motivation I wrote in a doc comment:An implementation of WTF-8, a utf8-compatible encoding that allows for unpaired surrogate codepoints. This implementation additionally allows for paired surrogates that are nonetheless treated as two separate codepoints.
RustPython uses this because CPython internally uses a variant of UCS-1/2/4 as its string storage, which treats each
u8
/u16
/u32
value (depending on the highest codepoint value in the string) as simply integers, unlike UTF-8 or UTF-16 where some characters are encoded using multi-byte sequences. CPython additionally doesn't disallow the use of surrogates instr
s (which in UTF-16 pair together to represent codepoints with a value higher thanu16::MAX
) and in fact takes quite extensive advantage of the fact that they're allowed. Thesurrogateescape
codec-error handler uses them to represent byte sequences which are invalid in the given codec (e.g. bytes with their high bit set in ASCII or UTF-8) by mapping them into the surrogate range.surrogateescape
is the default error handler in Python for interacting with the filesystem, and thus if RustPython is to properly supportsurrogateescape
, itsstr
s must be able to represent surrogates.We use WTF-8 over something more similar to CPython's string implementation because of its compatibility with UTF-8, meaning that in the case where a string has no surrogates, it can be viewed as a UTF-8 Rust
str
without needing any copies or re-encoding.This implementation is mostly copied from the WTF-8 implentation in the Rust standard library, which is used as the backing for
OsStr
on Windows targets. As previously mentioned, however, it is modified to not join two surrogates into one codepoint when concatenating strings, in order to match CPython's behavior.Method
I opted not to change the signature of
PyStr::as_str()
, since that would mean a truly massive diff. Instead, I have it panic, and my hope is that we can get to a place with this where tests pass, and then we can deal with the rest of theas_str
calls incrementally.Status
Currently panicking in cformat code. Is there still a reason
Parser
et al need to be a separate repo if ruff isn't using it anymore? It'd be really convenient to merge it into the monorepo.