Description
when decoding/encoding some str/bytes you can (for example) :
'strict' : raise an exception
'ignore' : wrong bytes are kicked out
'replace' : wrong bytes are replaced by the UNICODE_REPLACEMENT_CHAR
'backslashreplace' : Replace with backslashed escape sequences
No problem with current implementation for these ones.
This PEP proposes a means of dealing with [such] irregularities by embedding the bytes in character strings in such a way that allows recreation of the original byte string.
'surrogateescape' : replace wrong bytes by surrogate
With this PEP, non-decodable bytes >= 128 will be represented as lone surrogate codes U+DC80..U+DCFF.
But the actual RustPython implementation stores str
as valid unicode then :
a.push('\u{dcc3}');
help: src/main.rs:25: unicode escape must not be a surrogate
storing surrogate is not allowed.
I tried to do it inside an unsafe block it's not possible either.
I'm note sure to have a clear idea howto to do it. I was thinking to a wrapper on a "homemade" unicode and the Rust unicode implementation to store strings, but actually python is able to call some methods on invalid utf8 sequence :
a.capitalize()
>>> '\udcc3\udca9'
I'd be happy to have some opinion about it.