should we change str/unicode implementation for PEP 383 ?

when decoding/encoding some str/bytes you can (for example) :
'strict' : raise an exception
'ignore' : wrong bytes are kicked out
'replace' : wrong bytes are replaced by the UNICODE_REPLACEMENT_CHAR
'backslashreplace' : Replace with backslashed escape sequences

No problem with current implementation for these ones.

This PEP proposes a means of dealing with [such] irregularities by embedding the bytes in character strings in such a way that allows recreation of the original byte string.

'surrogateescape' : replace wrong bytes by surrogate

With this PEP, non-decodable bytes >= 128 will be represented as lone surrogate codes U+DC80..U+DCFF.

But the actual RustPython implementation stores str as valid unicode then :

a.push('\u{dcc3}');
help: src/main.rs:25: unicode escape must not be a surrogate

storing surrogate is not allowed.
I tried to do it inside an unsafe block it's not possible either.

I'm note sure to have a clear idea howto to do it. I was thinking to a wrapper on a "homemade" unicode and the Rust unicode implementation to store strings, but actually python is able to call some methods on invalid utf8 sequence :

a.capitalize()                                                         
>>>  '\udcc3\udca9'

I'd be happy to have some opinion about it.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

should we change str/unicode implementation for PEP 383 ? #935

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Search code, repositories, users, issues, pull requests...

should we change str/unicode implementation for PEP 383 ? #935

Description

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions