Skip to content

Navigation Menu

Sign in
Appearance settings

Search code, repositories, users, issues, pull requests...

Provide feedback

We read every piece of feedback, and take your input very seriously.

Saved searches

Use saved searches to filter your results more quickly

Appearance settings

gh-119182: Use strict error handler in PyUnicode_FromFormat() #120307

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Closed
wants to merge 1 commit into from

Conversation

vstinner
Copy link
Member

@vstinner vstinner commented Jun 10, 2024

PyUnicode_FromFormat() now decodes the "%s" format argument from UTF-8 with the "strict" error handler, instead of the "replace" error handler.

Remove the unused 'consumed' parameter of
unicode_decode_utf8_writer().


📚 Documentation preview 📚: https://cpython-previews--120307.org.readthedocs.build/

PyUnicode_FromFormat() now decodes the "%s" format argument from
UTF-8 with the "strict" error handler, instead of the "replace" error
handler.

Remove the unused 'consumed' parameter of
unicode_decode_utf8_writer().
@vstinner
Copy link
Member Author

cc @methane @serhiy-storchaka

@serhiy-storchaka
Copy link
Member

What happens with truncated strings, like %.50s, if the are truncated in the middle of multibyte UTF-8 sequence?

@vstinner
Copy link
Member Author

What happens with truncated strings, like %.50s, if the are truncated in the middle of multibyte UTF-8 sequence?

There are two tests on that: UnicodeDecodeError is raised in this case.

@vstinner
Copy link
Member Author

Example of test: test_capi.test_unicode

        # test "%s" format with precision
        check_format('abc',
                     b'%.3s', b'abcdef')
        with self.assertRaises(UnicodeDecodeError):
            PyUnicode_FromFormat(b'%.5s', 'abc[\u20ac]'.encode('utf8'))
        check_format('abc[\u20ac',
                     b'%.7s', 'abc[\u20ac]'.encode('utf8'))

@serhiy-storchaka
Copy link
Member

This is bad. Such formats are common in error formatting code (not only in CPython, but in third-party code), and now you will get a UnicodeDecodeError instead of the original error even if all was fine with encoding. In this case I think that it it is better to truncate the string before the truncated sequence.

But even without truncation, it may be better to get a replacement character in the error message of the correct exception than a UnicodeDecodeError.

@vstinner
Copy link
Member Author

On my PR gh-120248, @methane wrote:

I prefer "strict" because "hard to notice" is also hard to debug.

So I created this PR. @methane: What do you think?

I can modify the %.100s format ("%s" with precision) to truncate to 100 characters instead of 100 bytes, to avoid the risk of creating invalid UTF-8 strings.

@serhiy-storchaka
Copy link
Member

I think that @methane's comment was only related to the format string (which currently is ASCII-only), not to arguments.

@methane
Copy link
Member

methane commented Jun 11, 2024

I think 100 codepoints is the best option.

About error handler, there is no correct answer. Theorically speaking, "strict" is "Errors should never pass silently."
But both of "replace" and "backslashreplace" are acceptable.

@serhiy-storchaka
Copy link
Member

Precision should specify the length in bytes. This feature can be used to format not-null-teminated strings.

char buffer[100];
PyUnicode_FromFormat("%.100s", buffer);

If you start to count codepoints, you can read past the end of the array.

@vstinner
Copy link
Member Author

I abandon this PR. It seems like using "replace" error handler is more appropriate here.

@vstinner vstinner closed this Jun 17, 2024
@vstinner vstinner deleted the format_strict branch June 17, 2024 20:02
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants
Morty Proxy This is a proxified and sanitized view of the page, visit original site.