gh-119182: Use strict error handler in PyUnicode_FromFormat() #120307

vstinner · Jun 10, 2024

PyUnicode_FromFormat() now decodes the "%s" format argument from UTF-8 with the "strict" error handler, instead of the "replace" error handler.

Remove the unused 'consumed' parameter of
unicode_decode_utf8_writer().

Issue: [C API] Add an efficient public PyUnicodeWriter API #119182

📚 Documentation preview 📚: https://cpython-previews--120307.org.readthedocs.build/

PyUnicode_FromFormat() now decodes the "%s" format argument from UTF-8 with the "strict" error handler, instead of the "replace" error handler. Remove the unused 'consumed' parameter of unicode_decode_utf8_writer().

vstinner · Jun 10, 2024

cc @methane @serhiy-storchaka

serhiy-storchaka · Jun 10, 2024

What happens with truncated strings, like %.50s, if the are truncated in the middle of multibyte UTF-8 sequence?

vstinner · Jun 10, 2024

What happens with truncated strings, like %.50s, if the are truncated in the middle of multibyte UTF-8 sequence?

There are two tests on that: UnicodeDecodeError is raised in this case.

vstinner · Jun 10, 2024

Example of test: test_capi.test_unicode

        # test "%s" format with precision
        check_format('abc',
                     b'%.3s', b'abcdef')
        with self.assertRaises(UnicodeDecodeError):
            PyUnicode_FromFormat(b'%.5s', 'abc[\u20ac]'.encode('utf8'))
        check_format('abc[\u20ac',
                     b'%.7s', 'abc[\u20ac]'.encode('utf8'))

serhiy-storchaka · Jun 10, 2024

This is bad. Such formats are common in error formatting code (not only in CPython, but in third-party code), and now you will get a UnicodeDecodeError instead of the original error even if all was fine with encoding. In this case I think that it it is better to truncate the string before the truncated sequence.

But even without truncation, it may be better to get a replacement character in the error message of the correct exception than a UnicodeDecodeError.

vstinner · Jun 10, 2024

On my PR gh-120248, @methane wrote:

I prefer "strict" because "hard to notice" is also hard to debug.

So I created this PR. @methane: What do you think?

I can modify the %.100s format ("%s" with precision) to truncate to 100 characters instead of 100 bytes, to avoid the risk of creating invalid UTF-8 strings.

serhiy-storchaka · Jun 10, 2024

I think that @methane's comment was only related to the format string (which currently is ASCII-only), not to arguments.

methane · Jun 11, 2024

I think 100 codepoints is the best option.

About error handler, there is no correct answer. Theorically speaking, "strict" is "Errors should never pass silently."
But both of "replace" and "backslashreplace" are acceptable.

serhiy-storchaka · Jun 11, 2024

Precision should specify the length in bytes. This feature can be used to format not-null-teminated strings.

char buffer[100];
PyUnicode_FromFormat("%.100s", buffer);

If you start to count codepoints, you can read past the end of the array.

vstinner · Jun 17, 2024

I abandon this PR. It seems like using "replace" error handler is more appropriate here.

pythongh-119182: Use strict error handler in PyUnicode_FromFormat()

3541237

PyUnicode_FromFormat() now decodes the "%s" format argument from UTF-8 with the "strict" error handler, instead of the "replace" error handler. Remove the unused 'consumed' parameter of unicode_decode_utf8_writer().

bedevere-app bot mentioned this pull request Jun 10, 2024

[C API] Add an efficient public PyUnicodeWriter API #119182

Closed

bedevere-app bot added the awaiting core review label Jun 10, 2024

vstinner mentioned this pull request Jun 10, 2024

gh-119182: Decode PyUnicode_FromFormat() format string from UTF-8 #120248

Closed

vstinner closed this Jun 17, 2024

vstinner deleted the format_strict branch June 17, 2024 20:02

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Uh oh!

gh-119182: Use strict error handler in PyUnicode_FromFormat() #120307

gh-119182: Use strict error handler in PyUnicode_FromFormat() #120307

Uh oh!

vstinner commented Jun 10, 2024 •

edited by github-actions bot

Loading

Uh oh!

vstinner commented Jun 10, 2024

Uh oh!

serhiy-storchaka commented Jun 10, 2024

Uh oh!

vstinner commented Jun 10, 2024

Uh oh!

vstinner commented Jun 10, 2024

Uh oh!

serhiy-storchaka commented Jun 10, 2024

Uh oh!

vstinner commented Jun 10, 2024

Uh oh!

serhiy-storchaka commented Jun 10, 2024

Uh oh!

methane commented Jun 11, 2024

Uh oh!

serhiy-storchaka commented Jun 11, 2024

Uh oh!

vstinner commented Jun 17, 2024

Uh oh!

Uh oh!

Search code, repositories, users, issues, pull requests...

Uh oh!

gh-119182: Use strict error handler in PyUnicode_FromFormat() #120307

gh-119182: Use strict error handler in PyUnicode_FromFormat() #120307

Uh oh!

Conversation

vstinner commented Jun 10, 2024 • edited by github-actions bot Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

vstinner commented Jun 10, 2024

Uh oh!

serhiy-storchaka commented Jun 10, 2024

Uh oh!

vstinner commented Jun 10, 2024

Uh oh!

vstinner commented Jun 10, 2024

Uh oh!

serhiy-storchaka commented Jun 10, 2024

Uh oh!

vstinner commented Jun 10, 2024

Uh oh!

serhiy-storchaka commented Jun 10, 2024

Uh oh!

methane commented Jun 11, 2024

Uh oh!

serhiy-storchaka commented Jun 11, 2024

Uh oh!

vstinner commented Jun 17, 2024

Uh oh!

Uh oh!

vstinner commented Jun 10, 2024 •

edited by github-actions bot

Loading