-
-
Notifications
You must be signed in to change notification settings - Fork 32k
gh-119182: Use strict error handler in PyUnicode_FromFormat() #120307
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
PyUnicode_FromFormat() now decodes the "%s" format argument from UTF-8 with the "strict" error handler, instead of the "replace" error handler. Remove the unused 'consumed' parameter of unicode_decode_utf8_writer().
What happens with truncated strings, like |
There are two tests on that: UnicodeDecodeError is raised in this case. |
Example of test: test_capi.test_unicode # test "%s" format with precision
check_format('abc',
b'%.3s', b'abcdef')
with self.assertRaises(UnicodeDecodeError):
PyUnicode_FromFormat(b'%.5s', 'abc[\u20ac]'.encode('utf8'))
check_format('abc[\u20ac',
b'%.7s', 'abc[\u20ac]'.encode('utf8')) |
This is bad. Such formats are common in error formatting code (not only in CPython, but in third-party code), and now you will get a UnicodeDecodeError instead of the original error even if all was fine with encoding. In this case I think that it it is better to truncate the string before the truncated sequence. But even without truncation, it may be better to get a replacement character in the error message of the correct exception than a UnicodeDecodeError. |
On my PR gh-120248, @methane wrote:
So I created this PR. @methane: What do you think? I can modify the |
I think that @methane's comment was only related to the format string (which currently is ASCII-only), not to arguments. |
I think 100 codepoints is the best option. About error handler, there is no correct answer. Theorically speaking, "strict" is "Errors should never pass silently." |
Precision should specify the length in bytes. This feature can be used to format not-null-teminated strings. char buffer[100];
PyUnicode_FromFormat("%.100s", buffer); If you start to count codepoints, you can read past the end of the array. |
I abandon this PR. It seems like using |
PyUnicode_FromFormat() now decodes the "%s" format argument from UTF-8 with the "strict" error handler, instead of the "replace" error handler.
Remove the unused 'consumed' parameter of
unicode_decode_utf8_writer().
📚 Documentation preview 📚: https://cpython-previews--120307.org.readthedocs.build/