Message 97438 - Python tracker

➜

This issue tracker has been migrated to GitHub, and is currently read-only.
For more information, see the GitHub FAQs in the Python's Developer Guide.

Author	lemburg
Recipients	flox, lemburg, michael.foord
Date	2010-01-08.20:18:20
SpamBayes Score	6.668361e-09
Marked as misclassified	No
Message-id	<4B47930B.6010708@egenix.com>
In-reply-to	<1262946727.6.0.739397040405.issue7643@psf.upfronthosting.co.za>

Content
Florent Xicluna wrote: > > Florent Xicluna <laxyf@yahoo.fr> added the comment: > > Some technical background. > > == Unicode == > > According to the Unicode Standard Annex #9, a character with > bidirectional class B is a "Paragraph Separator". And “Because a > Paragraph Separator breaks lines, there will be at most one per line, > at the end of that line.” > > As a consequence, there's 3 reasons to identify a character as a > linebreak: > - General Category Zl "Line Separator" > - General Category Zp "Paragraph Separator" > - Bidirectional Class B "Paragraph Separator" This definition is what we use in Python for Py_UNICODE_ISLINEBREAK(ch). > There's 8 linebreaks in the current Unicode Database (5.2): > ------------------------------------------------------------------------ > 000A LF LINE FEED Cc B > 000D CR CARRIAGE RETURN Cc B > 001C FS INFORMATION SEPARATOR FOUR Cc B (UCD 3.1 FILE SEPARATOR) > 001D GS INFORMATION SEPARATOR THREE Cc B (UCD 3.1 GROUP SEPARATOR) > 001E RS INFORMATION SEPARATOR TWO Cc B (UCD 3.1 RECORD SEPARATOR) > 0085 NEL NEXT LINE Cc B (C1 Control Code) > 2028 LS LINE SEPARATOR Zl WS (Unicode) > 2029 PS PARAGRAPH SEPARATOR Zp B (Unicode) > ------------------------------------------------------------------------ And that's the list we're currently using. > == ASCII == > > The Standard ASCII control codes (C0) are in the range 00-1F. > It limits the list to LF, CR, FS, GS, RS. > Regarding the last three, they are not considered as linebreaks: > “The separators (File, Group, Record, and Unit: FS, GS, RS and US) were made to > structure data, usually on a tape, in order to simulate punched cards. End of > medium (EM) warns that the tape (or whatever) is ending. While many systems use > CR/LF and TAB for structuring data, it is possible to encounter the separator > control characters in data that needs to be structured. The separator control > characters are not overloaded; there is no general use of them except to > separate data into structured groupings. Their numeric values are contiguous > with the space character, which can be considered a member of the group, as a > word separator.” > (Ref: http://en.wikipedia.org/wiki/Control_character#Data_structuring) > > In conclusion, it may be better to keep things unchanged. Agreed. > We may add some words to the documentation for str.splitlines() and bytes.splitlines() to explain what is considered a line break character. For ASCII we should make the list of characters explicit. For Unicode, we should mention the above definition and give the table as example list (the Unicode database may add more such characters in the future). > References: > - The Unicode Character Database (UCD): http://www.unicode.org/ucd/ > - UCD Property Values: http://unicode.org/reports/tr44/#Property_Values > - The Bidirectional Algorithm: http://www.unicode.org/reports/tr9/ > - C0 and C1 Control Codes: > http://en.wikipedia.org/wiki/C0_and_C1_control_codes

Florent Xicluna wrote:
> 
> Florent Xicluna <laxyf@yahoo.fr> added the comment:
> 
> Some technical background.
> 
> == Unicode ==
> 
> According to the Unicode Standard Annex #9, a character with
> bidirectional class B is a "Paragraph Separator". And “Because a
> Paragraph Separator breaks lines, there will be at most one per line,
> at the end of that line.”
> 
> As a consequence, there's 3 reasons to identify a character as a
> linebreak:
>  - General Category Zl "Line Separator"
>  - General Category Zp "Paragraph Separator"
>  - Bidirectional Class B "Paragraph Separator"

This definition is what we use in Python for Py_UNICODE_ISLINEBREAK(ch).

> There's 8 linebreaks in the current Unicode Database (5.2):
> ------------------------------------------------------------------------
> 000A    LF  LINE FEED                   Cc  B
> 000D    CR  CARRIAGE RETURN             Cc  B
> 001C    FS  INFORMATION SEPARATOR FOUR  Cc  B (UCD 3.1 FILE SEPARATOR)
> 001D    GS  INFORMATION SEPARATOR THREE Cc  B (UCD 3.1 GROUP SEPARATOR)
> 001E    RS  INFORMATION SEPARATOR TWO   Cc  B (UCD 3.1 RECORD SEPARATOR)
> 0085    NEL NEXT LINE                   Cc  B (C1 Control Code)
> 2028    LS  LINE SEPARATOR              Zl  WS  (Unicode)
> 2029    PS  PARAGRAPH SEPARATOR         Zp  B   (Unicode)
> ------------------------------------------------------------------------

And that's the list we're currently using.

> == ASCII ==
> 
> The Standard ASCII control codes (C0) are in the range 00-1F.
> It limits the list to LF, CR, FS, GS, RS.
> Regarding the last three, they are not considered as linebreaks:
> “The separators (File, Group, Record, and Unit: FS, GS, RS and US) were made to
> structure data, usually on a tape, in order to simulate punched cards. End of
> medium (EM) warns that the tape (or whatever) is ending. While many systems use
> CR/LF and TAB for structuring data, it is possible to encounter the separator
> control characters in data that needs to be structured. The separator control
> characters are not overloaded; there is no general use of them except to
> separate data into structured groupings. Their numeric values are contiguous
> with the space character, which can be considered a member of the group, as a
> word separator.”
> (Ref: http://en.wikipedia.org/wiki/Control_character#Data_structuring)
> 
> In conclusion, it may be better to keep things unchanged.

Agreed.

> We may add some words to the documentation for str.splitlines() and bytes.splitlines() to explain what is considered a line break character.

For ASCII we should make the list of characters explicit.
For Unicode, we should mention the above definition and give
the table as example list (the Unicode database may add more
such characters in the future).

> References:
>  - The Unicode Character Database (UCD): http://www.unicode.org/ucd/
>  - UCD Property Values: http://unicode.org/reports/tr44/#Property_Values
>  - The Bidirectional Algorithm: http://www.unicode.org/reports/tr9/
>  - C0 and C1 Control Codes:
>      http://en.wikipedia.org/wiki/C0_and_C1_control_codes

History
Date	User	Action	Args
2010-01-08 20:18:23	lemburg	set	recipients: + lemburg, michael.foord, flox
2010-01-08 20:18:22	lemburg	link	issue7643 messages
2010-01-08 20:18:20	lemburg	create