This issue tracker has been migrated to GitHub, and is currently read-only.
For more information, see the GitHub FAQs in the Python's Developer Guide.

Author lemburg
Recipients flox, lemburg, michael.foord
Date 2010-01-08.20:18:20
SpamBayes Score 6.668361e-09
Marked as misclassified No
Message-id <4B47930B.6010708@egenix.com>
In-reply-to <1262946727.6.0.739397040405.issue7643@psf.upfronthosting.co.za>
Content
Florent Xicluna wrote:
> 
> Florent Xicluna <laxyf@yahoo.fr> added the comment:
> 
> Some technical background.
> 
> == Unicode ==
> 
> According to the Unicode Standard Annex #9, a character with
> bidirectional class B is a "Paragraph Separator". And “Because a
> Paragraph Separator breaks lines, there will be at most one per line,
> at the end of that line.”
> 
> As a consequence, there's 3 reasons to identify a character as a
> linebreak:
>  - General Category Zl "Line Separator"
>  - General Category Zp "Paragraph Separator"
>  - Bidirectional Class B "Paragraph Separator"

This definition is what we use in Python for Py_UNICODE_ISLINEBREAK(ch).

> There's 8 linebreaks in the current Unicode Database (5.2):
> ------------------------------------------------------------------------
> 000A    LF  LINE FEED                   Cc  B
> 000D    CR  CARRIAGE RETURN             Cc  B
> 001C    FS  INFORMATION SEPARATOR FOUR  Cc  B (UCD 3.1 FILE SEPARATOR)
> 001D    GS  INFORMATION SEPARATOR THREE Cc  B (UCD 3.1 GROUP SEPARATOR)
> 001E    RS  INFORMATION SEPARATOR TWO   Cc  B (UCD 3.1 RECORD SEPARATOR)
> 0085    NEL NEXT LINE                   Cc  B (C1 Control Code)
> 2028    LS  LINE SEPARATOR              Zl  WS  (Unicode)
> 2029    PS  PARAGRAPH SEPARATOR         Zp  B   (Unicode)
> ------------------------------------------------------------------------

And that's the list we're currently using.

> == ASCII ==
> 
> The Standard ASCII control codes (C0) are in the range 00-1F.
> It limits the list to LF, CR, FS, GS, RS.
> Regarding the last three, they are not considered as linebreaks:
> “The separators (File, Group, Record, and Unit: FS, GS, RS and US) were made to
> structure data, usually on a tape, in order to simulate punched cards. End of
> medium (EM) warns that the tape (or whatever) is ending. While many systems use
> CR/LF and TAB for structuring data, it is possible to encounter the separator
> control characters in data that needs to be structured. The separator control
> characters are not overloaded; there is no general use of them except to
> separate data into structured groupings. Their numeric values are contiguous
> with the space character, which can be considered a member of the group, as a
> word separator.”
> (Ref: http://en.wikipedia.org/wiki/Control_character#Data_structuring)
> 
> In conclusion, it may be better to keep things unchanged.

Agreed.

> We may add some words to the documentation for str.splitlines() and bytes.splitlines() to explain what is considered a line break character.

For ASCII we should make the list of characters explicit.
For Unicode, we should mention the above definition and give
the table as example list (the Unicode database may add more
such characters in the future).

> References:
>  - The Unicode Character Database (UCD): http://www.unicode.org/ucd/
>  - UCD Property Values: http://unicode.org/reports/tr44/#Property_Values
>  - The Bidirectional Algorithm: http://www.unicode.org/reports/tr9/
>  - C0 and C1 Control Codes:
>      http://en.wikipedia.org/wiki/C0_and_C1_control_codes
History
Date User Action Args
2010-01-08 20:18:23lemburgsetrecipients: + lemburg, michael.foord, flox
2010-01-08 20:18:22lemburglinkissue7643 messages
2010-01-08 20:18:20lemburgcreate
Morty Proxy This is a proxified and sanitized view of the page, visit original site.