From 5c8f375cf18bed364b332b6095bf041265ea255b Mon Sep 17 00:00:00 2001 From: Irit Katriel Date: Thu, 13 Jun 2024 10:29:51 +0100 Subject: [PATCH 1/2] gh-119786: move locations doc to InternalDocs. Delete lnotab_notes.txt as it is for older versions --- InternalDocs/README.md | 6 +- {Objects => InternalDocs}/locations.md | 8 +- Objects/lnotab_notes.txt | 228 ------------------------- 3 files changed, 8 insertions(+), 234 deletions(-) rename {Objects => InternalDocs}/locations.md (88%) delete mode 100644 Objects/lnotab_notes.txt diff --git a/InternalDocs/README.md b/InternalDocs/README.md index 42f6125794266a..2918ead265dcd1 100644 --- a/InternalDocs/README.md +++ b/InternalDocs/README.md @@ -14,6 +14,8 @@ it is not, please report that through the [Compiler Design](compiler.md) -[Exception Handling](exception_handling.md) - [Adaptive Instruction Families](adaptive.md) + +[The Source Code Locations Table](locations.md) + +[Exception Handling](exception_handling.md) diff --git a/Objects/locations.md b/InternalDocs/locations.md similarity index 88% rename from Objects/locations.md rename to InternalDocs/locations.md index 18a338a95978a9..91a7824e2a8e4d 100644 --- a/Objects/locations.md +++ b/InternalDocs/locations.md @@ -1,10 +1,10 @@ # Locations table -For versions up to 3.10 see ./lnotab_notes.txt +The `co_linetable` bytes object of code objects contains a compact +representation of the source code positions of instructions, which are +returned by the `co_positions()` iterator. -In version 3.11 the `co_linetable` bytes object of code objects contains a compact representation of the positions returned by the `co_positions()` iterator. - -The `co_linetable` consists of a sequence of location entries. +`co_linetable` consists of a sequence of location entries. Each entry starts with a byte with the most significant bit set, followed by zero or more bytes with most significant bit unset. Each entry contains the following information: diff --git a/Objects/lnotab_notes.txt b/Objects/lnotab_notes.txt deleted file mode 100644 index 0f3599340318f0..00000000000000 --- a/Objects/lnotab_notes.txt +++ /dev/null @@ -1,228 +0,0 @@ -Description of the internal format of the line number table in Python 3.10 -and earlier. - -(For 3.11 onwards, see Objects/locations.md) - -Conceptually, the line number table consists of a sequence of triples: - start-offset (inclusive), end-offset (exclusive), line-number. - -Note that not all byte codes have a line number so we need handle `None` for the line-number. - -However, storing the above sequence directly would be very inefficient as we would need 12 bytes per entry. - -First, note that the end of one entry is the same as the start of the next, so we can overlap entries. -Second, we don't really need arbitrary access to the sequence, so we can store deltas. - -We just need to store (end - start, line delta) pairs. The start offset of the first entry is always zero. - -Third, most deltas are small, so we can use a single byte for each value, as long we allow several entries for the same line. - -Consider the following table - Start End Line - 0 6 1 - 6 50 2 - 50 350 7 - 350 360 No line number - 360 376 8 - 376 380 208 - -Stripping the redundant ends gives: - - End-Start Line-delta - 6 +1 - 44 +1 - 300 +5 - 10 No line number - 16 +1 - 4 +200 - - -Note that the end - start value is always positive. - -Finally, in order to fit into a single byte we need to convert start deltas to the range 0 <= delta <= 254, -and line deltas to the range -127 <= delta <= 127. -A line delta of -128 is used to indicate no line number. -Also note that a delta of zero indicates that there are no bytecodes in the given range, -which means we can use an invalid line number for that range. - -Final form: - - Start delta Line delta - 6 +1 - 44 +1 - 254 +5 - 46 0 - 10 -128 (No line number, treated as a delta of zero) - 16 +1 - 0 +127 (line 135, but the range is empty as no bytecodes are at line 135) - 4 +73 - -Iterating over the table. -------------------------- - -For the `co_lines` method we want to emit the full form, omitting the (350, 360, No line number) and empty entries. - -The code is as follows: - -def co_lines(code): - line = code.co_firstlineno - end = 0 - table_iter = iter(code.internal_line_table): - for sdelta, ldelta in table_iter: - if ldelta == 0: # No change to line number, just accumulate changes to end - end += sdelta - continue - start = end - end = start + sdelta - if ldelta == -128: # No valid line number -- skip entry - continue - line += ldelta - if end == start: # Empty range, omit. - continue - yield start, end, line - - - - -The historical co_lnotab format -------------------------------- - -prior to 3.10 code objects stored a field named co_lnotab. -This was an array of unsigned bytes disguised as a Python bytes object. - -The old co_lnotab did not account for the presence of bytecodes without a line number, -nor was it well suited to tracing as a number of workarounds were required. - -The old format can still be accessed via `code.co_lnotab`, which is lazily computed from the new format. - -Below is the description of the old co_lnotab format: - - -The array is conceptually a compressed list of - (bytecode offset increment, line number increment) -pairs. The details are important and delicate, best illustrated by example: - - byte code offset source code line number - 0 1 - 6 2 - 50 7 - 350 207 - 361 208 - -Instead of storing these numbers literally, we compress the list by storing only -the difference from one row to the next. Conceptually, the stored list might -look like: - - 0, 1, 6, 1, 44, 5, 300, 200, 11, 1 - -The above doesn't really work, but it's a start. An unsigned byte (byte code -offset) can't hold negative values, or values larger than 255, a signed byte -(line number) can't hold values larger than 127 or less than -128, and the -above example contains two such values. (Note that before 3.6, line number -was also encoded by an unsigned byte.) So we make two tweaks: - - (a) there's a deep assumption that byte code offsets increase monotonically, - and - (b) if byte code offset jumps by more than 255 from one row to the next, or if - source code line number jumps by more than 127 or less than -128 from one row - to the next, more than one pair is written to the table. In case #b, - there's no way to know from looking at the table later how many were written. - That's the delicate part. A user of co_lnotab desiring to find the source - line number corresponding to a bytecode address A should do something like - this: - - lineno = addr = 0 - for addr_incr, line_incr in co_lnotab: - addr += addr_incr - if addr > A: - return lineno - if line_incr >= 0x80: - line_incr -= 0x100 - lineno += line_incr - -(In C, this is implemented by PyCode_Addr2Line().) In order for this to work, -when the addr field increments by more than 255, the line # increment in each -pair generated must be 0 until the remaining addr increment is < 256. So, in -the example above, assemble_lnotab in compile.c should not (as was actually done -until 2.2) expand 300, 200 to - 255, 255, 45, 45, -but to - 255, 0, 45, 127, 0, 73. - -The above is sufficient to reconstruct line numbers for tracebacks, but not for -line tracing. Tracing is handled by PyCode_CheckLineNumber() in codeobject.c -and maybe_call_line_trace() in ceval.c. - -*** Tracing *** - -To a first approximation, we want to call the tracing function when the line -number of the current instruction changes. Re-computing the current line for -every instruction is a little slow, though, so each time we compute the line -number we save the bytecode indices where it's valid: - - *instr_lb <= frame->f_lasti < *instr_ub - -is true so long as execution does not change lines. That is, *instr_lb holds -the first bytecode index of the current line, and *instr_ub holds the first -bytecode index of the next line. As long as the above expression is true, -maybe_call_line_trace() does not need to call PyCode_CheckLineNumber(). Note -that the same line may appear multiple times in the lnotab, either because the -bytecode jumped more than 255 indices between line number changes or because -the compiler inserted the same line twice. Even in that case, *instr_ub holds -the first index of the next line. - -However, we don't *always* want to call the line trace function when the above -test fails. - -Consider this code: - -1: def f(a): -2: while a: -3: print(1) -4: break -5: else: -6: print(2) - -which compiles to this: - - 2 0 SETUP_LOOP 26 (to 28) - >> 2 LOAD_FAST 0 (a) - 4 POP_JUMP_IF_FALSE 18 - - 3 6 LOAD_GLOBAL 0 (print) - 8 LOAD_CONST 1 (1) - 10 CALL_NO_KW 1 - 12 POP_TOP - - 4 14 BREAK_LOOP - 16 JUMP_ABSOLUTE 2 - >> 18 POP_BLOCK - - 6 20 LOAD_GLOBAL 0 (print) - 22 LOAD_CONST 2 (2) - 24 CALL_NO_KW 1 - 26 POP_TOP - >> 28 LOAD_CONST 0 (None) - 30 RETURN_VALUE - -If 'a' is false, execution will jump to the POP_BLOCK instruction at offset 18 -and the co_lnotab will claim that execution has moved to line 4, which is wrong. -In this case, we could instead associate the POP_BLOCK with line 5, but that -would break jumps around loops without else clauses. - -We fix this by only calling the line trace function for a forward jump if the -co_lnotab indicates we have jumped to the *start* of a line, i.e. if the current -instruction offset matches the offset given for the start of a line by the -co_lnotab. For backward jumps, however, we always call the line trace function, -which lets a debugger stop on every evaluation of a loop guard (which usually -won't be the first opcode in a line). - -Why do we set f_lineno when tracing, and only just before calling the trace -function? Well, consider the code above when 'a' is true. If stepping through -this with 'n' in pdb, you would stop at line 1 with a "call" type event, then -line events on lines 2, 3, and 4, then a "return" type event -- but because the -code for the return actually falls in the range of the "line 6" opcodes, you -would be shown line 6 during this event. This is a change from the behaviour in -2.2 and before, and I've found it confusing in practice. By setting and using -f_lineno when tracing, one can report a line number different from that -suggested by f_lasti on this one occasion where it's desirable. From 524e4449ed98cad1c0a8c0f32f6fdfcb80f5d6a4 Mon Sep 17 00:00:00 2001 From: Irit Katriel Date: Wed, 19 Jun 2024 16:43:58 +0100 Subject: [PATCH 2/2] revert deletion of lnotab_notes.txt --- Objects/lnotab_notes.txt | 228 +++++++++++++++++++++++++++++++++++++++ 1 file changed, 228 insertions(+) create mode 100644 Objects/lnotab_notes.txt diff --git a/Objects/lnotab_notes.txt b/Objects/lnotab_notes.txt new file mode 100644 index 00000000000000..0f3599340318f0 --- /dev/null +++ b/Objects/lnotab_notes.txt @@ -0,0 +1,228 @@ +Description of the internal format of the line number table in Python 3.10 +and earlier. + +(For 3.11 onwards, see Objects/locations.md) + +Conceptually, the line number table consists of a sequence of triples: + start-offset (inclusive), end-offset (exclusive), line-number. + +Note that not all byte codes have a line number so we need handle `None` for the line-number. + +However, storing the above sequence directly would be very inefficient as we would need 12 bytes per entry. + +First, note that the end of one entry is the same as the start of the next, so we can overlap entries. +Second, we don't really need arbitrary access to the sequence, so we can store deltas. + +We just need to store (end - start, line delta) pairs. The start offset of the first entry is always zero. + +Third, most deltas are small, so we can use a single byte for each value, as long we allow several entries for the same line. + +Consider the following table + Start End Line + 0 6 1 + 6 50 2 + 50 350 7 + 350 360 No line number + 360 376 8 + 376 380 208 + +Stripping the redundant ends gives: + + End-Start Line-delta + 6 +1 + 44 +1 + 300 +5 + 10 No line number + 16 +1 + 4 +200 + + +Note that the end - start value is always positive. + +Finally, in order to fit into a single byte we need to convert start deltas to the range 0 <= delta <= 254, +and line deltas to the range -127 <= delta <= 127. +A line delta of -128 is used to indicate no line number. +Also note that a delta of zero indicates that there are no bytecodes in the given range, +which means we can use an invalid line number for that range. + +Final form: + + Start delta Line delta + 6 +1 + 44 +1 + 254 +5 + 46 0 + 10 -128 (No line number, treated as a delta of zero) + 16 +1 + 0 +127 (line 135, but the range is empty as no bytecodes are at line 135) + 4 +73 + +Iterating over the table. +------------------------- + +For the `co_lines` method we want to emit the full form, omitting the (350, 360, No line number) and empty entries. + +The code is as follows: + +def co_lines(code): + line = code.co_firstlineno + end = 0 + table_iter = iter(code.internal_line_table): + for sdelta, ldelta in table_iter: + if ldelta == 0: # No change to line number, just accumulate changes to end + end += sdelta + continue + start = end + end = start + sdelta + if ldelta == -128: # No valid line number -- skip entry + continue + line += ldelta + if end == start: # Empty range, omit. + continue + yield start, end, line + + + + +The historical co_lnotab format +------------------------------- + +prior to 3.10 code objects stored a field named co_lnotab. +This was an array of unsigned bytes disguised as a Python bytes object. + +The old co_lnotab did not account for the presence of bytecodes without a line number, +nor was it well suited to tracing as a number of workarounds were required. + +The old format can still be accessed via `code.co_lnotab`, which is lazily computed from the new format. + +Below is the description of the old co_lnotab format: + + +The array is conceptually a compressed list of + (bytecode offset increment, line number increment) +pairs. The details are important and delicate, best illustrated by example: + + byte code offset source code line number + 0 1 + 6 2 + 50 7 + 350 207 + 361 208 + +Instead of storing these numbers literally, we compress the list by storing only +the difference from one row to the next. Conceptually, the stored list might +look like: + + 0, 1, 6, 1, 44, 5, 300, 200, 11, 1 + +The above doesn't really work, but it's a start. An unsigned byte (byte code +offset) can't hold negative values, or values larger than 255, a signed byte +(line number) can't hold values larger than 127 or less than -128, and the +above example contains two such values. (Note that before 3.6, line number +was also encoded by an unsigned byte.) So we make two tweaks: + + (a) there's a deep assumption that byte code offsets increase monotonically, + and + (b) if byte code offset jumps by more than 255 from one row to the next, or if + source code line number jumps by more than 127 or less than -128 from one row + to the next, more than one pair is written to the table. In case #b, + there's no way to know from looking at the table later how many were written. + That's the delicate part. A user of co_lnotab desiring to find the source + line number corresponding to a bytecode address A should do something like + this: + + lineno = addr = 0 + for addr_incr, line_incr in co_lnotab: + addr += addr_incr + if addr > A: + return lineno + if line_incr >= 0x80: + line_incr -= 0x100 + lineno += line_incr + +(In C, this is implemented by PyCode_Addr2Line().) In order for this to work, +when the addr field increments by more than 255, the line # increment in each +pair generated must be 0 until the remaining addr increment is < 256. So, in +the example above, assemble_lnotab in compile.c should not (as was actually done +until 2.2) expand 300, 200 to + 255, 255, 45, 45, +but to + 255, 0, 45, 127, 0, 73. + +The above is sufficient to reconstruct line numbers for tracebacks, but not for +line tracing. Tracing is handled by PyCode_CheckLineNumber() in codeobject.c +and maybe_call_line_trace() in ceval.c. + +*** Tracing *** + +To a first approximation, we want to call the tracing function when the line +number of the current instruction changes. Re-computing the current line for +every instruction is a little slow, though, so each time we compute the line +number we save the bytecode indices where it's valid: + + *instr_lb <= frame->f_lasti < *instr_ub + +is true so long as execution does not change lines. That is, *instr_lb holds +the first bytecode index of the current line, and *instr_ub holds the first +bytecode index of the next line. As long as the above expression is true, +maybe_call_line_trace() does not need to call PyCode_CheckLineNumber(). Note +that the same line may appear multiple times in the lnotab, either because the +bytecode jumped more than 255 indices between line number changes or because +the compiler inserted the same line twice. Even in that case, *instr_ub holds +the first index of the next line. + +However, we don't *always* want to call the line trace function when the above +test fails. + +Consider this code: + +1: def f(a): +2: while a: +3: print(1) +4: break +5: else: +6: print(2) + +which compiles to this: + + 2 0 SETUP_LOOP 26 (to 28) + >> 2 LOAD_FAST 0 (a) + 4 POP_JUMP_IF_FALSE 18 + + 3 6 LOAD_GLOBAL 0 (print) + 8 LOAD_CONST 1 (1) + 10 CALL_NO_KW 1 + 12 POP_TOP + + 4 14 BREAK_LOOP + 16 JUMP_ABSOLUTE 2 + >> 18 POP_BLOCK + + 6 20 LOAD_GLOBAL 0 (print) + 22 LOAD_CONST 2 (2) + 24 CALL_NO_KW 1 + 26 POP_TOP + >> 28 LOAD_CONST 0 (None) + 30 RETURN_VALUE + +If 'a' is false, execution will jump to the POP_BLOCK instruction at offset 18 +and the co_lnotab will claim that execution has moved to line 4, which is wrong. +In this case, we could instead associate the POP_BLOCK with line 5, but that +would break jumps around loops without else clauses. + +We fix this by only calling the line trace function for a forward jump if the +co_lnotab indicates we have jumped to the *start* of a line, i.e. if the current +instruction offset matches the offset given for the start of a line by the +co_lnotab. For backward jumps, however, we always call the line trace function, +which lets a debugger stop on every evaluation of a loop guard (which usually +won't be the first opcode in a line). + +Why do we set f_lineno when tracing, and only just before calling the trace +function? Well, consider the code above when 'a' is true. If stepping through +this with 'n' in pdb, you would stop at line 1 with a "call" type event, then +line events on lines 2, 3, and 4, then a "return" type event -- but because the +code for the return actually falls in the range of the "line 6" opcodes, you +would be shown line 6 during this event. This is a change from the behaviour in +2.2 and before, and I've found it confusing in practice. By setting and using +f_lineno when tracing, one can report a line number different from that +suggested by f_lasti on this one occasion where it's desirable.