Skip to content

Navigation Menu

Sign in
Appearance settings

Search code, repositories, users, issues, pull requests...

Provide feedback

We read every piece of feedback, and take your input very seriously.

Saved searches

Use saved searches to filter your results more quickly

Appearance settings

Commit 0abf997

Browse filesBrowse files
pulkintim-one
andauthored
gh-119105: difflib: improve recursion for degenerate cases (#119131)
Code from https://github.com/pulkin, in PR #119131 Greatly speeds `Differ` when there are many identically scoring pairs, by splitting the recursion near the inputs' midpoints instead of degenerating (as now) into just peeling off the first two lines. Co-authored-by: Tim Peters <tim.peters@gmail.com>
1 parent 3c28510 commit 0abf997
Copy full SHA for 0abf997

File tree

2 files changed

+20
-5
lines changed
Filter options

2 files changed

+20
-5
lines changed

‎Lib/difflib.py

Copy file name to clipboardExpand all lines: Lib/difflib.py
+19-5Lines changed: 19 additions & 5 deletions
Original file line numberDiff line numberDiff line change
@@ -911,33 +911,47 @@ def _fancy_replace(self, a, alo, ahi, b, blo, bhi):
911911

912912
# don't synch up unless the lines have a similarity score of at
913913
# least cutoff; best_ratio tracks the best score seen so far
914-
best_ratio, cutoff = 0.74, 0.75
914+
# best_ratio is a tuple storing the best .ratio() seen so far, and
915+
# a measure of how far the indices are from their index range
916+
# midpoints. The latter is used to resolve ratio ties. Favoring
917+
# indices near the midpoints tends to cut the ranges in half. Else,
918+
# if there are many pairs with the best ratio, recursion can grow
919+
# very deep, and runtime becomes cubic. See:
920+
# https://github.com/python/cpython/issues/119105
921+
best_ratio, cutoff = (0.74, 0), 0.75
915922
cruncher = SequenceMatcher(self.charjunk)
916923
eqi, eqj = None, None # 1st indices of equal lines (if any)
917924

918925
# search for the pair that matches best without being identical
919926
# (identical lines must be junk lines, & we don't want to synch up
920927
# on junk -- unless we have to)
928+
amid = (alo + ahi - 1) / 2
929+
bmid = (blo + bhi - 1) / 2
921930
for j in range(blo, bhi):
922931
bj = b[j]
923932
cruncher.set_seq2(bj)
933+
weight_j = - abs(j - bmid)
924934
for i in range(alo, ahi):
925935
ai = a[i]
926936
if ai == bj:
927937
if eqi is None:
928938
eqi, eqj = i, j
929939
continue
930940
cruncher.set_seq1(ai)
941+
# weight is used to balance the recursion by prioritizing
942+
# i and j in the middle of their ranges
943+
weight = weight_j - abs(i - amid)
931944
# computing similarity is expensive, so use the quick
932945
# upper bounds first -- have seen this speed up messy
933946
# compares by a factor of 3.
934947
# note that ratio() is only expensive to compute the first
935948
# time it's called on a sequence pair; the expensive part
936949
# of the computation is cached by cruncher
937-
if cruncher.real_quick_ratio() > best_ratio and \
938-
cruncher.quick_ratio() > best_ratio and \
939-
cruncher.ratio() > best_ratio:
940-
best_ratio, best_i, best_j = cruncher.ratio(), i, j
950+
if (cruncher.real_quick_ratio(), weight) > best_ratio and \
951+
(cruncher.quick_ratio(), weight) > best_ratio and \
952+
(cruncher.ratio(), weight) > best_ratio:
953+
best_ratio, best_i, best_j = (cruncher.ratio(), weight), i, j
954+
best_ratio, _ = best_ratio
941955
if best_ratio < cutoff:
942956
# no non-identical "pretty close" pair
943957
if eqi is None:
+1Lines changed: 1 addition & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1 @@
1+
``difflib.Differ`` is much faster for some cases of diffs where many pairs of lines are equally similar.

0 commit comments

Comments
0 (0)
Morty Proxy This is a proxified and sanitized view of the page, visit original site.