hub / github.com/python/cpython / SequenceMatcher

Class SequenceMatcher

Lib/difflib.py:45–664 · view source on GitHub ↗

SequenceMatcher is a flexible class for comparing pairs of sequences of any type, so long as the sequence elements are hashable. The basic algorithm predates, and is a little fancier than, an algorithm published in the late 1980's by Ratcliff and Obershelp under the hyperbolic

Source from the content-addressed store, hash-verified

43	return 1.0
44
45	class SequenceMatcher:
46
47	"""
48	SequenceMatcher is a flexible class for comparing pairs of sequences of
49	any type, so long as the sequence elements are hashable. The basic
50	algorithm predates, and is a little fancier than, an algorithm
51	published in the late 1980's by Ratcliff and Obershelp under the
52	hyperbolic name "gestalt pattern matching". The basic idea is to find
53	the longest contiguous matching subsequence that contains no "junk"
54	elements (R-O doesn't address junk). The same idea is then applied
55	recursively to the pieces of the sequences to the left and to the right
56	of the matching subsequence. This does not yield minimal edit
57	sequences, but does tend to yield matches that "look right" to people.
58
59	SequenceMatcher tries to compute a "human-friendly diff" between two
60	sequences. Unlike e.g. UNIX(tm) diff, the fundamental notion is the
61	longest contiguous & junk-free matching subsequence. That's what
62	catches peoples' eyes. The Windows(tm) windiff has another interesting
63	notion, pairing up elements that appear uniquely in each sequence.
64	That, and the method here, appear to yield more intuitive difference
65	reports than does diff. This method appears to be the least vulnerable
66	to syncing up on blocks of "junk lines", though (like blank lines in
67	ordinary text files, or maybe "<P>" lines in HTML files). That may be
68	because this is the only method of the 3 that has a concept of
69	"junk" <wink>.
70
71	Example, comparing two strings, and considering blanks to be "junk":
72
73	>>> s = SequenceMatcher(lambda x: x == " ",
74	... "private Thread currentThread;",
75	... "private volatile Thread currentThread;")
76	>>>
77
78	.ratio() returns a float in [0, 1], measuring the "similarity" of the
79	sequences. As a rule of thumb, a .ratio() value over 0.6 means the
80	sequences are close matches:
81
82	>>> print(round(s.ratio(), 2))
83	0.87
84	>>>
85
86	If you're only interested in where the sequences match,
87	.get_matching_blocks() is handy:
88
89	>>> for block in s.get_matching_blocks():
90	... print("a[%d] and b[%d] match for %d elements" % block)
91	a[0] and b[0] match for 8 elements
92	a[8] and b[17] match for 21 elements
93	a[29] and b[38] match for 0 elements
94
95	Note that the last tuple returned by .get_matching_blocks() is always a
96	dummy, (len(a), len(b), 0), and this is the only case in which the last
97	tuple element (number of elements matched) is 0.
98
99	If you want to know how to change the first sequence into the second,
100	use .get_opcodes():
101
102	>>> for opcode in s.get_opcodes():

Callers 5

get_close_matchesFunction · 0.85

compareMethod · 0.85

_fancy_replaceMethod · 0.85

unified_diffFunction · 0.85

context_diffFunction · 0.85

Calls

no outgoing calls

Tested by

no test coverage detected

Used in the wild real call sites across dependent graphs

searching dependent graphs…