SequenceMatcher is a flexible class for comparing pairs of sequences of any type, so long as the sequence elements are hashable. The basic algorithm predates, and is a little fancier than, an algorithm published in the late 1980's by Ratcliff and Obershelp under the hyperbolic
| 43 | return 1.0 |
| 44 | |
| 45 | class SequenceMatcher: |
| 46 | |
| 47 | """ |
| 48 | SequenceMatcher is a flexible class for comparing pairs of sequences of |
| 49 | any type, so long as the sequence elements are hashable. The basic |
| 50 | algorithm predates, and is a little fancier than, an algorithm |
| 51 | published in the late 1980's by Ratcliff and Obershelp under the |
| 52 | hyperbolic name "gestalt pattern matching". The basic idea is to find |
| 53 | the longest contiguous matching subsequence that contains no "junk" |
| 54 | elements (R-O doesn't address junk). The same idea is then applied |
| 55 | recursively to the pieces of the sequences to the left and to the right |
| 56 | of the matching subsequence. This does not yield minimal edit |
| 57 | sequences, but does tend to yield matches that "look right" to people. |
| 58 | |
| 59 | SequenceMatcher tries to compute a "human-friendly diff" between two |
| 60 | sequences. Unlike e.g. UNIX(tm) diff, the fundamental notion is the |
| 61 | longest *contiguous* & junk-free matching subsequence. That's what |
| 62 | catches peoples' eyes. The Windows(tm) windiff has another interesting |
| 63 | notion, pairing up elements that appear uniquely in each sequence. |
| 64 | That, and the method here, appear to yield more intuitive difference |
| 65 | reports than does diff. This method appears to be the least vulnerable |
| 66 | to syncing up on blocks of "junk lines", though (like blank lines in |
| 67 | ordinary text files, or maybe "<P>" lines in HTML files). That may be |
| 68 | because this is the only method of the 3 that has a *concept* of |
| 69 | "junk" <wink>. |
| 70 | |
| 71 | Example, comparing two strings, and considering blanks to be "junk": |
| 72 | |
| 73 | >>> s = SequenceMatcher(lambda x: x == " ", |
| 74 | ... "private Thread currentThread;", |
| 75 | ... "private volatile Thread currentThread;") |
| 76 | >>> |
| 77 | |
| 78 | .ratio() returns a float in [0, 1], measuring the "similarity" of the |
| 79 | sequences. As a rule of thumb, a .ratio() value over 0.6 means the |
| 80 | sequences are close matches: |
| 81 | |
| 82 | >>> print(round(s.ratio(), 2)) |
| 83 | 0.87 |
| 84 | >>> |
| 85 | |
| 86 | If you're only interested in where the sequences match, |
| 87 | .get_matching_blocks() is handy: |
| 88 | |
| 89 | >>> for block in s.get_matching_blocks(): |
| 90 | ... print("a[%d] and b[%d] match for %d elements" % block) |
| 91 | a[0] and b[0] match for 8 elements |
| 92 | a[8] and b[17] match for 21 elements |
| 93 | a[29] and b[38] match for 0 elements |
| 94 | |
| 95 | Note that the last tuple returned by .get_matching_blocks() is always a |
| 96 | dummy, (len(a), len(b), 0), and this is the only case in which the last |
| 97 | tuple element (number of elements matched) is 0. |
| 98 | |
| 99 | If you want to know how to change the first sequence into the second, |
| 100 | use .get_opcodes(): |
| 101 | |
| 102 | >>> for opcode in s.get_opcodes(): |
no outgoing calls
no test coverage detected
searching dependent graphs…