1 | // Copyright 2022 The Go Authors. All rights reserved. |
---|---|
2 | // Use of this source code is governed by a BSD-style |
3 | // license that can be found in the LICENSE file. |
4 | |
5 | // package lcs contains code to find longest-common-subsequences |
6 | // (and diffs) |
7 | package lcs |
8 | |
9 | /* |
10 | Compute longest-common-subsequences of two slices A, B using |
11 | algorithms from Myers' paper. A longest-common-subsequence |
12 | (LCS from now on) of A and B is a maximal set of lexically increasing |
13 | pairs of subscripts (x,y) with A[x]==B[y]. There may be many LCS, but |
14 | they all have the same length. An LCS determines a sequence of edits |
15 | that changes A into B. |
16 | |
17 | The key concept is the edit graph of A and B. |
18 | If A has length N and B has length M, then the edit graph has |
19 | vertices v[i][j] for 0 <= i <= N, 0 <= j <= M. There is a |
20 | horizontal edge from v[i][j] to v[i+1][j] whenever both are in |
21 | the graph, and a vertical edge from v[i][j] to f[i][j+1] similarly. |
22 | When A[i] == B[j] there is a diagonal edge from v[i][j] to v[i+1][j+1]. |
23 | |
24 | A path between in the graph between (0,0) and (N,M) determines a sequence |
25 | of edits converting A into B: each horizontal edge corresponds to removing |
26 | an element of A, and each vertical edge corresponds to inserting an |
27 | element of B. |
28 | |
29 | A vertex (x,y) is on (forward) diagonal k if x-y=k. A path in the graph |
30 | is of length D if it has D non-diagonal edges. The algorithms generate |
31 | forward paths (in which at least one of x,y increases at each edge), |
32 | or backward paths (in which at least one of x,y decreases at each edge), |
33 | or a combination. (Note that the orientation is the traditional mathematical one, |
34 | with the origin in the lower-left corner.) |
35 | |
36 | Here is the edit graph for A:"aabbaa", B:"aacaba". (I know the diagonals look weird.) |
37 | ⊙ ------- ⊙ ------- ⊙ ------- ⊙ ------- ⊙ ------- ⊙ ------- ⊙ |
38 | a | ___/‾‾‾ | ___/‾‾‾ | | | ___/‾‾‾ | ___/‾‾‾ | |
39 | ⊙ ------- ⊙ ------- ⊙ ------- ⊙ ------- ⊙ ------- ⊙ ------- ⊙ |
40 | b | | | ___/‾‾‾ | ___/‾‾‾ | | | |
41 | ⊙ ------- ⊙ ------- ⊙ ------- ⊙ ------- ⊙ ------- ⊙ ------- ⊙ |
42 | a | ___/‾‾‾ | ___/‾‾‾ | | | ___/‾‾‾ | ___/‾‾‾ | |
43 | ⊙ ------- ⊙ ------- ⊙ ------- ⊙ ------- ⊙ ------- ⊙ ------- ⊙ |
44 | c | | | | | | | |
45 | ⊙ ------- ⊙ ------- ⊙ ------- ⊙ ------- ⊙ ------- ⊙ ------- ⊙ |
46 | a | ___/‾‾‾ | ___/‾‾‾ | | | ___/‾‾‾ | ___/‾‾‾ | |
47 | ⊙ ------- ⊙ ------- ⊙ ------- ⊙ ------- ⊙ ------- ⊙ ------- ⊙ |
48 | a | ___/‾‾‾ | ___/‾‾‾ | | | ___/‾‾‾ | ___/‾‾‾ | |
49 | ⊙ ------- ⊙ ------- ⊙ ------- ⊙ ------- ⊙ ------- ⊙ ------- ⊙ |
50 | a a b b a a |
51 | |
52 | |
53 | The algorithm labels a vertex (x,y) with D,k if it is on diagonal k and at |
54 | the end of a maximal path of length D. (Because x-y=k it suffices to remember |
55 | only the x coordinate of the vertex.) |
56 | |
57 | The forward algorithm: Find the longest diagonal starting at (0,0) and |
58 | label its end with D=0,k=0. From that vertex take a vertical step and |
59 | then follow the longest diagonal (up and to the right), and label that vertex |
60 | with D=1,k=-1. From the D=0,k=0 point take a horizontal step and the follow |
61 | the longest diagonal (up and to the right) and label that vertex |
62 | D=1,k=1. In the same way, having labelled all the D vertices, |
63 | from a vertex labelled D,k find two vertices |
64 | tentatively labelled D+1,k-1 and D+1,k+1. There may be two on the same |
65 | diagonal, in which case take the one with the larger x. |
66 | |
67 | Eventually the path gets to (N,M), and the diagonals on it are the LCS. |
68 | |
69 | Here is the edit graph with the ends of D-paths labelled. (So, for instance, |
70 | 0/2,2 indicates that x=2,y=2 is labelled with 0, as it should be, since the first |
71 | step is to go up the longest diagonal from (0,0).) |
72 | A:"aabbaa", B:"aacaba" |
73 | ⊙ ------- ⊙ ------- ⊙ -------(3/3,6)------- ⊙ -------(3/5,6)-------(4/6,6) |
74 | a | ___/‾‾‾ | ___/‾‾‾ | | | ___/‾‾‾ | ___/‾‾‾ | |
75 | ⊙ ------- ⊙ ------- ⊙ -------(2/3,5)------- ⊙ ------- ⊙ ------- ⊙ |
76 | b | | | ___/‾‾‾ | ___/‾‾‾ | | | |
77 | ⊙ ------- ⊙ ------- ⊙ ------- ⊙ ------- ⊙ -------(3/5,4)------- ⊙ |
78 | a | ___/‾‾‾ | ___/‾‾‾ | | | ___/‾‾‾ | ___/‾‾‾ | |
79 | ⊙ ------- ⊙ -------(1/2,3)-------(2/3,3)------- ⊙ ------- ⊙ ------- ⊙ |
80 | c | | | | | | | |
81 | ⊙ ------- ⊙ -------(0/2,2)-------(1/3,2)-------(2/4,2)-------(3/5,2)-------(4/6,2) |
82 | a | ___/‾‾‾ | ___/‾‾‾ | | | ___/‾‾‾ | ___/‾‾‾ | |
83 | ⊙ ------- ⊙ ------- ⊙ ------- ⊙ ------- ⊙ ------- ⊙ ------- ⊙ |
84 | a | ___/‾‾‾ | ___/‾‾‾ | | | ___/‾‾‾ | ___/‾‾‾ | |
85 | ⊙ ------- ⊙ ------- ⊙ ------- ⊙ ------- ⊙ ------- ⊙ ------- ⊙ |
86 | a a b b a a |
87 | |
88 | The 4-path is reconstructed starting at (4/6,6), horizontal to (3/5,6), diagonal to (3,4), vertical |
89 | to (2/3,3), horizontal to (1/2,3), vertical to (0/2,2), and diagonal to (0,0). As expected, |
90 | there are 4 non-diagonal steps, and the diagonals form an LCS. |
91 | |
92 | There is a symmetric backward algorithm, which gives (backwards labels are prefixed with a colon): |
93 | A:"aabbaa", B:"aacaba" |
94 | ⊙ -------- ⊙ -------- ⊙ -------- ⊙ -------- ⊙ -------- ⊙ -------- ⊙ |
95 | a | ____/‾‾‾ | ____/‾‾‾ | | | ____/‾‾‾ | ____/‾‾‾ | |
96 | ⊙ -------- ⊙ -------- ⊙ -------- ⊙ -------- ⊙ --------(:0/5,5)-------- ⊙ |
97 | b | | | ____/‾‾‾ | ____/‾‾‾ | | | |
98 | ⊙ -------- ⊙ -------- ⊙ --------(:1/3,4)-------- ⊙ -------- ⊙ -------- ⊙ |
99 | a | ____/‾‾‾ | ____/‾‾‾ | | | ____/‾‾‾ | ____/‾‾‾ | |
100 | (:3/0,3)--------(:2/1,3)-------- ⊙ --------(:2/3,3)--------(:1/4,3)-------- ⊙ -------- ⊙ |
101 | c | | | | | | | |
102 | ⊙ -------- ⊙ -------- ⊙ --------(:3/3,2)--------(:2/4,2)-------- ⊙ -------- ⊙ |
103 | a | ____/‾‾‾ | ____/‾‾‾ | | | ____/‾‾‾ | ____/‾‾‾ | |
104 | (:3/0,1)-------- ⊙ -------- ⊙ -------- ⊙ --------(:3/4,1)-------- ⊙ -------- ⊙ |
105 | a | ____/‾‾‾ | ____/‾‾‾ | | | ____/‾‾‾ | ____/‾‾‾ | |
106 | (:4/0,0)-------- ⊙ -------- ⊙ -------- ⊙ --------(:4/4,0)-------- ⊙ -------- ⊙ |
107 | a a b b a a |
108 | |
109 | Neither of these is ideal for use in an editor, where it is undesirable to send very long diffs to the |
110 | front end. It's tricky to decide exactly what 'very long diffs' means, as "replace A by B" is very short. |
111 | We want to control how big D can be, by stopping when it gets too large. The forward algorithm then |
112 | privileges common prefixes, and the backward algorithm privileges common suffixes. Either is an undesirable |
113 | asymmetry. |
114 | |
115 | Fortunately there is a two-sided algorithm, implied by results in Myers' paper. Here's what the labels in |
116 | the edit graph look like. |
117 | A:"aabbaa", B:"aacaba" |
118 | ⊙ --------- ⊙ --------- ⊙ --------- ⊙ --------- ⊙ --------- ⊙ --------- ⊙ |
119 | a | ____/‾‾‾‾ | ____/‾‾‾‾ | | | ____/‾‾‾‾ | ____/‾‾‾‾ | |
120 | ⊙ --------- ⊙ --------- ⊙ --------- (2/3,5) --------- ⊙ --------- (:0/5,5)--------- ⊙ |
121 | b | | | ____/‾‾‾‾ | ____/‾‾‾‾ | | | |
122 | ⊙ --------- ⊙ --------- ⊙ --------- (:1/3,4)--------- ⊙ --------- ⊙ --------- ⊙ |
123 | a | ____/‾‾‾‾ | ____/‾‾‾‾ | | | ____/‾‾‾‾ | ____/‾‾‾‾ | |
124 | ⊙ --------- (:2/1,3)--------- (1/2,3) ---------(2:2/3,3)--------- (:1/4,3)--------- ⊙ --------- ⊙ |
125 | c | | | | | | | |
126 | ⊙ --------- ⊙ --------- (0/2,2) --------- (1/3,2) ---------(2:2/4,2)--------- ⊙ --------- ⊙ |
127 | a | ____/‾‾‾‾ | ____/‾‾‾‾ | | | ____/‾‾‾‾ | ____/‾‾‾‾ | |
128 | ⊙ --------- ⊙ --------- ⊙ --------- ⊙ --------- ⊙ --------- ⊙ --------- ⊙ |
129 | a | ____/‾‾‾‾ | ____/‾‾‾‾ | | | ____/‾‾‾‾ | ____/‾‾‾‾ | |
130 | ⊙ --------- ⊙ --------- ⊙ --------- ⊙ --------- ⊙ --------- ⊙ --------- ⊙ |
131 | a a b b a a |
132 | |
133 | The algorithm stopped when it saw the backwards 2-path ending at (1,3) and the forwards 2-path ending at (3,5). The criterion |
134 | is a backwards path ending at (u,v) and a forward path ending at (x,y), where u <= x and the two points are on the same |
135 | diagonal. (Here the edgegraph has a diagonal, but the criterion is x-y=u-v.) Myers proves there is a forward |
136 | 2-path from (0,0) to (1,3), and that together with the backwards 2-path ending at (1,3) gives the expected 4-path. |
137 | Unfortunately the forward path has to be constructed by another run of the forward algorithm; it can't be found from the |
138 | computed labels. That is the worst case. Had the code noticed (x,y)=(u,v)=(3,3) the whole path could be reconstructed |
139 | from the edgegraph. The implementation looks for a number of special cases to try to avoid computing an extra forward path. |
140 | |
141 | If the two-sided algorithm has stop early (because D has become too large) it will have found a forward LCS and a |
142 | backwards LCS. Ideally these go with disjoint prefixes and suffixes of A and B, but disjointness may fail and the two |
143 | computed LCS may conflict. (An easy example is where A is a suffix of B, and shares a short prefix. The backwards LCS |
144 | is all of A, and the forward LCS is a prefix of A.) The algorithm combines the two |
145 | to form a best-effort LCS. In the worst case the forward partial LCS may have to |
146 | be recomputed. |
147 | */ |
148 | |
149 | /* Eugene Myers paper is titled |
150 | "An O(ND) Difference Algorithm and Its Variations" |
151 | and can be found at |
152 | http://www.xmailserver.org/diff2.pdf |
153 | |
154 | (There is a generic implementation of the algorithm the the repository with git hash |
155 | b9ad7e4ade3a686d608e44475390ad428e60e7fc) |
156 | */ |
157 |
Members