github.com/munnerz/test-infra@v0.0.0-20190108210205-ce3d181dc989/triage/berghelroach.py (about) 1 # Copyright 2017 The Kubernetes Authors. 2 # 3 # Licensed under the Apache License, Version 2.0 (the "License"); 4 # you may not use this file except in compliance with the License. 5 # You may obtain a copy of the License at 6 # 7 # http://www.apache.org/licenses/LICENSE-2.0 8 # 9 # Unless required by applicable law or agreed to in writing, software 10 # distributed under the License is distributed on an "AS IS" BASIS, 11 # WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. 12 # See the License for the specific language governing permissions and 13 # limitations under the License. 14 15 # pylint: disable=invalid-name,missing-docstring 16 17 # Ported from Java com.google.gwt.dev.util.editdistance, which is: 18 # Copyright 2010 Google Inc. 19 # 20 # Licensed under the Apache License, Version 2.0 (the "License"); you may not 21 # use this file except in compliance with the License. You may obtain a copy of 22 # the License at 23 # 24 # http://www.apache.org/licenses/LICENSE-2.0 25 # 26 # Unless required by applicable law or agreed to in writing, software 27 # distributed under the License is distributed on an "AS IS" BASIS, WITHOUT 28 # WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. See the 29 # License for the specific language governing permissions and limitations under 30 # the License. 31 32 def dist(a, b, limit=None): 33 return BerghelRoach(a).getDistance(b, limit or len(a) + len(b)) 34 35 # This is a modification of the original Berghel-Roach edit 36 # distance (based on prior work by Ukkonen) described in 37 # ACM Transactions on Information Systems, Vol. 14, No. 1, 38 # January 1996, pages 94-106. 39 # 40 # I observed that only O(d) prior computations are required 41 # to compute edit distance. Rather than keeping all prior 42 # f(k,p) results in a matrix, we keep only the two "outer edges" 43 # in the triangular computation pattern that will be used in 44 # subsequent rounds. We cannot reconstruct the edit path, 45 # but many applications do not require that; for them, this 46 # modification uses less space (and empirically, slightly 47 # less time). 48 # 49 # First, some history behind the algorithm necessary to understand 50 # Berghel-Roach and our modification... 51 # 52 # The traditional algorithm for edit distance uses dynamic programming, 53 # building a matrix of distances for substrings: 54 # D[i,j] holds the distance for string1[0..i]=>string2[0..j]. 55 # The matrix is initially populated with the trivial values 56 # D[0,j]=j and D[i,0]=i; and then expanded with the rule: 57 # <pre> 58 # D[i,j] = min( D[i-1,j]+1, // insertion 59 # D[i,j-1]+1, // deletion 60 # (D[i-1,j-1] 61 # + (string1[i]==string2[j]) 62 # ? 0 // match 63 # : 1 // substitution ) ) 64 # </pre> 65 # 66 # Ukkonen observed that each diagonal of the matrix must increase 67 # by either 0 or 1 from row to row. If D[i,j] = p, then the 68 # matching rule requires that D[i+x,j+x] = p for all x 69 # where string1[i..i+x) matches string2[j..j+j+x). Ukkonen 70 # defined a function f(k,p) as the highest row number in which p 71 # appears on the k-th diagonal (those D[i,j] where k=(i-j), noting 72 # that k may be negative). The final result of the edit 73 # distance is the D[n,m] cell, on the (n-m) diagonal; it is 74 # the value of p for which f(n-m, p) = m. The function f can 75 # also be computed dynamically, according to a simple recursion: 76 # <pre> 77 # f(k,p) { 78 # contains_p = max(f(k-1,p-1), f(k,p-1)+1, f(k+1,p-1)+1) 79 # while (string1[contains_p] == string2[contains_p + k]) 80 # contains_p++; 81 # return contains_p; 82 # } 83 # </pre> 84 # The max() expression finds a row where the k-th diagonal must 85 # contain p by virtue of an edit from the prior, same, or following 86 # diagonal (corresponding to an insert, substitute, or delete); 87 # we need not consider more distant diagonals because row-to-row 88 # and column-to-column changes are at most +/- 1. 89 # 90 # The original Ukkonen algorithm computed f(k,p) roughly as 91 # follows: 92 # <pre> 93 # for (p = 0; ; p++) { 94 # compute f(k,p) for all valid k 95 # if (f(n-m, p) == m) return p; 96 # } 97 # </pre> 98 # 99 # Berghel and Roach observed that many values of f(k,p) are 100 # computed unnecessarily, and reorganized the computation into 101 # a just-in-time sequence. In each iteration, we are primarily 102 # interested in the terminating value f(main,p), where main=(n-m) 103 # is the main diagonal. To compute that we need f(x,p-1) for 104 # three values of x: main-1, main, and main+1. Those depend on 105 # values for p-2, and so forth. We will already have computed 106 # f(main,p-1) in the prior round, and thus f(main-1,p-2) and 107 # f(main+1,p-2), and so forth. The only new values we need to compute 108 # are on the edges: f(main-i,p-i) and f(main+i,p-i). Noting that 109 # f(k,p) is only meaningful when abs(k) is no greater than p, 110 # one of the Berghel-Roach reviewers noted that we can compute 111 # the bounds for i: 112 # <pre> 113 # (main+i &le p-i) implies (i ≤ (p-main)/2) 114 # </pre> 115 # (where main+i is limited on the positive side) and similarly 116 # <pre> 117 # (-(main-i) &le p-i) implies (i ≤ (p+main)/2). 118 # </pre> 119 # (where main-i is limited on the negative side). 120 # 121 # This reduces the computation sequence to 122 # <pre> 123 # for (i = (p-main)/2; i > 0; i--) compute f(main+i,p-i); 124 # for (i = (p+main)/2; i > 0; i--) compute f(main-i,p-i); 125 # if (f(main, p) == m) return p; 126 # </pre> 127 # 128 # The original Berghel-Roach algorithm recorded prior values 129 # of f(k,p) in a matrix, using O(distance^2) space, enabling 130 # reconstruction of the edit path, but if all we want is the 131 # edit *distance*, we only need to keep O(distance) prior computations. 132 # 133 # The requisite prior k-1, k, and k+1 values are conveniently 134 # computed in the current round and the two preceding it. 135 # For example, on the higher-diagonal side, we compute: 136 # <pre> 137 # current[i] = f(main+i, p-i) 138 # </pre> 139 # We keep the two prior rounds of results, where p was one and two 140 # smaller. So, from the preceidng round 141 # <pre> 142 # last[i] = f(main+i, (p-1)-i) 143 # </pre> 144 # and from the prior round, but one position back: 145 # <pre> 146 # prior[i-1] = f(main+(i-1), (p-2)-(i-1)) 147 # </pre> 148 # In the current round, one iteration earlier: 149 # <pre> 150 # current[i+1] = f(main+(i+1), p-(i+1)) 151 # </pre> 152 # Note that the distance in all of these evaluates to p-i-1, 153 # and the diagonals are (main+i) and its neighbors... just 154 # what we need. The lower-diagonal side behaves similarly. 155 # 156 # We need to materialize values that are not computed in prior 157 # rounds, for either of two reasons: <ul> 158 # <li> Initially, we have no prior rounds, so we need to fill 159 # all of the "last" and "prior" values for use in the 160 # first round. The first round uses only on one side 161 # of the main diagonal or the other. 162 # <li> In every other round, we compute one more diagonal than before. 163 # </ul> 164 # In all of these cases, the missing f(k,p) values are for abs(k) > p, 165 # where a real value of f(k,p) is undefined. [The original Berghel-Roach 166 # algorithm prefills its F matrix with these values, but we fill 167 # them as we go, as needed.] We define 168 # <pre> 169 # f(-p-1,p) = p, so that we start diagonal -p with row p, 170 # f(p+1,p) = -1, so that we start diagonal p with row 0. 171 # </pre> 172 # (We also allow f(p+2,p)=f(-p-2,p)=-1, causing those values to 173 # have no effect in the starting row computation.] 174 # 175 # We only expand the set of diagonals visited every other round, 176 # when (p-main) or (p+main) is even. We keep track of even/oddness 177 # to save some arithmetic. The first round is always even, as p=abs(main). 178 # Note that we rename the "f" function to "computeRow" to be Googley. 179 180 class BerghelRoach(object): 181 def __init__(self, pattern): 182 # The "pattern" string against which others are compared. 183 self.pattern = pattern 184 185 # The current and two preceding sets of Ukkonen f(k,p) values for diagonals 186 # around the main, computed by the main loop of {@code getDistance}. These 187 # arrays are retained between calls to save allocation costs. (They are all 188 # initialized to a real array so that we can indiscriminately use length 189 # when ensuring/resizing.) 190 self.currentLeft = [] 191 self.currentRight = [] 192 self.lastLeft = [] 193 self.lastRight = [] 194 195 self.priorLeft = [] 196 self.priorRight = [] 197 198 def getDistance(self, target, limit): 199 # pylint: disable=too-many-branches 200 201 # Compute the main diagonal number. 202 # The final result lies on this diagonal. 203 main = len(self.pattern) - len(target) 204 # Compute our initial distance candidate. 205 # The result cannot be less than the difference in 206 # string lengths, so we start there. 207 distance = abs(main) 208 if distance > limit: 209 # More than we wanted. Give up right away 210 return distance 211 212 # In the main loop below, the current{Right,Left} arrays record results 213 # from the current outer loop pass. The last{Right,Left} and 214 # prior{Right,Left} arrays hold the results from the preceding two passes. 215 # At the end of the outer loop, we shift them around (reusing the prior 216 # array as the current for the next round, to avoid reallocating). 217 # The Right reflects higher-numbered diagonals, Left lower-numbered. 218 # Fill in "prior" values for the first two passes through 219 # the distance loop. Note that we will execute only one side of 220 # the main diagonal in these passes, so we only need 221 # initialize one side of prior values. 222 223 if main <= 0: 224 self.ensureCapacityRight(distance, False) 225 for j in range(distance): 226 self.lastRight[j] = distance - j - 1 # Make diagonal -k start in row k 227 self.priorRight[j] = -1 228 else: 229 self.ensureCapacityLeft(distance, False) 230 for j in range(distance): 231 self.lastLeft[j] = -1 # Make diagonal +k start in row 0 232 self.priorLeft[j] = -1 233 234 # Keep track of even rounds. Only those rounds consider new diagonals, 235 # and thus only they require artificial "last" values below. 236 even = True 237 238 # MAIN LOOP: try each successive possible distance until one succeeds. 239 while True: 240 # Before calling computeRow(main, distance), we need to fill in 241 # missing cache elements. See the high-level description above. 242 # Higher-numbered diagonals 243 offDiagonal = (distance - main) / 2 244 self.ensureCapacityRight(offDiagonal, True) 245 246 if even: 247 # Higher diagonals start at row 0 248 self.lastRight[offDiagonal] = -1 249 250 immediateRight = -1 251 while offDiagonal > 0: 252 immediateRight = computeRow( 253 (main + offDiagonal), 254 (distance - offDiagonal), 255 self.pattern, 256 target, 257 self.priorRight[offDiagonal-1], 258 self.lastRight[offDiagonal], 259 immediateRight) 260 self.currentRight[offDiagonal] = immediateRight 261 offDiagonal -= 1 262 # Lower-numbered diagonals 263 offDiagonal = (distance + main) / 2 264 self.ensureCapacityLeft(offDiagonal, True) 265 266 if even: 267 # Lower diagonals, fictitious values for f(-x-1,x) = x 268 self.lastLeft[offDiagonal] = (distance-main)/2 - 1 269 270 if even: 271 immediateLeft = -1 272 else: 273 immediateLeft = (distance - main) / 2 274 275 while offDiagonal > 0: 276 immediateLeft = computeRow( 277 (main - offDiagonal), 278 (distance - offDiagonal), 279 self.pattern, target, 280 immediateLeft, 281 self.lastLeft[offDiagonal], 282 self.priorLeft[offDiagonal-1]) 283 self.currentLeft[offDiagonal] = immediateLeft 284 offDiagonal -= 1 285 286 # We are done if the main diagonal has distance in the last row. 287 mainRow = computeRow(main, distance, self.pattern, target, 288 immediateLeft, self.lastLeft[0], immediateRight) 289 290 if mainRow == len(target): 291 break 292 distance += 1 293 if distance > limit or distance < 0: 294 break 295 296 # The [0] element goes to both sides. 297 self.currentRight[0] = mainRow 298 self.currentLeft[0] = mainRow 299 300 # Rotate rows around for next round: current=>last=>prior (=>current) 301 tmp = self.priorLeft 302 self.priorLeft = self.lastLeft 303 self.lastLeft = self.currentLeft 304 self.currentLeft = self.priorLeft 305 306 tmp = self.priorRight 307 self.priorRight = self.lastRight 308 self.lastRight = self.currentRight 309 self.currentRight = tmp 310 311 # Update evenness, too 312 even = not even 313 314 return distance 315 316 def ensureCapacityLeft(self, index, cp): 317 # Ensures that the Left arrays can be indexed through {@code index}, 318 # inclusively, resizing (and copying) as necessary. 319 if len(self.currentLeft) <= index: 320 index += 1 321 self.priorLeft = resize(self.priorLeft, index, cp) 322 self.lastLeft = resize(self.lastLeft, index, cp) 323 self.currentLeft = resize(self.currentLeft, index, False) 324 325 def ensureCapacityRight(self, index, cp): 326 # Ensures that the Right arrays can be indexed through {@code index}, 327 # inclusively, resizing (and copying) as necessary. 328 if len(self.currentRight) <= index: 329 index += 1 330 self.priorRight = resize(self.priorRight, index, cp) 331 self.lastRight = resize(self.lastRight, index, cp) 332 self.currentRight = resize(self.currentRight, index, False) 333 334 335 # Resize an array, copying old contents if requested 336 def resize(array, size, cp): 337 if cp: 338 return array + [0] * (size - len(array)) 339 return [0] * size 340 341 # Computes the highest row in which the distance {@code p} appears 342 # in diagonal {@code k} of the edit distance computation for 343 # strings {@code a} and {@code b}. The diagonal number is 344 # represented by the difference in the indices for the two strings; 345 # it can range from {@code -b.length()} through {@code a.length()}. 346 # 347 # More precisely, this computes the highest value x such that 348 # <pre> 349 # p = edit-distance(a[0:(x+k)), b[0:x)). 350 # </pre> 351 # 352 # This is the "f" function described by Ukkonen. 353 # 354 # The caller must assure that abs(k) ≤ p, the only values for 355 # which this is well-defined. 356 # 357 # The implementation depends on the cached results of prior 358 # computeRow calls for diagonals k-1, k, and k+1 for distance p-1. 359 # These must be supplied in {@code knownLeft}, {@code knownAbove}, 360 # and {@code knownRight}, respectively. 361 # @param k diagonal number 362 # @param p edit distance 363 # @param a one string to be compared 364 # @param b other string to be compared 365 # @param knownLeft value of {@code computeRow(k-1, p-1, ...)} 366 # @param knownAbove value of {@code computeRow(k, p-1, ...)} 367 # @param knownRight value of {@code computeRow(k+1, p-1, ...)} 368 def computeRow(k, p, a, b, 369 knownLeft, knownAbove, knownRight): 370 assert abs(k) <= p 371 assert p >= 0 372 # Compute our starting point using the recurrence. 373 # That is, find the first row where the desired edit distance 374 # appears in our diagonal. This is at least one past 375 # the highest row for 376 if p == 0: 377 t = 0 378 else: 379 # We look at the adjacent diagonals for the next lower edit distance. 380 # We can start in the next row after the prior result from 381 # our own diagonal (the "substitute" case), or the next diagonal 382 # ("delete"), but only the same row as the prior result from 383 # the prior diagonal ("insert"). 384 t = max(max(knownAbove, knownRight)+1, knownLeft) 385 # Look down our diagonal for matches to find the maximum 386 # row with edit-distance p. 387 tmax = min(len(b), len(a)-k) 388 while t < tmax and b[t] == a[t+k]: 389 t += 1 390 391 return t