github.com/munnerz/test-infra@v0.0.0-20190108210205-ce3d181dc989/triage/berghelroach.py

github.com/munnerz/test-infra@v0.0.0-20190108210205-ce3d181dc989/triage/berghelroach.py (about)

     1  # Copyright 2017 The Kubernetes Authors.
     2  #
     3  # Licensed under the Apache License, Version 2.0 (the "License");
     4  # you may not use this file except in compliance with the License.
     5  # You may obtain a copy of the License at
     6  #
     7  #     http://www.apache.org/licenses/LICENSE-2.0
     8  #
     9  # Unless required by applicable law or agreed to in writing, software
    10  # distributed under the License is distributed on an "AS IS" BASIS,
    11  # WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
    12  # See the License for the specific language governing permissions and
    13  # limitations under the License.
    14  
    15  # pylint: disable=invalid-name,missing-docstring
    16  
    17  # Ported from Java com.google.gwt.dev.util.editdistance, which is:
    18  # Copyright 2010 Google Inc.
    19  #
    20  # Licensed under the Apache License, Version 2.0 (the "License"); you may not
    21  # use this file except in compliance with the License. You may obtain a copy of
    22  # the License at
    23  #
    24  # http://www.apache.org/licenses/LICENSE-2.0
    25  #
    26  # Unless required by applicable law or agreed to in writing, software
    27  # distributed under the License is distributed on an "AS IS" BASIS, WITHOUT
    28  # WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. See the
    29  # License for the specific language governing permissions and limitations under
    30  # the License.
    31  
    32  def dist(a, b, limit=None):
    33      return BerghelRoach(a).getDistance(b, limit or len(a) + len(b))
    34  
    35  # This is a modification of the original Berghel-Roach edit
    36  # distance (based on prior work by Ukkonen) described in
    37  #   ACM Transactions on Information Systems, Vol. 14, No. 1,
    38  #   January 1996, pages 94-106.
    39  #
    40  # I observed that only O(d) prior computations are required
    41  # to compute edit distance.  Rather than keeping all prior
    42  # f(k,p) results in a matrix, we keep only the two "outer edges"
    43  # in the triangular computation pattern that will be used in
    44  # subsequent rounds.  We cannot reconstruct the edit path,
    45  # but many applications do not require that; for them, this
    46  # modification uses less space (and empirically, slightly
    47  # less time).
    48  #
    49  # First, some history behind the algorithm necessary to understand
    50  # Berghel-Roach and our modification...
    51  #
    52  # The traditional algorithm for edit distance uses dynamic programming,
    53  # building a matrix of distances for substrings:
    54  # D[i,j] holds the distance for string1[0..i]=>string2[0..j].
    55  # The matrix is initially populated with the trivial values
    56  # D[0,j]=j and D[i,0]=i; and then expanded with the rule:
    57  # <pre>
    58  #    D[i,j] = min( D[i-1,j]+1,       // insertion
    59  #                  D[i,j-1]+1,       // deletion
    60  #                  (D[i-1,j-1]
    61  #                   + (string1[i]==string2[j])
    62  #                      ? 0           // match
    63  #                      : 1           // substitution ) )
    64  # </pre>
    65  #
    66  # Ukkonen observed that each diagonal of the matrix must increase
    67  # by either 0 or 1 from row to row.  If D[i,j] = p, then the
    68  # matching rule requires that D[i+x,j+x] = p for all x
    69  # where string1[i..i+x) matches string2[j..j+j+x). Ukkonen
    70  # defined a function f(k,p) as the highest row number in which p
    71  # appears on the k-th diagonal (those D[i,j] where k=(i-j), noting
    72  # that k may be negative).  The final result of the edit
    73  # distance is the D[n,m] cell, on the (n-m) diagonal; it is
    74  # the value of p for which f(n-m, p) = m.  The function f can
    75  # also be computed dynamically, according to a simple recursion:
    76  # <pre>
    77  #    f(k,p) {
    78  #      contains_p = max(f(k-1,p-1), f(k,p-1)+1, f(k+1,p-1)+1)
    79  #      while (string1[contains_p] == string2[contains_p + k])
    80  #        contains_p++;
    81  #      return contains_p;
    82  #    }
    83  # </pre>
    84  # The max() expression finds a row where the k-th diagonal must
    85  # contain p by virtue of an edit from the prior, same, or following
    86  # diagonal (corresponding to an insert, substitute, or delete);
    87  # we need not consider more distant diagonals because row-to-row
    88  # and column-to-column changes are at most +/- 1.
    89  #
    90  # The original Ukkonen algorithm computed f(k,p) roughly as
    91  # follows:
    92  # <pre>
    93  #    for (p = 0; ; p++) {
    94  #      compute f(k,p) for all valid k
    95  #      if (f(n-m, p) == m) return p;
    96  #    }
    97  # </pre>
    98  #
    99  # Berghel and Roach observed that many values of f(k,p) are
   100  # computed unnecessarily, and reorganized the computation into
   101  # a just-in-time sequence.  In each iteration, we are primarily
   102  # interested in the terminating value f(main,p), where main=(n-m)
   103  # is the main diagonal.  To compute that we need f(x,p-1) for
   104  # three values of x: main-1, main, and main+1.  Those depend on
   105  # values for p-2, and so forth.  We will already have computed
   106  # f(main,p-1) in the prior round, and thus f(main-1,p-2) and
   107  # f(main+1,p-2), and so forth.  The only new values we need to compute
   108  # are on the edges: f(main-i,p-i) and f(main+i,p-i).  Noting that
   109  # f(k,p) is only meaningful when abs(k) is no greater than p,
   110  # one of the Berghel-Roach reviewers noted that we can compute
   111  # the bounds for i:
   112  # <pre>
   113  #    (main+i &le p-i) implies (i &le; (p-main)/2)
   114  # </pre>
   115  # (where main+i is limited on the positive side) and similarly
   116  # <pre>
   117  #    (-(main-i) &le p-i) implies (i &le; (p+main)/2).
   118  # </pre>
   119  # (where main-i is limited on the negative side).
   120  #
   121  # This reduces the computation sequence to
   122  # <pre>
   123  #   for (i = (p-main)/2; i > 0; i--) compute f(main+i,p-i);
   124  #   for (i = (p+main)/2; i > 0; i--) compute f(main-i,p-i);
   125  #   if (f(main, p) == m) return p;
   126  # </pre>
   127  #
   128  # The original Berghel-Roach algorithm recorded prior values
   129  # of f(k,p) in a matrix, using O(distance^2) space, enabling
   130  # reconstruction of the edit path, but if all we want is the
   131  # edit *distance*, we only need to keep O(distance) prior computations.
   132  #
   133  # The requisite prior k-1, k, and k+1 values are conveniently
   134  # computed in the current round and the two preceding it.
   135  # For example, on the higher-diagonal side, we compute:
   136  # <pre>
   137  #    current[i] = f(main+i, p-i)
   138  # </pre>
   139  # We keep the two prior rounds of results, where p was one and two
   140  # smaller.  So, from the preceidng round
   141  # <pre>
   142  #    last[i] = f(main+i, (p-1)-i)
   143  # </pre>
   144  #  and from the prior round, but one position back:
   145  # <pre>
   146  #    prior[i-1] = f(main+(i-1), (p-2)-(i-1))
   147  # </pre>
   148  # In the current round, one iteration earlier:
   149  # <pre>
   150  #    current[i+1] = f(main+(i+1), p-(i+1))
   151  # </pre>
   152  # Note that the distance in all of these evaluates to p-i-1,
   153  # and the diagonals are (main+i) and its neighbors... just
   154  # what we need.  The lower-diagonal side behaves similarly.
   155  #
   156  # We need to materialize values that are not computed in prior
   157  # rounds, for either of two reasons: <ul>
   158  #    <li> Initially, we have no prior rounds, so we need to fill
   159  #     all of the "last" and "prior" values for use in the
   160  #     first round.  The first round uses only on one side
   161  #     of the main diagonal or the other.
   162  #    <li> In every other round, we compute one more diagonal than before.
   163  # </ul>
   164  # In all of these cases, the missing f(k,p) values are for abs(k) > p,
   165  # where a real value of f(k,p) is undefined.  [The original Berghel-Roach
   166  # algorithm prefills its F matrix with these values, but we fill
   167  # them as we go, as needed.]  We define
   168  # <pre>
   169  #    f(-p-1,p) = p, so that we start diagonal -p with row p,
   170  #    f(p+1,p) = -1, so that we start diagonal p with row 0.
   171  # </pre>
   172  # (We also allow f(p+2,p)=f(-p-2,p)=-1, causing those values to
   173  # have no effect in the starting row computation.]
   174  #
   175  # We only expand the set of diagonals visited every other round,
   176  # when (p-main) or (p+main) is even.  We keep track of even/oddness
   177  # to save some arithmetic.  The first round is always even, as p=abs(main).
   178  # Note that we rename the "f" function to "computeRow" to be Googley.
   179  
   180  class BerghelRoach(object):
   181      def __init__(self, pattern):
   182          # The "pattern" string against which others are compared.
   183          self.pattern = pattern
   184  
   185          # The current and two preceding sets of Ukkonen f(k,p) values for diagonals
   186          # around the main, computed by the main loop of {@code getDistance}.  These
   187          # arrays are retained between calls to save allocation costs.  (They are all
   188          # initialized to a real array so that we can indiscriminately use length
   189          # when ensuring/resizing.)
   190          self.currentLeft = []
   191          self.currentRight = []
   192          self.lastLeft = []
   193          self.lastRight = []
   194  
   195          self.priorLeft = []
   196          self.priorRight = []
   197  
   198      def getDistance(self, target, limit):
   199          # pylint: disable=too-many-branches
   200  
   201          # Compute the main diagonal number.
   202          # The final result lies on this diagonal.
   203          main = len(self.pattern) - len(target)
   204          # Compute our initial distance candidate.
   205          # The result cannot be less than the difference in
   206          # string lengths, so we start there.
   207          distance = abs(main)
   208          if distance > limit:
   209              # More than we wanted.  Give up right away
   210              return distance
   211  
   212          # In the main loop below, the current{Right,Left} arrays record results
   213          # from the current outer loop pass.  The last{Right,Left} and
   214          # prior{Right,Left} arrays hold the results from the preceding two passes.
   215          # At the end of the outer loop, we shift them around (reusing the prior
   216          # array as the current for the next round, to avoid reallocating).
   217          # The Right reflects higher-numbered diagonals, Left lower-numbered.
   218          # Fill in "prior" values for the first two passes through
   219          # the distance loop.  Note that we will execute only one side of
   220          # the main diagonal in these passes, so we only need
   221          # initialize one side of prior values.
   222  
   223          if main <= 0:
   224              self.ensureCapacityRight(distance, False)
   225              for j in range(distance):
   226                  self.lastRight[j] = distance - j - 1 # Make diagonal -k start in row k
   227                  self.priorRight[j] = -1
   228          else:
   229              self.ensureCapacityLeft(distance, False)
   230              for j in range(distance):
   231                  self.lastLeft[j] = -1 # Make diagonal +k start in row 0
   232                  self.priorLeft[j] = -1
   233  
   234          # Keep track of even rounds.  Only those rounds consider new diagonals,
   235          # and thus only they require artificial "last" values below.
   236          even = True
   237  
   238          # MAIN LOOP: try each successive possible distance until one succeeds.
   239          while True:
   240              # Before calling computeRow(main, distance), we need to fill in
   241              # missing cache elements.  See the high-level description above.
   242              # Higher-numbered diagonals
   243              offDiagonal = (distance - main) / 2
   244              self.ensureCapacityRight(offDiagonal, True)
   245  
   246              if even:
   247                  # Higher diagonals start at row 0
   248                  self.lastRight[offDiagonal] = -1
   249  
   250              immediateRight = -1
   251              while offDiagonal > 0:
   252                  immediateRight = computeRow(
   253                      (main + offDiagonal),
   254                      (distance - offDiagonal),
   255                      self.pattern,
   256                      target,
   257                      self.priorRight[offDiagonal-1],
   258                      self.lastRight[offDiagonal],
   259                      immediateRight)
   260                  self.currentRight[offDiagonal] = immediateRight
   261                  offDiagonal -= 1
   262              # Lower-numbered diagonals
   263              offDiagonal = (distance + main) / 2
   264              self.ensureCapacityLeft(offDiagonal, True)
   265  
   266              if even:
   267                  # Lower diagonals, fictitious values for f(-x-1,x) = x
   268                  self.lastLeft[offDiagonal] = (distance-main)/2 - 1
   269  
   270              if even:
   271                  immediateLeft = -1
   272              else:
   273                  immediateLeft = (distance - main) / 2
   274  
   275              while offDiagonal > 0:
   276                  immediateLeft = computeRow(
   277                      (main - offDiagonal),
   278                      (distance - offDiagonal),
   279                      self.pattern, target,
   280                      immediateLeft,
   281                      self.lastLeft[offDiagonal],
   282                      self.priorLeft[offDiagonal-1])
   283                  self.currentLeft[offDiagonal] = immediateLeft
   284                  offDiagonal -= 1
   285  
   286              # We are done if the main diagonal has distance in the last row.
   287              mainRow = computeRow(main, distance, self.pattern, target,
   288                                   immediateLeft, self.lastLeft[0], immediateRight)
   289  
   290              if mainRow == len(target):
   291                  break
   292              distance += 1
   293              if distance > limit or distance < 0:
   294                  break
   295  
   296              # The [0] element goes to both sides.
   297              self.currentRight[0] = mainRow
   298              self.currentLeft[0] = mainRow
   299  
   300              # Rotate rows around for next round: current=>last=>prior (=>current)
   301              tmp = self.priorLeft
   302              self.priorLeft = self.lastLeft
   303              self.lastLeft = self.currentLeft
   304              self.currentLeft = self.priorLeft
   305  
   306              tmp = self.priorRight
   307              self.priorRight = self.lastRight
   308              self.lastRight = self.currentRight
   309              self.currentRight = tmp
   310  
   311              # Update evenness, too
   312              even = not even
   313  
   314          return distance
   315  
   316      def ensureCapacityLeft(self, index, cp):
   317          # Ensures that the Left arrays can be indexed through {@code index},
   318          # inclusively, resizing (and copying) as necessary.
   319          if len(self.currentLeft) <= index:
   320              index += 1
   321              self.priorLeft = resize(self.priorLeft, index, cp)
   322              self.lastLeft = resize(self.lastLeft, index, cp)
   323              self.currentLeft = resize(self.currentLeft, index, False)
   324  
   325      def ensureCapacityRight(self, index, cp):
   326          # Ensures that the Right arrays can be indexed through {@code index},
   327          # inclusively, resizing (and copying) as necessary.
   328          if len(self.currentRight) <= index:
   329              index += 1
   330              self.priorRight = resize(self.priorRight, index, cp)
   331              self.lastRight = resize(self.lastRight, index, cp)
   332              self.currentRight = resize(self.currentRight, index, False)
   333  
   334  
   335  # Resize an array, copying old contents if requested
   336  def resize(array, size, cp):
   337      if cp:
   338          return array + [0] * (size - len(array))
   339      return [0] * size
   340  
   341  # Computes the highest row in which the distance {@code p} appears
   342  # in diagonal {@code k} of the edit distance computation for
   343  # strings {@code a} and {@code b}.  The diagonal number is
   344  # represented by the difference in the indices for the two strings;
   345  # it can range from {@code -b.length()} through {@code a.length()}.
   346  #
   347  # More precisely, this computes the highest value x such that
   348  # <pre>
   349  #     p = edit-distance(a[0:(x+k)), b[0:x)).
   350  # </pre>
   351  #
   352  # This is the "f" function described by Ukkonen.
   353  #
   354  # The caller must assure that abs(k) &le; p, the only values for
   355  # which this is well-defined.
   356  #
   357  # The implementation depends on the cached results of prior
   358  # computeRow calls for diagonals k-1, k, and k+1 for distance p-1.
   359  # These must be supplied in {@code knownLeft}, {@code knownAbove},
   360  # and {@code knownRight}, respectively.
   361  # @param k diagonal number
   362  # @param p edit distance
   363  # @param a one string to be compared
   364  # @param b other string to be compared
   365  # @param knownLeft value of {@code computeRow(k-1, p-1, ...)}
   366  # @param knownAbove value of {@code computeRow(k, p-1, ...)}
   367  # @param knownRight value of {@code computeRow(k+1, p-1, ...)}
   368  def computeRow(k, p, a, b,
   369                 knownLeft, knownAbove, knownRight):
   370      assert abs(k) <= p
   371      assert p >= 0
   372      # Compute our starting point using the recurrence.
   373      # That is, find the first row where the desired edit distance
   374      # appears in our diagonal.  This is at least one past
   375      # the highest row for
   376      if p == 0:
   377          t = 0
   378      else:
   379          # We look at the adjacent diagonals for the next lower edit distance.
   380          # We can start in the next row after the prior result from
   381          # our own diagonal (the "substitute" case), or the next diagonal
   382          # ("delete"), but only the same row as the prior result from
   383          # the prior diagonal ("insert").
   384          t = max(max(knownAbove, knownRight)+1, knownLeft)
   385      # Look down our diagonal for matches to find the maximum
   386      # row with edit-distance p.
   387      tmax = min(len(b), len(a)-k)
   388      while t < tmax and b[t] == a[t+k]:
   389          t += 1
   390  
   391      return t