github.com/shogo82148/std@v1.22.1-0.20240327122250-4e474527810c/index/suffixarray/sais.go

github.com/shogo82148/std@v1.22.1-0.20240327122250-4e474527810c/index/suffixarray/sais.go (about)

     1  // Copyright 2019 The Go Authors. All rights reserved.
     2  // Use of this source code is governed by a BSD-style
     3  // license that can be found in the LICENSE file.
     4  
     5  // Suffix array construction by induced sorting (SAIS).
     6  // See Ge Nong, Sen Zhang, and Wai Hong Chen,
     7  // "Two Efficient Algorithms for Linear Time Suffix Array Construction",
     8  // especially section 3 (https://ieeexplore.ieee.org/document/5582081).
     9  // See also http://zork.net/~st/jottings/sais.html.
    10  //
    11  // With optimizations inspired by Yuta Mori's sais-lite
    12  // (https://sites.google.com/site/yuta256/sais).
    13  //
    14  // And with other new optimizations.
    15  
    16  // Many of these functions are parameterized by the sizes of
    17  // the types they operate on. The generator gen.go makes
    18  // copies of these functions for use with other sizes.
    19  // Specifically:
    20  //
    21  // - A function with a name ending in _8_32 takes []byte and []int32 arguments
    22  //   and is duplicated into _32_32, _8_64, and _64_64 forms.
    23  //   The _32_32 and _64_64_ suffixes are shortened to plain _32 and _64.
    24  //   Any lines in the function body that contain the text "byte-only" or "256"
    25  //   are stripped when creating _32_32 and _64_64 forms.
    26  //   (Those lines are typically 8-bit-specific optimizations.)
    27  //
    28  // - A function with a name ending only in _32 operates on []int32
    29  //   and is duplicated into a _64 form. (Note that it may still take a []byte,
    30  //   but there is no need for a version of the function in which the []byte
    31  //   is widened to a full integer array.)
    32  
    33  // The overall runtime of this code is linear in the input size:
    34  // it runs a sequence of linear passes to reduce the problem to
    35  // a subproblem at most half as big, invokes itself recursively,
    36  // and then runs a sequence of linear passes to turn the answer
    37  // for the subproblem into the answer for the original problem.
    38  // This gives T(N) = O(N) + T(N/2) = O(N) + O(N/2) + O(N/4) + ... = O(N).
    39  //
    40  // The outline of the code, with the forward and backward scans
    41  // through O(N)-sized arrays called out, is:
    42  //
    43  // sais_I_N
    44  //	placeLMS_I_B
    45  //		bucketMax_I_B
    46  //			freq_I_B
    47  //				<scan +text> (1)
    48  //			<scan +freq> (2)
    49  //		<scan -text, random bucket> (3)
    50  //	induceSubL_I_B
    51  //		bucketMin_I_B
    52  //			freq_I_B
    53  //				<scan +text, often optimized away> (4)
    54  //			<scan +freq> (5)
    55  //		<scan +sa, random text, random bucket> (6)
    56  //	induceSubS_I_B
    57  //		bucketMax_I_B
    58  //			freq_I_B
    59  //				<scan +text, often optimized away> (7)
    60  //			<scan +freq> (8)
    61  //		<scan -sa, random text, random bucket> (9)
    62  //	assignID_I_B
    63  //		<scan +sa, random text substrings> (10)
    64  //	map_B
    65  //		<scan -sa> (11)
    66  //	recurse_B
    67  //		(recursive call to sais_B_B for a subproblem of size at most 1/2 input, often much smaller)
    68  //	unmap_I_B
    69  //		<scan -text> (12)
    70  //		<scan +sa> (13)
    71  //	expand_I_B
    72  //		bucketMax_I_B
    73  //			freq_I_B
    74  //				<scan +text, often optimized away> (14)
    75  //			<scan +freq> (15)
    76  //		<scan -sa, random text, random bucket> (16)
    77  //	induceL_I_B
    78  //		bucketMin_I_B
    79  //			freq_I_B
    80  //				<scan +text, often optimized away> (17)
    81  //			<scan +freq> (18)
    82  //		<scan +sa, random text, random bucket> (19)
    83  //	induceS_I_B
    84  //		bucketMax_I_B
    85  //			freq_I_B
    86  //				<scan +text, often optimized away> (20)
    87  //			<scan +freq> (21)
    88  //		<scan -sa, random text, random bucket> (22)
    89  //
    90  // Here, _B indicates the suffix array size (_32 or _64) and _I the input size (_8 or _B).
    91  //
    92  // The outline shows there are in general 22 scans through
    93  // O(N)-sized arrays for a given level of the recursion.
    94  // In the top level, operating on 8-bit input text,
    95  // the six freq scans are fixed size (256) instead of potentially
    96  // input-sized. Also, the frequency is counted once and cached
    97  // whenever there is room to do so (there is nearly always room in general,
    98  // and always room at the top level), which eliminates all but
    99  // the first freq_I_B text scans (that is, 5 of the 6).
   100  // So the top level of the recursion only does 22 - 6 - 5 = 11
   101  // input-sized scans and a typical level does 16 scans.
   102  //
   103  // The linear scans do not cost anywhere near as much as
   104  // the random accesses to the text made during a few of
   105  // the scans (specifically #6, #9, #16, #19, #22 marked above).
   106  // In real texts, there is not much but some locality to
   107  // the accesses, due to the repetitive structure of the text
   108  // (the same reason Burrows-Wheeler compression is so effective).
   109  // For random inputs, there is no locality, which makes those
   110  // accesses even more expensive, especially once the text
   111  // no longer fits in cache.
   112  // For example, running on 50 MB of Go source code, induceSubL_8_32
   113  // (which runs only once, at the top level of the recursion)
   114  // takes 0.44s, while on 50 MB of random input, it takes 2.55s.
   115  // Nearly all the relative slowdown is explained by the text access:
   116  //
   117  //		c0, c1 := text[k-1], text[k]
   118  //
   119  // That line runs for 0.23s on the Go text and 2.02s on random text.
   120  
   121  //go:generate go run gen.go
   122  
   123  package suffixarray