github.com/shogo82148/std@v1.22.1-0.20240327122250-4e474527810c/index/suffixarray/sais.go (about) 1 // Copyright 2019 The Go Authors. All rights reserved. 2 // Use of this source code is governed by a BSD-style 3 // license that can be found in the LICENSE file. 4 5 // Suffix array construction by induced sorting (SAIS). 6 // See Ge Nong, Sen Zhang, and Wai Hong Chen, 7 // "Two Efficient Algorithms for Linear Time Suffix Array Construction", 8 // especially section 3 (https://ieeexplore.ieee.org/document/5582081). 9 // See also http://zork.net/~st/jottings/sais.html. 10 // 11 // With optimizations inspired by Yuta Mori's sais-lite 12 // (https://sites.google.com/site/yuta256/sais). 13 // 14 // And with other new optimizations. 15 16 // Many of these functions are parameterized by the sizes of 17 // the types they operate on. The generator gen.go makes 18 // copies of these functions for use with other sizes. 19 // Specifically: 20 // 21 // - A function with a name ending in _8_32 takes []byte and []int32 arguments 22 // and is duplicated into _32_32, _8_64, and _64_64 forms. 23 // The _32_32 and _64_64_ suffixes are shortened to plain _32 and _64. 24 // Any lines in the function body that contain the text "byte-only" or "256" 25 // are stripped when creating _32_32 and _64_64 forms. 26 // (Those lines are typically 8-bit-specific optimizations.) 27 // 28 // - A function with a name ending only in _32 operates on []int32 29 // and is duplicated into a _64 form. (Note that it may still take a []byte, 30 // but there is no need for a version of the function in which the []byte 31 // is widened to a full integer array.) 32 33 // The overall runtime of this code is linear in the input size: 34 // it runs a sequence of linear passes to reduce the problem to 35 // a subproblem at most half as big, invokes itself recursively, 36 // and then runs a sequence of linear passes to turn the answer 37 // for the subproblem into the answer for the original problem. 38 // This gives T(N) = O(N) + T(N/2) = O(N) + O(N/2) + O(N/4) + ... = O(N). 39 // 40 // The outline of the code, with the forward and backward scans 41 // through O(N)-sized arrays called out, is: 42 // 43 // sais_I_N 44 // placeLMS_I_B 45 // bucketMax_I_B 46 // freq_I_B 47 // <scan +text> (1) 48 // <scan +freq> (2) 49 // <scan -text, random bucket> (3) 50 // induceSubL_I_B 51 // bucketMin_I_B 52 // freq_I_B 53 // <scan +text, often optimized away> (4) 54 // <scan +freq> (5) 55 // <scan +sa, random text, random bucket> (6) 56 // induceSubS_I_B 57 // bucketMax_I_B 58 // freq_I_B 59 // <scan +text, often optimized away> (7) 60 // <scan +freq> (8) 61 // <scan -sa, random text, random bucket> (9) 62 // assignID_I_B 63 // <scan +sa, random text substrings> (10) 64 // map_B 65 // <scan -sa> (11) 66 // recurse_B 67 // (recursive call to sais_B_B for a subproblem of size at most 1/2 input, often much smaller) 68 // unmap_I_B 69 // <scan -text> (12) 70 // <scan +sa> (13) 71 // expand_I_B 72 // bucketMax_I_B 73 // freq_I_B 74 // <scan +text, often optimized away> (14) 75 // <scan +freq> (15) 76 // <scan -sa, random text, random bucket> (16) 77 // induceL_I_B 78 // bucketMin_I_B 79 // freq_I_B 80 // <scan +text, often optimized away> (17) 81 // <scan +freq> (18) 82 // <scan +sa, random text, random bucket> (19) 83 // induceS_I_B 84 // bucketMax_I_B 85 // freq_I_B 86 // <scan +text, often optimized away> (20) 87 // <scan +freq> (21) 88 // <scan -sa, random text, random bucket> (22) 89 // 90 // Here, _B indicates the suffix array size (_32 or _64) and _I the input size (_8 or _B). 91 // 92 // The outline shows there are in general 22 scans through 93 // O(N)-sized arrays for a given level of the recursion. 94 // In the top level, operating on 8-bit input text, 95 // the six freq scans are fixed size (256) instead of potentially 96 // input-sized. Also, the frequency is counted once and cached 97 // whenever there is room to do so (there is nearly always room in general, 98 // and always room at the top level), which eliminates all but 99 // the first freq_I_B text scans (that is, 5 of the 6). 100 // So the top level of the recursion only does 22 - 6 - 5 = 11 101 // input-sized scans and a typical level does 16 scans. 102 // 103 // The linear scans do not cost anywhere near as much as 104 // the random accesses to the text made during a few of 105 // the scans (specifically #6, #9, #16, #19, #22 marked above). 106 // In real texts, there is not much but some locality to 107 // the accesses, due to the repetitive structure of the text 108 // (the same reason Burrows-Wheeler compression is so effective). 109 // For random inputs, there is no locality, which makes those 110 // accesses even more expensive, especially once the text 111 // no longer fits in cache. 112 // For example, running on 50 MB of Go source code, induceSubL_8_32 113 // (which runs only once, at the top level of the recursion) 114 // takes 0.44s, while on 50 MB of random input, it takes 2.55s. 115 // Nearly all the relative slowdown is explained by the text access: 116 // 117 // c0, c1 := text[k-1], text[k] 118 // 119 // That line runs for 0.23s on the Go text and 2.02s on random text. 120 121 //go:generate go run gen.go 122 123 package suffixarray