github.com/graybobo/golang.org-package-offline-cache@v0.0.0-20200626051047-6608995c132f/x/blog/content/normalization.article (about) 1 Text normalization in Go 2 26 Nov 2013 3 Tags: strings, bytes, runes, characters 4 5 Marcel van Lohuizen 6 7 * Introduction 8 9 An earlier [[http://blog.golang.org/strings][post]] talked about strings, bytes 10 and characters in Go. I've been working on various packages for multilingual 11 text processing for the go.text repository. Several of these packages deserve a 12 separate blog post, but today I want to focus on 13 [[http://godoc.org/code.google.com/p/go.text/unicode/norm][go.text/unicode/norm]], 14 which handles normalization, a topic touched in the 15 [[http://blog.golang.org/strings][strings article]] and the subject of this 16 post. Normalization works at a higher level of abstraction than raw bytes. 17 18 To learn pretty much everything you ever wanted to know about normalization 19 (and then some), [[http://unicode.org/reports/tr15/][Annex 15 of the Unicode Standard]] 20 is a good read. A more approachable article is the corresponding 21 [[http://en.wikipedia.org/wiki/Unicode_equivalence][Wikipedia page]]. Here we 22 focus on how normalization relates to Go. 23 24 * What is normalization? 25 26 There are often several ways to represent the same string. For example, an é 27 (e-acute) can be represented in a string as a single rune ("\u00e9") or an 'e' 28 followed by an acute accent ("e\u0301"). According to the Unicode standard, 29 these two are "canonically equivalent" and should be treated as equal. 30 31 Using a byte-to-byte comparison to determine equality would clearly not give 32 the right result for these two strings. Unicode defines a set of normal forms 33 such that if two strings are canonically equivalent and are normalized to the 34 same normal form, their byte representations are the same. 35 36 Unicode also defines a "compatibility equivalence" to equate characters that 37 represent the same characters, but may have a different visual appearance. For 38 example, the superscript digit '⁹' and the regular digit '9' are equivalent in 39 this form. 40 41 For each of these two equivalence forms, Unicode defines a composing and 42 decomposing form. The former replaces runes that can combine into a single rune 43 with this single rune. The latter breaks runes apart into their components. 44 This table shows the names, all starting with NF, by which the Unicode 45 Consortium identifies these forms: 46 47 .html normalization/table1.html 48 49 * Go's approach to normalization 50 51 As mentioned in the strings blog post, Go does not guarantee that characters in 52 a string are normalized. However, the go.text packages can compensate. For 53 example, the 54 [[http://godoc.org/code.google.com/p/go.text/collate][collate]] package, which 55 can sort strings in a language-specific way, works correctly even with 56 unnormalized strings. The packages in go.text do not always require normalized 57 input, but in general normalization may be necessary for consistent results. 58 59 Normalization isn't free but it is fast, particularly for collation and 60 searching or if a string is either in NFD or in NFC and can be converted to NFD 61 by decomposing without reordering its bytes. In practice, 62 [[http://www.macchiato.com/unicode/nfc-faq#TOC-How-much-text-is-already-NFC-][99.98%]] of 63 the web's HTML page content is in NFC form (not counting markup, in which case 64 it would be more). By far most NFC can be decomposed to NFD without the need 65 for reordering (which requires allocation). Also, it is efficient to detect 66 when reordering is necessary, so we can save time by doing it only for the rare 67 segments that need it. 68 69 To make things even better, the collation package typically does not use the 70 norm package directly, but instead uses the norm package to interleave 71 normalization information with its own tables. Interleaving the two problems 72 allows for reordering and normalization on the fly with almost no impact on 73 performance. The cost of on-the-fly normalization is compensated by not having 74 to normalize text beforehand and ensuring that the normal form is maintained 75 upon edits. The latter can be tricky. For instance, the result of concatenating 76 two NFC-normalized strings is not guaranteed to be in NFC. 77 78 Of course, we can also avoid the overhead outright if we know in advance that a 79 string is already normalized, which is often the case. 80 81 * Why bother? 82 83 After all this discussion about avoiding normalization, you might ask why it's 84 worth worrying about at all. The reason is that there are cases where 85 normalization is required and it is important to understand what those are, and 86 in turn how to do it correctly. 87 88 Before discussing those, we must first clarify the concept of 'character'. 89 90 * What is a character? 91 92 As was mentioned in the strings blog post, characters can span multiple runes. 93 For example, an 'e' and '◌́' (acute "\u0301") can combine to form 'é' ("e\u0301" 94 in NFD). Together these two runes are one character. The definition of a 95 character may vary depending on the application. For normalization we will 96 define it as a sequence of runes that starts with a starter, a rune that does 97 not modify or combine backwards with any other rune, followed by possibly empty 98 sequence of non-starters, that is, runes that do (typically accents). The 99 normalization algorithm processes one character at at time. 100 101 Theoretically, there is no bound to the number of runes that can make up a 102 Unicode character. In fact, there are no restrictions on the number of 103 modifiers that can follow a character and a modifier may be repeated, or 104 stacked. Ever seen an 'e' with three acutes? Here you go: 'é́́'. That is a 105 perfectly valid 4-rune character according to the standard. 106 107 As a consequence, even at the lowest level, text needs to be processed in 108 increments of unbounded chunk sizes. This is especially awkward with a 109 streaming approach to text processing, as used by Go's standard Reader and 110 Writer interfaces, as that model potentially requires any intermediate buffers 111 to have unbounded size as well. Also, a straightforward implementation of 112 normalization will have a O(n²) running time. 113 114 There are really no meaningful interpretations for such large sequences of 115 modifiers for practical applications. Unicode defines a Stream-Safe Text 116 format, which allows capping the number of modifiers (non-starters) to at most 117 30, more than enough for any practical purpose. Subsequent modifiers will be 118 placed after a freshly inserted Combining Grapheme Joiner (CGJ or U+034F). Go 119 adopts this approach for all normalization algorithms. This decision gives up a 120 little conformance but gains a little safety. 121 122 * Writing in normal form 123 124 Even if you don't need to normalize text within your Go code, you might still 125 want to do so when communicating to the outside world. For example, normalizing 126 to NFC might compact your text, making it cheaper to send down a wire. For some 127 languages, like Korean, the savings can be substantial. Also, some external 128 APIs might expect text in a certain normal form. Or you might just want to fit 129 in and output your text as NFC like the rest of the world. 130 131 To write your text as NFC, use the 132 [[http://godoc.org/code.google.com/p/go.text/unicode/norm][unicode/norm]] package 133 to wrap your `io.Writer` of choice: 134 135 wc := norm.NFC.Writer(w) 136 defer wc.Close() 137 // write as before... 138 139 If you have a small string and want to do a quick conversion, you can use this 140 simpler form: 141 142 norm.NFC.Bytes(b) 143 144 Package norm provides various other methods for normalizing text. 145 Pick the one that suits your needs best. 146 147 * Catching look-alikes 148 149 Can you tell the difference between 'K' ("\u004B") and 'K' (Kelvin sign 150 "\u212A") or 'Ω' ("\u03a9") and 'Ω' (Ohm sign "\u2126")? It is easy to overlook 151 the sometimes minute differences between variants of the same underlying 152 character. It is generally a good idea to disallow such variants in identifiers 153 or anything where deceiving users with such look-alikes can pose a security 154 hazard. 155 156 The compatibility normal forms, NFKC and NFKD, will map many visually nearly 157 identical forms to a single value. Note that it will not do so when two symbols 158 look alike, but are really from two different alphabets. For example the Latin 159 'o', Greek 'ο', and Cyrillic 'о' are still different characters as defined by 160 these forms. 161 162 * Correct text modifications 163 164 The norm package might also come to the rescue when one needs to modify text. 165 Consider a case where you want to search and replace the word "cafe" with its 166 plural form "cafes". A code snippet could look like this. 167 168 s := "We went to eat at multiple cafe" 169 cafe := "cafe" 170 if p := strings.Index(s, cafe); p != -1 { 171 p += len(cafe) 172 s = s[:p] + "s" + s[p:] 173 } 174 fmt.Println(s) 175 176 This prints "We went to eat at multiple cafes" as desired and expected. Now 177 consider our text contains the French spelling "café" in NFD form: 178 179 s := "We went to eat at multiple cafe\u0301" 180 181 Using the same code from above, the plural "s" would still be inserted after 182 the 'e', but before the acute, resulting in "We went to eat at multiple 183 cafeś". This behavior is undesirable. 184 185 The problem is that the code does not respect the boundaries between multi-rune 186 characters and inserts a rune in the middle of a character. Using the norm 187 package, we can rewrite this piece of code as follows: 188 189 s := "We went to eat at multiple cafe\u0301" 190 cafe := "cafe" 191 if p := strings.Index(s, cafe); p != -1 { 192 p += len(cafe) 193 if bp := norm.FirstBoundary(s[p:]); bp > 0 { 194 p += bp 195 } 196 s = s[:p] + "s" + s[p:] 197 } 198 fmt.Println(s) 199 200 This may be a contrived example, but the gist should be clear. Be mindful of 201 the fact that characters can span multiple runes. Generally these kinds of 202 problems can be avoided by using search functionality that respects character 203 boundaries (such as the planned go.text/search package.) 204 205 * Iteration 206 207 Another tool provided by the norm package that may help dealing with character 208 boundaries is its iterator, 209 [[http://godoc.org/code.google.com/p/go.text/unicode/norm#Iter][`norm.Iter`]]. 210 It iterates over characters one at a time in the normal form of choice. 211 212 * Performing magic 213 214 As mentioned earlier, most text is in NFC form, where base characters and 215 modifiers are combined into a single rune whenever possible. For the purpose 216 of analyzing characters, it is often easier to handle runes after decomposition 217 into their smallest components. This is where the NFD form comes in handy. For 218 example, the following piece of code creates a `transform.Transformer` that 219 decomposes text into its smallest parts, removes all accents, and then 220 recomposes the text into NFC: 221 222 import ( 223 "unicode" 224 225 "golang.org/x/text/transform" 226 "golang.org/x/text/unicode/norm" 227 ) 228 229 isMn := func(r rune) bool { 230 return unicode.Is(unicode.Mn, r) // Mn: nonspacing marks 231 } 232 t := transform.Chain(norm.NFD, transform.RemoveFunc(isMn), norm.NFC) 233 234 The resulting `Transformer` can be used to remove accents from an `io.Reader` 235 of choice as follows: 236 237 r = transform.NewReader(r, t) 238 // read as before ... 239 240 This will, for example, convert any mention of "cafés" in the text to "cafes", 241 regardless of the normal form in which the original text was encoded. 242 243 * Normalization info 244 245 As mentioned earlier, some packages precompute normalizations into their tables 246 to minimize the need for normalization at run time. The type `norm.Properties` 247 provides access to the per-rune information needed by these packages, most 248 notably the Canonical Combining Class and decomposition information. Read the 249 [[http://godoc.org/code.google.com/p/go.text/unicode/norm/#Properties][documentation]] 250 for this type if you want to dig deeper. 251 252 * Performance 253 254 To give an idea of the performance of normalization, we compare it against the 255 performance of strings.ToLower. The sample in the first row is both lowercase 256 and NFC and can in every case be returned as is. The second sample is neither 257 and requires writing a new version. 258 259 .html normalization/table2.html 260 261 The column with the results for the iterator shows both the measurement with 262 and without initialization of the iterator, which contain buffers that don't 263 need to be reinitialized upon reuse. 264 265 As you can see, detecting whether a string is normalized can be quite 266 efficient. A lot of the cost of normalizing in the second row is for the 267 initialization of buffers, the cost of which is amortized when one is 268 processing larger strings. As it turns out, these buffers are rarely needed, so 269 we may change the implementation at some point to speed up the common case for 270 small strings even further. 271 272 * Conclusion 273 274 If you're dealing with text inside Go, you generally do not have to use the 275 unicode/norm package to normalize your text. The package may still be useful 276 for things like ensuring that strings are normalized before sending them out or 277 to do advanced text manipulation. 278 279 This article briefly mentioned the existence of other go.text packages as well 280 as multilingual text processing and it may have raised more questions than it 281 has given answers. The discussion of these topics, however, will have to wait 282 until another day. 283