github.com/graybobo/golang.org-package-offline-cache@v0.0.0-20200626051047-6608995c132f/x/blog/content/normalization.article (about)

     1  Text normalization in Go
     2  26 Nov 2013
     3  Tags: strings, bytes, runes, characters
     4  
     5  Marcel van Lohuizen
     6  
     7  * Introduction
     8  
     9  An earlier [[http://blog.golang.org/strings][post]] talked about strings, bytes
    10  and characters in Go. I've been working on various packages for multilingual
    11  text processing for the go.text repository. Several of these packages deserve a
    12  separate blog post, but today I want to focus on
    13  [[http://godoc.org/code.google.com/p/go.text/unicode/norm][go.text/unicode/norm]],
    14  which handles normalization, a topic touched in the
    15  [[http://blog.golang.org/strings][strings article]] and the subject of this
    16  post. Normalization works at a higher level of abstraction than raw bytes.
    17  
    18  To learn pretty much everything you ever wanted to know about normalization
    19  (and then some), [[http://unicode.org/reports/tr15/][Annex 15 of the Unicode Standard]]
    20  is a good read. A more approachable article is the corresponding
    21  [[http://en.wikipedia.org/wiki/Unicode_equivalence][Wikipedia page]]. Here we
    22  focus on how normalization relates to Go.
    23  
    24  * What is normalization?
    25  
    26  There are often several ways to represent the same string. For example, an é
    27  (e-acute) can be represented in a string as a single rune ("\u00e9") or an 'e'
    28  followed by an acute accent ("e\u0301"). According to the Unicode standard,
    29  these two are "canonically equivalent" and should be treated as equal.
    30  
    31  Using a byte-to-byte comparison to determine equality would clearly not give
    32  the right result for these two strings. Unicode defines a set of normal forms
    33  such that if two strings are canonically equivalent and are normalized to the
    34  same normal form, their byte representations are the same.
    35  
    36  Unicode also defines a "compatibility equivalence" to equate characters that
    37  represent the same characters, but may have a different visual appearance. For
    38  example, the superscript digit '⁹' and the regular digit '9' are equivalent in
    39  this form.
    40  
    41  For each of these two equivalence forms, Unicode defines a composing and
    42  decomposing form. The former replaces runes that can combine into a single rune
    43  with this single rune. The latter breaks runes apart into their components.
    44  This table shows the names, all starting with NF, by which the Unicode
    45  Consortium identifies these forms:
    46  
    47  .html normalization/table1.html
    48  
    49  * Go's approach to normalization
    50  
    51  As mentioned in the strings blog post, Go does not guarantee that characters in
    52  a string are normalized. However, the go.text packages can compensate. For
    53  example, the
    54  [[http://godoc.org/code.google.com/p/go.text/collate][collate]] package, which
    55  can sort strings in a language-specific way, works correctly even with
    56  unnormalized strings. The packages in go.text do not always require normalized
    57  input, but in general normalization may be necessary for consistent results.
    58  
    59  Normalization isn't free but it is fast, particularly for collation and
    60  searching or if a string is either in NFD or in NFC and can be converted to NFD
    61  by decomposing without reordering its bytes. In practice,
    62  [[http://www.macchiato.com/unicode/nfc-faq#TOC-How-much-text-is-already-NFC-][99.98%]] of
    63  the web's HTML page content is in NFC form (not counting markup, in which case
    64  it would be more). By far most NFC can be decomposed to NFD without the need
    65  for reordering (which requires allocation). Also, it is efficient to detect
    66  when reordering is necessary, so we can save time by doing it only for the rare
    67  segments that need it.
    68  
    69  To make things even better, the collation package typically does not use the
    70  norm package directly, but instead uses the norm package to interleave
    71  normalization information with its own tables. Interleaving the two problems
    72  allows for reordering and normalization on the fly with almost no impact on
    73  performance. The cost of on-the-fly normalization is compensated by not having
    74  to normalize text beforehand and ensuring that the normal form is maintained
    75  upon edits. The latter can be tricky. For instance, the result of concatenating
    76  two NFC-normalized strings is not guaranteed to be in NFC.
    77  
    78  Of course, we can also avoid the overhead outright if we know in advance that a
    79  string is already normalized, which is often the case.
    80  
    81  * Why bother?
    82  
    83  After all this discussion about avoiding normalization, you might ask why it's
    84  worth worrying about at all. The reason is that there are cases where
    85  normalization is required and it is important to understand what those are, and
    86  in turn how to do it correctly.
    87  
    88  Before discussing those, we must first clarify the concept of 'character'.
    89  
    90  * What is a character?
    91  
    92  As was mentioned in the strings blog post, characters can span multiple runes.
    93  For example, an 'e' and '◌́' (acute "\u0301") can combine to form 'é' ("e\u0301"
    94  in NFD).  Together these two runes are one character. The definition of a
    95  character may vary depending on the application. For normalization we will
    96  define it as a sequence of runes that starts with a starter, a rune that does
    97  not modify or combine backwards with any other rune, followed by possibly empty
    98  sequence of non-starters, that is, runes that do (typically accents). The
    99  normalization algorithm processes one character at at time.
   100  
   101  Theoretically, there is no bound to the number of runes that can make up a
   102  Unicode character. In fact, there are no restrictions on the number of
   103  modifiers that can follow a character and a modifier may be repeated, or
   104  stacked. Ever seen an 'e' with three acutes? Here you go: 'é́́'. That is a
   105  perfectly valid 4-rune character according to the standard.
   106  
   107  As a consequence, even at the lowest level, text needs to be processed in
   108  increments of unbounded chunk sizes. This is especially awkward with a
   109  streaming approach to text processing, as used by Go's standard Reader and
   110  Writer interfaces, as that model potentially requires any intermediate buffers
   111  to have unbounded size as well. Also, a straightforward implementation of
   112  normalization will have a O(n²) running time.
   113  
   114  There are really no meaningful interpretations for such large sequences of
   115  modifiers for practical applications. Unicode defines a Stream-Safe Text
   116  format, which allows capping the number of modifiers (non-starters) to at most
   117  30, more than enough for any practical purpose. Subsequent modifiers will be
   118  placed after a freshly inserted Combining Grapheme Joiner (CGJ or U+034F). Go
   119  adopts this approach for all normalization algorithms. This decision gives up a
   120  little conformance but gains a little safety.
   121  
   122  * Writing in normal form
   123  
   124  Even if you don't need to normalize text within your Go code, you might still
   125  want to do so when communicating to the outside world. For example, normalizing
   126  to NFC might compact your text, making it cheaper to send down a wire. For some
   127  languages, like Korean, the savings can be substantial. Also, some external
   128  APIs might expect text in a certain normal form. Or you might just want to fit
   129  in and output your text as NFC like the rest of the world.
   130  
   131  To write your text as NFC, use the
   132  [[http://godoc.org/code.google.com/p/go.text/unicode/norm][unicode/norm]] package
   133  to wrap your `io.Writer` of choice:
   134  
   135  	wc := norm.NFC.Writer(w)
   136  	defer wc.Close()
   137  	// write as before...
   138  
   139  If you have a small string and want to do a quick conversion, you can use this
   140  simpler form:
   141  
   142  	norm.NFC.Bytes(b)
   143  
   144  Package norm provides various other methods for normalizing text.
   145  Pick the one that suits your needs best.
   146  
   147  * Catching look-alikes
   148  
   149  Can you tell the difference between 'K' ("\u004B") and 'K' (Kelvin sign
   150  "\u212A") or 'Ω' ("\u03a9") and 'Ω' (Ohm sign "\u2126")? It is easy to overlook
   151  the sometimes minute differences between variants of the same underlying
   152  character. It is generally a good idea to disallow such variants in identifiers
   153  or anything where deceiving users with such look-alikes can pose a security
   154  hazard.
   155  
   156  The compatibility normal forms, NFKC and NFKD, will map many visually nearly
   157  identical forms to a single value. Note that it will not do so when two symbols
   158  look alike, but are really from two different alphabets. For example the Latin
   159  'o', Greek 'ο', and Cyrillic 'о' are still different characters as defined by
   160  these forms.
   161  
   162  * Correct text modifications
   163  
   164  The norm package might also come to the rescue when one needs to modify text.
   165  Consider a case where you want to search and replace the word "cafe" with its
   166  plural form "cafes".  A code snippet could look like this.
   167  
   168  	s := "We went to eat at multiple cafe"
   169  	cafe := "cafe"
   170  	if p := strings.Index(s, cafe); p != -1 {
   171  		p += len(cafe)
   172  		s = s[:p] + "s" + s[p:]
   173  	}
   174  	fmt.Println(s)
   175  
   176  This prints "We went to eat at multiple cafes" as desired and expected. Now
   177  consider our text contains the French spelling "café" in NFD form:
   178  
   179  	s := "We went to eat at multiple cafe\u0301"
   180  
   181  Using the same code from above, the plural "s" would still be inserted after
   182  the 'e', but before the acute, resulting in  "We went to eat at multiple
   183  cafeś".  This behavior is undesirable.
   184  
   185  The problem is that the code does not respect the boundaries between multi-rune
   186  characters and inserts a rune in the middle of a character.  Using the norm
   187  package, we can rewrite this piece of code as follows:
   188  
   189  	s := "We went to eat at multiple cafe\u0301"
   190  	cafe := "cafe"
   191  	if p := strings.Index(s, cafe); p != -1 {
   192  		p += len(cafe)
   193  		if bp := norm.FirstBoundary(s[p:]); bp > 0 {
   194  			p += bp
   195  		}
   196  		s = s[:p] + "s" + s[p:]
   197  	}
   198  	fmt.Println(s)
   199  
   200  This may be a contrived example, but the gist should be clear. Be mindful of
   201  the fact that characters can span multiple runes. Generally these kinds of
   202  problems can be avoided by using search functionality that respects character
   203  boundaries (such as the planned go.text/search package.)
   204  
   205  * Iteration
   206  
   207  Another tool provided by the norm package that may help dealing with character
   208  boundaries is its iterator,
   209  [[http://godoc.org/code.google.com/p/go.text/unicode/norm#Iter][`norm.Iter`]].
   210  It iterates over characters one at a time in the normal form of choice.
   211  
   212  * Performing magic
   213  
   214  As mentioned earlier, most text is in NFC form, where base characters and
   215  modifiers are combined into a single rune whenever possible.  For the purpose
   216  of analyzing characters, it is often easier to handle runes after decomposition
   217  into their smallest components. This is where the NFD form comes in handy. For
   218  example, the following piece of code creates a `transform.Transformer` that
   219  decomposes text into its smallest parts, removes all accents, and then
   220  recomposes the text into NFC:
   221  
   222  	import (
   223  		"unicode"
   224  
   225  		"golang.org/x/text/transform"
   226  		"golang.org/x/text/unicode/norm"
   227  	)
   228  
   229  	isMn := func(r rune) bool {
   230  		return unicode.Is(unicode.Mn, r) // Mn: nonspacing marks
   231  	}
   232  	t := transform.Chain(norm.NFD, transform.RemoveFunc(isMn), norm.NFC)
   233  
   234  The resulting `Transformer` can be used to remove accents from an `io.Reader`
   235  of choice as follows:
   236  
   237  	r = transform.NewReader(r, t)
   238  	// read as before ...
   239  
   240  This will, for example, convert any mention of "cafés" in the text to "cafes",
   241  regardless of the normal form in which the original text was encoded.
   242  
   243  * Normalization info
   244  
   245  As mentioned earlier, some packages precompute normalizations into their tables
   246  to minimize the need for normalization at run time. The type `norm.Properties`
   247  provides access to the per-rune information needed by these packages, most
   248  notably the Canonical Combining Class and decomposition information. Read the
   249  [[http://godoc.org/code.google.com/p/go.text/unicode/norm/#Properties][documentation]]
   250  for this type if you want to dig deeper.
   251  
   252  * Performance
   253  
   254  To give an idea of the performance of normalization, we compare it against the
   255  performance of strings.ToLower. The sample in the first row is both lowercase
   256  and NFC and can in every case be returned as is. The second sample is neither
   257  and requires writing a new version.
   258  
   259  .html normalization/table2.html
   260  
   261  The column with the results for the iterator shows both the measurement with
   262  and without initialization of the iterator, which contain buffers that don't
   263  need to be reinitialized upon reuse.
   264  
   265  As you can see, detecting whether a string is normalized can be quite
   266  efficient. A lot of the cost of normalizing in the second row is for the
   267  initialization of buffers, the cost of which is amortized when one is
   268  processing larger strings. As it turns out, these buffers are rarely needed, so
   269  we may change the implementation at some point to speed up the common case for
   270  small strings even further.
   271  
   272  * Conclusion
   273  
   274  If you're dealing with text inside Go, you generally do not have to use the
   275  unicode/norm package to normalize your text. The package may still be useful
   276  for things like ensuring that strings are normalized before sending them out or
   277  to do advanced text manipulation.
   278  
   279  This article briefly mentioned the existence of other go.text packages as well
   280  as multilingual text processing and it may have raised more questions than it
   281  has given answers. The discussion of these topics, however, will have to wait
   282  until another day.
   283