golang.org/toolchain@v0.0.1-go1.9rc2.windows-amd64/blog/content/matchlang.article (about) 1 Language and Locale Matching in Go 2 09 Feb 2016 3 Tags: language, locale, tag, BCP 47, matching 4 5 Marcel van Lohuizen 6 7 * Introduction 8 9 Consider an application, such as a web site, with support for multiple languages 10 in its user interface. 11 When a user arrives with a list of preferred languages, the application must 12 decide which language it should use in its presentation to the user. 13 This requires finding the best match between the languages the application supports 14 and those the user prefers. 15 This post explains why this is a difficult decision and how Go can help. 16 17 * Language Tags 18 19 Language tags, also known as locale identifiers, are machine-readable 20 identifiers for the language and/or dialect being used. 21 The most common reference for them is the IETF BCP 47 standard, and that is the 22 standard the Go libraries follow. 23 Here are some examples of BCP 47 language tags and the language or dialect they 24 represent. 25 26 .html matchlang/tags.html 27 28 The general form of the language tag is 29 a language code (“en”, “cmn”, “zh”, “nl”, “az” above) 30 followed by an optional subtag for script (“-Arab”), 31 region (“-US”, “-BE”, “-419”), 32 variants (“-oxendict” for Oxford English Dictionary spelling), 33 and extensions (“-u-co-phonebk” for phone-book sorting). 34 The most common form is assumed if a subtag is omitted, for instance 35 “az-Latn-AZ” for “az”. 36 37 The most common use of language tags is to select from a set of system-supported 38 languages according to a list of the user's language preferences, for example 39 deciding that a user who prefers Afrikaans would be best served (assuming 40 Afrikaans is not available) by the system showing Dutch. Resolving such matches 41 involves consulting data on mutual language comprehensibility. 42 43 The tag resulting from this match is subsequently used to obtain 44 language-specific resources such as translations, sorting order, 45 and casing algorithms. 46 This involves a different kind of matching. For example, as there is no specific 47 sorting order for Portuguese, a collate package may fall back to the sorting 48 order for the default, or “root”, language. 49 50 * The Messy Nature of Matching Languages 51 52 Handling language tags is tricky. 53 This is partly because the boundaries of human languages are not well defined 54 and partly because of the legacy of evolving language tag standards. 55 In this section we will show some of the messy aspects of handling language tags. 56 57 __Tags_with_different_language_codes_can_indicate_the_same_language_ 58 59 For historical and political reasons, many language codes have changed over 60 time, leaving languages with an older legacy code as well as a new one. 61 But even two current codes may refer to the same language. 62 For example, the official language code for Mandarin is “cmn”, but “zh” is by 63 far the most commonly used designator for this language. 64 The code “zh” is officially reserved for a so called macro language, identifying 65 the group of Chinese languages. 66 Tags for macro languages are often used interchangeably with the most-spoken 67 language in the group. 68 69 _Matching_language_code_alone_is_not_sufficient_ 70 71 Azerbaijani (“az”), for example, is written in different scripts depending on 72 the country in which it is spoken: "az-Latn" for Latin (the default script), 73 "az-Arab" for Arabic, and “az-Cyrl” for Cyrillic. 74 If you replace "az-Arab" with just "az", the result will be in Latin script and 75 may not be understandable to a user who only knows the Arabic form. 76 77 Also different regions may imply different scripts. 78 For example: “zh-TW” and “zh-SG” respectively imply the use of Traditional and 79 Simplified Han. As another example, “sr” (Serbian) defaults to Cyrillic script, 80 but “sr-RU” (Serbian as written in Russia) implies the Latin script! 81 A similar thing can be said for Kyrgyz and other languages. 82 83 If you ignore subtags, you might as well present Greek to the user. 84 85 _The_best_match_might_be_a_language_not_listed_by_the_user_ 86 87 The most common written form of Norwegian (“nb”) looks an awful lot like Danish. 88 If Norwegian is not available, Danish may be a good second choice. 89 Similarly, a user requesting Swiss German (“gsw”) will likely be happy to be 90 presented German (“de”), though the converse is far from true. 91 A user requesting Uygur may be happier to fall back to Chinese than to English. 92 Other examples abound. 93 If a user-requested language is not supported, falling back to English is often 94 not the best thing to do. 95 96 _The_choice_of_language_decides_more_than_translation_ 97 98 Suppose a user asks for Danish, with German as a second choice. 99 If an application chooses German, it must not only use German translations 100 but also use German (not Danish) collation. 101 Otherwise, for example, a list of animals might sort “Bär” before “Äffin”. 102 103 Selecting a supported language given the user’s preferred languages is like a 104 handshaking algorithm: first you determine which protocol to communicate in (the 105 language) and then you stick with this protocol for all communication for the 106 duration of a session. 107 108 _Using_a_“parent”_of_a_language_as_fallback_is_non-trivial_ 109 110 Suppose your application supports Angolan Portuguese (“pt-AO”). 111 Packages in [[http://golang.org/x/text]], like collation and display, may not 112 have specific support for this dialect. 113 The correct course of action in such cases is to match the closest parent dialect. 114 Languages are arranged in a hierarchy, with each specific language having a more 115 general parent. 116 For example, the parent of “en-GB-oxendict” is “en-GB”, whose parent is “en”, 117 whose parent is the undefined language “und”, also known as the root language. 118 In the case of collation, there is no specific collation order for Portugese, 119 so the collate package will select the sorting order of the root language. 120 The closest parent to Angolan Portuguese supported by the display package is 121 European Portuguese (“pt-PT”) and not the more obvious “pt”, which implies 122 Brazilian. 123 124 In general, parent relationships are non-trivial. 125 To give a few more examples, the parent of “es-CL” is “es-419”, the parent of 126 “zh-TW” is “zh-Hant”, and the parent of “zh-Hant” is “und”. 127 If you compute the parent by simply removing subtags, you may select a “dialect” 128 that is incomprehensible to the user. 129 130 * Language Matching in Go 131 132 The Go package [[http://golang.org/x/text/language]] implements the BCP 47 133 standard for language tags and adds support for deciding which language to use 134 based on data published in the Unicode Common Locale Data Repository (CLDR). 135 136 Here is a sample program, explained below, matching a user's language 137 preferences against an application's supported languages: 138 139 .code -edit matchlang/complete.go 140 141 ** Creating Language Tags 142 143 The simplest way to create a language.Tag from a user-given language code string 144 is with language.Make. 145 It extracts meaningful information even from malformed input. 146 For example, “en-USD” will result in “en” even though USD is not a valid subtag. 147 148 Make doesn’t return an error. 149 It is common practice to use the default language if an error occurs anyway so 150 this makes it more convenient. Use Parse to handle any error manually. 151 152 The HTTP Accept-Language header is often used to pass a user’s desired 153 languages. 154 The ParseAcceptLanguage function parses it into a slice of language tags, 155 ordered by preference. 156 157 By default, the language package does not canonicalize tags. 158 For example, it does not follow the BCP 47 recommendation of eliminating scripts 159 if it is the common choice in the “overwhelming majority”. 160 It similarly ignores CLDR recommendations: “cmn” is not replaced by “zh” and 161 “zh-Hant-HK” is not simplified to “zh-HK”. 162 Canonicalizing tags may throw away useful information about user intent. 163 Canonicalization is handled in the Matcher instead. 164 A full array of canonicalization options are available if the programmer still 165 desires to do so. 166 167 ** Matching User-Preferred Languages to Supported Languages 168 169 A Matcher matches user-preferred languages to supported languages. 170 Users are strongly advised to use it if they don’t want to deal with all the 171 intricacies of matching languages. 172 173 The Match method may pass through user settings (from BCP 47 extensions) from 174 the preferred tags to the selected supported tag. 175 It is therefore important that the tag returned by Match is used to obtain 176 language-specific resources. 177 For example, “de-u-co-phonebk” requests phone-book ordering for German. 178 The extension is ignored for matching, but is used by the collate package to 179 select the respective sorting order variant. 180 181 A Matcher is initialized with the languages supported by an application, which 182 are usually the languages for which there are translations. 183 This set is typically fixed, allowing a matcher to be created at startup. 184 Matcher is optimized to improve the performance of Match at the expense of 185 initialization cost. 186 187 The language package provides a predefined set of the most commonly used 188 language tags that can be used for defining the supported set. 189 Users generally don’t have to worry about the exact tags to pick for supported 190 languages. 191 For example, AmericanEnglish (“en-US”) may be used interchangeably with the more 192 common English (“en”), which defaults to American. 193 It is all the same for the Matcher. An application may even add both, allowing 194 for more specific American slang for “en-US”. 195 196 ** Matching Example 197 198 Consider the following Matcher and lists of supported languages: 199 200 var supported = []language.Tag{ 201 language.AmericanEnglish, // en-US: first language is fallback 202 language.German, // de 203 language.Dutch, // nl 204 language.Portuguese // pt (defaults to Brazilian) 205 language.EuropeanPortuguese, // pt-pT 206 language.Romanian // ro 207 language.Serbian, // sr (defaults to Cyrillic script) 208 language.SerbianLatin, // sr-Latn 209 language.SimplifiedChinese, // zh-Hans 210 language.TraditionalChinese, // zh-Hant 211 } 212 var matcher = language.NewMatcher(supported) 213 214 Let's look at the matches against this list of supported languages for various 215 user preferences. 216 217 For a user preference of "he" (Hebrew), the best match is "en-US" (American 218 English). 219 There is no good match, so the matcher uses the fallback language (the first in 220 the supported list). 221 222 For a user preference of "hr" (Croatian), the best match is "sr-Latn" (Serbian 223 with Latin script), because, once they are written in the same script, Serbian 224 and Croatian are mutually intelligible. 225 226 For a user preference of "ru, mo" (Russian, then Moldavian), the best match is 227 "ro" (Romanian), because Moldavian is now canonically classified as "ro-MD" 228 (Romanian in Moldova). 229 230 For a user preference of "zh-TW" (Mandarin in Taiwan), the best match is 231 "zh-Hant" (Mandarin written in Traditional Chinese), not "zh-Hans" (Mandarin 232 written in Simplified Chinese). 233 234 For a user preference of "af, ar" (Afrikaans, then Arabic), the best match is 235 "nl" (Dutch). Neither preference is supported directly, but Dutch is a 236 significantly closer match to Afrikaans than the fallback language English is to 237 either. 238 239 For a user preference of "pt-AO, id" (Angolan Portuguese, then Indonesian), the 240 best match is "pt-PT" (European Portuguese), not "pt" (Brazilian Portuguese). 241 242 For a user preference of "gsw-u-co-phonebk" (Swiss German with phone-book 243 collation order), the best match is "de-u-co-phonebk" (German with phone-book 244 collation order). 245 German is the best match for Swiss German in the server's language list, and the 246 option for phone-book collation order has been carried over. 247 248 ** Confidence Scores 249 250 Go uses coarse-grained confidence scoring with rule-based elimination. 251 A match is classified as Exact, High (not exact, but no known ambiguity), Low 252 (probably the correct match, but maybe not), or No. 253 In case of multiple matches, there is a set of tie-breaking rules that are 254 executed in order. 255 The first match is returned in the case of multiple equal matches. 256 These confidence scores may be useful, for example, to reject relatively weak 257 matches. 258 They are also used to score, for example, the most likely region or script from 259 a language tag. 260 261 Implementations in other languages often use more fine-grained, variable-scale 262 scoring. 263 We found that using coarse-grained scoring in the Go implementation ended up 264 simpler to implement, more maintainable, and faster, meaning that we could 265 handle more rules. 266 267 ** Displaying Supported Languages 268 269 The [[http://golang.org/x/text/language/display]] package allows naming language 270 tags in many languages. 271 It also contains a “Self” namer for displaying a tag in its own language. 272 273 For example: 274 275 .code -edit matchlang/display.go /START/,/END/ 276 277 prints 278 279 English (English) 280 French (français) 281 Dutch (Nederlands) 282 Flemish (Vlaams) 283 Simplified Chinese (简体中文) 284 Traditional Chinese (繁體中文) 285 Russian (русский) 286 287 In the second column, note the differences in capitalization, reflecting the 288 rules of the respective language. 289 290 * Conclusion 291 292 At first glance, language tags look like nicely structured data, but because 293 they describe human languages, the structure of relationships between language 294 tags is actually quite complex. 295 It is often tempting, especially for English-speaking programmers, to write 296 ad-hoc language matching using nothing other than string manipulation of the 297 language tags. 298 As described above, this can produce awful results. 299 300 Go's [[http://golang.org/x/text/language]] package solves this complex problem 301 while still presenting a simple, easy-to-use API. Enjoy.