golang.org/toolchain@v0.0.1-go1.9rc2.windows-amd64/blog/content/matchlang.article

golang.org/toolchain@v0.0.1-go1.9rc2.windows-amd64/blog/content/matchlang.article (about)

1 Language and Locale Matching in Go
2 09 Feb 2016
3 Tags: language, locale, tag, BCP 47, matching
4
5 Marcel van Lohuizen
6
7 * Introduction
8
9 Consider an application, such as a web site, with support for multiple languages
10 in its user interface.
11 When a user arrives with a list of preferred languages, the application must
12 decide which language it should use in its presentation to the user.
13 This requires finding the best match between the languages the application supports
14 and those the user prefers.
15 This post explains why this is a difficult decision and how Go can help.
16
17 * Language Tags
18
19 Language tags, also known as locale identifiers, are machine-readable
20 identifiers for the language and/or dialect being used.
21 The most common reference for them is the IETF BCP 47 standard, and that is the
22 standard the Go libraries follow.
23 Here are some examples of BCP 47 language tags and the language or dialect they
24 represent.
25
26 .html matchlang/tags.html
27
28 The general form of the language tag is
29 a language code (“en”, “cmn”, “zh”, “nl”, “az” above)
30 followed by an optional subtag for script (“-Arab”),
31 region (“-US”, “-BE”, “-419”),
32 variants (“-oxendict” for Oxford English Dictionary spelling),
33 and extensions (“-u-co-phonebk” for phone-book sorting).
34 The most common form is assumed if a subtag is omitted, for instance
35 “az-Latn-AZ” for “az”.
36
37 The most common use of language tags is to select from a set of system-supported
38 languages according to a list of the user's language preferences, for example
39 deciding that a user who prefers Afrikaans would be best served (assuming
40 Afrikaans is not available) by the system showing Dutch. Resolving such matches
41 involves consulting data on mutual language comprehensibility.
42
43 The tag resulting from this match is subsequently used to obtain
44 language-specific resources such as translations, sorting order,
45 and casing algorithms.
46 This involves a different kind of matching. For example, as there is no specific
47 sorting order for Portuguese, a collate package may fall back to the sorting
48 order for the default, or “root”, language.
49
50 * The Messy Nature of Matching Languages
51
52 Handling language tags is tricky.
53 This is partly because the boundaries of human languages are not well defined
54 and partly because of the legacy of evolving language tag standards.
55 In this section we will show some of the messy aspects of handling language tags.
56
57 __Tags_with_different_language_codes_can_indicate_the_same_language_
58
59 For historical and political reasons, many language codes have changed over
60 time, leaving languages with an older legacy code as well as a new one.
61 But even two current codes may refer to the same language.
62 For example, the official language code for Mandarin is “cmn”, but “zh” is by
63 far the most commonly used designator for this language.
64 The code “zh” is officially reserved for a so called macro language, identifying
65 the group of Chinese languages.
66 Tags for macro languages are often used interchangeably with the most-spoken
67 language in the group.
68
69 _Matching_language_code_alone_is_not_sufficient_
70
71 Azerbaijani (“az”), for example, is written in different scripts depending on
72 the country in which it is spoken: "az-Latn" for Latin (the default script),
73 "az-Arab" for Arabic, and “az-Cyrl” for Cyrillic.
74 If you replace "az-Arab" with just "az", the result will be in Latin script and
75 may not be understandable to a user who only knows the Arabic form.
76
77 Also different regions may imply different scripts.
78 For example: “zh-TW” and “zh-SG” respectively imply the use of Traditional and
79 Simplified Han. As another example, “sr” (Serbian) defaults to Cyrillic script,
80 but “sr-RU” (Serbian as written in Russia) implies the Latin script!
81 A similar thing can be said for Kyrgyz and other languages.
82
83 If you ignore subtags, you might as well present Greek to the user.
84
85 _The_best_match_might_be_a_language_not_listed_by_the_user_
86
87 The most common written form of Norwegian (“nb”) looks an awful lot like Danish.
88 If Norwegian is not available, Danish may be a good second choice.
89 Similarly, a user requesting Swiss German (“gsw”) will likely be happy to be
90 presented German (“de”), though the converse is far from true.
91 A user requesting Uygur may be happier to fall back to Chinese than to English.
92 Other examples abound.
93 If a user-requested language is not supported, falling back to English is often
94 not the best thing to do.
95
96 _The_choice_of_language_decides_more_than_translation_
97
98 Suppose a user asks for Danish, with German as a second choice.
99 If an application chooses German, it must not only use German translations
100 but also use German (not Danish) collation.
101 Otherwise, for example, a list of animals might sort “Bär” before “Äffin”.
102
103 Selecting a supported language given the user’s preferred languages is like a
104 handshaking algorithm: first you determine which protocol to communicate in (the
105 language) and then you stick with this protocol for all communication for the
106 duration of a session.
107
108 _Using_a_“parent”_of_a_language_as_fallback_is_non-trivial_
109
110 Suppose your application supports Angolan Portuguese (“pt-AO”).
111 Packages in [[http://golang.org/x/text]], like collation and display, may not
112 have specific support for this dialect.
113 The correct course of action in such cases is to match the closest parent dialect.
114 Languages are arranged in a hierarchy, with each specific language having a more
115 general parent.
116 For example, the parent of “en-GB-oxendict” is “en-GB”, whose parent is “en”,
117 whose parent is the undefined language “und”, also known as the root language.
118 In the case of collation, there is no specific collation order for Portugese,
119 so the collate package will select the sorting order of the root language.
120 The closest parent to Angolan Portuguese supported by the display package is
121 European Portuguese (“pt-PT”) and not the more obvious “pt”, which implies
122 Brazilian.
123
124 In general, parent relationships are non-trivial.
125 To give a few more examples, the parent of “es-CL” is “es-419”, the parent of
126 “zh-TW” is “zh-Hant”, and the parent of “zh-Hant” is “und”.
127 If you compute the parent by simply removing subtags, you may select a “dialect”
128 that is incomprehensible to the user.
129
130 * Language Matching in Go
131
132 The Go package [[http://golang.org/x/text/language]] implements the BCP 47
133 standard for language tags and adds support for deciding which language to use
134 based on data published in the Unicode Common Locale Data Repository (CLDR).
135
136 Here is a sample program, explained below, matching a user's language
137 preferences against an application's supported languages:
138
139 .code -edit matchlang/complete.go
140
141 ** Creating Language Tags
142
143 The simplest way to create a language.Tag from a user-given language code string
144 is with language.Make.
145 It extracts meaningful information even from malformed input.
146 For example, “en-USD” will result in “en” even though USD is not a valid subtag.
147
148 Make doesn’t return an error.
149 It is common practice to use the default language if an error occurs anyway so
150 this makes it more convenient. Use Parse to handle any error manually.
151
152 The HTTP Accept-Language header is often used to pass a user’s desired
153 languages.
154 The ParseAcceptLanguage function parses it into a slice of language tags,
155 ordered by preference.
156
157 By default, the language package does not canonicalize tags.
158 For example, it does not follow the BCP 47 recommendation of eliminating scripts
159 if it is the common choice in the “overwhelming majority”.
160 It similarly ignores CLDR recommendations: “cmn” is not replaced by “zh” and
161 “zh-Hant-HK” is not simplified to “zh-HK”.
162 Canonicalizing tags may throw away useful information about user intent.
163 Canonicalization is handled in the Matcher instead.
164 A full array of canonicalization options are available if the programmer still
165 desires to do so.
166
167 ** Matching User-Preferred Languages to Supported Languages
168
169 A Matcher matches user-preferred languages to supported languages.
170 Users are strongly advised to use it if they don’t want to deal with all the
171 intricacies of matching languages.
172
173 The Match method may pass through user settings (from BCP 47 extensions) from
174 the preferred tags to the selected supported tag.
175 It is therefore important that the tag returned by Match is used to obtain
176 language-specific resources.
177 For example, “de-u-co-phonebk” requests phone-book ordering for German.
178 The extension is ignored for matching, but is used by the collate package to
179 select the respective sorting order variant.
180
181 A Matcher is initialized with the languages supported by an application, which
182 are usually the languages for which there are translations.
183 This set is typically fixed, allowing a matcher to be created at startup.
184 Matcher is optimized to improve the performance of Match at the expense of
185 initialization cost.
186
187 The language package provides a predefined set of the most commonly used
188 language tags that can be used for defining the supported set.
189 Users generally don’t have to worry about the exact tags to pick for supported
190 languages.
191 For example, AmericanEnglish (“en-US”) may be used interchangeably with the more
192 common English (“en”), which defaults to American.
193 It is all the same for the Matcher. An application may even add both, allowing
194 for more specific American slang for “en-US”.
195
196 ** Matching Example
197
198 Consider the following Matcher and lists of supported languages:
199
200 var supported = []language.Tag{
201 language.AmericanEnglish, // en-US: first language is fallback
202 language.German, // de
203 language.Dutch, // nl
204 language.Portuguese // pt (defaults to Brazilian)
205 language.EuropeanPortuguese, // pt-pT
206 language.Romanian // ro
207 language.Serbian, // sr (defaults to Cyrillic script)
208 language.SerbianLatin, // sr-Latn
209 language.SimplifiedChinese, // zh-Hans
210 language.TraditionalChinese, // zh-Hant
211 }
212 var matcher = language.NewMatcher(supported)
213
214 Let's look at the matches against this list of supported languages for various
215 user preferences.
216
217 For a user preference of "he" (Hebrew), the best match is "en-US" (American
218 English).
219 There is no good match, so the matcher uses the fallback language (the first in
220 the supported list).
221
222 For a user preference of "hr" (Croatian), the best match is "sr-Latn" (Serbian
223 with Latin script), because, once they are written in the same script, Serbian
224 and Croatian are mutually intelligible.
225
226 For a user preference of "ru, mo" (Russian, then Moldavian), the best match is
227 "ro" (Romanian), because Moldavian is now canonically classified as "ro-MD"
228 (Romanian in Moldova).
229
230 For a user preference of "zh-TW" (Mandarin in Taiwan), the best match is
231 "zh-Hant" (Mandarin written in Traditional Chinese), not "zh-Hans" (Mandarin
232 written in Simplified Chinese).
233
234 For a user preference of "af, ar" (Afrikaans, then Arabic), the best match is
235 "nl" (Dutch). Neither preference is supported directly, but Dutch is a
236 significantly closer match to Afrikaans than the fallback language English is to
237 either.
238
239 For a user preference of "pt-AO, id" (Angolan Portuguese, then Indonesian), the
240 best match is "pt-PT" (European Portuguese), not "pt" (Brazilian Portuguese).
241
242 For a user preference of "gsw-u-co-phonebk" (Swiss German with phone-book
243 collation order), the best match is "de-u-co-phonebk" (German with phone-book
244 collation order).
245 German is the best match for Swiss German in the server's language list, and the
246 option for phone-book collation order has been carried over.
247
248 ** Confidence Scores
249
250 Go uses coarse-grained confidence scoring with rule-based elimination.
251 A match is classified as Exact, High (not exact, but no known ambiguity), Low
252 (probably the correct match, but maybe not), or No.
253 In case of multiple matches, there is a set of tie-breaking rules that are
254 executed in order.
255 The first match is returned in the case of multiple equal matches.
256 These confidence scores may be useful, for example, to reject relatively weak
257 matches.
258 They are also used to score, for example, the most likely region or script from
259 a language tag.
260
261 Implementations in other languages often use more fine-grained, variable-scale
262 scoring.
263 We found that using coarse-grained scoring in the Go implementation ended up
264 simpler to implement, more maintainable, and faster, meaning that we could
265 handle more rules.
266
267 ** Displaying Supported Languages
268
269 The [[http://golang.org/x/text/language/display]] package allows naming language
270 tags in many languages.
271 It also contains a “Self” namer for displaying a tag in its own language.
272
273 For example:
274
275 .code -edit matchlang/display.go /START/,/END/
276
277 prints
278
279 English (English)
280 French (français)
281 Dutch (Nederlands)
282 Flemish (Vlaams)
283 Simplified Chinese (简体中文)
284 Traditional Chinese (繁體中文)
285 Russian (русский)
286
287 In the second column, note the differences in capitalization, reflecting the
288 rules of the respective language.
289
290 * Conclusion
291
292 At first glance, language tags look like nicely structured data, but because
293 they describe human languages, the structure of relationships between language
294 tags is actually quite complex.
295 It is often tempting, especially for English-speaking programmers, to write
296 ad-hoc language matching using nothing other than string manipulation of the
297 language tags.
298 As described above, this can produce awful results.
299
300 Go's [[http://golang.org/x/text/language]] package solves this complex problem
301 while still presenting a simple, easy-to-use API. Enjoy.