golang.org/toolchain@v0.0.1-go1.9rc2.windows-amd64/blog/content/matchlang.article (about)

     1  Language and Locale Matching in Go
     2  09 Feb 2016
     3  Tags: language, locale, tag, BCP 47, matching
     4  
     5  Marcel van Lohuizen
     6  
     7  * Introduction
     8  
     9  Consider an application, such as a web site, with support for multiple languages
    10  in its user interface.
    11  When a user arrives with a list of preferred languages, the application must
    12  decide which language it should use in its presentation to the user.
    13  This requires finding the best match between the languages the application supports
    14  and those the user prefers.
    15  This post explains why this is a difficult decision and how Go can help.
    16  
    17  * Language Tags
    18  
    19  Language tags, also known as locale identifiers, are machine-readable
    20  identifiers for the language and/or dialect being used.
    21  The most common reference for them is the IETF BCP 47 standard, and that is the
    22  standard the Go libraries follow.
    23  Here are some examples of BCP 47 language tags and the language or dialect they
    24  represent.
    25  
    26  .html matchlang/tags.html
    27  
    28  The general form of the language tag is
    29  a language code (“en”, “cmn”, “zh”, “nl”, “az” above)
    30  followed by an optional subtag for script (“-Arab”),
    31  region (“-US”, “-BE”, “-419”),
    32  variants (“-oxendict” for Oxford English Dictionary spelling),
    33  and extensions (“-u-co-phonebk” for phone-book sorting).
    34  The most common form is assumed if a subtag is omitted, for instance
    35  “az-Latn-AZ” for “az”.
    36  
    37  The most common use of language tags is to select from a set of system-supported
    38  languages according to a list of the user's language preferences, for example
    39  deciding that a user who prefers Afrikaans would be best served (assuming
    40  Afrikaans is not available) by the system showing Dutch. Resolving such matches
    41  involves consulting data on mutual language comprehensibility.
    42  
    43  The tag resulting from this match is subsequently used to obtain
    44  language-specific resources such as translations, sorting order,
    45  and casing algorithms.
    46  This involves a different kind of matching. For example, as there is no specific
    47  sorting order for Portuguese, a collate package may fall back to the sorting
    48  order for the default, or “root”, language.
    49  
    50  * The Messy Nature of Matching Languages
    51  
    52  Handling language tags is tricky.
    53  This is partly because the boundaries of human languages are not well defined
    54  and partly because of the legacy of evolving language tag standards.
    55  In this section we will show some of the messy aspects of handling language tags.
    56  
    57  __Tags_with_different_language_codes_can_indicate_the_same_language_
    58  
    59  For historical and political reasons, many language codes have changed over
    60  time, leaving languages with an older legacy code as well as a new one.
    61  But even two current codes may refer to the same language.
    62  For example, the official language code for Mandarin is “cmn”, but “zh” is by
    63  far the most commonly used designator for this language.
    64  The code “zh” is officially reserved for a so called macro language, identifying
    65  the group of Chinese languages.
    66  Tags for macro languages are often used interchangeably with the most-spoken
    67  language in the group.
    68  
    69  _Matching_language_code_alone_is_not_sufficient_
    70  
    71  Azerbaijani (“az”), for example, is written in different scripts depending on
    72  the country in which it is spoken: "az-Latn" for Latin (the default script),
    73  "az-Arab" for Arabic, and “az-Cyrl” for Cyrillic.
    74  If you replace "az-Arab" with just "az", the result will be in Latin script and
    75  may not be understandable to a user who only knows the Arabic form.
    76  
    77  Also different regions may imply different scripts.
    78  For example: “zh-TW” and “zh-SG” respectively imply the use of Traditional and
    79  Simplified Han. As another example, “sr” (Serbian) defaults to Cyrillic script,
    80  but “sr-RU” (Serbian as written in Russia) implies the Latin script!
    81  A similar thing can be said for Kyrgyz and other languages.
    82  
    83  If you ignore subtags, you might as well present Greek to the user.
    84  
    85  _The_best_match_might_be_a_language_not_listed_by_the_user_
    86  
    87  The most common written form of Norwegian (“nb”) looks an awful lot like Danish.
    88  If Norwegian is not available, Danish may be a good second choice.
    89  Similarly, a user requesting Swiss German (“gsw”) will likely be happy to be
    90  presented German (“de”), though the converse is far from true.
    91  A user requesting Uygur may be happier to fall back to Chinese than to English.
    92  Other examples abound.
    93  If a user-requested language is not supported, falling back to English is often
    94  not the best thing to do.
    95  
    96  _The_choice_of_language_decides_more_than_translation_
    97  
    98  Suppose a user asks for Danish, with German as a second choice.
    99  If an application chooses German, it must not only use German translations
   100  but also use German (not Danish) collation.
   101  Otherwise, for example, a list of animals might sort “Bär” before “Äffin”.
   102  
   103  Selecting a supported language given the user’s preferred languages is like a
   104  handshaking algorithm: first you determine which protocol to communicate in (the
   105  language) and then you stick with this protocol for all communication for the
   106  duration of a session.
   107  
   108  _Using_a_“parent”_of_a_language_as_fallback_is_non-trivial_
   109  
   110  Suppose your application supports Angolan Portuguese (“pt-AO”).
   111  Packages in [[http://golang.org/x/text]], like collation and display, may not
   112  have specific support for this dialect.
   113  The correct course of action in such cases is to match the closest parent dialect.
   114  Languages are arranged in a hierarchy, with each specific language having a more
   115  general parent.
   116  For example, the parent of “en-GB-oxendict” is “en-GB”, whose parent is “en”,
   117  whose parent is the undefined language “und”, also known as the root language.
   118  In the case of collation, there is no specific collation order for Portugese,
   119  so the collate package will select the sorting order of the root language.
   120  The closest parent to Angolan Portuguese supported by the display package is
   121  European Portuguese (“pt-PT”) and not the more obvious “pt”, which implies
   122  Brazilian.
   123  
   124  In general, parent relationships are non-trivial.
   125  To give a few more examples, the parent of “es-CL” is “es-419”, the parent of
   126  “zh-TW” is “zh-Hant”, and the parent of “zh-Hant” is “und”.
   127  If you compute the parent by simply removing subtags, you may select a “dialect”
   128  that is incomprehensible to the user.
   129  
   130  * Language Matching in Go
   131  
   132  The Go package [[http://golang.org/x/text/language]] implements the BCP 47
   133  standard for language tags and adds support for deciding which language to use
   134  based on data published in the Unicode Common Locale Data Repository (CLDR).
   135  
   136  Here is a sample program, explained below, matching a user's language
   137  preferences against an application's supported languages:
   138  
   139  .code -edit matchlang/complete.go
   140  
   141  ** Creating Language Tags
   142  
   143  The simplest way to create a language.Tag from a user-given language code string
   144  is with language.Make.
   145  It extracts meaningful information even from malformed input.
   146  For example, “en-USD” will result in “en” even though USD is not a valid subtag.
   147  
   148  Make doesn’t return an error.
   149  It is common practice to use the default language if an error occurs anyway so
   150  this makes it more convenient. Use Parse to handle any error manually.
   151  
   152  The HTTP Accept-Language header is often used to pass a user’s desired
   153  languages.
   154  The ParseAcceptLanguage function parses it into a slice of language tags,
   155  ordered by preference.
   156  
   157  By default, the language package does not canonicalize tags.
   158  For example, it does not follow the BCP 47 recommendation of eliminating scripts
   159  if it is the common choice in the “overwhelming majority”.
   160  It similarly ignores CLDR recommendations: “cmn” is not replaced by “zh” and
   161  “zh-Hant-HK” is not simplified to “zh-HK”.
   162  Canonicalizing tags may throw away useful information about user intent.
   163  Canonicalization is handled in the Matcher instead.
   164  A full array of canonicalization options are available if the programmer still
   165  desires to do so.
   166  
   167  ** Matching User-Preferred Languages to Supported Languages
   168  
   169  A Matcher matches user-preferred languages to supported languages.
   170  Users are strongly advised to use it if they don’t want to deal with all the
   171  intricacies of matching languages.
   172  
   173  The Match method may pass through user settings (from BCP 47 extensions) from
   174  the preferred tags to the selected supported tag.
   175  It is therefore important that the tag returned by Match is used to obtain
   176  language-specific resources.
   177  For example, “de-u-co-phonebk” requests phone-book ordering for German.
   178  The extension is ignored for matching, but is used by the collate package to
   179  select the respective sorting order variant.
   180  
   181  A Matcher is initialized with the languages supported by an application, which
   182  are usually the languages for which there are translations.
   183  This set is typically fixed, allowing a matcher to be created at startup.
   184  Matcher is optimized to improve the performance of Match at the expense of
   185  initialization cost.
   186  
   187  The language package provides a predefined set of the most commonly used
   188  language tags that can be used for defining the supported set.
   189  Users generally don’t have to worry about the exact tags to pick for supported
   190  languages.
   191  For example, AmericanEnglish (“en-US”) may be used interchangeably with the more
   192  common English (“en”), which defaults to American.
   193  It is all the same for the Matcher. An application may even add both, allowing
   194  for more specific American slang for “en-US”.
   195  
   196  ** Matching Example
   197  
   198  Consider the following Matcher and lists of supported languages:
   199  
   200  	var supported = []language.Tag{
   201  		language.AmericanEnglish,    // en-US: first language is fallback
   202  		language.German,             // de
   203  		language.Dutch,              // nl
   204  		language.Portuguese          // pt (defaults to Brazilian)
   205  		language.EuropeanPortuguese, // pt-pT
   206  		language.Romanian            // ro
   207  		language.Serbian,            // sr (defaults to Cyrillic script)
   208  		language.SerbianLatin,       // sr-Latn
   209  		language.SimplifiedChinese,  // zh-Hans
   210  		language.TraditionalChinese, // zh-Hant
   211  	}
   212  	var matcher = language.NewMatcher(supported)
   213  
   214  Let's look at the matches against this list of supported languages for various
   215  user preferences.
   216  
   217  For a user preference of "he" (Hebrew), the best match is "en-US" (American
   218  English).
   219  There is no good match, so the matcher uses the fallback language (the first in
   220  the supported list).
   221  
   222  For a user preference of "hr" (Croatian), the best match is "sr-Latn" (Serbian
   223  with Latin script), because, once they are written in the same script, Serbian
   224  and Croatian are mutually intelligible.
   225  
   226  For a user preference of "ru, mo" (Russian, then Moldavian), the best match is
   227  "ro" (Romanian), because Moldavian is now canonically classified as "ro-MD"
   228  (Romanian in Moldova).
   229  
   230  For a user preference of "zh-TW" (Mandarin in Taiwan), the best match is
   231  "zh-Hant" (Mandarin written in Traditional Chinese), not "zh-Hans" (Mandarin
   232  written in Simplified Chinese).
   233  
   234  For a user preference of "af, ar" (Afrikaans, then Arabic), the best match is
   235  "nl" (Dutch). Neither preference is supported directly, but Dutch is a
   236  significantly closer match to Afrikaans than the fallback language English is to
   237  either.
   238  
   239  For a user preference of "pt-AO, id" (Angolan Portuguese, then Indonesian), the
   240  best match is "pt-PT" (European Portuguese), not "pt" (Brazilian Portuguese).
   241  
   242  For a user preference of "gsw-u-co-phonebk" (Swiss German with phone-book
   243  collation order), the best match is "de-u-co-phonebk" (German with phone-book
   244  collation order).
   245  German is the best match for Swiss German in the server's language list, and the
   246  option for phone-book collation order has been carried over.
   247  
   248  ** Confidence Scores
   249  
   250  Go uses coarse-grained confidence scoring with rule-based elimination.
   251  A match is classified as Exact, High (not exact, but no known ambiguity), Low
   252  (probably the correct match, but maybe not), or No.
   253  In case of multiple matches, there is a set of tie-breaking rules that are
   254  executed in order.
   255  The first match is returned in the case of multiple equal matches.
   256  These confidence scores may be useful, for example, to reject relatively weak
   257  matches.
   258  They are also used to score, for example, the most likely region or script from
   259  a language tag.
   260  
   261  Implementations in other languages often use more fine-grained, variable-scale
   262  scoring.
   263  We found that using coarse-grained scoring in the Go implementation ended up
   264  simpler to implement, more maintainable, and faster, meaning that we could
   265  handle more rules.
   266  
   267  ** Displaying Supported Languages
   268  
   269  The [[http://golang.org/x/text/language/display]] package allows naming language
   270  tags in many languages.
   271  It also contains a “Self” namer for displaying a tag in its own language.
   272  
   273  For example:
   274  
   275  .code -edit matchlang/display.go /START/,/END/
   276  
   277  prints
   278  
   279  	English              (English)
   280  	French               (français)
   281  	Dutch                (Nederlands)
   282  	Flemish              (Vlaams)
   283  	Simplified Chinese   (简体中文)
   284  	Traditional Chinese  (繁體中文)
   285  	Russian              (русский)
   286  
   287  In the second column, note the differences in capitalization, reflecting the
   288  rules of the respective language.
   289  
   290  * Conclusion
   291  
   292  At first glance, language tags look like nicely structured data, but because
   293  they describe human languages, the structure of relationships between language
   294  tags is actually quite complex.
   295  It is often tempting, especially for English-speaking programmers, to write
   296  ad-hoc language matching using nothing other than string manipulation of the
   297  language tags.
   298  As described above, this can produce awful results.
   299  
   300  Go's [[http://golang.org/x/text/language]] package solves this complex problem
   301  while still presenting a simple, easy-to-use API. Enjoy.