github.com/cycloidio/terraform@v1.1.10-0.20220513142504-76d5c768dc63/docs/unicode.md

github.com/cycloidio/terraform@v1.1.10-0.20220513142504-76d5c768dc63/docs/unicode.md (about)

     1  # How Terraform Uses Unicode
     2  
     3  The Terraform language uses the Unicode standards as the basis of various
     4  different features. The Unicode Consortium publishes new versions of those
     5  standards periodically, and we aim to adopt those new versions in new
     6  minor releases of Terraform in order to support additional characters added
     7  in those new versions.
     8  
     9  Unfortunately due to those features being implemented by relying on a number
    10  of external libraries, adopting a new version of Unicode is not as simple as
    11  just updating a version number somewhere. This document aims to describe the
    12  various steps required to adopt a new version of Unicode in Terraform.
    13  
    14  We typically aim to be consistent across all of these dependencies as to which
    15  major version of Unicode we currently conform to. The usual initial driver
    16  for a Unicode upgrade is switching to new version of the Go runtime library
    17  which itself uses a new version of Unicode, because Go itself does not provide
    18  any way to select Unicode versions independently from Go versions. Therefore
    19  we typically upgrade to a new Unicode version only in conjunction with
    20  upgrading to a new Go version.
    21  
    22  ## Unicode tables in the Go standard library
    23  
    24  Several Terraform language features are implemented in terms of functions in
    25  [the Go `strings` package](https://pkg.go.dev/strings),
    26  [the Go `unicode` package](https://pkg.go.dev/unicode), and other supporting
    27  packages in the Go standard library.
    28  
    29  The Go team maintains the Go standard library features to support a particular
    30  Unicode version for each Go version. The specific Unicode version for a
    31  particular Go version is available in
    32  [`unicode.Version`](https://pkg.go.dev/unicode#Version).
    33  
    34  We adopt a new version of Go by editing the `.go-version` file in the root
    35  of this repository. Although it's typically possible to build Terraform with
    36  other versions of Go, that file documents the version we intend to use for
    37  official releases and thus the primary version we use for development and
    38  testing. Adopting a new Go version typically also implies other behavior
    39  changes inherited from the Go standard library, so it's important to review the
    40  relevant version changelog(s) to note any behavior changes we'll need to pass
    41  on to our own users via the Terraform changelog.
    42  
    43  The other subsystems described below should always be set up to match
    44  `unicode.Version`. In some cases those libraries automatically try to align
    45  themselves with `unicode.Version` and generate an error if they cannot, but
    46  that isn't true of all of them.
    47  
    48  ## Unicode Text Segmentation
    49  
    50  _Text Segmentation_ (TR29) is a Unicode standards annex which describes
    51  algorithms for breaking strings into smaller units such as sentences, words,
    52  and grapheme clusters.
    53  
    54  Several Terraform language features make use of the _grapheme cluster_
    55  algorithm in particular, because it provides a practical definition of
    56  individual visible characters, taking into account combining sequences such
    57  as Latin letters with separate diacritics or Emoji characters with gender
    58  presentation and skin tone modifiers.
    59  
    60  The text segmentation algorithms rely on supplementary data tables that are
    61  not part of the core set encoded in the Go standard library's `unicode`
    62  packages, and so instead we rely on the third-party module
    63  [`github.com/apparentlymart/go-textseg`](http://pkg.go.dev/github.com/apparentlymart/go-textseg)
    64  to provide those tables and a Go implementation of the grapheme cluster
    65  segmentation algorithm in terms of the tables.
    66  
    67  The `go-textseg` library is designed to allow calling programs to potentially
    68  support multiple Unicode versions at once, by offering a separate module major
    69  version for each Unicode major version. For example, the full module path for
    70  the Unicode 13 implementation is `github.com/apparentlymart/go-textseg/v13`.
    71  
    72  If that external library doesn't yet have support for the Unicode version we
    73  intend to adopt then we'll first need to open a pull request to contribute
    74  new language support. The details of how to do this will unfortunately vary
    75  depending on how significantly the Text Segmentation annex has changed since
    76  the most recently-supported Unicode version, but in many cases it can be
    77  just a matter of editing that library's `make_tables.go`, `make_test_tables.go`,
    78  and `generate.go` files to point to the URLs where the Unicode consortium
    79  published new tables and then run `go generate` to rebuild the files derived
    80  from those data sources. As long as the new Unicode version has only changed
    81  the data tables and not also changed the algorithm, often no further changes
    82  are needed.
    83  
    84  Once a new Unicode version is included, the maintainer of that library will
    85  typically publish a new major version that we can depend on. Two different
    86  codebases included in Terraform all depend directly on the `go-textseg` module
    87  for parts of their functionality:
    88  
    89  * [`hashicorp/hcl`](https://github.com/hashicorp/hcl) uses text
    90    segmentation as part of producing visual column offsets in source ranges
    91    returned by the tokenizer and parser. Terraform in turn uses that library
    92    for the underlying syntax of the Terraform language, and so it passes on
    93    those source ranges to the end-user as part of diagnostic messages.
    94  * The third-party module [`github.com/zclconf/go-cty`](https://github.com/zclconf/go-cty)
    95    provides several of the Terraform language built in functions, including
    96    functions like `substr` and `length` which need to count grapheme clusters
    97    as part of their implementation.
    98  
    99  As part of upgrading Terraform's Unicode support we therefore typically also
   100  open pull requests against these other codebases, and then adopt the new
   101  versions that produces. Terraform work often drives the adoption of new Unicode
   102  versions in those codebases, with other dependencies following along when they
   103  next upgrade.
   104  
   105  At the time of writing Terraform itself doesn't _directly_ depend on
   106  `go-textseg`, and so there are no specific changes required in this Terraform
   107  codebase aside from the `go.sum` file update that always follows from
   108  changes to transitive dependencies.
   109  
   110  The `go-textseg` library does have a different "auto-version" mechanism which
   111  selects an appropriate module version based on the current Go language version,
   112  but neither HCL nor cty use that because the auto-version package will not
   113  compile for any Go version that doesn't have a corresponding Unicode version
   114  explicitly recorded in that repository, and so that would be too harsh a
   115  constraint for libraries like HCL which have many callers, many of which don't
   116  care strongly about Unicode support, that may wish to upgrade Go before the
   117  text segmentation library has been updated.