github.com/graybobo/golang.org-package-offline-cache@v0.0.0-20200626051047-6608995c132f/x/blog/content/strings.article (about)

     1  Strings, bytes, runes and characters in Go
     2  23 Oct 2013
     3  Tags: strings, bytes, runes, characters
     4  
     5  Rob Pike
     6  
     7  * Introduction
     8  
     9  The [[http://blog.golang.org/slices][previous blog post]] explained how slices
    10  work in Go, using a number of examples to illustrate the mechanism behind
    11  their implementation.
    12  Building on that background, this post discusses strings in Go.
    13  At first, strings might seem too simple a topic for a blog post, but to use
    14  them well requires understanding not only how they work,
    15  but also the difference between a byte, a character, and a rune,
    16  the difference between Unicode and UTF-8,
    17  the difference between a string and a string literal,
    18  and other even more subtle distinctions.
    19  
    20  One way to approach this topic is to think of it as an answer to the frequently
    21  asked question, "When I index a Go string at position _n_, why don't I get the
    22  _nth_ character?"
    23  As you'll see, this question leads us to many details about how text works
    24  in the modern world.
    25  
    26  An excellent introduction to some of these issues, independent of Go,
    27  is Joel Spolsky's famous blog post,
    28  [[http://www.joelonsoftware.com/articles/Unicode.html][The Absolute Minimum Every Software Developer Absolutely, Positively Must Know About Unicode and Character Sets (No Excuses!)]].
    29  Many of the points he raises will be echoed here.
    30  
    31  * What is a string?
    32  
    33  Let's start with some basics.
    34  
    35  In Go, a string is in effect a read-only slice of bytes.
    36  If you're at all uncertain about what a slice of bytes is or how it works,
    37  please read the [[http://blog.golang.org/slices][previous blog post]];
    38  we'll assume here that you have.
    39  
    40  It's important to state right up front that a string holds _arbitrary_ bytes.
    41  It is not required to hold Unicode text, UTF-8 text, or any other predefined format.
    42  As far as the content of a string is concerned, it is exactly equivalent to a
    43  slice of bytes.
    44  
    45  Here is a string literal (more about those soon) that uses the
    46  `\xNN` notation to define a string constant holding some peculiar byte values.
    47  (Of course, bytes range from hexadecimal values 00 through FF, inclusive.)
    48  
    49  .code strings/basic.go /const sample/
    50  
    51  * Printing strings
    52  
    53  Because some of the bytes in our sample string are not valid ASCII, not even
    54  valid UTF-8, printing the string directly will produce ugly output.
    55  The simple print statement
    56  
    57  .code strings/basic.go /println/,/println/
    58  
    59  produces this mess (whose exact appearance varies with the environment):
    60  
    61  	��=� ⌘
    62  
    63  To find out what that string really holds, we need to take it apart and examine the pieces.
    64  There are several ways to do this.
    65  The most obvious is to loop over its contents and pull out the bytes
    66  individually, as in this `for` loop:
    67  
    68  .code strings/basic.go /byte loop/,/byte loop/
    69  
    70  As implied up front, indexing a string accesses individual bytes, not
    71  characters. We'll return to that topic in detail below. For now, let's
    72  stick with just the bytes.
    73  This is the output from the byte-by-byte loop:
    74  
    75  	bd b2 3d bc 20 e2 8c 98 
    76  
    77  Notice how the individual bytes match the
    78  hexadecimal escapes that defined the string.
    79  
    80  A shorter way to generate presentable output for a messy string
    81  is to use the `%x` (hexadecimal) format verb of `fmt.Printf`.
    82  It just dumps out the sequential bytes of the string as hexadecimal
    83  digits, two per byte.
    84  
    85  .code strings/basic.go /percent x/,/percent x/
    86  
    87  Compare its output to that above:
    88  
    89  	bdb23dbc20e28c98
    90  
    91  A nice trick is to use the "space" flag in that format, putting a
    92  space between the `%` and the `x`. Compare the format string
    93  used here to the one above,
    94  
    95  .code strings/basic.go /percent space x/,/percent space x/
    96  
    97  and notice how the bytes come
    98  out with spaces between, making the result a little less imposing:
    99  
   100  	bd b2 3d bc 20 e2 8c 98
   101  
   102  There's more. The `%q` (quoted) verb will escape any non-printable
   103  byte sequences in a string so the output is unambiguous.
   104  
   105  .code strings/basic.go /percent q/,/percent q/
   106  
   107  This technique is handy when much of the string is
   108  intelligible as text but there are peculiarities to root out; it produces:
   109  
   110  	"\xbd\xb2=\xbc ⌘"
   111  
   112  If we squint at that, we can see that buried in the noise is one ASCII equals sign,
   113  along with a regular space, and at the end appears the well-known Swedish "Place of Interest"
   114  symbol.
   115  That symbol has Unicode value U+2318, encoded as UTF-8 by the bytes
   116  after the space (hex value `20`): `e2` `8c` `98`.
   117  
   118  If we are unfamiliar or confused by strange values in the string,
   119  we can use the "plus" flag to the `%q` verb. This flag causes the output to escape
   120  not only non-printable sequences, but also any non-ASCII bytes, all
   121  while interpreting UTF-8.
   122  The result is that it exposes the Unicode values of properly formatted UTF-8
   123  that represents non-ASCII data in the string:
   124  
   125  .code strings/basic.go /percent plus q/,/percent plus q/
   126  
   127  With that format, the Unicode value of the Swedish symbol shows up as a
   128  `\u` escape:
   129  
   130  	"\xbd\xb2=\xbc \u2318"
   131  
   132  These printing techiques are good to know when debugging
   133  the contents of strings, and will be handy in the discussion that follows.
   134  It's worth pointing out as well that all these methods behave exactly the
   135  same for byte slices as they do for strings.
   136  
   137  Here's the full set of printing options we've listed, presented as
   138  a complete program you can run (and edit) right in the browser:
   139  
   140  .play -edit strings/basic.go /package/,/^}/
   141  
   142  [Exercise: Modify the examples above to use a slice of bytes
   143  instead of a string. Hint: Use a conversion to create the slice.]
   144  
   145  [Exercise: Loop over the string using the `%q` format on each byte.
   146  What does the output tell you?]
   147  
   148  * UTF-8 and string literals
   149  
   150  As we saw, indexing a string yields its bytes, not its characters: a string is just a
   151  bunch of bytes.
   152  That means that when we store a character value in a string,
   153  we store its byte-at-a-time representation.
   154  Let's look at a more controlled example to see how that happens.
   155  
   156  Here's a simple program that prints a string constant with a single character
   157  three different ways, once as a plain string, once as an ASCII-only quoted
   158  string, and once as individual bytes in hexadecimal.
   159  To avoid any confusion, we create a "raw string", enclosed by back quotes,
   160  so it can contain only literal text. (Regular strings, enclosed by double
   161  quotes, can contain escape sequences as we showed above.)
   162  
   163  .play -edit strings/utf8.go /^func/,/^}/
   164  
   165  The output is:
   166  
   167  	plain string: ⌘
   168  	quoted string: "\u2318"
   169  	hex bytes: e2 8c 98 
   170  
   171  which reminds us that the Unicode character value U+2318, the "Place
   172  of Interest" symbol ⌘, is represented by the bytes `e2` `8c` `98`, and
   173  that those bytes are the UTF-8 encoding of the hexadecimal
   174  value 2318.
   175  
   176  It may be obvious or it may be subtle, depending on your familiarity with
   177  UTF-8, but it's worth taking a moment to explain how the UTF-8 representation
   178  of the string was created.
   179  The simple fact is: it was created when the source code was written.
   180  
   181  Source code in Go is _defined_ to be UTF-8 text; no other representation is
   182  allowed. That implies that when, in the source code, we write the text
   183  
   184  	`⌘`
   185  
   186  the text editor used to create the program places the UTF-8 encoding
   187  of the symbol ⌘ into the source text.
   188  When we print out the hexadecimal bytes, we're just dumping the
   189  data the editor placed in the file.
   190  
   191  In short, Go source code is UTF-8, so
   192  _the_source_code_for_the_string_literal_is_UTF-8_text_.
   193  If that string literal contains no escape sequences, which a raw
   194  string cannot, the constructed string will hold exactly the
   195  source text  between the quotes.
   196  Thus by definition and
   197  by construction the raw string will always contain a valid UTF-8
   198  representation of its contents.
   199  Similarly, unless it contains UTF-8-breaking escapes like those
   200  from the previous section, a regular string literal will also always
   201  contain valid UTF-8.
   202  
   203  Some people think Go strings are always UTF-8, but they
   204  are not: only string literals are UTF-8.
   205  As we showed in the previous section, string _values_ can contain arbitrary
   206  bytes;
   207  as we showed in this one, string _literals_ always contain UTF-8 text
   208  as long as they have no byte-level escapes.
   209  
   210  To summarize, strings can contain arbitrary bytes, but when constructed
   211  from string literals, those bytes are (almost always) UTF-8.
   212  
   213  * Code points, characters, and runes
   214  
   215  We've been very careful so far in how we use the words "byte" and "character".
   216  That's partly because strings hold bytes, and partly because the idea of "character"
   217  is a little hard to define.
   218  The Unicode standard uses the term "code point" to refer to the item represented
   219  by a single value.
   220  The code point U+2318, with hexadecimal value 2318, represents the symbol ⌘.
   221  (For lots more information about that code point, see
   222  [[http://unicode.org/cldr/utility/character.jsp?a=2318][its Unicode page]].)
   223  
   224  To pick a more prosaic example, the Unicode code point U+0061 is the lower
   225  case Latin letter 'A': a.
   226  
   227  But what about the lower case grave-accented letter 'A', à?
   228  That's a character, and it's also a code point (U+00E0), but it has other
   229  representations.
   230  For example we can use the "combining" grave accent code point, U+0300,
   231  and attach it to the lower case letter a, U+0061, to create the same character à.
   232  In general, a character may be represented by a number of different
   233  sequences of code points, and therefore different sequences of UTF-8 bytes.
   234  
   235  The concept of character in computing is therefore ambiguous, or at least
   236  confusing, so we use it with care.
   237  To make things dependable, there are _normalization_ techniques that guarantee that
   238  a given character is always represented by the same code points, but that
   239  subject takes us too far off the topic for now.
   240  A later blog post will explain how the Go libraries address normalization.
   241  
   242  "Code point" is a bit of a mouthful, so Go introduces a shorter term for the
   243  concept: _rune_.
   244  The term appears in the libraries and source code, and means exactly
   245  the same as "code point", with one interesting addition.
   246  
   247  The Go language defines the word `rune` as an alias for the type `int32`, so
   248  programs can be clear when an integer value represents a code point.
   249  Moreover, what you might think of as a character constant is called a
   250  _rune_constant_ in Go.
   251  The type and value of the expression
   252  
   253  	'⌘'
   254  
   255  is `rune` with integer value `0x2318`.
   256  
   257  To summarize, here are the salient points:
   258  
   259  - Go source code is always UTF-8.
   260  - A string holds arbitrary bytes.
   261  - A string literal, absent byte-level escapes, always holds valid UTF-8 sequences.
   262  - Those sequences represent Unicode code points, called runes.
   263  - No guarantee is made in Go that characters in strings are normalized.
   264  
   265  
   266  * Range loops
   267  
   268  Besides the axiomatic detail that Go source code is UTF-8,
   269  there's really only one way that Go treats UTF-8 specially, and that is when using
   270  a `for` `range` loop on a string.
   271  
   272  We've seen what happens with a regular `for` loop.
   273  A `for` `range` loop, by contrast, decodes one UTF-8-encoded rune on each
   274  iteration.
   275  Each time around the loop, the index of the loop is the starting position of the
   276  current rune, measured in bytes, and the code point is its value.
   277  Here's an example using yet another handy `Printf` format, `%#U`, which shows
   278  the code point's Unicode value and its printed representation:
   279  
   280  .play -edit strings/range.go /const/,/}/
   281  
   282  The output shows how each code point occupies multiple bytes:
   283  
   284  	U+65E5 '日' starts at byte position 0
   285  	U+672C '本' starts at byte position 3
   286  	U+8A9E '語' starts at byte position 6
   287  
   288  [Exercise: Put an invalid UTF-8 byte sequence into the string. (How?)
   289  What happens to the iterations of the loop?]
   290  
   291  * Libraries
   292  
   293  Go's standard library provides strong support for interpreting UTF-8 text.
   294  If a `for` `range` loop isn't sufficient for your purposes,
   295  chances are the facility you need is provided by a package in the library.
   296  
   297  The most important such package is
   298  [[http://golang.org/pkg/unicode/utf8/][`unicode/utf8`]],
   299  which contains
   300  helper routines to validate, disassemble, and reassemble UTF-8 strings.
   301  Here is a program equivalent to the `for` `range` example above,
   302  but using the `DecodeRuneInString` function from that package to
   303  do the work.
   304  The return values from the function are the rune and its width in
   305  UTF-8-encoded bytes.
   306  
   307  .play -edit strings/encoding.go /const/,/}/
   308  
   309  Run it to see that it performs the same.
   310  The `for` `range` loop and `DecodeRuneInString` are defined to produce
   311  exactly the same iteration sequence.
   312  
   313  Look at the
   314  [[http://golang.org/pkg/unicode/utf8/][documentation]]
   315  for the `unicode/utf8` package to see what
   316  other facilities it provides.
   317  
   318  * Conclusion
   319  
   320  To answer the question posed at the beginning: Strings are built from bytes
   321  so indexing them yields bytes, not characters.
   322  A string might not even hold characters.
   323  In fact, the definition of "character" is ambiguous and it would
   324  be a mistake to try to resolve the ambiguity by defining that strings are made
   325  of characters.
   326  
   327  There's much more to say about Unicode, UTF-8, and the world of multilingual
   328  text processing, but it can wait for another post.
   329  For now, we hope you have a better understanding of how Go strings behave
   330  and that, although they may contain arbitrary bytes, UTF-8 is a central part
   331  of their design.