github.com/graybobo/golang.org-package-offline-cache@v0.0.0-20200626051047-6608995c132f/x/blog/content/strings.article

github.com/graybobo/golang.org-package-offline-cache@v0.0.0-20200626051047-6608995c132f/x/blog/content/strings.article (about)

1 Strings, bytes, runes and characters in Go
2 23 Oct 2013
3 Tags: strings, bytes, runes, characters
4
5 Rob Pike
6
7 * Introduction
8
9 The [[http://blog.golang.org/slices][previous blog post]] explained how slices
10 work in Go, using a number of examples to illustrate the mechanism behind
11 their implementation.
12 Building on that background, this post discusses strings in Go.
13 At first, strings might seem too simple a topic for a blog post, but to use
14 them well requires understanding not only how they work,
15 but also the difference between a byte, a character, and a rune,
16 the difference between Unicode and UTF-8,
17 the difference between a string and a string literal,
18 and other even more subtle distinctions.
19
20 One way to approach this topic is to think of it as an answer to the frequently
21 asked question, "When I index a Go string at position _n_, why don't I get the
22 _nth_ character?"
23 As you'll see, this question leads us to many details about how text works
24 in the modern world.
25
26 An excellent introduction to some of these issues, independent of Go,
27 is Joel Spolsky's famous blog post,
28 [[http://www.joelonsoftware.com/articles/Unicode.html][The Absolute Minimum Every Software Developer Absolutely, Positively Must Know About Unicode and Character Sets (No Excuses!)]].
29 Many of the points he raises will be echoed here.
30
31 * What is a string?
32
33 Let's start with some basics.
34
35 In Go, a string is in effect a read-only slice of bytes.
36 If you're at all uncertain about what a slice of bytes is or how it works,
37 please read the [[http://blog.golang.org/slices][previous blog post]];
38 we'll assume here that you have.
39
40 It's important to state right up front that a string holds _arbitrary_ bytes.
41 It is not required to hold Unicode text, UTF-8 text, or any other predefined format.
42 As far as the content of a string is concerned, it is exactly equivalent to a
43 slice of bytes.
44
45 Here is a string literal (more about those soon) that uses the
46 `\xNN` notation to define a string constant holding some peculiar byte values.
47 (Of course, bytes range from hexadecimal values 00 through FF, inclusive.)
48
49 .code strings/basic.go /const sample/
50
51 * Printing strings
52
53 Because some of the bytes in our sample string are not valid ASCII, not even
54 valid UTF-8, printing the string directly will produce ugly output.
55 The simple print statement
56
57 .code strings/basic.go /println/,/println/
58
59 produces this mess (whose exact appearance varies with the environment):
60
61 ��=� ⌘
62
63 To find out what that string really holds, we need to take it apart and examine the pieces.
64 There are several ways to do this.
65 The most obvious is to loop over its contents and pull out the bytes
66 individually, as in this `for` loop:
67
68 .code strings/basic.go /byte loop/,/byte loop/
69
70 As implied up front, indexing a string accesses individual bytes, not
71 characters. We'll return to that topic in detail below. For now, let's
72 stick with just the bytes.
73 This is the output from the byte-by-byte loop:
74
75 bd b2 3d bc 20 e2 8c 98
76
77 Notice how the individual bytes match the
78 hexadecimal escapes that defined the string.
79
80 A shorter way to generate presentable output for a messy string
81 is to use the `%x` (hexadecimal) format verb of `fmt.Printf`.
82 It just dumps out the sequential bytes of the string as hexadecimal
83 digits, two per byte.
84
85 .code strings/basic.go /percent x/,/percent x/
86
87 Compare its output to that above:
88
89 bdb23dbc20e28c98
90
91 A nice trick is to use the "space" flag in that format, putting a
92 space between the `%` and the `x`. Compare the format string
93 used here to the one above,
94
95 .code strings/basic.go /percent space x/,/percent space x/
96
97 and notice how the bytes come
98 out with spaces between, making the result a little less imposing:
99
100 bd b2 3d bc 20 e2 8c 98
101
102 There's more. The `%q` (quoted) verb will escape any non-printable
103 byte sequences in a string so the output is unambiguous.
104
105 .code strings/basic.go /percent q/,/percent q/
106
107 This technique is handy when much of the string is
108 intelligible as text but there are peculiarities to root out; it produces:
109
110 "\xbd\xb2=\xbc ⌘"
111
112 If we squint at that, we can see that buried in the noise is one ASCII equals sign,
113 along with a regular space, and at the end appears the well-known Swedish "Place of Interest"
114 symbol.
115 That symbol has Unicode value U+2318, encoded as UTF-8 by the bytes
116 after the space (hex value `20`): `e2` `8c` `98`.
117
118 If we are unfamiliar or confused by strange values in the string,
119 we can use the "plus" flag to the `%q` verb. This flag causes the output to escape
120 not only non-printable sequences, but also any non-ASCII bytes, all
121 while interpreting UTF-8.
122 The result is that it exposes the Unicode values of properly formatted UTF-8
123 that represents non-ASCII data in the string:
124
125 .code strings/basic.go /percent plus q/,/percent plus q/
126
127 With that format, the Unicode value of the Swedish symbol shows up as a
128 `\u` escape:
129
130 "\xbd\xb2=\xbc \u2318"
131
132 These printing techiques are good to know when debugging
133 the contents of strings, and will be handy in the discussion that follows.
134 It's worth pointing out as well that all these methods behave exactly the
135 same for byte slices as they do for strings.
136
137 Here's the full set of printing options we've listed, presented as
138 a complete program you can run (and edit) right in the browser:
139
140 .play -edit strings/basic.go /package/,/^}/
141
142 [Exercise: Modify the examples above to use a slice of bytes
143 instead of a string. Hint: Use a conversion to create the slice.]
144
145 [Exercise: Loop over the string using the `%q` format on each byte.
146 What does the output tell you?]
147
148 * UTF-8 and string literals
149
150 As we saw, indexing a string yields its bytes, not its characters: a string is just a
151 bunch of bytes.
152 That means that when we store a character value in a string,
153 we store its byte-at-a-time representation.
154 Let's look at a more controlled example to see how that happens.
155
156 Here's a simple program that prints a string constant with a single character
157 three different ways, once as a plain string, once as an ASCII-only quoted
158 string, and once as individual bytes in hexadecimal.
159 To avoid any confusion, we create a "raw string", enclosed by back quotes,
160 so it can contain only literal text. (Regular strings, enclosed by double
161 quotes, can contain escape sequences as we showed above.)
162
163 .play -edit strings/utf8.go /^func/,/^}/
164
165 The output is:
166
167 plain string: ⌘
168 quoted string: "\u2318"
169 hex bytes: e2 8c 98
170
171 which reminds us that the Unicode character value U+2318, the "Place
172 of Interest" symbol ⌘, is represented by the bytes `e2` `8c` `98`, and
173 that those bytes are the UTF-8 encoding of the hexadecimal
174 value 2318.
175
176 It may be obvious or it may be subtle, depending on your familiarity with
177 UTF-8, but it's worth taking a moment to explain how the UTF-8 representation
178 of the string was created.
179 The simple fact is: it was created when the source code was written.
180
181 Source code in Go is _defined_ to be UTF-8 text; no other representation is
182 allowed. That implies that when, in the source code, we write the text
183
184 `⌘`
185
186 the text editor used to create the program places the UTF-8 encoding
187 of the symbol ⌘ into the source text.
188 When we print out the hexadecimal bytes, we're just dumping the
189 data the editor placed in the file.
190
191 In short, Go source code is UTF-8, so
192 _the_source_code_for_the_string_literal_is_UTF-8_text_.
193 If that string literal contains no escape sequences, which a raw
194 string cannot, the constructed string will hold exactly the
195 source text between the quotes.
196 Thus by definition and
197 by construction the raw string will always contain a valid UTF-8
198 representation of its contents.
199 Similarly, unless it contains UTF-8-breaking escapes like those
200 from the previous section, a regular string literal will also always
201 contain valid UTF-8.
202
203 Some people think Go strings are always UTF-8, but they
204 are not: only string literals are UTF-8.
205 As we showed in the previous section, string _values_ can contain arbitrary
206 bytes;
207 as we showed in this one, string _literals_ always contain UTF-8 text
208 as long as they have no byte-level escapes.
209
210 To summarize, strings can contain arbitrary bytes, but when constructed
211 from string literals, those bytes are (almost always) UTF-8.
212
213 * Code points, characters, and runes
214
215 We've been very careful so far in how we use the words "byte" and "character".
216 That's partly because strings hold bytes, and partly because the idea of "character"
217 is a little hard to define.
218 The Unicode standard uses the term "code point" to refer to the item represented
219 by a single value.
220 The code point U+2318, with hexadecimal value 2318, represents the symbol ⌘.
221 (For lots more information about that code point, see
222 [[http://unicode.org/cldr/utility/character.jsp?a=2318][its Unicode page]].)
223
224 To pick a more prosaic example, the Unicode code point U+0061 is the lower
225 case Latin letter 'A': a.
226
227 But what about the lower case grave-accented letter 'A', à?
228 That's a character, and it's also a code point (U+00E0), but it has other
229 representations.
230 For example we can use the "combining" grave accent code point, U+0300,
231 and attach it to the lower case letter a, U+0061, to create the same character à.
232 In general, a character may be represented by a number of different
233 sequences of code points, and therefore different sequences of UTF-8 bytes.
234
235 The concept of character in computing is therefore ambiguous, or at least
236 confusing, so we use it with care.
237 To make things dependable, there are _normalization_ techniques that guarantee that
238 a given character is always represented by the same code points, but that
239 subject takes us too far off the topic for now.
240 A later blog post will explain how the Go libraries address normalization.
241
242 "Code point" is a bit of a mouthful, so Go introduces a shorter term for the
243 concept: _rune_.
244 The term appears in the libraries and source code, and means exactly
245 the same as "code point", with one interesting addition.
246
247 The Go language defines the word `rune` as an alias for the type `int32`, so
248 programs can be clear when an integer value represents a code point.
249 Moreover, what you might think of as a character constant is called a
250 _rune_constant_ in Go.
251 The type and value of the expression
252
253 '⌘'
254
255 is `rune` with integer value `0x2318`.
256
257 To summarize, here are the salient points:
258
259 - Go source code is always UTF-8.
260 - A string holds arbitrary bytes.
261 - A string literal, absent byte-level escapes, always holds valid UTF-8 sequences.
262 - Those sequences represent Unicode code points, called runes.
263 - No guarantee is made in Go that characters in strings are normalized.
264
265
266 * Range loops
267
268 Besides the axiomatic detail that Go source code is UTF-8,
269 there's really only one way that Go treats UTF-8 specially, and that is when using
270 a `for` `range` loop on a string.
271
272 We've seen what happens with a regular `for` loop.
273 A `for` `range` loop, by contrast, decodes one UTF-8-encoded rune on each
274 iteration.
275 Each time around the loop, the index of the loop is the starting position of the
276 current rune, measured in bytes, and the code point is its value.
277 Here's an example using yet another handy `Printf` format, `%#U`, which shows
278 the code point's Unicode value and its printed representation:
279
280 .play -edit strings/range.go /const/,/}/
281
282 The output shows how each code point occupies multiple bytes:
283
284 U+65E5 '日' starts at byte position 0
285 U+672C '本' starts at byte position 3
286 U+8A9E '語' starts at byte position 6
287
288 [Exercise: Put an invalid UTF-8 byte sequence into the string. (How?)
289 What happens to the iterations of the loop?]
290
291 * Libraries
292
293 Go's standard library provides strong support for interpreting UTF-8 text.
294 If a `for` `range` loop isn't sufficient for your purposes,
295 chances are the facility you need is provided by a package in the library.
296
297 The most important such package is
298 [[http://golang.org/pkg/unicode/utf8/][`unicode/utf8`]],
299 which contains
300 helper routines to validate, disassemble, and reassemble UTF-8 strings.
301 Here is a program equivalent to the `for` `range` example above,
302 but using the `DecodeRuneInString` function from that package to
303 do the work.
304 The return values from the function are the rune and its width in
305 UTF-8-encoded bytes.
306
307 .play -edit strings/encoding.go /const/,/}/
308
309 Run it to see that it performs the same.
310 The `for` `range` loop and `DecodeRuneInString` are defined to produce
311 exactly the same iteration sequence.
312
313 Look at the
314 [[http://golang.org/pkg/unicode/utf8/][documentation]]
315 for the `unicode/utf8` package to see what
316 other facilities it provides.
317
318 * Conclusion
319
320 To answer the question posed at the beginning: Strings are built from bytes
321 so indexing them yields bytes, not characters.
322 A string might not even hold characters.
323 In fact, the definition of "character" is ambiguous and it would
324 be a mistake to try to resolve the ambiguity by defining that strings are made
325 of characters.
326
327 There's much more to say about Unicode, UTF-8, and the world of multilingual
328 text processing, but it can wait for another post.
329 For now, we hope you have a better understanding of how Go strings behave
330 and that, although they may contain arbitrary bytes, UTF-8 is a central part
331 of their design.