github.com/mdempsky/go@v0.0.0-20151201204031-5dd372bd1e70/doc/codewalk/markov.xml (about)

     1  <!--
     2  Copyright 2011 The Go Authors.  All rights reserved.
     3  Use of this source code is governed by a BSD-style
     4  license that can be found in the LICENSE file.
     5  -->
     6  
     7  <codewalk title="Generating arbitrary text: a Markov chain algorithm">
     8  
     9  <step title="Introduction" src="doc/codewalk/markov.go:/Generating/,/line\./">
    10  	This codewalk describes a program that generates random text using
    11  	a Markov chain algorithm. The package comment describes the algorithm
    12  	and the operation of the program. Please read it before continuing.
    13  </step>
    14  
    15  <step title="Modeling Markov chains" src="doc/codewalk/markov.go:/	chain/">
    16  	A chain consists of a prefix and a suffix. Each prefix is a set
    17  	number of words, while a suffix is a single word.
    18  	A prefix can have an arbitrary number of suffixes.
    19  	To model this data, we use a <code>map[string][]string</code>.
    20  	Each map key is a prefix (a <code>string</code>) and its values are
    21  	lists of suffixes (a slice of strings, <code>[]string</code>).
    22  	<br/><br/>
    23  	Here is the example table from the package comment
    24  	as modeled by this data structure:
    25  	<pre>
    26  map[string][]string{
    27  	" ":          {"I"},
    28  	" I":         {"am"},
    29  	"I am":       {"a", "not"},
    30  	"a free":     {"man!"},
    31  	"am a":       {"free"},
    32  	"am not":     {"a"},
    33  	"a number!":  {"I"},
    34  	"number! I":  {"am"},
    35  	"not a":      {"number!"},
    36  }</pre>
    37  	While each prefix consists of multiple words, we
    38  	store prefixes in the map as a single <code>string</code>.
    39  	It would seem more natural to store the prefix as a
    40  	<code>[]string</code>, but we can't do this with a map because the
    41  	key type of a map must implement equality (and slices do not).
    42  	<br/><br/>
    43  	Therefore, in most of our code we will model prefixes as a
    44  	<code>[]string</code> and join the strings together with a space
    45  	to generate the map key:
    46  	<pre>
    47  Prefix               Map key
    48  
    49  []string{"", ""}     " "
    50  []string{"", "I"}    " I"
    51  []string{"I", "am"}  "I am"
    52  </pre>
    53  </step>
    54  
    55  <step title="The Chain struct" src="doc/codewalk/markov.go:/type Chain/,/}/">
    56  	The complete state of the chain table consists of the table itself and
    57  	the word length of the prefixes. The <code>Chain</code> struct stores
    58  	this data.
    59  </step>
    60  
    61  <step title="The NewChain constructor function" src="doc/codewalk/markov.go:/func New/,/\n}/">
    62  	The <code>Chain</code> struct has two unexported fields (those that
    63  	do not begin with an upper case character), and so we write a
    64  	<code>NewChain</code> constructor function that initializes the
    65  	<code>chain</code> map with <code>make</code> and sets the
    66  	<code>prefixLen</code> field.
    67  	<br/><br/>
    68  	This is constructor function is not strictly necessary as this entire
    69  	program is within a single package (<code>main</code>) and therefore
    70  	there is little practical difference between exported and unexported
    71  	fields. We could just as easily write out the contents of this function
    72  	when we want to construct a new Chain.
    73  	But using these unexported fields is good practice; it clearly denotes
    74  	that only methods of Chain and its constructor function should access
    75  	those fields. Also, structuring <code>Chain</code> like this means we
    76  	could easily move it into its own package at some later date.
    77  </step>
    78  
    79  <step title="The Prefix type" src="doc/codewalk/markov.go:/type Prefix/">
    80  	Since we'll be working with prefixes often, we define a
    81  	<code>Prefix</code> type with the concrete type <code>[]string</code>.
    82  	Defining a named type clearly allows us to be explicit when we are
    83  	working with a prefix instead of just a <code>[]string</code>.
    84  	Also, in Go we can define methods on any named type (not just structs),
    85  	so we can add methods that operate on <code>Prefix</code> if we need to.
    86  </step>
    87  
    88  <step title="The String method" src="doc/codewalk/markov.go:/func[^\n]+String/,/}/">
    89  	The first method we define on <code>Prefix</code> is
    90  	<code>String</code>. It returns a <code>string</code> representation
    91  	of a <code>Prefix</code> by joining the slice elements together with
    92  	spaces. We will use this method to generate keys when working with
    93  	the chain map.
    94  </step>
    95  
    96  <step title="Building the chain" src="doc/codewalk/markov.go:/func[^\n]+Build/,/\n}/">
    97  	The <code>Build</code> method reads text from an <code>io.Reader</code>
    98  	and parses it into prefixes and suffixes that are stored in the
    99  	<code>Chain</code>.
   100  	<br/><br/>
   101  	The <code><a href="/pkg/io/#Reader">io.Reader</a></code> is an
   102  	interface type that is widely used by the standard library and
   103  	other Go code. Our code uses the
   104  	<code><a href="/pkg/fmt/#Fscan">fmt.Fscan</a></code> function, which
   105  	reads space-separated values from an <code>io.Reader</code>.
   106  	<br/><br/>
   107  	The <code>Build</code> method returns once the <code>Reader</code>'s
   108  	<code>Read</code> method returns <code>io.EOF</code> (end of file)
   109  	or some other read error occurs.
   110  </step>
   111  
   112  <step title="Buffering the input" src="doc/codewalk/markov.go:/bufio\.NewReader/">
   113  	This function does many small reads, which can be inefficient for some
   114  	<code>Readers</code>. For efficiency we wrap the provided
   115  	<code>io.Reader</code> with
   116  	<code><a href="/pkg/bufio/">bufio.NewReader</a></code> to create a
   117  	new <code>io.Reader</code> that provides buffering.
   118  </step>
   119  
   120  <step title="The Prefix variable" src="doc/codewalk/markov.go:/make\(Prefix/">
   121  	At the top of the function we make a <code>Prefix</code> slice
   122  	<code>p</code> using the <code>Chain</code>'s <code>prefixLen</code>
   123  	field as its length.
   124  	We'll use this variable to hold the current prefix and mutate it with
   125  	each new word we encounter.
   126  </step>
   127  
   128  <step title="Scanning words" src="doc/codewalk/markov.go:/var s string/,/\n		}/">
   129  	In our loop we read words from the <code>Reader</code> into a
   130  	<code>string</code> variable <code>s</code> using
   131  	<code>fmt.Fscan</code>. Since <code>Fscan</code> uses space to
   132  	separate each input value, each call will yield just one word
   133  	(including punctuation), which is exactly what we need.
   134  	<br/><br/>
   135  	<code>Fscan</code> returns an error if it encounters a read error
   136  	(<code>io.EOF</code>, for example) or if it can't scan the requested
   137  	value (in our case, a single string). In either case we just want to
   138  	stop scanning, so we <code>break</code> out of the loop.
   139  </step>
   140  
   141  <step title="Adding a prefix and suffix to the chain" src="doc/codewalk/markov.go:/	key/,/key\], s\)">
   142  	The word stored in <code>s</code> is a new suffix. We add the new
   143  	prefix/suffix combination to the <code>chain</code> map by computing
   144  	the map key with <code>p.String</code> and appending the suffix
   145  	to the slice stored under that key.
   146  	<br/><br/>
   147  	The built-in <code>append</code> function appends elements to a slice
   148  	and allocates new storage when necessary. When the provided slice is
   149  	<code>nil</code>, <code>append</code> allocates a new slice.
   150  	This behavior conveniently ties in with the semantics of our map:
   151  	retrieving an unset key returns the zero value of the value type and
   152  	the zero value of <code>[]string</code> is <code>nil</code>.
   153  	When our program encounters a new prefix (yielding a <code>nil</code>
   154  	value in the map) <code>append</code> will allocate a new slice.
   155  	<br/><br/>
   156  	For more information about the <code>append</code> function and slices
   157  	in general see the
   158  	<a href="/doc/articles/slices_usage_and_internals.html">Slices: usage and internals</a> article.
   159  </step>
   160  
   161  <step title="Pushing the suffix onto the prefix" src="doc/codewalk/markov.go:/p\.Shift/">
   162  	Before reading the next word our algorithm requires us to drop the
   163  	first word from the prefix and push the current suffix onto the prefix.
   164  	<br/><br/>
   165  	When in this state
   166  	<pre>
   167  p == Prefix{"I", "am"}
   168  s == "not" </pre>
   169  	the new value for <code>p</code> would be
   170  	<pre>
   171  p == Prefix{"am", "not"}</pre>
   172  	This operation is also required during text generation so we put
   173  	the code to perform this mutation of the slice inside a method on
   174  	<code>Prefix</code> named <code>Shift</code>.
   175  </step>
   176  
   177  <step title="The Shift method" src="doc/codewalk/markov.go:/func[^\n]+Shift/,/\n}/">
   178  	The <code>Shift</code> method uses the built-in <code>copy</code>
   179  	function to copy the last len(p)-1 elements of <code>p</code> to
   180  	the start of the slice, effectively moving the elements
   181  	one index to the left (if you consider zero as the leftmost index).
   182  	<pre>
   183  p := Prefix{"I", "am"}
   184  copy(p, p[1:])
   185  // p == Prefix{"am", "am"}</pre>
   186  	We then assign the provided <code>word</code> to the last index
   187  	of the slice:
   188  	<pre>
   189  // suffix == "not"
   190  p[len(p)-1] = suffix
   191  // p == Prefix{"am", "not"}</pre>
   192  </step>
   193  
   194  <step title="Generating text" src="doc/codewalk/markov.go:/func[^\n]+Generate/,/\n}/">
   195  	The <code>Generate</code> method is similar to <code>Build</code>
   196  	except that instead of reading words from a <code>Reader</code>
   197  	and storing them in a map, it reads words from the map and
   198  	appends them to a slice (<code>words</code>).
   199  	<br/><br/>
   200  	<code>Generate</code> uses a conditional for loop to generate
   201  	up to <code>n</code> words.
   202  </step>
   203  
   204  <step title="Getting potential suffixes" src="doc/codewalk/markov.go:/choices/,/}\n/">
   205  	At each iteration of the loop we retrieve a list of potential suffixes
   206  	for the current prefix. We access the <code>chain</code> map at key
   207  	<code>p.String()</code> and assign its contents to <code>choices</code>.
   208  	<br/><br/>
   209  	If <code>len(choices)</code> is zero we break out of the loop as there
   210  	are no potential suffixes for that prefix.
   211  	This test also works if the key isn't present in the map at all:
   212  	in that case, <code>choices</code> will be <code>nil</code> and the
   213  	length of a <code>nil</code> slice is zero.
   214  </step>
   215  
   216  <step title="Choosing a suffix at random" src="doc/codewalk/markov.go:/next := choices/,/Shift/">
   217  	To choose a suffix we use the
   218  	<code><a href="/pkg/math/rand/#Intn">rand.Intn</a></code> function.
   219  	It returns a random integer up to (but not including) the provided
   220  	value. Passing in <code>len(choices)</code> gives us a random index
   221  	into the full length of the list.
   222  	<br/><br/>
   223  	We use that index to pick our new suffix, assign it to
   224  	<code>next</code> and append it to the <code>words</code> slice.
   225  	<br/><br/>
   226  	Next, we <code>Shift</code> the new suffix onto the prefix just as
   227  	we did in the <code>Build</code> method.
   228  </step>
   229  
   230  <step title="Returning the generated text" src="doc/codewalk/markov.go:/Join\(words/">
   231  	Before returning the generated text as a string, we use the
   232  	<code>strings.Join</code> function to join the elements of
   233  	the <code>words</code> slice together, separated by spaces.
   234  </step>
   235  
   236  <step title="Command-line flags" src="doc/codewalk/markov.go:/Register command-line flags/,/prefixLen/">
   237  	To make it easy to tweak the prefix and generated text lengths we
   238  	use the <code><a href="/pkg/flag/">flag</a></code> package to parse
   239  	command-line flags.
   240  	<br/><br/>
   241  	These calls to <code>flag.Int</code> register new flags with the
   242  	<code>flag</code> package. The arguments to <code>Int</code> are the
   243  	flag name, its default value, and a description. The <code>Int</code>
   244  	function returns a pointer to an integer that will contain the
   245  	user-supplied value (or the default value if the flag was omitted on
   246  	the command-line).
   247  </step>
   248  
   249  <step title="Program set up" src="doc/codewalk/markov.go:/flag.Parse/,/rand.Seed/">
   250  	The <code>main</code> function begins by parsing the command-line
   251  	flags with <code>flag.Parse</code> and seeding the <code>rand</code>
   252  	package's random number generator with the current time.
   253  	<br/><br/>
   254  	If the command-line flags provided by the user are invalid the
   255  	<code>flag.Parse</code> function will print an informative usage
   256  	message and terminate the program.
   257  </step>
   258  
   259  <step title="Creating and building a new Chain" src="doc/codewalk/markov.go:/c := NewChain/,/c\.Build/">
   260  	To create the new <code>Chain</code> we call <code>NewChain</code>
   261  	with the value of the <code>prefix</code> flag.
   262  	<br/><br/>
   263  	To build the chain we call <code>Build</code> with
   264  	<code>os.Stdin</code> (which implements <code>io.Reader</code>) so
   265  	that it will read its input from standard input.
   266  </step>
   267  
   268  <step title="Generating and printing text" src="doc/codewalk/markov.go:/c\.Generate/,/fmt.Println/">
   269  	Finally, to generate text we call <code>Generate</code> with
   270  	the value of the <code>words</code> flag and assigning the result
   271  	to the variable <code>text</code>.
   272  	<br/><br/>
   273  	Then we call <code>fmt.Println</code> to write the text to standard
   274  	output, followed by a carriage return.
   275  </step>
   276  
   277  <step title="Using this program" src="doc/codewalk/markov.go">
   278  	To use this program, first build it with the
   279  	<a href="/cmd/go/">go</a> command:
   280  	<pre>
   281  $ go build markov.go</pre>
   282  	And then execute it while piping in some input text:
   283  	<pre>
   284  $ echo "a man a plan a canal panama" \
   285  	| ./markov -prefix=1
   286  a plan a man a plan a canal panama</pre>
   287  	Here's a transcript of generating some text using the Go distribution's
   288  	README file as source material:
   289  	<pre>
   290  $ ./markov -words=10 &lt; $GOROOT/README
   291  This is the source code repository for the Go source
   292  $ ./markov -prefix=1 -words=10 &lt; $GOROOT/README
   293  This is the go directory (the one containing this README).
   294  $ ./markov -prefix=1 -words=10 &lt; $GOROOT/README
   295  This is the variable if you have just untarred a</pre>
   296  </step>
   297  
   298  <step title="An exercise for the reader" src="doc/codewalk/markov.go">
   299  	The <code>Generate</code> function does a lot of allocations when it
   300  	builds the <code>words</code> slice. As an exercise, modify it to
   301  	take an <code>io.Writer</code> to which it incrementally writes the
   302  	generated text with <code>Fprint</code>.
   303  	Aside from being more efficient this makes <code>Generate</code>
   304  	more symmetrical to <code>Build</code>.
   305  </step>
   306  
   307  </codewalk>