github.com/minio/simdjson-go@v0.4.6-0.20231116094823-04d21cddf993/README.md (about)

     1  # simdjson-go
     2  
     3  ## Introduction
     4  
     5  This is a Golang port of [simdjson](https://github.com/lemire/simdjson),
     6  a high performance JSON parser developed by Daniel Lemire and Geoff Langdale.
     7  It makes extensive use of SIMD instructions to achieve parsing performance of gigabytes of JSON per second.
     8  
     9  Performance wise, `simdjson-go` runs on average at about 40% to 60% of the speed of simdjson.
    10  Compared to Golang's standard package `encoding/json`, `simdjson-go` is about 10x faster.
    11  
    12  [![Documentation](https://godoc.org/github.com/minio/simdjson-go?status.svg)](https://pkg.go.dev/github.com/minio/simdjson-go?tab=doc)
    13  
    14  ## Features
    15  
    16  `simdjson-go` is a validating parser, meaning that it amongst others validates and checks numerical values, booleans etc.
    17   Therefore, these values are available as the appropriate `int` and `float64` representations after parsing.
    18  
    19  Additionally `simdjson-go` has the following features:
    20  
    21  - No 4 GB object limit
    22  - Support for [ndjson](http://ndjson.org/) (newline delimited json)
    23  - Pure Go (no need for cgo)
    24  - Object search/traversal.
    25  - In-place value replacement.
    26  - Remove object/array members.
    27  - Serialize parsed JSONas binary data.
    28  - Re-serialize parts as JSON.
    29  
    30  ## Requirements
    31  
    32  `simdjson-go` has the following requirements for parsing:
    33  
    34  A CPU with both AVX2 and CLMUL is required (Haswell from 2013 onwards should do for Intel, for AMD a Ryzen/EPYC CPU (Q1 2017) should be sufficient).
    35  This can be checked using the provided [`SupportedCPU()`](https://pkg.go.dev/github.com/minio/simdjson-go?tab=doc#SupportedCPU`) function.
    36  
    37  The package does not provide fallback for unsupported CPUs, but serialized data can be deserialized on an unsupported CPU.
    38  
    39  Using the `gccgo` will also always return unsupported CPU since it cannot compile assembly.
    40  
    41  ## Usage
    42  
    43  Run the following command in order to install `simdjson-go`
    44  
    45  ```bash
    46  go get -u github.com/minio/simdjson-go
    47  ```
    48  
    49  In order to parse a JSON byte stream, you either call [`simdjson.Parse()`](https://pkg.go.dev/github.com/minio/simdjson-go?tab=doc#Parse)
    50  or [`simdjson.ParseND()`](https://pkg.go.dev/github.com/minio/simdjson-go?tab=doc#ParseND) for newline delimited JSON files.
    51  Both of these functions return a [`ParsedJson`](https://pkg.go.dev/github.com/minio/simdjson-go?tab=doc#ParsedJson)
    52  struct that can be used to navigate the JSON object by calling [`Iter()`](https://pkg.go.dev/github.com/minio/simdjson-go?tab=doc#ParsedJson.Iter).
    53  
    54  The easiest use is to call [`ForEach()`]((https://pkg.go.dev/github.com/minio/simdjson-go?tab=doc#ParsedJson.ForEach)) function of the returned `ParsedJson`.
    55  
    56  ```Go
    57  func main() {
    58  	// Parse JSON:
    59  	pj, err := Parse([]byte(`{"Image":{"URL":"http://example.com/example.gif"}}`), nil)
    60  	if err != nil {
    61  		log.Fatal(err)
    62  	}
    63  
    64  	// Iterate each top level element.
    65  	_ = pj.ForEach(func(i Iter) error {
    66  		fmt.Println("Got iterator for type:", i.Type())
    67  		element, err := i.FindElement(nil, "Image", "URL")
    68  		if err == nil {
    69  			value, _ := element.Iter.StringCvt()
    70  			fmt.Println("Found element:", element.Name, "Type:", element.Type, "Value:", value)
    71  		}
    72  		return nil
    73  	})
    74  
    75  	// Output:
    76  	// Got iterator for type: object
    77  	// Found element: URL Type: string Value: http://example.com/example.gif
    78  }
    79  ```
    80  
    81  ### Parsing with iterators
    82  
    83  Using the type [`Iter`](https://pkg.go.dev/github.com/minio/simdjson-go?tab=doc#Iter) you can call
    84  [`Advance()`](https://pkg.go.dev/github.com/minio/simdjson-go?tab=doc#Iter.Advance) to iterate over the tape, like so:
    85  
    86  ```Go
    87  for {
    88      typ := iter.Advance()
    89  
    90      switch typ {
    91      case simdjson.TypeRoot:
    92          if typ, tmp, err = iter.Root(tmp); err != nil {
    93              return
    94          }
    95  
    96          if typ == simdjson.TypeObject {
    97              if obj, err = tmp.Object(obj); err != nil {
    98                  return
    99              }
   100  
   101              e := obj.FindKey(key, &elem)
   102              if e != nil && elem.Type == simdjson.TypeString {
   103                  v, _ := elem.Iter.StringBytes()
   104                  fmt.Println(string(v))
   105              }
   106          }
   107  
   108      default:
   109          return
   110      }
   111  }
   112  ```
   113  
   114  When you advance the Iter you get the next type currently queued.
   115  
   116  Each type then has helpers to access the data. When you get a type you can use these to access the data:
   117  
   118  | Type       | Action on Iter             |
   119  |------------|----------------------------|
   120  | TypeNone   | Nothing follows. Iter done |
   121  | TypeNull   | Null value                 |
   122  | TypeString | `String()`/`StringBytes()` |
   123  | TypeInt    | `Int()`/`Float()`          |
   124  | TypeUint   | `Uint()`/`Float()`         |
   125  | TypeFloat  | `Float()`                  |
   126  | TypeBool   | `Bool()`                   |
   127  | TypeObject | `Object()`                 |
   128  | TypeArray  | `Array()`                  |
   129  | TypeRoot   | `Root()`                   |
   130  
   131  You can also get the next value as an `interface{}` using the [Interface()](https://pkg.go.dev/github.com/minio/simdjson-go#Iter.Interface) method.
   132  
   133  Note that arrays and objects that are null are always returned as `TypeNull`.
   134  
   135  The complex types returns helpers that will help parse each of the underlying structures.
   136  
   137  It is up to you to keep track of the nesting level you are operating at.
   138  
   139  For any `Iter` it is possible to marshal the recursive content of the Iter using
   140  [`MarshalJSON()`](https://pkg.go.dev/github.com/minio/simdjson-go#Iter.MarshalJSON) or
   141  [`MarshalJSONBuffer(...)`](https://pkg.go.dev/github.com/minio/simdjson-go#Iter.MarshalJSONBuffer).
   142  
   143  Currently, it is not possible to unmarshal into structs.
   144  
   145  ### Search by path
   146  
   147  It is possible to search by path to find elements by traversing objects.
   148  
   149  For example:
   150  
   151  ```
   152  	// Find element in path.
   153  	elem, err := i.FindElement(nil, "Image", "URL")
   154  ```
   155  
   156  Will locate the field inside a json object with the following structure:
   157  
   158  ```
   159  {
   160      "Image": {
   161          "URL": "value"
   162      }
   163  }
   164  ```
   165  
   166  The values can be any type. The [Element](https://pkg.go.dev/github.com/minio/simdjson-go#Element)
   167  will contain the element information and an Iter to access the content.
   168  
   169  ## Parsing Objects
   170  
   171  If you are only interested in one key in an object you can use `FindKey` to quickly select it.
   172  
   173  It is possible to use the `ForEach(fn func(key []byte, i Iter), onlyKeys map[string]struct{})` 
   174  which makes it possible to get a callback for each element in the object. 
   175  
   176  An object can be traversed manually by using `NextElement(dst *Iter) (name string, t Type, err error)`.
   177  The key of the element will be returned as a string and the type of the value will be returned
   178  and the provided `Iter` will contain an iterator which will allow access to the content.
   179  
   180  There is a `NextElementBytes` which provides the same, but without the need to allocate a string.
   181  
   182  All elements of the object can be retrieved using a pretty lightweight [`Parse`](https://pkg.go.dev/github.com/minio/simdjson-go#Object.Parse)
   183  which provides a map of all keys and all elements an a slide.
   184  
   185  All elements of the object can be returned as `map[string]interface{}` using the `Map` method on the object.
   186  This will naturally perform allocations for all elements.
   187  
   188  ## Parsing Arrays
   189  
   190  [Arrays](https://pkg.go.dev/github.com/minio/simdjson-go#Array) in JSON can have mixed types.
   191  
   192  It is possible to call `ForEach(fn func(i Iter))` to get each element.
   193  
   194  To iterate over the array with mixed types use the [`Iter`](https://pkg.go.dev/github.com/minio/simdjson-go#Array.Iter)
   195  method to get an iterator.
   196  
   197  There are methods that allow you to retrieve all elements as a single type,
   198  []int64, []uint64, []float64 and []string with AsInteger(), AsUint64(), AsFloat() and AsString().
   199  
   200  ## Number parsing
   201  
   202  Numbers in JSON are untyped and are returned by the following rules in order:
   203  
   204  * If there is any float point notation, like exponents, or a dot notation, it is always returned as float.
   205  * If number is a pure integer and it fits within an int64 it is returned as such.
   206  * If number is a pure positive integer and fits within a uint64 it is returned as such.
   207  * If the number is valid number it is returned as float64.
   208  
   209  If the number was converted from integer notation to a float due to not fitting inside int64/uint64
   210  the `FloatOverflowedInteger` flag is set, which can be retrieved using `(Iter).FloatFlags()` method.
   211  
   212  JSON numbers follow JavaScript’s double-precision floating-point format.
   213  
   214  * Represented in base 10 with no superfluous leading zeros (e.g. 67, 1, 100).
   215  * Include digits between 0 and 9.
   216  * Can be a negative number (e.g. -10).
   217  * Can be a fraction (e.g. .5).
   218  * Can also have an exponent of 10, prefixed by e or E with a plus or minus sign to indicate positive or negative exponentiation.
   219  * Octal and hexadecimal formats are not supported.
   220  * Can not have a value of NaN (Not A Number) or Infinity.
   221  
   222  ## Parsing NDJSON stream
   223  
   224  Newline delimited json is sent as packets with each line being a root element.
   225  
   226  Here is an example that counts the number of `"Make": "HOND"` in NDJSON similar to this:
   227  
   228  ```
   229  {"Age":20, "Make": "HOND"}
   230  {"Age":22, "Make": "TLSA"}
   231  ```
   232  
   233  ```Go
   234  func findHondas(r io.Reader) {
   235  	var nFound int
   236  
   237  	// Communication
   238  	reuse := make(chan *simdjson.ParsedJson, 10)
   239  	res := make(chan simdjson.Stream, 10)
   240  
   241  	simdjson.ParseNDStream(r, res, reuse)
   242  	// Read results in blocks...
   243  	for got := range res {
   244  		if got.Error != nil {
   245  			if got.Error == io.EOF {
   246  				break
   247  			}
   248  			log.Fatal(got.Error)
   249  		}
   250  
   251  		var result int
   252  		var elem *Element
   253  		err := got.Value.ForEach(func(i Iter) error {
   254  			var err error
   255  			elem, err = i.FindElement(elem, "Make")
   256  			if err != nil {
   257  				return nil
   258  			}
   259  			bts, _ := elem.Iter.StringBytes()
   260  			if string(bts) == "HOND" {
   261  				result++
   262  			}
   263  			return nil
   264  		})
   265  		reuse <- got.Value
   266  	}
   267  	fmt.Println("Found", nFound, "Hondas")
   268  }
   269  ```
   270  
   271  More examples can be found in the examples subdirectory and further documentation can be found at [godoc](https://pkg.go.dev/github.com/minio/simdjson-go?tab=doc).
   272  
   273  
   274  ### In-place Value Replacement
   275  
   276  It is possible to replace a few, basic internal values.
   277  This means that when re-parsing or re-serializing the parsed JSON these values will be output.
   278  
   279  Boolean (true/false) and null values can be freely exchanged.
   280  
   281  Numeric values (float, int, uint) can be exchanged freely.
   282  
   283  Strings can also be exchanged with different values.
   284  
   285  Strings and numbers can be exchanged. However, note that there is no checks for numbers inserted as object keys,
   286  so if used for this invalid JSON is possible.
   287  
   288  There is no way to modify objects, arrays, other than value types above inside each.
   289  It is not possible to remove or add elements.
   290  
   291  To replace a value, of value referenced by an `Iter` simply call `SetNull`, `SetBool`, `SetFloat`, `SetInt`, `SetUInt`,
   292  `SetString` or `SetStringBytes`.
   293  
   294  ### Object & Array Element Deletion
   295  
   296  It is possible to delete one or more elements in an object.
   297  
   298  `(*Object).DeleteElems(fn, onlyKeys)` will call back fn for each key+ value.
   299  
   300  If true is returned, the key+value is deleted. A key filter can be provided for optional filtering.
   301  If the callback function is nil all elements matching the filter will be deleted.
   302  If both are nil all elements are deleted.
   303  
   304  Example:
   305  
   306  ```Go
   307  	// The object we are modifying
   308  	var obj *simdjson.Object
   309  
   310  	// Delete all entries where the key is "unwanted":
   311  	err = obj.DeleteElems(func(key []byte, i Iter) bool {
   312  		return string(key) == "unwanted")
   313  	}, nil)
   314  
   315  	// Alternative version with prefiltered keys:
   316  	err = obj.DeleteElems(nil, map[string]struct{}{"unwanted": {}})
   317  ```
   318  
   319  `(*Array).DeleteElems(fn func(i Iter) bool)` will call back fn for each array value.
   320  If the function returns true the element is deleted in the array.
   321  
   322  ```Go
   323  	// The array we are modifying
   324  	var array *simdjson.Array
   325  
   326  	// Delete all entries that are strings.
   327  	array.DeleteElems(func(i Iter) bool {
   328  		return i.Type() == TypeString
   329  	})
   330  ```
   331  
   332  ## Serializing parsed json
   333  
   334  It is possible to serialize parsed JSON for more compact storage and faster load time.
   335  
   336  To create a new serialized use [NewSerializer](https://pkg.go.dev/github.com/minio/simdjson-go#NewSerializer).
   337  This serializer can be reused for several JSON blocks.
   338  
   339  The serializer will provide string deduplication and compression of elements.
   340  This can be finetuned using the [`CompressMode`](https://pkg.go.dev/github.com/minio/simdjson-go#Serializer.CompressMode) setting.
   341  
   342  To serialize a block of parsed data use the [`Serialize`](https://pkg.go.dev/github.com/minio/simdjson-go#Serializer.Serialize) method.
   343  
   344  To read back use the [`Deserialize`](https://pkg.go.dev/github.com/minio/simdjson-go#Serializer.Deserialize) method.
   345  For deserializing the compression mode does not need to match since it is read from the stream.
   346  
   347  Example of speed for serializer/deserializer on [`parking-citations-1M`](https://dl.minio.io/assets/parking-citations-1M.json.zst).
   348  
   349  | Compress Mode | % of JSON size | Serialize Speed | Deserialize Speed |
   350  |---------------|----------------|-----------------|-------------------|
   351  | None          | 177.26%        | 425.70 MB/s     | 2334.33 MB/s      |
   352  | Fast          | 17.20%         | 412.75 MB/s     | 1234.76 MB/s      |
   353  | Default       | 16.85%         | 411.59 MB/s     | 1242.09 MB/s      |
   354  | Best          | 10.91%         | 337.17 MB/s     | 806.23 MB/s       |
   355  
   356  In some cases the speed difference and compression difference will be bigger.
   357  
   358  ## Performance vs `encoding/json` and `json-iterator/go`
   359  
   360  Though simdjson provides different output than traditional unmarshal functions this can give
   361  an overview of the expected performance for reading specific data in JSON.
   362  
   363  Below is a performance comparison to Golang's standard package `encoding/json` based on the same set of JSON test files, unmarshal to `interface{}`.
   364  
   365  Comparisons with default settings:
   366  
   367  ```
   368  λ benchcmp enc-json.txt simdjson.txt
   369  benchmark                      old ns/op     new ns/op     delta
   370  BenchmarkApache_builds-32      1219080       142972        -88.27%
   371  BenchmarkCanada-32             38362219      13417193      -65.02%
   372  BenchmarkCitm_catalog-32       17051899      1359983       -92.02%
   373  BenchmarkGithub_events-32      603037        74042         -87.72%
   374  BenchmarkGsoc_2018-32          20777333      1259171       -93.94%
   375  BenchmarkInstruments-32        2626808       301370        -88.53%
   376  BenchmarkMarine_ik-32          56630295      14419901      -74.54%
   377  BenchmarkMesh-32               13411486      4206251       -68.64%
   378  BenchmarkMesh_pretty-32        18226803      4786081       -73.74%
   379  BenchmarkNumbers-32            2131951       909641        -57.33%
   380  BenchmarkRandom-32             7360966       1004387       -86.36%
   381  BenchmarkTwitter-32            6635848       588773        -91.13%
   382  BenchmarkTwitterescaped-32     6292856       972250        -84.55%
   383  BenchmarkUpdate_center-32      6396501       708717        -88.92%
   384  
   385  benchmark                      old MB/s     new MB/s     speedup
   386  BenchmarkApache_builds-32      104.40       890.21       8.53x
   387  BenchmarkCanada-32             58.68        167.77       2.86x
   388  BenchmarkCitm_catalog-32       101.29       1270.02      12.54x
   389  BenchmarkGithub_events-32      108.01       879.67       8.14x
   390  BenchmarkGsoc_2018-32          160.17       2642.88      16.50x
   391  BenchmarkInstruments-32        83.88        731.15       8.72x
   392  BenchmarkMarine_ik-32          52.68        206.90       3.93x
   393  BenchmarkMesh-32               53.95        172.03       3.19x
   394  BenchmarkMesh_pretty-32        86.54        329.57       3.81x
   395  BenchmarkNumbers-32            70.42        165.04       2.34x
   396  BenchmarkRandom-32             69.35        508.25       7.33x
   397  BenchmarkTwitter-32            95.17        1072.59      11.27x
   398  BenchmarkTwitterescaped-32     89.37        578.46       6.47x
   399  BenchmarkUpdate_center-32      83.35        752.31       9.03x
   400  
   401  benchmark                      old allocs     new allocs     delta
   402  BenchmarkApache_builds-32      9716           22             -99.77%
   403  BenchmarkCanada-32             392535         250            -99.94%
   404  BenchmarkCitm_catalog-32       95372          110            -99.88%
   405  BenchmarkGithub_events-32      3328           17             -99.49%
   406  BenchmarkGsoc_2018-32          58615          67             -99.89%
   407  BenchmarkInstruments-32        13336          33             -99.75%
   408  BenchmarkMarine_ik-32          614776         467            -99.92%
   409  BenchmarkMesh-32               149504         122            -99.92%
   410  BenchmarkMesh_pretty-32        149504         122            -99.92%
   411  BenchmarkNumbers-32            20025          28             -99.86%
   412  BenchmarkRandom-32             66083          76             -99.88%
   413  BenchmarkTwitter-32            31261          53             -99.83%
   414  BenchmarkTwitterescaped-32     31757          53             -99.83%
   415  BenchmarkUpdate_center-32      49074          58             -99.88%
   416  
   417  benchmark                      old bytes     new bytes     delta
   418  BenchmarkApache_builds-32      461556        965           -99.79%
   419  BenchmarkCanada-32             10943847      39793         -99.64%
   420  BenchmarkCitm_catalog-32       5122732       6089          -99.88%
   421  BenchmarkGithub_events-32      186148        802           -99.57%
   422  BenchmarkGsoc_2018-32          7032092       17215         -99.76%
   423  BenchmarkInstruments-32        882265        1310          -99.85%
   424  BenchmarkMarine_ik-32          22564413      189870        -99.16%
   425  BenchmarkMesh-32               7130934       15483         -99.78%
   426  BenchmarkMesh_pretty-32        7288661       12066         -99.83%
   427  BenchmarkNumbers-32            1066304       1280          -99.88%
   428  BenchmarkRandom-32             2787054       4096          -99.85%
   429  BenchmarkTwitter-32            2152260       2550          -99.88%
   430  BenchmarkTwitterescaped-32     2330548       3062          -99.87%
   431  BenchmarkUpdate_center-32      2729631       3235          -99.88%
   432  ```
   433  
   434  Here is another benchmark comparison to `json-iterator/go`, unmarshal to `interface{}`.
   435  
   436  ```
   437  λ benchcmp jsiter.txt simdjson.txt
   438  benchmark                      old ns/op     new ns/op     delta
   439  BenchmarkApache_builds-32      891370        142972        -83.96%
   440  BenchmarkCanada-32             52365386      13417193      -74.38%
   441  BenchmarkCitm_catalog-32       10154544      1359983       -86.61%
   442  BenchmarkGithub_events-32      398741        74042         -81.43%
   443  BenchmarkGsoc_2018-32          15584278      1259171       -91.92%
   444  BenchmarkInstruments-32        1858339       301370        -83.78%
   445  BenchmarkMarine_ik-32          49881479      14419901      -71.09%
   446  BenchmarkMesh-32               15038300      4206251       -72.03%
   447  BenchmarkMesh_pretty-32        17655583      4786081       -72.89%
   448  BenchmarkNumbers-32            2903165       909641        -68.67%
   449  BenchmarkRandom-32             6156849       1004387       -83.69%
   450  BenchmarkTwitter-32            4655981       588773        -87.35%
   451  BenchmarkTwitterescaped-32     5521004       972250        -82.39%
   452  BenchmarkUpdate_center-32      5540200       708717        -87.21%
   453  
   454  benchmark                      old MB/s     new MB/s     speedup
   455  BenchmarkApache_builds-32      142.79       890.21       6.23x
   456  BenchmarkCanada-32             42.99        167.77       3.90x
   457  BenchmarkCitm_catalog-32       170.09       1270.02      7.47x
   458  BenchmarkGithub_events-32      163.34       879.67       5.39x
   459  BenchmarkGsoc_2018-32          213.54       2642.88      12.38x
   460  BenchmarkInstruments-32        118.57       731.15       6.17x
   461  BenchmarkMarine_ik-32          59.81        206.90       3.46x
   462  BenchmarkMesh-32               48.12        172.03       3.58x
   463  BenchmarkMesh_pretty-32        89.34        329.57       3.69x
   464  BenchmarkNumbers-32            51.71        165.04       3.19x
   465  BenchmarkRandom-32             82.91        508.25       6.13x
   466  BenchmarkTwitter-32            135.64       1072.59      7.91x
   467  BenchmarkTwitterescaped-32     101.87       578.46       5.68x
   468  BenchmarkUpdate_center-32      96.24        752.31       7.82x
   469  
   470  benchmark                      old allocs     new allocs     delta
   471  BenchmarkApache_builds-32      13248          22             -99.83%
   472  BenchmarkCanada-32             665988         250            -99.96%
   473  BenchmarkCitm_catalog-32       118755         110            -99.91%
   474  BenchmarkGithub_events-32      4442           17             -99.62%
   475  BenchmarkGsoc_2018-32          90915          67             -99.93%
   476  BenchmarkInstruments-32        18776          33             -99.82%
   477  BenchmarkMarine_ik-32          692512         467            -99.93%
   478  BenchmarkMesh-32               184137         122            -99.93%
   479  BenchmarkMesh_pretty-32        204037         122            -99.94%
   480  BenchmarkNumbers-32            30037          28             -99.91%
   481  BenchmarkRandom-32             88091          76             -99.91%
   482  BenchmarkTwitter-32            45040          53             -99.88%
   483  BenchmarkTwitterescaped-32     47198          53             -99.89%
   484  BenchmarkUpdate_center-32      66757          58             -99.91%
   485  
   486  benchmark                      old bytes     new bytes     delta
   487  BenchmarkApache_builds-32      518350        965           -99.81%
   488  BenchmarkCanada-32             16189358      39793         -99.75%
   489  BenchmarkCitm_catalog-32       5571982       6089          -99.89%
   490  BenchmarkGithub_events-32      221631        802           -99.64%
   491  BenchmarkGsoc_2018-32          11771591      17215         -99.85%
   492  BenchmarkInstruments-32        991674        1310          -99.87%
   493  BenchmarkMarine_ik-32          25257277      189870        -99.25%
   494  BenchmarkMesh-32               7991707       15483         -99.81%
   495  BenchmarkMesh_pretty-32        8628570       12066         -99.86%
   496  BenchmarkNumbers-32            1226518       1280          -99.90%
   497  BenchmarkRandom-32             3167528       4096          -99.87%
   498  BenchmarkTwitter-32            2426730       2550          -99.89%
   499  BenchmarkTwitterescaped-32     2607198       3062          -99.88%
   500  BenchmarkUpdate_center-32      3052382       3235          -99.89%
   501  ```
   502  
   503  
   504  ### Inplace strings
   505  
   506  The best performance is obtained by keeping the JSON message fully mapped in memory and using the
   507  `WithCopyStrings(false)` option. This prevents duplicate copies of string values being made
   508  but mandates that the original JSON buffer is kept alive until the `ParsedJson` object is no longer needed
   509  (ie iteration over the tape format has been completed).
   510  
   511  In case the JSON message buffer is freed earlier (or for streaming use cases where memory is reused)
   512  `WithCopyStrings(true)` should be used (which is the default behaviour).
   513  
   514  The performance impact differs based on the input type, but this is the general differences:
   515  
   516  ```
   517  BenchmarkApache_builds/copy-32                	    8242	    142972 ns/op	 890.21 MB/s	     965 B/op	      22 allocs/op
   518  BenchmarkApache_builds/nocopy-32              	   10000	    111189 ns/op	1144.68 MB/s	     932 B/op	      22 allocs/op
   519  
   520  BenchmarkCanada/copy-32                       	      91	  13417193 ns/op	 167.77 MB/s	   39793 B/op	     250 allocs/op
   521  BenchmarkCanada/nocopy-32                     	      87	  13392401 ns/op	 168.08 MB/s	   41334 B/op	     250 allocs/op
   522  
   523  BenchmarkCitm_catalog/copy-32                 	     889	   1359983 ns/op	1270.02 MB/s	    6089 B/op	     110 allocs/op
   524  BenchmarkCitm_catalog/nocopy-32               	     924	   1268470 ns/op	1361.64 MB/s	    5582 B/op	     110 allocs/op
   525  
   526  BenchmarkGithub_events/copy-32                	   16092	     74042 ns/op	 879.67 MB/s	     802 B/op	      17 allocs/op
   527  BenchmarkGithub_events/nocopy-32              	   19446	     62143 ns/op	1048.10 MB/s	     794 B/op	      17 allocs/op
   528  
   529  BenchmarkGsoc_2018/copy-32                    	     948	   1259171 ns/op	2642.88 MB/s	   17215 B/op	      67 allocs/op
   530  BenchmarkGsoc_2018/nocopy-32                  	    1144	   1040864 ns/op	3197.18 MB/s	    9947 B/op	      67 allocs/op
   531  
   532  BenchmarkInstruments/copy-32                  	    3932	    301370 ns/op	 731.15 MB/s	    1310 B/op	      33 allocs/op
   533  BenchmarkInstruments/nocopy-32                	    4443	    271500 ns/op	 811.59 MB/s	    1258 B/op	      33 allocs/op
   534  
   535  BenchmarkMarine_ik/copy-32                    	      79	  14419901 ns/op	 206.90 MB/s	  189870 B/op	     467 allocs/op
   536  BenchmarkMarine_ik/nocopy-32                  	      79	  14176758 ns/op	 210.45 MB/s	  189867 B/op	     467 allocs/op
   537  
   538  BenchmarkMesh/copy-32                         	     288	   4206251 ns/op	 172.03 MB/s	   15483 B/op	     122 allocs/op
   539  BenchmarkMesh/nocopy-32                       	     285	   4207299 ns/op	 171.99 MB/s	   15615 B/op	     122 allocs/op
   540  
   541  BenchmarkMesh_pretty/copy-32                  	     248	   4786081 ns/op	 329.57 MB/s	   12066 B/op	     122 allocs/op
   542  BenchmarkMesh_pretty/nocopy-32                	     250	   4803647 ns/op	 328.37 MB/s	   12009 B/op	     122 allocs/op
   543  
   544  BenchmarkNumbers/copy-32                      	    1336	    909641 ns/op	 165.04 MB/s	    1280 B/op	      28 allocs/op
   545  BenchmarkNumbers/nocopy-32                    	    1321	    910493 ns/op	 164.88 MB/s	    1281 B/op	      28 allocs/op
   546  
   547  BenchmarkRandom/copy-32                       	    1201	   1004387 ns/op	 508.25 MB/s	    4096 B/op	      76 allocs/op
   548  BenchmarkRandom/nocopy-32                     	    1554	    773142 ns/op	 660.26 MB/s	    3198 B/op	      76 allocs/op
   549  
   550  BenchmarkTwitter/copy-32                      	    2035	    588773 ns/op	1072.59 MB/s	    2550 B/op	      53 allocs/op
   551  BenchmarkTwitter/nocopy-32                    	    2485	    475949 ns/op	1326.85 MB/s	    2029 B/op	      53 allocs/op
   552  
   553  BenchmarkTwitterescaped/copy-32               	    1189	    972250 ns/op	 578.46 MB/s	    3062 B/op	      53 allocs/op
   554  BenchmarkTwitterescaped/nocopy-32             	    1372	    874972 ns/op	 642.77 MB/s	    2518 B/op	      53 allocs/op
   555  
   556  BenchmarkUpdate_center/copy-32                	    1665	    708717 ns/op	 752.31 MB/s	    3235 B/op	      58 allocs/op
   557  BenchmarkUpdate_center/nocopy-32              	    2241	    536027 ns/op	 994.68 MB/s	    2130 B/op	      58 allocs/op
   558  ```
   559  
   560  ## Design
   561  
   562  `simdjson-go` follows the same two stage design as `simdjson`.
   563  During the first stage the structural elements (`{`, `}`, `[`, `]`, `:`, and `,`)
   564  are detected and forwarded as offsets in the message buffer to the second stage.
   565  The second stage builds a tape format of the structure of the JSON document.
   566  
   567  Note that in contrast to `simdjson`, `simdjson-go` outputs `uint32`
   568  increments (as opposed to absolute values) to the second stage.
   569  This allows arbitrarily large JSON files to be parsed (as long as a single (string) element does not surpass 4 GB...).
   570  
   571  Also, for better performance,
   572  both stages run concurrently as separate go routines and a go channel is used to communicate between the two stages.
   573  
   574  ### Stage 1
   575  
   576  Stage 1 has been converted from the original C code (containing the SIMD intrinsics) to Golang assembly using [c2goasm](https://github.com/minio/c2goasm).
   577  It essentially consists of five separate steps, being:
   578  
   579  - `find_odd_backslash_sequences`: detect backslash characters used to escape quotes
   580  - `find_quote_mask_and_bits`: generate a mask with bits turned on for characters between quotes
   581  - `find_whitespace_and_structurals`: generate a mask for whitespace plus a mask for the structural characters
   582  - `finalize_structurals`: combine the masks computed above into a final mask where each active bit represents the position of a structural character in the input message.
   583  - `flatten_bits_incremental`: output the active bits in the final mask as incremental offsets.
   584  
   585  For more details you can take a look at the various test cases in `find_subroutines_amd64_test.go` to see how
   586  the individual routines can be invoked (typically with a 64 byte input buffer that generates one or more 64-bit masks).
   587  
   588  There is one final routine, `find_structural_bits_in_slice`, that ties it all together and is
   589  invoked with a slice of the message buffer in order to find the incremental offsets.
   590  
   591  ### Stage 2
   592  
   593  During Stage 2 the tape structure is constructed.
   594  It is essentially a single function that jumps around as it finds the various structural characters
   595  and builds the hierarchy of the JSON document that it processes.
   596  The values of the JSON elements such as strings, integers, booleans etc. are parsed and written to the tape.
   597  
   598  Any errors (such as an array not being closed or a missing closing brace) are detected and reported back as errors to the client.
   599  
   600  ## Tape format
   601  
   602  Similarly to `simdjson`, `simdjson-go` parses the structure onto a 'tape' format.
   603  With this format it is possible to skip over arrays and (sub)objects as the sizes are recorded in the tape.
   604  
   605  `simdjson-go` format is exactly the same as the `simdjson` [tape](https://github.com/lemire/simdjson/blob/master/doc/tape.md)
   606  format with the following 2 exceptions:
   607  
   608  - In order to support ndjson, it is possible to have more than one root element on the tape.
   609  Also, to allow for fast navigation over root elements, a root points to the next root element
   610  (and as such the last root element points 1 index past the length of the tape).
   611  
   612  A "NOP" tag is added. The value contains the number of tape entries to skip forward for next tag.
   613  
   614  - Strings are handled differently, unlike `simdjson` the string size is not prepended in the String buffer
   615  but is added as an additional element to the tape itself (much like integers and floats).
   616    - In case `WithCopyStrings(false)` Only strings that contain special characters are copied to the String buffer
   617  in which case the payload from the tape is the offset into the String buffer.
   618  For string values without special characters the tape's payload points directly into the message buffer.
   619    - In case `WithCopyStrings(true)` (default): Strings are always copied to the String buffer.
   620  
   621  For more information, see `TestStage2BuildTape` in `stage2_build_tape_test.go`.
   622  
   623  ## Fuzz Tests
   624  
   625  `simdjson-go` has been extensively fuzz tested to ensure that input cannot generate crashes and that output matches
   626  the standard library.
   627  
   628  The fuzz tests are included as Go 1.18+ compatible tests.
   629  
   630  ## License
   631  
   632  `simdjson-go` is released under the Apache License v2.0. You can find the complete text in the file LICENSE.
   633  
   634  ## Contributing
   635  
   636  Contributions are welcome, please send PRs for any enhancements.
   637  
   638  If your PR include parsing changes please run fuzz testers for a couple of hours.