github.com/dgraph-io/simdjson-go@v0.3.0/README.md (about)

     1  # simdjson-go
     2  
     3  ## Introduction
     4  
     5  This is a Golang port of [simdjson](https://github.com/lemire/simdjson), 
     6  a high performance JSON parser developed by Daniel Lemire and Geoff Langdale. 
     7  It makes extensive use of SIMD instructions to achieve parsing performance of gigabytes of JSON per second.
     8  
     9  Performance wise, `simdjson-go` runs on average at about 40% to 60% of the speed of simdjson. 
    10  Compared to Golang's standard package `encoding/json`, `simdjson-go` is about 10x faster. 
    11  
    12  [![Documentation](https://godoc.org/github.com/minio/simdjson-go?status.svg)](https://pkg.go.dev/github.com/minio/simdjson-go?tab=doc)
    13  
    14  ## Features
    15  
    16  `simdjson-go` is a validating parser, meaning that it amongst others validates and checks numerical values, booleans etc.
    17   Therefore these values are available as the appropriate `int` and `float64` representations after parsing.
    18  
    19  Additionally `simdjson-go` has the following features:
    20  
    21  - No 4 GB object limit
    22  - Support for [ndjson](http://ndjson.org/) (newline delimited json)
    23  - Pure Go (no need for cgo)
    24  
    25  ## Requirements
    26  
    27  `simdjson-go` has the following requirements for parsing:
    28  
    29  A CPU with both AVX2 and CLMUL is required (Haswell from 2013 onwards should do for Intel, for AMD a Ryzen/EPYC CPU (Q1 2017) should be sufficient).
    30  This can be checked using the provided [`SupportedCPU()`](https://pkg.go.dev/github.com/minio/simdjson-go?tab=doc#SupportedCPU`) function.
    31  
    32  The package does not provide fallback for unsupported CPUs, but serialized data can be deserialized on an unsupported CPU.
    33  
    34  Using the `gccgo` will also always return unsupported CPU since it cannot compile assembly. 
    35  
    36  ## Usage 
    37  
    38  Run the following command in order to install `simdjson-go`
    39  
    40  ```bash
    41  go get -u github.com/minio/simdjson-go
    42  ```
    43  
    44  In order to parse a JSON byte stream, you either call [`simdjson.Parse()`](https://pkg.go.dev/github.com/minio/simdjson-go?tab=doc#Parse)
    45  or [`simdjson.ParseND()`](https://pkg.go.dev/github.com/minio/simdjson-go?tab=doc#ParseND) for newline delimited JSON files. 
    46  Both of these functions return a [`ParsedJson`](https://pkg.go.dev/github.com/minio/simdjson-go?tab=doc#ParsedJson) 
    47  struct that can be used to navigate the JSON object by calling [`Iter()`](https://pkg.go.dev/github.com/minio/simdjson-go?tab=doc#ParsedJson.Iter). 
    48  
    49  Using the type [`Iter`](https://pkg.go.dev/github.com/minio/simdjson-go?tab=doc#Iter) you can call 
    50  [`Advance()`](https://pkg.go.dev/github.com/minio/simdjson-go?tab=doc#Iter.Advance) to iterate over the tape, like so:
    51  
    52  ```Go
    53  for {
    54      typ := iter.Advance()
    55  
    56      switch typ {
    57      case simdjson.TypeRoot:
    58          if typ, tmp, err = iter.Root(tmp); err != nil {
    59              return
    60          }
    61  
    62          if typ == simdjson.TypeObject {
    63              if obj, err = tmp.Object(obj); err != nil {
    64                  return
    65              }
    66  
    67              e := obj.FindKey(key, &elem)
    68              if e != nil && elem.Type == simdjson.TypeString {
    69                  v, _ := elem.Iter.StringBytes()
    70                  fmt.Println(string(v))
    71              }
    72          }
    73  
    74      default:
    75          return
    76      }
    77  }
    78  ```
    79  
    80  When you advance the Iter you get the next type currently queued.
    81  
    82  Each type then has helpers to access the data. When you get a type you can use these to access the data:
    83  
    84  | Type       | Action on Iter             |
    85  |------------|----------------------------|
    86  | TypeNone   | Nothing follows. Iter done |
    87  | TypeNull   | Null value                 |
    88  | TypeString | `String()`/`StringBytes()` |
    89  | TypeInt    | `Int()`/`Float()`          |
    90  | TypeUint   | `Uint()`/`Float()`         |
    91  | TypeFloat  | `Float()`                  |
    92  | TypeBool   | `Bool()`                   |
    93  | TypeObject | `Object()`                 |
    94  | TypeArray  | `Array()`                  |
    95  | TypeRoot   | `Root()`                   |
    96  
    97  You can also get the next value as an `interface{}` using the [Interface()](https://pkg.go.dev/github.com/minio/simdjson-go#Iter.Interface) method.
    98  
    99  Note that arrays and objects that are null are always returned as `TypeNull`.
   100  
   101  The complex types returns helpers that will help parse each of the underlying structures.
   102  
   103  It is up to you to keep track of the nesting level you are operating at.
   104  
   105  For any `Iter` it is possible to marshal the recursive content of the Iter using
   106  [`MarshalJSON()`](https://pkg.go.dev/github.com/minio/simdjson-go#Iter.MarshalJSON) or
   107  [`MarshalJSONBuffer(...)`](https://pkg.go.dev/github.com/minio/simdjson-go#Iter.MarshalJSONBuffer).
   108  
   109  Currently, it is not possible to unmarshal into structs.
   110  
   111  ## Parsing Objects
   112  
   113  If you are only interested in one key in an object you can use `FindKey` to quickly select it.
   114  
   115  An object kan be traversed manually by using `NextElement(dst *Iter) (name string, t Type, err error)`.
   116  The key of the element will be returned as a string and the type of the value will be returned
   117  and the provided `Iter` will contain an iterator which will allow access to the content.
   118  
   119  There is a `NextElementBytes` which provides the same, but without the need to allocate a string.
   120  
   121  All elements of the object can be retrieved using a pretty lightweight [`Parse`](https://pkg.go.dev/github.com/minio/simdjson-go#Object.Parse)
   122  which provides a map of all keys and all elements an a slide.
   123  
   124  All elements of the object can be returned as `map[string]interface{}` using the `Map` method on the object.
   125  This will naturally perform allocations for all elements.
   126  
   127  ## Parsing Arrays
   128  
   129  [Arrays](https://pkg.go.dev/github.com/minio/simdjson-go#Array) in JSON can have mixed types. 
   130  To iterate over the array with mixed types use the [`Iter`](https://pkg.go.dev/github.com/minio/simdjson-go#Array.Iter) 
   131  method to get an iterator.
   132  
   133  There are methods that allow you to retrieve all elements as a single type, 
   134  []int64, []uint64, float64 and strings.  
   135  
   136  ## Number parsing
   137  
   138  Numbers in JSON are untyped and are returned by the following rules in order:
   139  
   140  * If there is any float point notation, like exponents, or a dot notation, it is always returned as float.
   141  * If number is a pure integer and it fits within an int64 it is returned as such.
   142  * If number is a pure positive integer and fits within a uint64 it is returned as such.
   143  * If the number is valid number it is returned as float64.
   144  
   145  If the number was converted from integer notation to a float due to not fitting inside int64/uint64
   146  the `FloatOverflowedInteger` flag is set, which can be retrieved using `(Iter).FloatFlags()` method.  
   147  
   148  JSON numbers follow JavaScript’s double-precision floating-point format.
   149  
   150  * Represented in base 10 with no superfluous leading zeros (e.g. 67, 1, 100).
   151  * Include digits between 0 and 9.
   152  * Can be a negative number (e.g. -10).
   153  * Can be a fraction (e.g. .5).
   154  * Can also have an exponent of 10, prefixed by e or E with a plus or minus sign to indicate positive or negative exponentiation.
   155  * Octal and hexadecimal formats are not supported.
   156  * Can not have a value of NaN (Not A Number) or Infinity.
   157  
   158  ## Parsing NDSJON stream
   159  
   160  Newline delimited json is sent as packets with each line being a root element.
   161  
   162  Here is an example that counts the number of `"Make": "HOND"` in NDSJON similar to this:
   163  
   164  ```
   165  {"Age":20, "Make": "HOND"}
   166  {"Age":22, "Make": "TLSA"}
   167  ```
   168  
   169  ```Go
   170  func findHondas(r io.Reader) {
   171  	// Temp values.
   172  	var tmpO simdjson.Object{}
   173  	var tmpE simdjson.Element{}
   174  	var tmpI simdjson.Iter
   175  	var nFound int
   176  	
   177  	// Communication
   178  	reuse := make(chan *simdjson.ParsedJson, 10)
   179  	res := make(chan simdjson.Stream, 10)
   180  
   181  	simdjson.ParseNDStream(r, res, reuse)
   182  	// Read results in blocks...
   183  	for got := range res {
   184  		if got.Error != nil {
   185  			if got.Error == io.EOF {
   186  				break
   187  			}
   188  			log.Fatal(got.Error)
   189  		}
   190  
   191  		all := got.Value.Iter()
   192  		// NDJSON is a separated by root objects.
   193  		for all.Advance() == simdjson.TypeRoot {
   194  			// Read inside root.
   195  			t, i, err := all.Root(&tmpI)
   196  			if t != simdjson.TypeObject {
   197  				log.Println("got type", t.String())
   198  				continue
   199  			}
   200  
   201  			// Prepare object.
   202  			obj, err := i.Object(&tmpO)
   203  			if err != nil {
   204  				log.Println("got err", err)
   205  				continue
   206  			}
   207  
   208  			// Find Make key.
   209  			elem := obj.FindKey("Make", &tmpE)
   210  			if elem.Type != TypeString {
   211  				log.Println("got type", err)
   212  				continue
   213  			}
   214  			
   215  			// Get value as bytes.
   216  			asB, err := elem.Iter.StringBytes()
   217  			if err != nil {
   218  				log.Println("got err", err)
   219  				continue
   220  			}
   221  			if bytes.Equal(asB, []byte("HOND")) {
   222  				nFound++
   223  			}
   224  		}
   225  		reuse <- got.Value
   226  	}
   227  	fmt.Println("Found", nFound, "Hondas")
   228  }
   229  ```
   230  
   231  More examples can be found in the examples subdirectory and further documentation can be found at [godoc](https://pkg.go.dev/github.com/minio/simdjson-go?tab=doc). 
   232  
   233  ## Serializing parsed json
   234  
   235  It is possible to serialize parsed JSON for more compact storage and faster load time.
   236  
   237  To create a new serialized use [NewSerializer](https://pkg.go.dev/github.com/minio/simdjson-go#NewSerializer).
   238  This serializer can be reused for several JSON blocks.
   239  
   240  The serializer will provide string deduplication and compression of elements. 
   241  This can be finetuned using the [`CompressMode`](https://pkg.go.dev/github.com/minio/simdjson-go#Serializer.CompressMode) setting.
   242  
   243  To serialize a block of parsed data use the [`Serialize`](https://pkg.go.dev/github.com/minio/simdjson-go#Serializer.Serialize) method.
   244  
   245  To read back use the [`Deserialize`](https://pkg.go.dev/github.com/minio/simdjson-go#Serializer.Deserialize) method.
   246  For deserializing the compression mode does not need to match since it is read from the stream. 
   247  
   248  Example of speed for serializer/deserializer on [`parking-citations-1M`](https://files.klauspost.com/compress/parking-citations-1M.json.zst).
   249  
   250  | Compress Mode | % of JSON size | Serialize Speed | Deserialize Speed |
   251  |---------------|----------------|-----------------|-------------------|
   252  | None          | 177.26%        | 425.70 MB/s     | 2334.33 MB/s      |
   253  | Fast          | 17.20%         | 412.75 MB/s     | 1234.76 MB/s      |
   254  | Default       | 16.85%         | 411.59 MB/s     | 1242.09 MB/s      |
   255  | Best          | 10.91%         | 337.17 MB/s     | 806.23 MB/s       |
   256  
   257  In some cases the speed difference and compression difference will be bigger.
   258  
   259  ## Performance vs simdjson
   260  
   261  Based on the same set of JSON test files, the graph below shows a comparison between `simdjson` and `simdjson-go`.
   262  
   263  ![simdjson-vs-go-comparison](chart/simdjson-vs-simdjson-go.png)
   264  
   265  These numbers were measured on a MacBook Pro equipped with a 3.1 GHz Intel Core i7.
   266  Also, to make it a fair comparison, the constant `GOLANG_NUMBER_PARSING` was set to `false` (default is `true`)
   267  in order to use the same number parsing function (which is faster at the expense of some precision; see more below).
   268  
   269  In addition the constant `ALWAYS_COPY_STRINGS` was set to `false` (default is `true`) for non-streaming use case
   270  scenarios where the full JSON message is kept in memory (similar to the `simdjson` behaviour).
   271  
   272  ## Performance vs `encoding/json` and `json-iterator/go`
   273  
   274  Below is a performance comparison to Golang's standard package `encoding/json` based on the same set of JSON test files.
   275  
   276  ```
   277  $ benchcmp                    encoding_json.txt      simdjson-go.txt
   278  benchmark                     old MB/s               new MB/s         speedup
   279  BenchmarkApache_builds-8      106.77                  948.75           8.89x
   280  BenchmarkCanada-8              54.39                  519.85           9.56x
   281  BenchmarkCitm_catalog-8       100.44                 1565.28          15.58x
   282  BenchmarkGithub_events-8      159.49                  848.88           5.32x
   283  BenchmarkGsoc_2018-8          152.93                 2515.59          16.45x
   284  BenchmarkInstruments-8         82.82                  811.61           9.80x
   285  BenchmarkMarine_ik-8           48.12                  422.43           8.78x
   286  BenchmarkMesh-8                49.38                  371.39           7.52x
   287  BenchmarkMesh_pretty-8         73.10                  784.89          10.74x
   288  BenchmarkNumbers-8            160.69                  434.85           2.71x
   289  BenchmarkRandom-8              66.56                  615.12           9.24x
   290  BenchmarkTwitter-8             79.05                 1193.47          15.10x
   291  BenchmarkTwitterescaped-8      83.96                  536.19           6.39x
   292  BenchmarkUpdate_center-8       73.92                  860.52          11.64x
   293  ```
   294  
   295  Also `simdjson-go` uses less additional memory and allocations.
   296  
   297  Here is another benchmark comparison to `json-iterator/go`:
   298  
   299  ```
   300  $ benchcmp                    json-iterator.txt      simdjson-go.txt
   301  benchmark                     old MB/s               new MB/s         speedup
   302  BenchmarkApache_builds-8      154.65                  948.75           6.13x
   303  BenchmarkCanada-8              40.34                  519.85          12.89x
   304  BenchmarkCitm_catalog-8       183.69                 1565.28           8.52x
   305  BenchmarkGithub_events-8      170.77                  848.88           4.97x
   306  BenchmarkGsoc_2018-8          225.13                 2515.59          11.17x
   307  BenchmarkInstruments-8        120.39                  811.61           6.74x
   308  BenchmarkMarine_ik-8           61.71                  422.43           6.85x
   309  BenchmarkMesh-8                50.66                  371.39           7.33x
   310  BenchmarkMesh_pretty-8         90.36                  784.89           8.69x
   311  BenchmarkNumbers-8             52.61                  434.85           8.27x
   312  BenchmarkRandom-8              85.87                  615.12           7.16x
   313  BenchmarkTwitter-8            139.57                 1193.47           8.55x
   314  BenchmarkTwitterescaped-8     102.28                  536.19           5.24x
   315  BenchmarkUpdate_center-8      101.41                  860.52           8.49x
   316  ```
   317  
   318  ## AVX512 Acceleration
   319  
   320  Stage 1 has been optimized using AVX512 instructions. Under full CPU load (8 threads) the AVX512 code is about 1 GB/sec (15%) faster as compared to the AVX2 code.
   321  
   322  ```
   323  benchmark                                   AVX2 MB/s    AVX512 MB/s     speedup
   324  BenchmarkFindStructuralBitsParallelLoop      7225.24      8302.96         1.15x
   325  ```
   326  
   327  These benchmarks were generated on a c5.2xlarge EC2 instance with a Xeon Platinum 8124M CPU at 3.0 GHz.
   328  
   329  ## Design
   330  
   331  `simdjson-go` follows the same two stage design as `simdjson`. 
   332  During the first stage the structural elements (`{`, `}`, `[`, `]`, `:`, and `,`) 
   333  are detected and forwarded as offsets in the message buffer to the second stage. 
   334  The second stage builds a tape format of the structure of the JSON document. 
   335  
   336  Note that in contrast to `simdjson`, `simdjson-go` outputs `uint32` 
   337  increments (as opposed to absolute values) to the second stage. 
   338  This allows arbitrarily large JSON files to be parsed (as long as a single (string) element does not surpass 4 GB...). 
   339  
   340  Also, for better performance, 
   341  both stages run concurrently as separate go routines and a go channel is used to communicate between the two stages.
   342  
   343  ### Stage 1
   344  
   345  Stage 1 has been converted from the original C code (containing the SIMD intrinsics) to Golang assembly using [c2goasm](https://github.com/minio/c2goasm). 
   346  It essentially consists of five separate steps, being: 
   347  
   348  - `find_odd_backslash_sequences`: detect backslash characters used to escape quotes
   349  - `find_quote_mask_and_bits`: generate a mask with bits turned on for characters between quotes
   350  - `find_whitespace_and_structurals`: generate a mask for whitespace plus a mask for the structural characters 
   351  - `finalize_structurals`: combine the masks computed above into a final mask where each active bit represents the position of a structural character in the input message.
   352  - `flatten_bits_incremental`: output the active bits in the final mask as incremental offsets.
   353  
   354  For more details you can take a look at the various test cases in `find_subroutines_amd64_test.go` to see how 
   355  the individual routines can be invoked (typically with a 64 byte input buffer that generates one or more 64-bit masks). 
   356  
   357  There is one final routine, `find_structural_bits_in_slice`, that ties it all together and is 
   358  invoked with a slice of the message buffer in order to find the incremental offsets.
   359  
   360  ### Stage 2
   361  
   362  During Stage 2 the tape structure is constructed. 
   363  It is essentially a single function that jumps around as it finds the various structural characters 
   364  and builds the hierarchy of the JSON document that it processes. 
   365  The values of the JSON elements such as strings, integers, booleans etc. are parsed and written to the tape.
   366  
   367  Any errors (such as an array not being closed or a missing closing brace) are detected and reported back as errors to the client.
   368  
   369  ## Tape format
   370  
   371  Similarly to `simdjson`, `simdjson-go` parses the structure onto a 'tape' format. 
   372  With this format it is possible to skip over arrays and (sub)objects as the sizes are recorded in the tape. 
   373  
   374  `simdjson-go` format is exactly the same as the `simdjson` [tape](https://github.com/lemire/simdjson/blob/master/doc/tape.md) 
   375  format with the following 2 exceptions:
   376  
   377  - In order to support ndjson, it is possible to have more than one root element on the tape. 
   378  Also, to allow for fast navigation over root elements, a root points to the next root element
   379  (and as such the last root element points 1 index past the length of the tape).
   380  
   381  - Strings are handled differently, unlike `simdjson` the string size is not prepended in the String buffer 
   382  but is added as an additional element to the tape itself (much like integers and floats). 
   383    - In case `ALWAYS_COPY_STRINGS` is `false`: Only strings that contain special characters are copied to the String buffer 
   384  in which case the payload from the tape is the offset into the String buffer. 
   385  For string values without special characters the tape's payload points directly into the message buffer.
   386    - In case `ALWAYS_COPY_STRINGS` is `true` (default): Strings are always copied to the String buffer.
   387  
   388  For more information, see `TestStage2BuildTape` in `stage2_build_tape_test.go`.
   389  
   390  ## Non streaming use cases
   391  
   392  The best performance is obtained by keeping the JSON message fully mapped in memory and setting the
   393  `ALWAYS_COPY_STRINGS` constant to `false`. This prevents duplicate copies of string values being made 
   394  but mandates that the original JSON buffer is kept alive until the `ParsedJson` object is no longer needed
   395   (ie iteration over the tape format has been completed).
   396   
   397   In case the JSON message buffer is freed earlier (or for streaming use cases where memory is reused) 
   398   `ALWAYS_COPY_STRINGS` should be set to `true` (which is the default behaviour).
   399  
   400  ## Fuzz Tests
   401  
   402  `simdjson-go` has been extensively fuzz tested to ensure that input cannot generate crashes and that output matches
   403  the standard library.
   404  
   405  The fuzzers and corpus are contained in a separate repository at [github.com/minio/simdjson-fuzz](https://github.com/minio/simdjson-fuzz)
   406  
   407  The repo contains information on how to run them.
   408  
   409  ## License
   410  
   411  `simdjson-go` is released under the Apache License v2.0. You can find the complete text in the file LICENSE.
   412  
   413  ## Contributing
   414  
   415  Contributions are welcome, please send PRs for any enhancements.
   416  
   417  If your PR include parsing changes please run fuzz testers for a couple of hours.