github.com/dgraph-io/simdjson-go@v0.3.0/README.md (about) 1 # simdjson-go 2 3 ## Introduction 4 5 This is a Golang port of [simdjson](https://github.com/lemire/simdjson), 6 a high performance JSON parser developed by Daniel Lemire and Geoff Langdale. 7 It makes extensive use of SIMD instructions to achieve parsing performance of gigabytes of JSON per second. 8 9 Performance wise, `simdjson-go` runs on average at about 40% to 60% of the speed of simdjson. 10 Compared to Golang's standard package `encoding/json`, `simdjson-go` is about 10x faster. 11 12 [![Documentation](https://godoc.org/github.com/minio/simdjson-go?status.svg)](https://pkg.go.dev/github.com/minio/simdjson-go?tab=doc) 13 14 ## Features 15 16 `simdjson-go` is a validating parser, meaning that it amongst others validates and checks numerical values, booleans etc. 17 Therefore these values are available as the appropriate `int` and `float64` representations after parsing. 18 19 Additionally `simdjson-go` has the following features: 20 21 - No 4 GB object limit 22 - Support for [ndjson](http://ndjson.org/) (newline delimited json) 23 - Pure Go (no need for cgo) 24 25 ## Requirements 26 27 `simdjson-go` has the following requirements for parsing: 28 29 A CPU with both AVX2 and CLMUL is required (Haswell from 2013 onwards should do for Intel, for AMD a Ryzen/EPYC CPU (Q1 2017) should be sufficient). 30 This can be checked using the provided [`SupportedCPU()`](https://pkg.go.dev/github.com/minio/simdjson-go?tab=doc#SupportedCPU`) function. 31 32 The package does not provide fallback for unsupported CPUs, but serialized data can be deserialized on an unsupported CPU. 33 34 Using the `gccgo` will also always return unsupported CPU since it cannot compile assembly. 35 36 ## Usage 37 38 Run the following command in order to install `simdjson-go` 39 40 ```bash 41 go get -u github.com/minio/simdjson-go 42 ``` 43 44 In order to parse a JSON byte stream, you either call [`simdjson.Parse()`](https://pkg.go.dev/github.com/minio/simdjson-go?tab=doc#Parse) 45 or [`simdjson.ParseND()`](https://pkg.go.dev/github.com/minio/simdjson-go?tab=doc#ParseND) for newline delimited JSON files. 46 Both of these functions return a [`ParsedJson`](https://pkg.go.dev/github.com/minio/simdjson-go?tab=doc#ParsedJson) 47 struct that can be used to navigate the JSON object by calling [`Iter()`](https://pkg.go.dev/github.com/minio/simdjson-go?tab=doc#ParsedJson.Iter). 48 49 Using the type [`Iter`](https://pkg.go.dev/github.com/minio/simdjson-go?tab=doc#Iter) you can call 50 [`Advance()`](https://pkg.go.dev/github.com/minio/simdjson-go?tab=doc#Iter.Advance) to iterate over the tape, like so: 51 52 ```Go 53 for { 54 typ := iter.Advance() 55 56 switch typ { 57 case simdjson.TypeRoot: 58 if typ, tmp, err = iter.Root(tmp); err != nil { 59 return 60 } 61 62 if typ == simdjson.TypeObject { 63 if obj, err = tmp.Object(obj); err != nil { 64 return 65 } 66 67 e := obj.FindKey(key, &elem) 68 if e != nil && elem.Type == simdjson.TypeString { 69 v, _ := elem.Iter.StringBytes() 70 fmt.Println(string(v)) 71 } 72 } 73 74 default: 75 return 76 } 77 } 78 ``` 79 80 When you advance the Iter you get the next type currently queued. 81 82 Each type then has helpers to access the data. When you get a type you can use these to access the data: 83 84 | Type | Action on Iter | 85 |------------|----------------------------| 86 | TypeNone | Nothing follows. Iter done | 87 | TypeNull | Null value | 88 | TypeString | `String()`/`StringBytes()` | 89 | TypeInt | `Int()`/`Float()` | 90 | TypeUint | `Uint()`/`Float()` | 91 | TypeFloat | `Float()` | 92 | TypeBool | `Bool()` | 93 | TypeObject | `Object()` | 94 | TypeArray | `Array()` | 95 | TypeRoot | `Root()` | 96 97 You can also get the next value as an `interface{}` using the [Interface()](https://pkg.go.dev/github.com/minio/simdjson-go#Iter.Interface) method. 98 99 Note that arrays and objects that are null are always returned as `TypeNull`. 100 101 The complex types returns helpers that will help parse each of the underlying structures. 102 103 It is up to you to keep track of the nesting level you are operating at. 104 105 For any `Iter` it is possible to marshal the recursive content of the Iter using 106 [`MarshalJSON()`](https://pkg.go.dev/github.com/minio/simdjson-go#Iter.MarshalJSON) or 107 [`MarshalJSONBuffer(...)`](https://pkg.go.dev/github.com/minio/simdjson-go#Iter.MarshalJSONBuffer). 108 109 Currently, it is not possible to unmarshal into structs. 110 111 ## Parsing Objects 112 113 If you are only interested in one key in an object you can use `FindKey` to quickly select it. 114 115 An object kan be traversed manually by using `NextElement(dst *Iter) (name string, t Type, err error)`. 116 The key of the element will be returned as a string and the type of the value will be returned 117 and the provided `Iter` will contain an iterator which will allow access to the content. 118 119 There is a `NextElementBytes` which provides the same, but without the need to allocate a string. 120 121 All elements of the object can be retrieved using a pretty lightweight [`Parse`](https://pkg.go.dev/github.com/minio/simdjson-go#Object.Parse) 122 which provides a map of all keys and all elements an a slide. 123 124 All elements of the object can be returned as `map[string]interface{}` using the `Map` method on the object. 125 This will naturally perform allocations for all elements. 126 127 ## Parsing Arrays 128 129 [Arrays](https://pkg.go.dev/github.com/minio/simdjson-go#Array) in JSON can have mixed types. 130 To iterate over the array with mixed types use the [`Iter`](https://pkg.go.dev/github.com/minio/simdjson-go#Array.Iter) 131 method to get an iterator. 132 133 There are methods that allow you to retrieve all elements as a single type, 134 []int64, []uint64, float64 and strings. 135 136 ## Number parsing 137 138 Numbers in JSON are untyped and are returned by the following rules in order: 139 140 * If there is any float point notation, like exponents, or a dot notation, it is always returned as float. 141 * If number is a pure integer and it fits within an int64 it is returned as such. 142 * If number is a pure positive integer and fits within a uint64 it is returned as such. 143 * If the number is valid number it is returned as float64. 144 145 If the number was converted from integer notation to a float due to not fitting inside int64/uint64 146 the `FloatOverflowedInteger` flag is set, which can be retrieved using `(Iter).FloatFlags()` method. 147 148 JSON numbers follow JavaScript’s double-precision floating-point format. 149 150 * Represented in base 10 with no superfluous leading zeros (e.g. 67, 1, 100). 151 * Include digits between 0 and 9. 152 * Can be a negative number (e.g. -10). 153 * Can be a fraction (e.g. .5). 154 * Can also have an exponent of 10, prefixed by e or E with a plus or minus sign to indicate positive or negative exponentiation. 155 * Octal and hexadecimal formats are not supported. 156 * Can not have a value of NaN (Not A Number) or Infinity. 157 158 ## Parsing NDSJON stream 159 160 Newline delimited json is sent as packets with each line being a root element. 161 162 Here is an example that counts the number of `"Make": "HOND"` in NDSJON similar to this: 163 164 ``` 165 {"Age":20, "Make": "HOND"} 166 {"Age":22, "Make": "TLSA"} 167 ``` 168 169 ```Go 170 func findHondas(r io.Reader) { 171 // Temp values. 172 var tmpO simdjson.Object{} 173 var tmpE simdjson.Element{} 174 var tmpI simdjson.Iter 175 var nFound int 176 177 // Communication 178 reuse := make(chan *simdjson.ParsedJson, 10) 179 res := make(chan simdjson.Stream, 10) 180 181 simdjson.ParseNDStream(r, res, reuse) 182 // Read results in blocks... 183 for got := range res { 184 if got.Error != nil { 185 if got.Error == io.EOF { 186 break 187 } 188 log.Fatal(got.Error) 189 } 190 191 all := got.Value.Iter() 192 // NDJSON is a separated by root objects. 193 for all.Advance() == simdjson.TypeRoot { 194 // Read inside root. 195 t, i, err := all.Root(&tmpI) 196 if t != simdjson.TypeObject { 197 log.Println("got type", t.String()) 198 continue 199 } 200 201 // Prepare object. 202 obj, err := i.Object(&tmpO) 203 if err != nil { 204 log.Println("got err", err) 205 continue 206 } 207 208 // Find Make key. 209 elem := obj.FindKey("Make", &tmpE) 210 if elem.Type != TypeString { 211 log.Println("got type", err) 212 continue 213 } 214 215 // Get value as bytes. 216 asB, err := elem.Iter.StringBytes() 217 if err != nil { 218 log.Println("got err", err) 219 continue 220 } 221 if bytes.Equal(asB, []byte("HOND")) { 222 nFound++ 223 } 224 } 225 reuse <- got.Value 226 } 227 fmt.Println("Found", nFound, "Hondas") 228 } 229 ``` 230 231 More examples can be found in the examples subdirectory and further documentation can be found at [godoc](https://pkg.go.dev/github.com/minio/simdjson-go?tab=doc). 232 233 ## Serializing parsed json 234 235 It is possible to serialize parsed JSON for more compact storage and faster load time. 236 237 To create a new serialized use [NewSerializer](https://pkg.go.dev/github.com/minio/simdjson-go#NewSerializer). 238 This serializer can be reused for several JSON blocks. 239 240 The serializer will provide string deduplication and compression of elements. 241 This can be finetuned using the [`CompressMode`](https://pkg.go.dev/github.com/minio/simdjson-go#Serializer.CompressMode) setting. 242 243 To serialize a block of parsed data use the [`Serialize`](https://pkg.go.dev/github.com/minio/simdjson-go#Serializer.Serialize) method. 244 245 To read back use the [`Deserialize`](https://pkg.go.dev/github.com/minio/simdjson-go#Serializer.Deserialize) method. 246 For deserializing the compression mode does not need to match since it is read from the stream. 247 248 Example of speed for serializer/deserializer on [`parking-citations-1M`](https://files.klauspost.com/compress/parking-citations-1M.json.zst). 249 250 | Compress Mode | % of JSON size | Serialize Speed | Deserialize Speed | 251 |---------------|----------------|-----------------|-------------------| 252 | None | 177.26% | 425.70 MB/s | 2334.33 MB/s | 253 | Fast | 17.20% | 412.75 MB/s | 1234.76 MB/s | 254 | Default | 16.85% | 411.59 MB/s | 1242.09 MB/s | 255 | Best | 10.91% | 337.17 MB/s | 806.23 MB/s | 256 257 In some cases the speed difference and compression difference will be bigger. 258 259 ## Performance vs simdjson 260 261 Based on the same set of JSON test files, the graph below shows a comparison between `simdjson` and `simdjson-go`. 262 263 ![simdjson-vs-go-comparison](chart/simdjson-vs-simdjson-go.png) 264 265 These numbers were measured on a MacBook Pro equipped with a 3.1 GHz Intel Core i7. 266 Also, to make it a fair comparison, the constant `GOLANG_NUMBER_PARSING` was set to `false` (default is `true`) 267 in order to use the same number parsing function (which is faster at the expense of some precision; see more below). 268 269 In addition the constant `ALWAYS_COPY_STRINGS` was set to `false` (default is `true`) for non-streaming use case 270 scenarios where the full JSON message is kept in memory (similar to the `simdjson` behaviour). 271 272 ## Performance vs `encoding/json` and `json-iterator/go` 273 274 Below is a performance comparison to Golang's standard package `encoding/json` based on the same set of JSON test files. 275 276 ``` 277 $ benchcmp encoding_json.txt simdjson-go.txt 278 benchmark old MB/s new MB/s speedup 279 BenchmarkApache_builds-8 106.77 948.75 8.89x 280 BenchmarkCanada-8 54.39 519.85 9.56x 281 BenchmarkCitm_catalog-8 100.44 1565.28 15.58x 282 BenchmarkGithub_events-8 159.49 848.88 5.32x 283 BenchmarkGsoc_2018-8 152.93 2515.59 16.45x 284 BenchmarkInstruments-8 82.82 811.61 9.80x 285 BenchmarkMarine_ik-8 48.12 422.43 8.78x 286 BenchmarkMesh-8 49.38 371.39 7.52x 287 BenchmarkMesh_pretty-8 73.10 784.89 10.74x 288 BenchmarkNumbers-8 160.69 434.85 2.71x 289 BenchmarkRandom-8 66.56 615.12 9.24x 290 BenchmarkTwitter-8 79.05 1193.47 15.10x 291 BenchmarkTwitterescaped-8 83.96 536.19 6.39x 292 BenchmarkUpdate_center-8 73.92 860.52 11.64x 293 ``` 294 295 Also `simdjson-go` uses less additional memory and allocations. 296 297 Here is another benchmark comparison to `json-iterator/go`: 298 299 ``` 300 $ benchcmp json-iterator.txt simdjson-go.txt 301 benchmark old MB/s new MB/s speedup 302 BenchmarkApache_builds-8 154.65 948.75 6.13x 303 BenchmarkCanada-8 40.34 519.85 12.89x 304 BenchmarkCitm_catalog-8 183.69 1565.28 8.52x 305 BenchmarkGithub_events-8 170.77 848.88 4.97x 306 BenchmarkGsoc_2018-8 225.13 2515.59 11.17x 307 BenchmarkInstruments-8 120.39 811.61 6.74x 308 BenchmarkMarine_ik-8 61.71 422.43 6.85x 309 BenchmarkMesh-8 50.66 371.39 7.33x 310 BenchmarkMesh_pretty-8 90.36 784.89 8.69x 311 BenchmarkNumbers-8 52.61 434.85 8.27x 312 BenchmarkRandom-8 85.87 615.12 7.16x 313 BenchmarkTwitter-8 139.57 1193.47 8.55x 314 BenchmarkTwitterescaped-8 102.28 536.19 5.24x 315 BenchmarkUpdate_center-8 101.41 860.52 8.49x 316 ``` 317 318 ## AVX512 Acceleration 319 320 Stage 1 has been optimized using AVX512 instructions. Under full CPU load (8 threads) the AVX512 code is about 1 GB/sec (15%) faster as compared to the AVX2 code. 321 322 ``` 323 benchmark AVX2 MB/s AVX512 MB/s speedup 324 BenchmarkFindStructuralBitsParallelLoop 7225.24 8302.96 1.15x 325 ``` 326 327 These benchmarks were generated on a c5.2xlarge EC2 instance with a Xeon Platinum 8124M CPU at 3.0 GHz. 328 329 ## Design 330 331 `simdjson-go` follows the same two stage design as `simdjson`. 332 During the first stage the structural elements (`{`, `}`, `[`, `]`, `:`, and `,`) 333 are detected and forwarded as offsets in the message buffer to the second stage. 334 The second stage builds a tape format of the structure of the JSON document. 335 336 Note that in contrast to `simdjson`, `simdjson-go` outputs `uint32` 337 increments (as opposed to absolute values) to the second stage. 338 This allows arbitrarily large JSON files to be parsed (as long as a single (string) element does not surpass 4 GB...). 339 340 Also, for better performance, 341 both stages run concurrently as separate go routines and a go channel is used to communicate between the two stages. 342 343 ### Stage 1 344 345 Stage 1 has been converted from the original C code (containing the SIMD intrinsics) to Golang assembly using [c2goasm](https://github.com/minio/c2goasm). 346 It essentially consists of five separate steps, being: 347 348 - `find_odd_backslash_sequences`: detect backslash characters used to escape quotes 349 - `find_quote_mask_and_bits`: generate a mask with bits turned on for characters between quotes 350 - `find_whitespace_and_structurals`: generate a mask for whitespace plus a mask for the structural characters 351 - `finalize_structurals`: combine the masks computed above into a final mask where each active bit represents the position of a structural character in the input message. 352 - `flatten_bits_incremental`: output the active bits in the final mask as incremental offsets. 353 354 For more details you can take a look at the various test cases in `find_subroutines_amd64_test.go` to see how 355 the individual routines can be invoked (typically with a 64 byte input buffer that generates one or more 64-bit masks). 356 357 There is one final routine, `find_structural_bits_in_slice`, that ties it all together and is 358 invoked with a slice of the message buffer in order to find the incremental offsets. 359 360 ### Stage 2 361 362 During Stage 2 the tape structure is constructed. 363 It is essentially a single function that jumps around as it finds the various structural characters 364 and builds the hierarchy of the JSON document that it processes. 365 The values of the JSON elements such as strings, integers, booleans etc. are parsed and written to the tape. 366 367 Any errors (such as an array not being closed or a missing closing brace) are detected and reported back as errors to the client. 368 369 ## Tape format 370 371 Similarly to `simdjson`, `simdjson-go` parses the structure onto a 'tape' format. 372 With this format it is possible to skip over arrays and (sub)objects as the sizes are recorded in the tape. 373 374 `simdjson-go` format is exactly the same as the `simdjson` [tape](https://github.com/lemire/simdjson/blob/master/doc/tape.md) 375 format with the following 2 exceptions: 376 377 - In order to support ndjson, it is possible to have more than one root element on the tape. 378 Also, to allow for fast navigation over root elements, a root points to the next root element 379 (and as such the last root element points 1 index past the length of the tape). 380 381 - Strings are handled differently, unlike `simdjson` the string size is not prepended in the String buffer 382 but is added as an additional element to the tape itself (much like integers and floats). 383 - In case `ALWAYS_COPY_STRINGS` is `false`: Only strings that contain special characters are copied to the String buffer 384 in which case the payload from the tape is the offset into the String buffer. 385 For string values without special characters the tape's payload points directly into the message buffer. 386 - In case `ALWAYS_COPY_STRINGS` is `true` (default): Strings are always copied to the String buffer. 387 388 For more information, see `TestStage2BuildTape` in `stage2_build_tape_test.go`. 389 390 ## Non streaming use cases 391 392 The best performance is obtained by keeping the JSON message fully mapped in memory and setting the 393 `ALWAYS_COPY_STRINGS` constant to `false`. This prevents duplicate copies of string values being made 394 but mandates that the original JSON buffer is kept alive until the `ParsedJson` object is no longer needed 395 (ie iteration over the tape format has been completed). 396 397 In case the JSON message buffer is freed earlier (or for streaming use cases where memory is reused) 398 `ALWAYS_COPY_STRINGS` should be set to `true` (which is the default behaviour). 399 400 ## Fuzz Tests 401 402 `simdjson-go` has been extensively fuzz tested to ensure that input cannot generate crashes and that output matches 403 the standard library. 404 405 The fuzzers and corpus are contained in a separate repository at [github.com/minio/simdjson-fuzz](https://github.com/minio/simdjson-fuzz) 406 407 The repo contains information on how to run them. 408 409 ## License 410 411 `simdjson-go` is released under the Apache License v2.0. You can find the complete text in the file LICENSE. 412 413 ## Contributing 414 415 Contributions are welcome, please send PRs for any enhancements. 416 417 If your PR include parsing changes please run fuzz testers for a couple of hours.