github.com/minio/simdjson-go@v0.4.6-0.20231116094823-04d21cddf993/README.md (about) 1 # simdjson-go 2 3 ## Introduction 4 5 This is a Golang port of [simdjson](https://github.com/lemire/simdjson), 6 a high performance JSON parser developed by Daniel Lemire and Geoff Langdale. 7 It makes extensive use of SIMD instructions to achieve parsing performance of gigabytes of JSON per second. 8 9 Performance wise, `simdjson-go` runs on average at about 40% to 60% of the speed of simdjson. 10 Compared to Golang's standard package `encoding/json`, `simdjson-go` is about 10x faster. 11 12 [![Documentation](https://godoc.org/github.com/minio/simdjson-go?status.svg)](https://pkg.go.dev/github.com/minio/simdjson-go?tab=doc) 13 14 ## Features 15 16 `simdjson-go` is a validating parser, meaning that it amongst others validates and checks numerical values, booleans etc. 17 Therefore, these values are available as the appropriate `int` and `float64` representations after parsing. 18 19 Additionally `simdjson-go` has the following features: 20 21 - No 4 GB object limit 22 - Support for [ndjson](http://ndjson.org/) (newline delimited json) 23 - Pure Go (no need for cgo) 24 - Object search/traversal. 25 - In-place value replacement. 26 - Remove object/array members. 27 - Serialize parsed JSONas binary data. 28 - Re-serialize parts as JSON. 29 30 ## Requirements 31 32 `simdjson-go` has the following requirements for parsing: 33 34 A CPU with both AVX2 and CLMUL is required (Haswell from 2013 onwards should do for Intel, for AMD a Ryzen/EPYC CPU (Q1 2017) should be sufficient). 35 This can be checked using the provided [`SupportedCPU()`](https://pkg.go.dev/github.com/minio/simdjson-go?tab=doc#SupportedCPU`) function. 36 37 The package does not provide fallback for unsupported CPUs, but serialized data can be deserialized on an unsupported CPU. 38 39 Using the `gccgo` will also always return unsupported CPU since it cannot compile assembly. 40 41 ## Usage 42 43 Run the following command in order to install `simdjson-go` 44 45 ```bash 46 go get -u github.com/minio/simdjson-go 47 ``` 48 49 In order to parse a JSON byte stream, you either call [`simdjson.Parse()`](https://pkg.go.dev/github.com/minio/simdjson-go?tab=doc#Parse) 50 or [`simdjson.ParseND()`](https://pkg.go.dev/github.com/minio/simdjson-go?tab=doc#ParseND) for newline delimited JSON files. 51 Both of these functions return a [`ParsedJson`](https://pkg.go.dev/github.com/minio/simdjson-go?tab=doc#ParsedJson) 52 struct that can be used to navigate the JSON object by calling [`Iter()`](https://pkg.go.dev/github.com/minio/simdjson-go?tab=doc#ParsedJson.Iter). 53 54 The easiest use is to call [`ForEach()`]((https://pkg.go.dev/github.com/minio/simdjson-go?tab=doc#ParsedJson.ForEach)) function of the returned `ParsedJson`. 55 56 ```Go 57 func main() { 58 // Parse JSON: 59 pj, err := Parse([]byte(`{"Image":{"URL":"http://example.com/example.gif"}}`), nil) 60 if err != nil { 61 log.Fatal(err) 62 } 63 64 // Iterate each top level element. 65 _ = pj.ForEach(func(i Iter) error { 66 fmt.Println("Got iterator for type:", i.Type()) 67 element, err := i.FindElement(nil, "Image", "URL") 68 if err == nil { 69 value, _ := element.Iter.StringCvt() 70 fmt.Println("Found element:", element.Name, "Type:", element.Type, "Value:", value) 71 } 72 return nil 73 }) 74 75 // Output: 76 // Got iterator for type: object 77 // Found element: URL Type: string Value: http://example.com/example.gif 78 } 79 ``` 80 81 ### Parsing with iterators 82 83 Using the type [`Iter`](https://pkg.go.dev/github.com/minio/simdjson-go?tab=doc#Iter) you can call 84 [`Advance()`](https://pkg.go.dev/github.com/minio/simdjson-go?tab=doc#Iter.Advance) to iterate over the tape, like so: 85 86 ```Go 87 for { 88 typ := iter.Advance() 89 90 switch typ { 91 case simdjson.TypeRoot: 92 if typ, tmp, err = iter.Root(tmp); err != nil { 93 return 94 } 95 96 if typ == simdjson.TypeObject { 97 if obj, err = tmp.Object(obj); err != nil { 98 return 99 } 100 101 e := obj.FindKey(key, &elem) 102 if e != nil && elem.Type == simdjson.TypeString { 103 v, _ := elem.Iter.StringBytes() 104 fmt.Println(string(v)) 105 } 106 } 107 108 default: 109 return 110 } 111 } 112 ``` 113 114 When you advance the Iter you get the next type currently queued. 115 116 Each type then has helpers to access the data. When you get a type you can use these to access the data: 117 118 | Type | Action on Iter | 119 |------------|----------------------------| 120 | TypeNone | Nothing follows. Iter done | 121 | TypeNull | Null value | 122 | TypeString | `String()`/`StringBytes()` | 123 | TypeInt | `Int()`/`Float()` | 124 | TypeUint | `Uint()`/`Float()` | 125 | TypeFloat | `Float()` | 126 | TypeBool | `Bool()` | 127 | TypeObject | `Object()` | 128 | TypeArray | `Array()` | 129 | TypeRoot | `Root()` | 130 131 You can also get the next value as an `interface{}` using the [Interface()](https://pkg.go.dev/github.com/minio/simdjson-go#Iter.Interface) method. 132 133 Note that arrays and objects that are null are always returned as `TypeNull`. 134 135 The complex types returns helpers that will help parse each of the underlying structures. 136 137 It is up to you to keep track of the nesting level you are operating at. 138 139 For any `Iter` it is possible to marshal the recursive content of the Iter using 140 [`MarshalJSON()`](https://pkg.go.dev/github.com/minio/simdjson-go#Iter.MarshalJSON) or 141 [`MarshalJSONBuffer(...)`](https://pkg.go.dev/github.com/minio/simdjson-go#Iter.MarshalJSONBuffer). 142 143 Currently, it is not possible to unmarshal into structs. 144 145 ### Search by path 146 147 It is possible to search by path to find elements by traversing objects. 148 149 For example: 150 151 ``` 152 // Find element in path. 153 elem, err := i.FindElement(nil, "Image", "URL") 154 ``` 155 156 Will locate the field inside a json object with the following structure: 157 158 ``` 159 { 160 "Image": { 161 "URL": "value" 162 } 163 } 164 ``` 165 166 The values can be any type. The [Element](https://pkg.go.dev/github.com/minio/simdjson-go#Element) 167 will contain the element information and an Iter to access the content. 168 169 ## Parsing Objects 170 171 If you are only interested in one key in an object you can use `FindKey` to quickly select it. 172 173 It is possible to use the `ForEach(fn func(key []byte, i Iter), onlyKeys map[string]struct{})` 174 which makes it possible to get a callback for each element in the object. 175 176 An object can be traversed manually by using `NextElement(dst *Iter) (name string, t Type, err error)`. 177 The key of the element will be returned as a string and the type of the value will be returned 178 and the provided `Iter` will contain an iterator which will allow access to the content. 179 180 There is a `NextElementBytes` which provides the same, but without the need to allocate a string. 181 182 All elements of the object can be retrieved using a pretty lightweight [`Parse`](https://pkg.go.dev/github.com/minio/simdjson-go#Object.Parse) 183 which provides a map of all keys and all elements an a slide. 184 185 All elements of the object can be returned as `map[string]interface{}` using the `Map` method on the object. 186 This will naturally perform allocations for all elements. 187 188 ## Parsing Arrays 189 190 [Arrays](https://pkg.go.dev/github.com/minio/simdjson-go#Array) in JSON can have mixed types. 191 192 It is possible to call `ForEach(fn func(i Iter))` to get each element. 193 194 To iterate over the array with mixed types use the [`Iter`](https://pkg.go.dev/github.com/minio/simdjson-go#Array.Iter) 195 method to get an iterator. 196 197 There are methods that allow you to retrieve all elements as a single type, 198 []int64, []uint64, []float64 and []string with AsInteger(), AsUint64(), AsFloat() and AsString(). 199 200 ## Number parsing 201 202 Numbers in JSON are untyped and are returned by the following rules in order: 203 204 * If there is any float point notation, like exponents, or a dot notation, it is always returned as float. 205 * If number is a pure integer and it fits within an int64 it is returned as such. 206 * If number is a pure positive integer and fits within a uint64 it is returned as such. 207 * If the number is valid number it is returned as float64. 208 209 If the number was converted from integer notation to a float due to not fitting inside int64/uint64 210 the `FloatOverflowedInteger` flag is set, which can be retrieved using `(Iter).FloatFlags()` method. 211 212 JSON numbers follow JavaScript’s double-precision floating-point format. 213 214 * Represented in base 10 with no superfluous leading zeros (e.g. 67, 1, 100). 215 * Include digits between 0 and 9. 216 * Can be a negative number (e.g. -10). 217 * Can be a fraction (e.g. .5). 218 * Can also have an exponent of 10, prefixed by e or E with a plus or minus sign to indicate positive or negative exponentiation. 219 * Octal and hexadecimal formats are not supported. 220 * Can not have a value of NaN (Not A Number) or Infinity. 221 222 ## Parsing NDJSON stream 223 224 Newline delimited json is sent as packets with each line being a root element. 225 226 Here is an example that counts the number of `"Make": "HOND"` in NDJSON similar to this: 227 228 ``` 229 {"Age":20, "Make": "HOND"} 230 {"Age":22, "Make": "TLSA"} 231 ``` 232 233 ```Go 234 func findHondas(r io.Reader) { 235 var nFound int 236 237 // Communication 238 reuse := make(chan *simdjson.ParsedJson, 10) 239 res := make(chan simdjson.Stream, 10) 240 241 simdjson.ParseNDStream(r, res, reuse) 242 // Read results in blocks... 243 for got := range res { 244 if got.Error != nil { 245 if got.Error == io.EOF { 246 break 247 } 248 log.Fatal(got.Error) 249 } 250 251 var result int 252 var elem *Element 253 err := got.Value.ForEach(func(i Iter) error { 254 var err error 255 elem, err = i.FindElement(elem, "Make") 256 if err != nil { 257 return nil 258 } 259 bts, _ := elem.Iter.StringBytes() 260 if string(bts) == "HOND" { 261 result++ 262 } 263 return nil 264 }) 265 reuse <- got.Value 266 } 267 fmt.Println("Found", nFound, "Hondas") 268 } 269 ``` 270 271 More examples can be found in the examples subdirectory and further documentation can be found at [godoc](https://pkg.go.dev/github.com/minio/simdjson-go?tab=doc). 272 273 274 ### In-place Value Replacement 275 276 It is possible to replace a few, basic internal values. 277 This means that when re-parsing or re-serializing the parsed JSON these values will be output. 278 279 Boolean (true/false) and null values can be freely exchanged. 280 281 Numeric values (float, int, uint) can be exchanged freely. 282 283 Strings can also be exchanged with different values. 284 285 Strings and numbers can be exchanged. However, note that there is no checks for numbers inserted as object keys, 286 so if used for this invalid JSON is possible. 287 288 There is no way to modify objects, arrays, other than value types above inside each. 289 It is not possible to remove or add elements. 290 291 To replace a value, of value referenced by an `Iter` simply call `SetNull`, `SetBool`, `SetFloat`, `SetInt`, `SetUInt`, 292 `SetString` or `SetStringBytes`. 293 294 ### Object & Array Element Deletion 295 296 It is possible to delete one or more elements in an object. 297 298 `(*Object).DeleteElems(fn, onlyKeys)` will call back fn for each key+ value. 299 300 If true is returned, the key+value is deleted. A key filter can be provided for optional filtering. 301 If the callback function is nil all elements matching the filter will be deleted. 302 If both are nil all elements are deleted. 303 304 Example: 305 306 ```Go 307 // The object we are modifying 308 var obj *simdjson.Object 309 310 // Delete all entries where the key is "unwanted": 311 err = obj.DeleteElems(func(key []byte, i Iter) bool { 312 return string(key) == "unwanted") 313 }, nil) 314 315 // Alternative version with prefiltered keys: 316 err = obj.DeleteElems(nil, map[string]struct{}{"unwanted": {}}) 317 ``` 318 319 `(*Array).DeleteElems(fn func(i Iter) bool)` will call back fn for each array value. 320 If the function returns true the element is deleted in the array. 321 322 ```Go 323 // The array we are modifying 324 var array *simdjson.Array 325 326 // Delete all entries that are strings. 327 array.DeleteElems(func(i Iter) bool { 328 return i.Type() == TypeString 329 }) 330 ``` 331 332 ## Serializing parsed json 333 334 It is possible to serialize parsed JSON for more compact storage and faster load time. 335 336 To create a new serialized use [NewSerializer](https://pkg.go.dev/github.com/minio/simdjson-go#NewSerializer). 337 This serializer can be reused for several JSON blocks. 338 339 The serializer will provide string deduplication and compression of elements. 340 This can be finetuned using the [`CompressMode`](https://pkg.go.dev/github.com/minio/simdjson-go#Serializer.CompressMode) setting. 341 342 To serialize a block of parsed data use the [`Serialize`](https://pkg.go.dev/github.com/minio/simdjson-go#Serializer.Serialize) method. 343 344 To read back use the [`Deserialize`](https://pkg.go.dev/github.com/minio/simdjson-go#Serializer.Deserialize) method. 345 For deserializing the compression mode does not need to match since it is read from the stream. 346 347 Example of speed for serializer/deserializer on [`parking-citations-1M`](https://dl.minio.io/assets/parking-citations-1M.json.zst). 348 349 | Compress Mode | % of JSON size | Serialize Speed | Deserialize Speed | 350 |---------------|----------------|-----------------|-------------------| 351 | None | 177.26% | 425.70 MB/s | 2334.33 MB/s | 352 | Fast | 17.20% | 412.75 MB/s | 1234.76 MB/s | 353 | Default | 16.85% | 411.59 MB/s | 1242.09 MB/s | 354 | Best | 10.91% | 337.17 MB/s | 806.23 MB/s | 355 356 In some cases the speed difference and compression difference will be bigger. 357 358 ## Performance vs `encoding/json` and `json-iterator/go` 359 360 Though simdjson provides different output than traditional unmarshal functions this can give 361 an overview of the expected performance for reading specific data in JSON. 362 363 Below is a performance comparison to Golang's standard package `encoding/json` based on the same set of JSON test files, unmarshal to `interface{}`. 364 365 Comparisons with default settings: 366 367 ``` 368 λ benchcmp enc-json.txt simdjson.txt 369 benchmark old ns/op new ns/op delta 370 BenchmarkApache_builds-32 1219080 142972 -88.27% 371 BenchmarkCanada-32 38362219 13417193 -65.02% 372 BenchmarkCitm_catalog-32 17051899 1359983 -92.02% 373 BenchmarkGithub_events-32 603037 74042 -87.72% 374 BenchmarkGsoc_2018-32 20777333 1259171 -93.94% 375 BenchmarkInstruments-32 2626808 301370 -88.53% 376 BenchmarkMarine_ik-32 56630295 14419901 -74.54% 377 BenchmarkMesh-32 13411486 4206251 -68.64% 378 BenchmarkMesh_pretty-32 18226803 4786081 -73.74% 379 BenchmarkNumbers-32 2131951 909641 -57.33% 380 BenchmarkRandom-32 7360966 1004387 -86.36% 381 BenchmarkTwitter-32 6635848 588773 -91.13% 382 BenchmarkTwitterescaped-32 6292856 972250 -84.55% 383 BenchmarkUpdate_center-32 6396501 708717 -88.92% 384 385 benchmark old MB/s new MB/s speedup 386 BenchmarkApache_builds-32 104.40 890.21 8.53x 387 BenchmarkCanada-32 58.68 167.77 2.86x 388 BenchmarkCitm_catalog-32 101.29 1270.02 12.54x 389 BenchmarkGithub_events-32 108.01 879.67 8.14x 390 BenchmarkGsoc_2018-32 160.17 2642.88 16.50x 391 BenchmarkInstruments-32 83.88 731.15 8.72x 392 BenchmarkMarine_ik-32 52.68 206.90 3.93x 393 BenchmarkMesh-32 53.95 172.03 3.19x 394 BenchmarkMesh_pretty-32 86.54 329.57 3.81x 395 BenchmarkNumbers-32 70.42 165.04 2.34x 396 BenchmarkRandom-32 69.35 508.25 7.33x 397 BenchmarkTwitter-32 95.17 1072.59 11.27x 398 BenchmarkTwitterescaped-32 89.37 578.46 6.47x 399 BenchmarkUpdate_center-32 83.35 752.31 9.03x 400 401 benchmark old allocs new allocs delta 402 BenchmarkApache_builds-32 9716 22 -99.77% 403 BenchmarkCanada-32 392535 250 -99.94% 404 BenchmarkCitm_catalog-32 95372 110 -99.88% 405 BenchmarkGithub_events-32 3328 17 -99.49% 406 BenchmarkGsoc_2018-32 58615 67 -99.89% 407 BenchmarkInstruments-32 13336 33 -99.75% 408 BenchmarkMarine_ik-32 614776 467 -99.92% 409 BenchmarkMesh-32 149504 122 -99.92% 410 BenchmarkMesh_pretty-32 149504 122 -99.92% 411 BenchmarkNumbers-32 20025 28 -99.86% 412 BenchmarkRandom-32 66083 76 -99.88% 413 BenchmarkTwitter-32 31261 53 -99.83% 414 BenchmarkTwitterescaped-32 31757 53 -99.83% 415 BenchmarkUpdate_center-32 49074 58 -99.88% 416 417 benchmark old bytes new bytes delta 418 BenchmarkApache_builds-32 461556 965 -99.79% 419 BenchmarkCanada-32 10943847 39793 -99.64% 420 BenchmarkCitm_catalog-32 5122732 6089 -99.88% 421 BenchmarkGithub_events-32 186148 802 -99.57% 422 BenchmarkGsoc_2018-32 7032092 17215 -99.76% 423 BenchmarkInstruments-32 882265 1310 -99.85% 424 BenchmarkMarine_ik-32 22564413 189870 -99.16% 425 BenchmarkMesh-32 7130934 15483 -99.78% 426 BenchmarkMesh_pretty-32 7288661 12066 -99.83% 427 BenchmarkNumbers-32 1066304 1280 -99.88% 428 BenchmarkRandom-32 2787054 4096 -99.85% 429 BenchmarkTwitter-32 2152260 2550 -99.88% 430 BenchmarkTwitterescaped-32 2330548 3062 -99.87% 431 BenchmarkUpdate_center-32 2729631 3235 -99.88% 432 ``` 433 434 Here is another benchmark comparison to `json-iterator/go`, unmarshal to `interface{}`. 435 436 ``` 437 λ benchcmp jsiter.txt simdjson.txt 438 benchmark old ns/op new ns/op delta 439 BenchmarkApache_builds-32 891370 142972 -83.96% 440 BenchmarkCanada-32 52365386 13417193 -74.38% 441 BenchmarkCitm_catalog-32 10154544 1359983 -86.61% 442 BenchmarkGithub_events-32 398741 74042 -81.43% 443 BenchmarkGsoc_2018-32 15584278 1259171 -91.92% 444 BenchmarkInstruments-32 1858339 301370 -83.78% 445 BenchmarkMarine_ik-32 49881479 14419901 -71.09% 446 BenchmarkMesh-32 15038300 4206251 -72.03% 447 BenchmarkMesh_pretty-32 17655583 4786081 -72.89% 448 BenchmarkNumbers-32 2903165 909641 -68.67% 449 BenchmarkRandom-32 6156849 1004387 -83.69% 450 BenchmarkTwitter-32 4655981 588773 -87.35% 451 BenchmarkTwitterescaped-32 5521004 972250 -82.39% 452 BenchmarkUpdate_center-32 5540200 708717 -87.21% 453 454 benchmark old MB/s new MB/s speedup 455 BenchmarkApache_builds-32 142.79 890.21 6.23x 456 BenchmarkCanada-32 42.99 167.77 3.90x 457 BenchmarkCitm_catalog-32 170.09 1270.02 7.47x 458 BenchmarkGithub_events-32 163.34 879.67 5.39x 459 BenchmarkGsoc_2018-32 213.54 2642.88 12.38x 460 BenchmarkInstruments-32 118.57 731.15 6.17x 461 BenchmarkMarine_ik-32 59.81 206.90 3.46x 462 BenchmarkMesh-32 48.12 172.03 3.58x 463 BenchmarkMesh_pretty-32 89.34 329.57 3.69x 464 BenchmarkNumbers-32 51.71 165.04 3.19x 465 BenchmarkRandom-32 82.91 508.25 6.13x 466 BenchmarkTwitter-32 135.64 1072.59 7.91x 467 BenchmarkTwitterescaped-32 101.87 578.46 5.68x 468 BenchmarkUpdate_center-32 96.24 752.31 7.82x 469 470 benchmark old allocs new allocs delta 471 BenchmarkApache_builds-32 13248 22 -99.83% 472 BenchmarkCanada-32 665988 250 -99.96% 473 BenchmarkCitm_catalog-32 118755 110 -99.91% 474 BenchmarkGithub_events-32 4442 17 -99.62% 475 BenchmarkGsoc_2018-32 90915 67 -99.93% 476 BenchmarkInstruments-32 18776 33 -99.82% 477 BenchmarkMarine_ik-32 692512 467 -99.93% 478 BenchmarkMesh-32 184137 122 -99.93% 479 BenchmarkMesh_pretty-32 204037 122 -99.94% 480 BenchmarkNumbers-32 30037 28 -99.91% 481 BenchmarkRandom-32 88091 76 -99.91% 482 BenchmarkTwitter-32 45040 53 -99.88% 483 BenchmarkTwitterescaped-32 47198 53 -99.89% 484 BenchmarkUpdate_center-32 66757 58 -99.91% 485 486 benchmark old bytes new bytes delta 487 BenchmarkApache_builds-32 518350 965 -99.81% 488 BenchmarkCanada-32 16189358 39793 -99.75% 489 BenchmarkCitm_catalog-32 5571982 6089 -99.89% 490 BenchmarkGithub_events-32 221631 802 -99.64% 491 BenchmarkGsoc_2018-32 11771591 17215 -99.85% 492 BenchmarkInstruments-32 991674 1310 -99.87% 493 BenchmarkMarine_ik-32 25257277 189870 -99.25% 494 BenchmarkMesh-32 7991707 15483 -99.81% 495 BenchmarkMesh_pretty-32 8628570 12066 -99.86% 496 BenchmarkNumbers-32 1226518 1280 -99.90% 497 BenchmarkRandom-32 3167528 4096 -99.87% 498 BenchmarkTwitter-32 2426730 2550 -99.89% 499 BenchmarkTwitterescaped-32 2607198 3062 -99.88% 500 BenchmarkUpdate_center-32 3052382 3235 -99.89% 501 ``` 502 503 504 ### Inplace strings 505 506 The best performance is obtained by keeping the JSON message fully mapped in memory and using the 507 `WithCopyStrings(false)` option. This prevents duplicate copies of string values being made 508 but mandates that the original JSON buffer is kept alive until the `ParsedJson` object is no longer needed 509 (ie iteration over the tape format has been completed). 510 511 In case the JSON message buffer is freed earlier (or for streaming use cases where memory is reused) 512 `WithCopyStrings(true)` should be used (which is the default behaviour). 513 514 The performance impact differs based on the input type, but this is the general differences: 515 516 ``` 517 BenchmarkApache_builds/copy-32 8242 142972 ns/op 890.21 MB/s 965 B/op 22 allocs/op 518 BenchmarkApache_builds/nocopy-32 10000 111189 ns/op 1144.68 MB/s 932 B/op 22 allocs/op 519 520 BenchmarkCanada/copy-32 91 13417193 ns/op 167.77 MB/s 39793 B/op 250 allocs/op 521 BenchmarkCanada/nocopy-32 87 13392401 ns/op 168.08 MB/s 41334 B/op 250 allocs/op 522 523 BenchmarkCitm_catalog/copy-32 889 1359983 ns/op 1270.02 MB/s 6089 B/op 110 allocs/op 524 BenchmarkCitm_catalog/nocopy-32 924 1268470 ns/op 1361.64 MB/s 5582 B/op 110 allocs/op 525 526 BenchmarkGithub_events/copy-32 16092 74042 ns/op 879.67 MB/s 802 B/op 17 allocs/op 527 BenchmarkGithub_events/nocopy-32 19446 62143 ns/op 1048.10 MB/s 794 B/op 17 allocs/op 528 529 BenchmarkGsoc_2018/copy-32 948 1259171 ns/op 2642.88 MB/s 17215 B/op 67 allocs/op 530 BenchmarkGsoc_2018/nocopy-32 1144 1040864 ns/op 3197.18 MB/s 9947 B/op 67 allocs/op 531 532 BenchmarkInstruments/copy-32 3932 301370 ns/op 731.15 MB/s 1310 B/op 33 allocs/op 533 BenchmarkInstruments/nocopy-32 4443 271500 ns/op 811.59 MB/s 1258 B/op 33 allocs/op 534 535 BenchmarkMarine_ik/copy-32 79 14419901 ns/op 206.90 MB/s 189870 B/op 467 allocs/op 536 BenchmarkMarine_ik/nocopy-32 79 14176758 ns/op 210.45 MB/s 189867 B/op 467 allocs/op 537 538 BenchmarkMesh/copy-32 288 4206251 ns/op 172.03 MB/s 15483 B/op 122 allocs/op 539 BenchmarkMesh/nocopy-32 285 4207299 ns/op 171.99 MB/s 15615 B/op 122 allocs/op 540 541 BenchmarkMesh_pretty/copy-32 248 4786081 ns/op 329.57 MB/s 12066 B/op 122 allocs/op 542 BenchmarkMesh_pretty/nocopy-32 250 4803647 ns/op 328.37 MB/s 12009 B/op 122 allocs/op 543 544 BenchmarkNumbers/copy-32 1336 909641 ns/op 165.04 MB/s 1280 B/op 28 allocs/op 545 BenchmarkNumbers/nocopy-32 1321 910493 ns/op 164.88 MB/s 1281 B/op 28 allocs/op 546 547 BenchmarkRandom/copy-32 1201 1004387 ns/op 508.25 MB/s 4096 B/op 76 allocs/op 548 BenchmarkRandom/nocopy-32 1554 773142 ns/op 660.26 MB/s 3198 B/op 76 allocs/op 549 550 BenchmarkTwitter/copy-32 2035 588773 ns/op 1072.59 MB/s 2550 B/op 53 allocs/op 551 BenchmarkTwitter/nocopy-32 2485 475949 ns/op 1326.85 MB/s 2029 B/op 53 allocs/op 552 553 BenchmarkTwitterescaped/copy-32 1189 972250 ns/op 578.46 MB/s 3062 B/op 53 allocs/op 554 BenchmarkTwitterescaped/nocopy-32 1372 874972 ns/op 642.77 MB/s 2518 B/op 53 allocs/op 555 556 BenchmarkUpdate_center/copy-32 1665 708717 ns/op 752.31 MB/s 3235 B/op 58 allocs/op 557 BenchmarkUpdate_center/nocopy-32 2241 536027 ns/op 994.68 MB/s 2130 B/op 58 allocs/op 558 ``` 559 560 ## Design 561 562 `simdjson-go` follows the same two stage design as `simdjson`. 563 During the first stage the structural elements (`{`, `}`, `[`, `]`, `:`, and `,`) 564 are detected and forwarded as offsets in the message buffer to the second stage. 565 The second stage builds a tape format of the structure of the JSON document. 566 567 Note that in contrast to `simdjson`, `simdjson-go` outputs `uint32` 568 increments (as opposed to absolute values) to the second stage. 569 This allows arbitrarily large JSON files to be parsed (as long as a single (string) element does not surpass 4 GB...). 570 571 Also, for better performance, 572 both stages run concurrently as separate go routines and a go channel is used to communicate between the two stages. 573 574 ### Stage 1 575 576 Stage 1 has been converted from the original C code (containing the SIMD intrinsics) to Golang assembly using [c2goasm](https://github.com/minio/c2goasm). 577 It essentially consists of five separate steps, being: 578 579 - `find_odd_backslash_sequences`: detect backslash characters used to escape quotes 580 - `find_quote_mask_and_bits`: generate a mask with bits turned on for characters between quotes 581 - `find_whitespace_and_structurals`: generate a mask for whitespace plus a mask for the structural characters 582 - `finalize_structurals`: combine the masks computed above into a final mask where each active bit represents the position of a structural character in the input message. 583 - `flatten_bits_incremental`: output the active bits in the final mask as incremental offsets. 584 585 For more details you can take a look at the various test cases in `find_subroutines_amd64_test.go` to see how 586 the individual routines can be invoked (typically with a 64 byte input buffer that generates one or more 64-bit masks). 587 588 There is one final routine, `find_structural_bits_in_slice`, that ties it all together and is 589 invoked with a slice of the message buffer in order to find the incremental offsets. 590 591 ### Stage 2 592 593 During Stage 2 the tape structure is constructed. 594 It is essentially a single function that jumps around as it finds the various structural characters 595 and builds the hierarchy of the JSON document that it processes. 596 The values of the JSON elements such as strings, integers, booleans etc. are parsed and written to the tape. 597 598 Any errors (such as an array not being closed or a missing closing brace) are detected and reported back as errors to the client. 599 600 ## Tape format 601 602 Similarly to `simdjson`, `simdjson-go` parses the structure onto a 'tape' format. 603 With this format it is possible to skip over arrays and (sub)objects as the sizes are recorded in the tape. 604 605 `simdjson-go` format is exactly the same as the `simdjson` [tape](https://github.com/lemire/simdjson/blob/master/doc/tape.md) 606 format with the following 2 exceptions: 607 608 - In order to support ndjson, it is possible to have more than one root element on the tape. 609 Also, to allow for fast navigation over root elements, a root points to the next root element 610 (and as such the last root element points 1 index past the length of the tape). 611 612 A "NOP" tag is added. The value contains the number of tape entries to skip forward for next tag. 613 614 - Strings are handled differently, unlike `simdjson` the string size is not prepended in the String buffer 615 but is added as an additional element to the tape itself (much like integers and floats). 616 - In case `WithCopyStrings(false)` Only strings that contain special characters are copied to the String buffer 617 in which case the payload from the tape is the offset into the String buffer. 618 For string values without special characters the tape's payload points directly into the message buffer. 619 - In case `WithCopyStrings(true)` (default): Strings are always copied to the String buffer. 620 621 For more information, see `TestStage2BuildTape` in `stage2_build_tape_test.go`. 622 623 ## Fuzz Tests 624 625 `simdjson-go` has been extensively fuzz tested to ensure that input cannot generate crashes and that output matches 626 the standard library. 627 628 The fuzz tests are included as Go 1.18+ compatible tests. 629 630 ## License 631 632 `simdjson-go` is released under the Apache License v2.0. You can find the complete text in the file LICENSE. 633 634 ## Contributing 635 636 Contributions are welcome, please send PRs for any enhancements. 637 638 If your PR include parsing changes please run fuzz testers for a couple of hours.