github.com/bir3/gocompiler@v0.9.2202/extra/compress/zstd/README.md (about)

     1  # zstd 
     2  
     3  [Zstandard](https://facebook.github.io/zstd/) is a real-time compression algorithm, providing high compression ratios. 
     4  It offers a very wide range of compression / speed trade-off, while being backed by a very fast decoder.
     5  A high performance compression algorithm is implemented. For now focused on speed. 
     6  
     7  This package provides [compression](#Compressor) to and [decompression](#Decompressor) of Zstandard content. 
     8  
     9  This package is pure Go and without use of "unsafe". 
    10  
    11  The `zstd` package is provided as open source software using a Go standard license.
    12  
    13  Currently the package is heavily optimized for 64 bit processors and will be significantly slower on 32 bit processors.
    14  
    15  For seekable zstd streams, see [this excellent package](https://github.com/SaveTheRbtz/zstd-seekable-format-go).
    16  
    17  ## Installation
    18  
    19  Install using `go get -u github.com/klauspost/compress`. The package is located in `github.com/klauspost/compress/zstd`.
    20  
    21  [![Go Reference](https://pkg.go.dev/badge/github.com/klauspost/compress/zstd.svg)](https://pkg.go.dev/github.com/klauspost/compress/zstd)
    22  
    23  ## Compressor
    24  
    25  ### Status: 
    26  
    27  STABLE - there may always be subtle bugs, a wide variety of content has been tested and the library is actively 
    28  used by several projects. This library is being [fuzz-tested](https://github.com/klauspost/compress-fuzz) for all updates.
    29  
    30  There may still be specific combinations of data types/size/settings that could lead to edge cases, 
    31  so as always, testing is recommended.  
    32  
    33  For now, a high speed (fastest) and medium-fast (default) compressor has been implemented. 
    34  
    35  * The "Fastest" compression ratio is roughly equivalent to zstd level 1. 
    36  * The "Default" compression ratio is roughly equivalent to zstd level 3 (default).
    37  * The "Better" compression ratio is roughly equivalent to zstd level 7.
    38  * The "Best" compression ratio is roughly equivalent to zstd level 11.
    39  
    40  In terms of speed, it is typically 2x as fast as the stdlib deflate/gzip in its fastest mode. 
    41  The compression ratio compared to stdlib is around level 3, but usually 3x as fast.
    42  
    43   
    44  ### Usage
    45  
    46  An Encoder can be used for either compressing a stream via the
    47  `io.WriteCloser` interface supported by the Encoder or as multiple independent
    48  tasks via the `EncodeAll` function.
    49  Smaller encodes are encouraged to use the EncodeAll function.
    50  Use `NewWriter` to create a new instance that can be used for both.
    51  
    52  To create a writer with default options, do like this:
    53  
    54  ```Go
    55  // Compress input to output.
    56  func Compress(in io.Reader, out io.Writer) error {
    57      enc, err := zstd.NewWriter(out)
    58      if err != nil {
    59          return err
    60      }
    61      _, err = io.Copy(enc, in)
    62      if err != nil {
    63          enc.Close()
    64          return err
    65      }
    66      return enc.Close()
    67  }
    68  ```
    69  
    70  Now you can encode by writing data to `enc`. The output will be finished writing when `Close()` is called.
    71  Even if your encode fails, you should still call `Close()` to release any resources that may be held up.  
    72  
    73  The above is fine for big encodes. However, whenever possible try to *reuse* the writer.
    74  
    75  To reuse the encoder, you can use the `Reset(io.Writer)` function to change to another output. 
    76  This will allow the encoder to reuse all resources and avoid wasteful allocations. 
    77  
    78  Currently stream encoding has 'light' concurrency, meaning up to 2 goroutines can be working on part 
    79  of a stream. This is independent of the `WithEncoderConcurrency(n)`, but that is likely to change 
    80  in the future. So if you want to limit concurrency for future updates, specify the concurrency
    81  you would like.
    82  
    83  If you would like stream encoding to be done without spawning async goroutines, use `WithEncoderConcurrency(1)`
    84  which will compress input as each block is completed, blocking on writes until each has completed.
    85  
    86  You can specify your desired compression level using `WithEncoderLevel()` option. Currently only pre-defined 
    87  compression settings can be specified.
    88  
    89  #### Future Compatibility Guarantees
    90  
    91  This will be an evolving project. When using this package it is important to note that both the compression efficiency and speed may change.
    92  
    93  The goal will be to keep the default efficiency at the default zstd (level 3). 
    94  However the encoding should never be assumed to remain the same, 
    95  and you should not use hashes of compressed output for similarity checks.
    96  
    97  The Encoder can be assumed to produce the same output from the exact same code version.
    98  However, the may be modes in the future that break this, 
    99  although they will not be enabled without an explicit option.   
   100  
   101  This encoder is not designed to (and will probably never) output the exact same bitstream as the reference encoder.
   102  
   103  Also note, that the cgo decompressor currently does not [report all errors on invalid input](https://github.com/DataDog/zstd/issues/59),
   104  [omits error checks](https://github.com/DataDog/zstd/issues/61), [ignores checksums](https://github.com/DataDog/zstd/issues/43) 
   105  and seems to ignore concatenated streams, even though [it is part of the spec](https://github.com/facebook/zstd/blob/dev/doc/zstd_compression_format.md#frames).
   106  
   107  #### Blocks
   108  
   109  For compressing small blocks, the returned encoder has a function called `EncodeAll(src, dst []byte) []byte`.
   110  
   111  `EncodeAll` will encode all input in src and append it to dst.
   112  This function can be called concurrently. 
   113  Each call will only run on a same goroutine as the caller.
   114  
   115  Encoded blocks can be concatenated and the result will be the combined input stream.
   116  Data compressed with EncodeAll can be decoded with the Decoder, using either a stream or `DecodeAll`.
   117  
   118  Especially when encoding blocks you should take special care to reuse the encoder. 
   119  This will effectively make it run without allocations after a warmup period. 
   120  To make it run completely without allocations, supply a destination buffer with space for all content.   
   121  
   122  ```Go
   123  import "github.com/bir3/gocompiler/src/cmd/gocmd/compress/zstd"
   124  
   125  // Create a writer that caches compressors.
   126  // For this operation type we supply a nil Reader.
   127  var encoder, _ = zstd.NewWriter(nil)
   128  
   129  // Compress a buffer. 
   130  // If you have a destination buffer, the allocation in the call can also be eliminated.
   131  func Compress(src []byte) []byte {
   132      return encoder.EncodeAll(src, make([]byte, 0, len(src)))
   133  } 
   134  ```
   135  
   136  You can control the maximum number of concurrent encodes using the `WithEncoderConcurrency(n)` 
   137  option when creating the writer.
   138  
   139  Using the Encoder for both a stream and individual blocks concurrently is safe. 
   140  
   141  ### Performance
   142  
   143  I have collected some speed examples to compare speed and compression against other compressors.
   144  
   145  * `file` is the input file.
   146  * `out` is the compressor used. `zskp` is this package. `zstd` is the Datadog cgo library. `gzstd/gzkp` is gzip standard and this library.
   147  * `level` is the compression level used. For `zskp` level 1 is "fastest", level 2 is "default"; 3 is "better", 4 is "best".
   148  * `insize`/`outsize` is the input/output size.
   149  * `millis` is the number of milliseconds used for compression.
   150  * `mb/s` is megabytes (2^20 bytes) per second.
   151  
   152  ```
   153  Silesia Corpus:
   154  http://sun.aei.polsl.pl/~sdeor/corpus/silesia.zip
   155  
   156  This package:
   157  file    out     level   insize      outsize     millis  mb/s
   158  silesia.tar zskp    1   211947520   73821326    634     318.47
   159  silesia.tar zskp    2   211947520   67655404    1508    133.96
   160  silesia.tar zskp    3   211947520   64746933    3000    67.37
   161  silesia.tar zskp    4   211947520   60073508    16926   11.94
   162  
   163  cgo zstd:
   164  silesia.tar zstd    1   211947520   73605392    543     371.56
   165  silesia.tar zstd    3   211947520   66793289    864     233.68
   166  silesia.tar zstd    6   211947520   62916450    1913    105.66
   167  silesia.tar zstd    9   211947520   60212393    5063    39.92
   168  
   169  gzip, stdlib/this package:
   170  silesia.tar gzstd   1   211947520   80007735    1498    134.87
   171  silesia.tar gzkp    1   211947520   80088272    1009    200.31
   172  
   173  GOB stream of binary data. Highly compressible.
   174  https://files.klauspost.com/compress/gob-stream.7z
   175  
   176  file        out     level   insize  outsize     millis  mb/s
   177  gob-stream  zskp    1   1911399616  233948096   3230    564.34
   178  gob-stream  zskp    2   1911399616  203997694   4997    364.73
   179  gob-stream  zskp    3   1911399616  173526523   13435   135.68
   180  gob-stream  zskp    4   1911399616  162195235   47559   38.33
   181  
   182  gob-stream  zstd    1   1911399616  249810424   2637    691.26
   183  gob-stream  zstd    3   1911399616  208192146   3490    522.31
   184  gob-stream  zstd    6   1911399616  193632038   6687    272.56
   185  gob-stream  zstd    9   1911399616  177620386   16175   112.70
   186  
   187  gob-stream  gzstd   1   1911399616  357382013   9046    201.49
   188  gob-stream  gzkp    1   1911399616  359136669   4885    373.08
   189  
   190  The test data for the Large Text Compression Benchmark is the first
   191  10^9 bytes of the English Wikipedia dump on Mar. 3, 2006.
   192  http://mattmahoney.net/dc/textdata.html
   193  
   194  file    out level   insize      outsize     millis  mb/s
   195  enwik9  zskp    1   1000000000  343833605   3687    258.64
   196  enwik9  zskp    2   1000000000  317001237   7672    124.29
   197  enwik9  zskp    3   1000000000  291915823   15923   59.89
   198  enwik9  zskp    4   1000000000  261710291   77697   12.27
   199  
   200  enwik9  zstd    1   1000000000  358072021   3110    306.65
   201  enwik9  zstd    3   1000000000  313734672   4784    199.35
   202  enwik9  zstd    6   1000000000  295138875   10290   92.68
   203  enwik9  zstd    9   1000000000  278348700   28549   33.40
   204  
   205  enwik9  gzstd   1   1000000000  382578136   8608    110.78
   206  enwik9  gzkp    1   1000000000  382781160   5628    169.45
   207  
   208  Highly compressible JSON file.
   209  https://files.klauspost.com/compress/github-june-2days-2019.json.zst
   210  
   211  file                        out level   insize      outsize     millis  mb/s
   212  github-june-2days-2019.json zskp    1   6273951764  697439532   9789    611.17
   213  github-june-2days-2019.json zskp    2   6273951764  610876538   18553   322.49
   214  github-june-2days-2019.json zskp    3   6273951764  517662858   44186   135.41
   215  github-june-2days-2019.json zskp    4   6273951764  464617114   165373  36.18
   216  
   217  github-june-2days-2019.json zstd    1   6273951764  766284037   8450    708.00
   218  github-june-2days-2019.json zstd    3   6273951764  661889476   10927   547.57
   219  github-june-2days-2019.json zstd    6   6273951764  642756859   22996   260.18
   220  github-june-2days-2019.json zstd    9   6273951764  601974523   52413   114.16
   221  
   222  github-june-2days-2019.json gzstd   1   6273951764  1164397768  26793   223.32
   223  github-june-2days-2019.json gzkp    1   6273951764  1120631856  17693   338.16
   224  
   225  VM Image, Linux mint with a few installed applications:
   226  https://files.klauspost.com/compress/rawstudio-mint14.7z
   227  
   228  file                    out level   insize      outsize     millis  mb/s
   229  rawstudio-mint14.tar    zskp    1   8558382592  3718400221  18206   448.29
   230  rawstudio-mint14.tar    zskp    2   8558382592  3326118337  37074   220.15
   231  rawstudio-mint14.tar    zskp    3   8558382592  3163842361  87306   93.49
   232  rawstudio-mint14.tar    zskp    4   8558382592  2970480650  783862  10.41
   233  
   234  rawstudio-mint14.tar    zstd    1   8558382592  3609250104  17136   476.27
   235  rawstudio-mint14.tar    zstd    3   8558382592  3341679997  29262   278.92
   236  rawstudio-mint14.tar    zstd    6   8558382592  3235846406  77904   104.77
   237  rawstudio-mint14.tar    zstd    9   8558382592  3160778861  140946  57.91
   238  
   239  rawstudio-mint14.tar    gzstd   1   8558382592  3926234992  51345   158.96
   240  rawstudio-mint14.tar    gzkp    1   8558382592  3960117298  36722   222.26
   241  
   242  CSV data:
   243  https://files.klauspost.com/compress/nyc-taxi-data-10M.csv.zst
   244  
   245  file                    out level   insize      outsize     millis  mb/s
   246  nyc-taxi-data-10M.csv   zskp    1   3325605752  641319332   9462    335.17
   247  nyc-taxi-data-10M.csv   zskp    2   3325605752  588976126   17570   180.50
   248  nyc-taxi-data-10M.csv   zskp    3   3325605752  529329260   32432   97.79
   249  nyc-taxi-data-10M.csv   zskp    4   3325605752  474949772   138025  22.98
   250  
   251  nyc-taxi-data-10M.csv   zstd    1   3325605752  687399637   8233    385.18
   252  nyc-taxi-data-10M.csv   zstd    3   3325605752  598514411   10065   315.07
   253  nyc-taxi-data-10M.csv   zstd    6   3325605752  570522953   20038   158.27
   254  nyc-taxi-data-10M.csv   zstd    9   3325605752  517554797   64565   49.12
   255  
   256  nyc-taxi-data-10M.csv   gzstd   1   3325605752  928654908   21270   149.11
   257  nyc-taxi-data-10M.csv   gzkp    1   3325605752  922273214   13929   227.68
   258  ```
   259  
   260  ## Decompressor
   261  
   262  Staus: STABLE - there may still be subtle bugs, but a wide variety of content has been tested.
   263  
   264  This library is being continuously [fuzz-tested](https://github.com/klauspost/compress-fuzz),
   265  kindly supplied by [fuzzit.dev](https://fuzzit.dev/). 
   266  The main purpose of the fuzz testing is to ensure that it is not possible to crash the decoder, 
   267  or run it past its limits with ANY input provided.  
   268   
   269  ### Usage
   270  
   271  The package has been designed for two main usages, big streams of data and smaller in-memory buffers. 
   272  There are two main usages of the package for these. Both of them are accessed by creating a `Decoder`.
   273  
   274  For streaming use a simple setup could look like this:
   275  
   276  ```Go
   277  import "github.com/bir3/gocompiler/src/cmd/gocmd/compress/zstd"
   278  
   279  func Decompress(in io.Reader, out io.Writer) error {
   280      d, err := zstd.NewReader(in)
   281      if err != nil {
   282          return err
   283      }
   284      defer d.Close()
   285      
   286      // Copy content...
   287      _, err = io.Copy(out, d)
   288      return err
   289  }
   290  ```
   291  
   292  It is important to use the "Close" function when you no longer need the Reader to stop running goroutines, 
   293  when running with default settings.
   294  Goroutines will exit once an error has been returned, including `io.EOF` at the end of a stream.
   295  
   296  Streams are decoded concurrently in 4 asynchronous stages to give the best possible throughput.
   297  However, if you prefer synchronous decompression, use `WithDecoderConcurrency(1)` which will decompress data 
   298  as it is being requested only.
   299  
   300  For decoding buffers, it could look something like this:
   301  
   302  ```Go
   303  import "github.com/bir3/gocompiler/src/cmd/gocmd/compress/zstd"
   304  
   305  // Create a reader that caches decompressors.
   306  // For this operation type we supply a nil Reader.
   307  var decoder, _ = zstd.NewReader(nil, WithDecoderConcurrency(0))
   308  
   309  // Decompress a buffer. We don't supply a destination buffer,
   310  // so it will be allocated by the decoder.
   311  func Decompress(src []byte) ([]byte, error) {
   312      return decoder.DecodeAll(src, nil)
   313  } 
   314  ```
   315  
   316  Both of these cases should provide the functionality needed. 
   317  The decoder can be used for *concurrent* decompression of multiple buffers.
   318  By default 4 decompressors will be created. 
   319  
   320  It will only allow a certain number of concurrent operations to run. 
   321  To tweak that yourself use the `WithDecoderConcurrency(n)` option when creating the decoder.
   322  It is possible to use `WithDecoderConcurrency(0)` to create GOMAXPROCS decoders.
   323  
   324  ### Dictionaries
   325  
   326  Data compressed with [dictionaries](https://github.com/facebook/zstd#the-case-for-small-data-compression) can be decompressed.
   327  
   328  Dictionaries are added individually to Decoders.
   329  Dictionaries are generated by the `zstd --train` command and contains an initial state for the decoder.
   330  To add a dictionary use the `WithDecoderDicts(dicts ...[]byte)` option with the dictionary data.
   331  Several dictionaries can be added at once.
   332  
   333  The dictionary will be used automatically for the data that specifies them.
   334  A re-used Decoder will still contain the dictionaries registered.
   335  
   336  When registering multiple dictionaries with the same ID, the last one will be used.
   337  
   338  It is possible to use dictionaries when compressing data.
   339  
   340  To enable a dictionary use `WithEncoderDict(dict []byte)`. Here only one dictionary will be used 
   341  and it will likely be used even if it doesn't improve compression. 
   342  
   343  The used dictionary must be used to decompress the content.
   344  
   345  For any real gains, the dictionary should be built with similar data. 
   346  If an unsuitable dictionary is used the output may be slightly larger than using no dictionary.
   347  Use the [zstd commandline tool](https://github.com/facebook/zstd/releases) to build a dictionary from sample data.
   348  For information see [zstd dictionary information](https://github.com/facebook/zstd#the-case-for-small-data-compression). 
   349  
   350  For now there is a fixed startup performance penalty for compressing content with dictionaries. 
   351  This will likely be improved over time. Just be aware to test performance when implementing.  
   352  
   353  ### Allocation-less operation
   354  
   355  The decoder has been designed to operate without allocations after a warmup. 
   356  
   357  This means that you should *store* the decoder for best performance. 
   358  To re-use a stream decoder, use the `Reset(r io.Reader) error` to switch to another stream.
   359  A decoder can safely be re-used even if the previous stream failed.
   360  
   361  To release the resources, you must call the `Close()` function on a decoder.
   362  After this it can *no longer be reused*, but all running goroutines will be stopped.
   363  So you *must* use this if you will no longer need the Reader.
   364  
   365  For decompressing smaller buffers a single decoder can be used.
   366  When decoding buffers, you can supply a destination slice with length 0 and your expected capacity.
   367  In this case no unneeded allocations should be made. 
   368  
   369  ### Concurrency
   370  
   371  The buffer decoder does everything on the same goroutine and does nothing concurrently.
   372  It can however decode several buffers concurrently. Use `WithDecoderConcurrency(n)` to limit that.
   373  
   374  The stream decoder will create goroutines that:
   375  
   376  1) Reads input and splits the input into blocks.
   377  2) Decompression of literals.
   378  3) Decompression of sequences.
   379  4) Reconstruction of output stream.
   380  
   381  So effectively this also means the decoder will "read ahead" and prepare data to always be available for output.
   382  
   383  The concurrency level will, for streams, determine how many blocks ahead the compression will start.
   384  
   385  Since "blocks" are quite dependent on the output of the previous block stream decoding will only have limited concurrency.
   386  
   387  In practice this means that concurrency is often limited to utilizing about 3 cores effectively.
   388    
   389  ### Benchmarks
   390  
   391  The first two are streaming decodes and the last are smaller inputs. 
   392  
   393  Running on AMD Ryzen 9 3950X 16-Core Processor. AMD64 assembly used.
   394  
   395  ```
   396  BenchmarkDecoderSilesia-32    	                   5	 206878840 ns/op	1024.50 MB/s	   49808 B/op	      43 allocs/op
   397  BenchmarkDecoderEnwik9-32                          1	1271809000 ns/op	 786.28 MB/s	   72048 B/op	      52 allocs/op
   398  
   399  Concurrent blocks, performance:
   400  
   401  BenchmarkDecoder_DecodeAllParallel/kppkn.gtb.zst-32         	   67356	     17857 ns/op	10321.96 MB/s	        22.48 pct	     102 B/op	       0 allocs/op
   402  BenchmarkDecoder_DecodeAllParallel/geo.protodata.zst-32     	  266656	      4421 ns/op	26823.21 MB/s	        11.89 pct	      19 B/op	       0 allocs/op
   403  BenchmarkDecoder_DecodeAllParallel/plrabn12.txt.zst-32      	   20992	     56842 ns/op	8477.17 MB/s	        39.90 pct	     754 B/op	       0 allocs/op
   404  BenchmarkDecoder_DecodeAllParallel/lcet10.txt.zst-32        	   27456	     43932 ns/op	9714.01 MB/s	        33.27 pct	     524 B/op	       0 allocs/op
   405  BenchmarkDecoder_DecodeAllParallel/asyoulik.txt.zst-32      	   78432	     15047 ns/op	8319.15 MB/s	        40.34 pct	      66 B/op	       0 allocs/op
   406  BenchmarkDecoder_DecodeAllParallel/alice29.txt.zst-32       	   65800	     18436 ns/op	8249.63 MB/s	        37.75 pct	      88 B/op	       0 allocs/op
   407  BenchmarkDecoder_DecodeAllParallel/html_x_4.zst-32          	  102993	     11523 ns/op	35546.09 MB/s	         3.637 pct	     143 B/op	       0 allocs/op
   408  BenchmarkDecoder_DecodeAllParallel/paper-100k.pdf.zst-32    	 1000000	      1070 ns/op	95720.98 MB/s	        80.53 pct	       3 B/op	       0 allocs/op
   409  BenchmarkDecoder_DecodeAllParallel/fireworks.jpeg.zst-32    	  749802	      1752 ns/op	70272.35 MB/s	       100.0 pct	       5 B/op	       0 allocs/op
   410  BenchmarkDecoder_DecodeAllParallel/urls.10K.zst-32          	   22640	     52934 ns/op	13263.37 MB/s	        26.25 pct	    1014 B/op	       0 allocs/op
   411  BenchmarkDecoder_DecodeAllParallel/html.zst-32              	  226412	      5232 ns/op	19572.27 MB/s	        14.49 pct	      20 B/op	       0 allocs/op
   412  BenchmarkDecoder_DecodeAllParallel/comp-data.bin.zst-32     	  923041	      1276 ns/op	3194.71 MB/s	        31.26 pct	       0 B/op	       0 allocs/op
   413  ```
   414  
   415  This reflects the performance around May 2022, but this may be out of date.
   416  
   417  ## Zstd inside ZIP files
   418  
   419  It is possible to use zstandard to compress individual files inside zip archives.
   420  While this isn't widely supported it can be useful for internal files.
   421  
   422  To support the compression and decompression of these files you must register a compressor and decompressor.
   423  
   424  It is highly recommended registering the (de)compressors on individual zip Reader/Writer and NOT
   425  use the global registration functions. The main reason for this is that 2 registrations from 
   426  different packages will result in a panic.
   427  
   428  It is a good idea to only have a single compressor and decompressor, since they can be used for multiple zip
   429  files concurrently, and using a single instance will allow reusing some resources.
   430  
   431  See [this example](https://pkg.go.dev/github.com/klauspost/compress/zstd#example-ZipCompressor) for 
   432  how to compress and decompress files inside zip archives.
   433  
   434  # Contributions
   435  
   436  Contributions are always welcome. 
   437  For new features/fixes, remember to add tests and for performance enhancements include benchmarks.
   438  
   439  For general feedback and experience reports, feel free to open an issue or write me on [Twitter](https://twitter.com/sh0dan).
   440  
   441  This package includes the excellent [`github.com/cespare/xxhash`](https://github.com/cespare/xxhash) package Copyright (c) 2016 Caleb Spare.