github.com/demonoid81/moby@v0.0.0-20200517203328-62dd8e17c460/pkg/tarsum/tarsum_spec.md (about)

     1  page_title: TarSum checksum specification
     2  page_description: Documentation for algorithms used in the TarSum checksum calculation
     3  page_keywords: docker, checksum, validation, tarsum
     4  
     5  # TarSum Checksum Specification
     6  
     7  ## Abstract
     8  
     9  This document describes the algorithms used in performing the TarSum checksum
    10  calculation on filesystem layers, the need for this method over existing
    11  methods, and the versioning of this calculation.
    12  
    13  ## Warning
    14  
    15  This checksum algorithm is for best-effort comparison of file trees with fuzzy logic.
    16  
    17  This is _not_ a cryptographic attestation, and should not be considered secure.
    18  
    19  ## Introduction
    20  
    21  The transportation of filesystems, regarding Docker, is done with tar(1)
    22  archives. There are a variety of tar serialization formats [2], and a key
    23  concern here is ensuring a repeatable checksum given a set of inputs from a
    24  generic tar archive. Types of transportation include distribution to and from a
    25  registry endpoint, saving and loading through commands or Docker daemon APIs,
    26  transferring the build context from client to Docker daemon, and committing the
    27  filesystem of a container to become an image.
    28  
    29  As tar archives are used for transit, but not preserved in many situations, the
    30  focus of the algorithm is to ensure the integrity of the preserved filesystem,
    31  while maintaining a deterministic accountability. This includes neither
    32  constraining the ordering or manipulation of the files during the creation or
    33  unpacking of the archive, nor include additional metadata state about the file
    34  system attributes.
    35  
    36  ## Intended Audience
    37  
    38  This document is outlining the methods used for consistent checksum calculation
    39  for filesystems transported via tar archives.
    40  
    41  Auditing these methodologies is an open and iterative process. This document
    42  should accommodate the review of source code. Ultimately, this document should
    43  be the starting point of further refinements to the algorithm and its future
    44  versions.
    45  
    46  ## Concept
    47  
    48  The checksum mechanism must ensure the integrity and assurance of the
    49  filesystem payload.
    50  
    51  ## Checksum Algorithm Profile
    52  
    53  A checksum mechanism must define the following operations and attributes:
    54  
    55  * Associated hashing cipher - used to checksum each file payload and attribute
    56    information.
    57  * Checksum list - each file of the filesystem archive has its checksum
    58    calculated from the payload and attributes of the file. The final checksum is
    59    calculated from this list, with specific ordering.
    60  * Version - as the algorithm adapts to requirements, there are behaviors of the
    61    algorithm to manage by versioning.
    62  * Archive being calculated - the tar archive having its checksum calculated
    63  
    64  ## Elements of TarSum checksum
    65  
    66  The calculated sum output is a text string. The elements included in the output
    67  of the calculated sum comprise the information needed for validation of the sum
    68  (TarSum version and hashing cipher used) and the expected checksum in hexadecimal
    69  form.
    70  
    71  There are two delimiters used:
    72  * '+' separates TarSum version from hashing cipher
    73  * ':' separates calculation mechanics from expected hash
    74  
    75  Example:
    76  
    77  ```
    78  	"tarsum.v1+sha256:220a60ecd4a3c32c282622a625a54db9ba0ff55b5ba9c29c7064a2bc358b6a3e"
    79  	|         |       \                                                               |
    80  	|         |        \                                                              |
    81  	|_version_|_cipher__|__                                                           |
    82  	|                      \                                                          |
    83  	|_calculation_mechanics_|______________________expected_sum_______________________|
    84  ```
    85  
    86  ## Versioning
    87  
    88  Versioning was introduced [0] to accommodate differences in calculation needed,
    89  and ability to maintain reverse compatibility.
    90  
    91  The general algorithm will be describe further in the 'Calculation'.
    92  
    93  ### Version0
    94  
    95  This is the initial version of TarSum.
    96  
    97  Its element in the TarSum checksum string is `tarsum`.
    98  
    99  ### Version1
   100  
   101  Its element in the TarSum checksum is `tarsum.v1`.
   102  
   103  The notable changes in this version:
   104  * Exclusion of file `mtime` from the file information headers, in each file
   105    checksum calculation
   106  * Inclusion of extended attributes (`xattrs`. Also seen as `SCHILY.xattr.` prefixed Pax
   107    tar file info headers) keys and values in each file checksum calculation
   108  
   109  ### VersionDev
   110  
   111  *Do not use unless validating refinements to the checksum algorithm*
   112  
   113  Its element in the TarSum checksum is `tarsum.dev`.
   114  
   115  This is a floating place holder for a next version and grounds for testing
   116  changes. The methods used for calculation are subject to change without notice,
   117  and this version is for testing and not for production use.
   118  
   119  ## Ciphers
   120  
   121  The official default and standard hashing cipher used in the calculation mechanic
   122  is `sha256`. This refers to SHA256 hash algorithm as defined in FIPS 180-4.
   123  
   124  Though the TarSum algorithm itself is not exclusively bound to the single
   125  hashing cipher `sha256`, support for alternate hashing ciphers was later added
   126  [1]. Use cases for alternate cipher could include future-proofing TarSum
   127  checksum format and using faster cipher hashes for tar filesystem checksums.
   128  
   129  ## Calculation
   130  
   131  ### Requirement
   132  
   133  As mentioned earlier, the calculation is such that it takes into consideration
   134  the lifecycle of the tar archive. In that the tar archive is not an immutable,
   135  permanent artifact. Otherwise options like relying on a known hashing cipher
   136  checksum of the archive itself would be reliable enough. The tar archive of the
   137  filesystem is used as a transportation medium for Docker images, and the
   138  archive is discarded once its contents are extracted. Therefore, for consistent
   139  validation items such as order of files in the tar archive and time stamps are
   140  subject to change once an image is received.
   141  
   142  ### Process
   143  
   144  The method is typically iterative due to reading tar info headers from the
   145  archive stream, though this is not a strict requirement.
   146  
   147  #### Files
   148  
   149  Each file in the tar archive have their contents (headers and body) checksummed
   150  individually using the designated associated hashing cipher. The ordered
   151  headers of the file are written to the checksum calculation first, and then the
   152  payload of the file body.
   153  
   154  The resulting checksum of the file is appended to the list of file sums. The
   155  sum is encoded as a string of the hexadecimal digest. Additionally, the file
   156  name and position in the archive is kept as reference for special ordering.
   157  
   158  #### Headers
   159  
   160  The following headers are read, in this
   161  order ( and the corresponding representation of its value):
   162  * 'name' - string
   163  * 'mode' - string of the base10 integer
   164  * 'uid' - string of the integer
   165  * 'gid' - string of the integer
   166  * 'size' - string of the integer
   167  * 'mtime' (_Version0 only_) - string of integer of the seconds since 1970-01-01 00:00:00 UTC
   168  * 'typeflag' - string of the char
   169  * 'linkname' - string
   170  * 'uname' - string
   171  * 'gname' - string
   172  * 'devmajor' - string of the integer
   173  * 'devminor' - string of the integer
   174  
   175  For >= Version1, the extended attribute headers ("SCHILY.xattr." prefixed pax
   176  headers) included after the above list. These xattrs key/values are first
   177  sorted by the keys.
   178  
   179  #### Header Format
   180  
   181  The ordered headers are written to the hash in the format of
   182  
   183  	"{.key}{.value}"
   184  
   185  with no newline.
   186  
   187  #### Body
   188  
   189  After the order headers of the file have been added to the checksum for the
   190  file, the body of the file is written to the hash.
   191  
   192  #### List of file sums
   193  
   194  The list of file sums is sorted by the string of the hexadecimal digest.
   195  
   196  If there are two files in the tar with matching paths, the order of occurrence
   197  for that path is reflected for the sums of the corresponding file header and
   198  body.
   199  
   200  #### Final Checksum
   201  
   202  Begin with a fresh or initial state of the associated hash cipher. If there is
   203  additional payload to include in the TarSum calculation for the archive, it is
   204  written first. Then each checksum from the ordered list of file sums is written
   205  to the hash.
   206  
   207  The resulting digest is formatted per the Elements of TarSum checksum,
   208  including the TarSum version, the associated hash cipher and the hexadecimal
   209  encoded checksum digest.
   210  
   211  ## Security Considerations
   212  
   213  The initial version of TarSum has undergone one update that could invalidate
   214  handcrafted tar archives. The tar archive format supports appending of files
   215  with same names as prior files in the archive. The latter file will clobber the
   216  prior file of the same path. Due to this the algorithm now accounts for files
   217  with matching paths, and orders the list of file sums accordingly [3].
   218  
   219  ## Footnotes
   220  
   221  * [0] Versioning https://github.com/demonoid81/moby/commit/747f89cd327db9d50251b17797c4d825162226d0
   222  * [1] Alternate ciphers https://github.com/demonoid81/moby/commit/4e9925d780665149b8bc940d5ba242ada1973c4e
   223  * [2] Tar http://en.wikipedia.org/wiki/Tar_%28computing%29
   224  * [3] Name collision https://github.com/demonoid81/moby/commit/c5e6362c53cbbc09ddbabd5a7323e04438b57d31
   225  
   226  ## Acknowledgments
   227  
   228  Joffrey F (shin-) and Guillaume J. Charmes (creack) on the initial work of the
   229  TarSum calculation.
   230