github.com/jandre/docker@v1.7.0/pkg/tarsum/tarsum_spec.md (about)

     1  page_title: TarSum checksum specification
     2  page_description: Documentation for algorithms used in the TarSum checksum calculation
     3  page_keywords: docker, checksum, validation, tarsum
     4  
     5  # TarSum Checksum Specification
     6  
     7  ## Abstract
     8  
     9  This document describes the algorithms used in performing the TarSum checksum
    10  calculation on filesystem layers, the need for this method over existing
    11  methods, and the versioning of this calculation.
    12  
    13  
    14  ## Introduction
    15  
    16  The transportation of filesystems, regarding Docker, is done with tar(1)
    17  archives. There are a variety of tar serialization formats [2], and a key
    18  concern here is ensuring a repeatable checksum given a set of inputs from a
    19  generic tar archive. Types of transportation include distribution to and from a
    20  registry endpoint, saving and loading through commands or Docker daemon APIs,
    21  transferring the build context from client to Docker daemon, and committing the
    22  filesystem of a container to become an image.
    23  
    24  As tar archives are used for transit, but not preserved in many situations, the
    25  focus of the algorithm is to ensure the integrity of the preserved filesystem,
    26  while maintaining a deterministic accountability. This includes neither
    27  constraining the ordering or manipulation of the files during the creation or
    28  unpacking of the archive, nor include additional metadata state about the file
    29  system attributes.
    30  
    31  ## Intended Audience
    32  
    33  This document is outlining the methods used for consistent checksum calculation
    34  for filesystems transported via tar archives.
    35  
    36  Auditing these methodologies is an open and iterative process. This document
    37  should accommodate the review of source code. Ultimately, this document should
    38  be the starting point of further refinements to the algorithm and its future
    39  versions.
    40  
    41  ## Concept
    42  
    43  The checksum mechanism must ensure the integrity and assurance of the
    44  filesystem payload.
    45  
    46  ## Checksum Algorithm Profile
    47  
    48  A checksum mechanism must define the following operations and attributes:
    49  
    50  * Associated hashing cipher - used to checksum each file payload and attribute
    51    information.
    52  * Checksum list - each file of the filesystem archive has its checksum
    53    calculated from the payload and attributes of the file. The final checksum is
    54    calculated from this list, with specific ordering.
    55  * Version - as the algorithm adapts to requirements, there are behaviors of the
    56    algorithm to manage by versioning.
    57  * Archive being calculated - the tar archive having its checksum calculated
    58  
    59  ## Elements of TarSum checksum
    60  
    61  The calculated sum output is a text string. The elements included in the output
    62  of the calculated sum comprise the information needed for validation of the sum
    63  (TarSum version and hashing cipher used) and the expected checksum in hexadecimal
    64  form.
    65  
    66  There are two delimiters used:
    67  * '+' separates TarSum version from hashing cipher
    68  * ':' separates calculation mechanics from expected hash
    69  
    70  Example:
    71  
    72  ```
    73  	"tarsum.v1+sha256:220a60ecd4a3c32c282622a625a54db9ba0ff55b5ba9c29c7064a2bc358b6a3e"
    74  	|         |       \                                                               |
    75  	|         |        \                                                              |
    76  	|_version_|_cipher__|__                                                           |
    77  	|                      \                                                          |
    78  	|_calculation_mechanics_|______________________expected_sum_______________________|
    79  ```
    80  
    81  ## Versioning
    82  
    83  Versioning was introduced [0] to accommodate differences in calculation needed,
    84  and ability to maintain reverse compatibility.
    85  
    86  The general algorithm will be describe further in the 'Calculation'.
    87  
    88  ### Version0
    89  
    90  This is the initial version of TarSum.
    91  
    92  Its element in the TarSum checksum string is `tarsum`.
    93  
    94  ### Version1
    95  
    96  Its element in the TarSum checksum is `tarsum.v1`.
    97  
    98  The notable changes in this version:
    99  * Exclusion of file `mtime` from the file information headers, in each file
   100    checksum calculation
   101  * Inclusion of extended attributes (`xattrs`. Also seen as `SCHILY.xattr.` prefixed Pax
   102    tar file info headers) keys and values in each file checksum calculation
   103  
   104  ### VersionDev
   105  
   106  *Do not use unless validating refinements to the checksum algorithm*
   107  
   108  Its element in the TarSum checksum is `tarsum.dev`.
   109  
   110  This is a floating place holder for a next version and grounds for testing
   111  changes. The methods used for calculation are subject to change without notice,
   112  and this version is for testing and not for production use.
   113  
   114  ## Ciphers
   115  
   116  The official default and standard hashing cipher used in the calculation mechanic
   117  is `sha256`. This refers to SHA256 hash algorithm as defined in FIPS 180-4.
   118  
   119  Though the TarSum algorithm itself is not exclusively bound to the single
   120  hashing cipher `sha256`, support for alternate hashing ciphers was later added
   121  [1]. Use cases for alternate cipher could include future-proofing TarSum
   122  checksum format and using faster cipher hashes for tar filesystem checksums.
   123  
   124  ## Calculation
   125  
   126  ### Requirement
   127  
   128  As mentioned earlier, the calculation is such that it takes into consideration
   129  the lifecycle of the tar archive. In that the tar archive is not an immutable,
   130  permanent artifact. Otherwise options like relying on a known hashing cipher
   131  checksum of the archive itself would be reliable enough. The tar archive of the
   132  filesystem is used as a transportation medium for Docker images, and the
   133  archive is discarded once its contents are extracted. Therefore, for consistent
   134  validation items such as order of files in the tar archive and time stamps are
   135  subject to change once an image is received.
   136  
   137  ### Process
   138  
   139  The method is typically iterative due to reading tar info headers from the
   140  archive stream, though this is not a strict requirement.
   141  
   142  #### Files
   143  
   144  Each file in the tar archive have their contents (headers and body) checksummed
   145  individually using the designated associated hashing cipher. The ordered
   146  headers of the file are written to the checksum calculation first, and then the
   147  payload of the file body.
   148  
   149  The resulting checksum of the file is appended to the list of file sums. The
   150  sum is encoded as a string of the hexadecimal digest. Additionally, the file
   151  name and position in the archive is kept as reference for special ordering.
   152  
   153  #### Headers
   154  
   155  The following headers are read, in this
   156  order ( and the corresponding representation of its value):
   157  * 'name' - string
   158  * 'mode' - string of the base10 integer
   159  * 'uid' - string of the integer
   160  * 'gid' - string of the integer
   161  * 'size' - string of the integer
   162  * 'mtime' (_Version0 only_) - string of integer of the seconds since 1970-01-01 00:00:00 UTC
   163  * 'typeflag' - string of the char
   164  * 'linkname' - string
   165  * 'uname' - string
   166  * 'gname' - string
   167  * 'devmajor' - string of the integer
   168  * 'devminor' - string of the integer
   169  
   170  For >= Version1, the extented attribute headers ("SCHILY.xattr." prefixed pax
   171  headers) included after the above list. These xattrs key/values are first
   172  sorted by the keys.
   173  
   174  #### Header Format
   175  
   176  The ordered headers are written to the hash in the format of
   177  
   178  	"{.key}{.value}"
   179  
   180  with no newline.
   181  
   182  #### Body
   183  
   184  After the order headers of the file have been added to the checksum for the
   185  file, the body of the file is written to the hash.
   186  
   187  #### List of file sums
   188  
   189  The list of file sums is sorted by the string of the hexadecimal digest.
   190  
   191  If there are two files in the tar with matching paths, the order of occurrence
   192  for that path is reflected for the sums of the corresponding file header and
   193  body.
   194  
   195  #### Final Checksum
   196  
   197  Begin with a fresh or initial state of the associated hash cipher. If there is
   198  additional payload to include in the TarSum calculation for the archive, it is
   199  written first. Then each checksum from the ordered list of file sums is written
   200  to the hash.
   201  
   202  The resulting digest is formatted per the Elements of TarSum checksum,
   203  including the TarSum version, the associated hash cipher and the hexadecimal
   204  encoded checksum digest.
   205  
   206  ## Security Considerations
   207  
   208  The initial version of TarSum has undergone one update that could invalidate
   209  handcrafted tar archives. The tar archive format supports appending of files
   210  with same names as prior files in the archive. The latter file will clobber the
   211  prior file of the same path. Due to this the algorithm now accounts for files
   212  with matching paths, and orders the list of file sums accordingly [3].
   213  
   214  ## Footnotes
   215  
   216  * [0] Versioning https://github.com/docker/docker/commit/747f89cd327db9d50251b17797c4d825162226d0
   217  * [1] Alternate ciphers https://github.com/docker/docker/commit/4e9925d780665149b8bc940d5ba242ada1973c4e
   218  * [2] Tar http://en.wikipedia.org/wiki/Tar_%28computing%29
   219  * [3] Name collision https://github.com/docker/docker/commit/c5e6362c53cbbc09ddbabd5a7323e04438b57d31
   220  
   221  ## Acknowledgements
   222  
   223  Joffrey F (shin-) and Guillaume J. Charmes (creack) on the initial work of the
   224  TarSum calculation.
   225