github.com/demonoid81/moby@v0.0.0-20200517203328-62dd8e17c460/pkg/tarsum/tarsum_spec.md (about) 1 page_title: TarSum checksum specification 2 page_description: Documentation for algorithms used in the TarSum checksum calculation 3 page_keywords: docker, checksum, validation, tarsum 4 5 # TarSum Checksum Specification 6 7 ## Abstract 8 9 This document describes the algorithms used in performing the TarSum checksum 10 calculation on filesystem layers, the need for this method over existing 11 methods, and the versioning of this calculation. 12 13 ## Warning 14 15 This checksum algorithm is for best-effort comparison of file trees with fuzzy logic. 16 17 This is _not_ a cryptographic attestation, and should not be considered secure. 18 19 ## Introduction 20 21 The transportation of filesystems, regarding Docker, is done with tar(1) 22 archives. There are a variety of tar serialization formats [2], and a key 23 concern here is ensuring a repeatable checksum given a set of inputs from a 24 generic tar archive. Types of transportation include distribution to and from a 25 registry endpoint, saving and loading through commands or Docker daemon APIs, 26 transferring the build context from client to Docker daemon, and committing the 27 filesystem of a container to become an image. 28 29 As tar archives are used for transit, but not preserved in many situations, the 30 focus of the algorithm is to ensure the integrity of the preserved filesystem, 31 while maintaining a deterministic accountability. This includes neither 32 constraining the ordering or manipulation of the files during the creation or 33 unpacking of the archive, nor include additional metadata state about the file 34 system attributes. 35 36 ## Intended Audience 37 38 This document is outlining the methods used for consistent checksum calculation 39 for filesystems transported via tar archives. 40 41 Auditing these methodologies is an open and iterative process. This document 42 should accommodate the review of source code. Ultimately, this document should 43 be the starting point of further refinements to the algorithm and its future 44 versions. 45 46 ## Concept 47 48 The checksum mechanism must ensure the integrity and assurance of the 49 filesystem payload. 50 51 ## Checksum Algorithm Profile 52 53 A checksum mechanism must define the following operations and attributes: 54 55 * Associated hashing cipher - used to checksum each file payload and attribute 56 information. 57 * Checksum list - each file of the filesystem archive has its checksum 58 calculated from the payload and attributes of the file. The final checksum is 59 calculated from this list, with specific ordering. 60 * Version - as the algorithm adapts to requirements, there are behaviors of the 61 algorithm to manage by versioning. 62 * Archive being calculated - the tar archive having its checksum calculated 63 64 ## Elements of TarSum checksum 65 66 The calculated sum output is a text string. The elements included in the output 67 of the calculated sum comprise the information needed for validation of the sum 68 (TarSum version and hashing cipher used) and the expected checksum in hexadecimal 69 form. 70 71 There are two delimiters used: 72 * '+' separates TarSum version from hashing cipher 73 * ':' separates calculation mechanics from expected hash 74 75 Example: 76 77 ``` 78 "tarsum.v1+sha256:220a60ecd4a3c32c282622a625a54db9ba0ff55b5ba9c29c7064a2bc358b6a3e" 79 | | \ | 80 | | \ | 81 |_version_|_cipher__|__ | 82 | \ | 83 |_calculation_mechanics_|______________________expected_sum_______________________| 84 ``` 85 86 ## Versioning 87 88 Versioning was introduced [0] to accommodate differences in calculation needed, 89 and ability to maintain reverse compatibility. 90 91 The general algorithm will be describe further in the 'Calculation'. 92 93 ### Version0 94 95 This is the initial version of TarSum. 96 97 Its element in the TarSum checksum string is `tarsum`. 98 99 ### Version1 100 101 Its element in the TarSum checksum is `tarsum.v1`. 102 103 The notable changes in this version: 104 * Exclusion of file `mtime` from the file information headers, in each file 105 checksum calculation 106 * Inclusion of extended attributes (`xattrs`. Also seen as `SCHILY.xattr.` prefixed Pax 107 tar file info headers) keys and values in each file checksum calculation 108 109 ### VersionDev 110 111 *Do not use unless validating refinements to the checksum algorithm* 112 113 Its element in the TarSum checksum is `tarsum.dev`. 114 115 This is a floating place holder for a next version and grounds for testing 116 changes. The methods used for calculation are subject to change without notice, 117 and this version is for testing and not for production use. 118 119 ## Ciphers 120 121 The official default and standard hashing cipher used in the calculation mechanic 122 is `sha256`. This refers to SHA256 hash algorithm as defined in FIPS 180-4. 123 124 Though the TarSum algorithm itself is not exclusively bound to the single 125 hashing cipher `sha256`, support for alternate hashing ciphers was later added 126 [1]. Use cases for alternate cipher could include future-proofing TarSum 127 checksum format and using faster cipher hashes for tar filesystem checksums. 128 129 ## Calculation 130 131 ### Requirement 132 133 As mentioned earlier, the calculation is such that it takes into consideration 134 the lifecycle of the tar archive. In that the tar archive is not an immutable, 135 permanent artifact. Otherwise options like relying on a known hashing cipher 136 checksum of the archive itself would be reliable enough. The tar archive of the 137 filesystem is used as a transportation medium for Docker images, and the 138 archive is discarded once its contents are extracted. Therefore, for consistent 139 validation items such as order of files in the tar archive and time stamps are 140 subject to change once an image is received. 141 142 ### Process 143 144 The method is typically iterative due to reading tar info headers from the 145 archive stream, though this is not a strict requirement. 146 147 #### Files 148 149 Each file in the tar archive have their contents (headers and body) checksummed 150 individually using the designated associated hashing cipher. The ordered 151 headers of the file are written to the checksum calculation first, and then the 152 payload of the file body. 153 154 The resulting checksum of the file is appended to the list of file sums. The 155 sum is encoded as a string of the hexadecimal digest. Additionally, the file 156 name and position in the archive is kept as reference for special ordering. 157 158 #### Headers 159 160 The following headers are read, in this 161 order ( and the corresponding representation of its value): 162 * 'name' - string 163 * 'mode' - string of the base10 integer 164 * 'uid' - string of the integer 165 * 'gid' - string of the integer 166 * 'size' - string of the integer 167 * 'mtime' (_Version0 only_) - string of integer of the seconds since 1970-01-01 00:00:00 UTC 168 * 'typeflag' - string of the char 169 * 'linkname' - string 170 * 'uname' - string 171 * 'gname' - string 172 * 'devmajor' - string of the integer 173 * 'devminor' - string of the integer 174 175 For >= Version1, the extended attribute headers ("SCHILY.xattr." prefixed pax 176 headers) included after the above list. These xattrs key/values are first 177 sorted by the keys. 178 179 #### Header Format 180 181 The ordered headers are written to the hash in the format of 182 183 "{.key}{.value}" 184 185 with no newline. 186 187 #### Body 188 189 After the order headers of the file have been added to the checksum for the 190 file, the body of the file is written to the hash. 191 192 #### List of file sums 193 194 The list of file sums is sorted by the string of the hexadecimal digest. 195 196 If there are two files in the tar with matching paths, the order of occurrence 197 for that path is reflected for the sums of the corresponding file header and 198 body. 199 200 #### Final Checksum 201 202 Begin with a fresh or initial state of the associated hash cipher. If there is 203 additional payload to include in the TarSum calculation for the archive, it is 204 written first. Then each checksum from the ordered list of file sums is written 205 to the hash. 206 207 The resulting digest is formatted per the Elements of TarSum checksum, 208 including the TarSum version, the associated hash cipher and the hexadecimal 209 encoded checksum digest. 210 211 ## Security Considerations 212 213 The initial version of TarSum has undergone one update that could invalidate 214 handcrafted tar archives. The tar archive format supports appending of files 215 with same names as prior files in the archive. The latter file will clobber the 216 prior file of the same path. Due to this the algorithm now accounts for files 217 with matching paths, and orders the list of file sums accordingly [3]. 218 219 ## Footnotes 220 221 * [0] Versioning https://github.com/demonoid81/moby/commit/747f89cd327db9d50251b17797c4d825162226d0 222 * [1] Alternate ciphers https://github.com/demonoid81/moby/commit/4e9925d780665149b8bc940d5ba242ada1973c4e 223 * [2] Tar http://en.wikipedia.org/wiki/Tar_%28computing%29 224 * [3] Name collision https://github.com/demonoid81/moby/commit/c5e6362c53cbbc09ddbabd5a7323e04438b57d31 225 226 ## Acknowledgments 227 228 Joffrey F (shin-) and Guillaume J. Charmes (creack) on the initial work of the 229 TarSum calculation. 230