github.com/jandre/docker@v1.7.0/pkg/tarsum/tarsum_spec.md (about) 1 page_title: TarSum checksum specification 2 page_description: Documentation for algorithms used in the TarSum checksum calculation 3 page_keywords: docker, checksum, validation, tarsum 4 5 # TarSum Checksum Specification 6 7 ## Abstract 8 9 This document describes the algorithms used in performing the TarSum checksum 10 calculation on filesystem layers, the need for this method over existing 11 methods, and the versioning of this calculation. 12 13 14 ## Introduction 15 16 The transportation of filesystems, regarding Docker, is done with tar(1) 17 archives. There are a variety of tar serialization formats [2], and a key 18 concern here is ensuring a repeatable checksum given a set of inputs from a 19 generic tar archive. Types of transportation include distribution to and from a 20 registry endpoint, saving and loading through commands or Docker daemon APIs, 21 transferring the build context from client to Docker daemon, and committing the 22 filesystem of a container to become an image. 23 24 As tar archives are used for transit, but not preserved in many situations, the 25 focus of the algorithm is to ensure the integrity of the preserved filesystem, 26 while maintaining a deterministic accountability. This includes neither 27 constraining the ordering or manipulation of the files during the creation or 28 unpacking of the archive, nor include additional metadata state about the file 29 system attributes. 30 31 ## Intended Audience 32 33 This document is outlining the methods used for consistent checksum calculation 34 for filesystems transported via tar archives. 35 36 Auditing these methodologies is an open and iterative process. This document 37 should accommodate the review of source code. Ultimately, this document should 38 be the starting point of further refinements to the algorithm and its future 39 versions. 40 41 ## Concept 42 43 The checksum mechanism must ensure the integrity and assurance of the 44 filesystem payload. 45 46 ## Checksum Algorithm Profile 47 48 A checksum mechanism must define the following operations and attributes: 49 50 * Associated hashing cipher - used to checksum each file payload and attribute 51 information. 52 * Checksum list - each file of the filesystem archive has its checksum 53 calculated from the payload and attributes of the file. The final checksum is 54 calculated from this list, with specific ordering. 55 * Version - as the algorithm adapts to requirements, there are behaviors of the 56 algorithm to manage by versioning. 57 * Archive being calculated - the tar archive having its checksum calculated 58 59 ## Elements of TarSum checksum 60 61 The calculated sum output is a text string. The elements included in the output 62 of the calculated sum comprise the information needed for validation of the sum 63 (TarSum version and hashing cipher used) and the expected checksum in hexadecimal 64 form. 65 66 There are two delimiters used: 67 * '+' separates TarSum version from hashing cipher 68 * ':' separates calculation mechanics from expected hash 69 70 Example: 71 72 ``` 73 "tarsum.v1+sha256:220a60ecd4a3c32c282622a625a54db9ba0ff55b5ba9c29c7064a2bc358b6a3e" 74 | | \ | 75 | | \ | 76 |_version_|_cipher__|__ | 77 | \ | 78 |_calculation_mechanics_|______________________expected_sum_______________________| 79 ``` 80 81 ## Versioning 82 83 Versioning was introduced [0] to accommodate differences in calculation needed, 84 and ability to maintain reverse compatibility. 85 86 The general algorithm will be describe further in the 'Calculation'. 87 88 ### Version0 89 90 This is the initial version of TarSum. 91 92 Its element in the TarSum checksum string is `tarsum`. 93 94 ### Version1 95 96 Its element in the TarSum checksum is `tarsum.v1`. 97 98 The notable changes in this version: 99 * Exclusion of file `mtime` from the file information headers, in each file 100 checksum calculation 101 * Inclusion of extended attributes (`xattrs`. Also seen as `SCHILY.xattr.` prefixed Pax 102 tar file info headers) keys and values in each file checksum calculation 103 104 ### VersionDev 105 106 *Do not use unless validating refinements to the checksum algorithm* 107 108 Its element in the TarSum checksum is `tarsum.dev`. 109 110 This is a floating place holder for a next version and grounds for testing 111 changes. The methods used for calculation are subject to change without notice, 112 and this version is for testing and not for production use. 113 114 ## Ciphers 115 116 The official default and standard hashing cipher used in the calculation mechanic 117 is `sha256`. This refers to SHA256 hash algorithm as defined in FIPS 180-4. 118 119 Though the TarSum algorithm itself is not exclusively bound to the single 120 hashing cipher `sha256`, support for alternate hashing ciphers was later added 121 [1]. Use cases for alternate cipher could include future-proofing TarSum 122 checksum format and using faster cipher hashes for tar filesystem checksums. 123 124 ## Calculation 125 126 ### Requirement 127 128 As mentioned earlier, the calculation is such that it takes into consideration 129 the lifecycle of the tar archive. In that the tar archive is not an immutable, 130 permanent artifact. Otherwise options like relying on a known hashing cipher 131 checksum of the archive itself would be reliable enough. The tar archive of the 132 filesystem is used as a transportation medium for Docker images, and the 133 archive is discarded once its contents are extracted. Therefore, for consistent 134 validation items such as order of files in the tar archive and time stamps are 135 subject to change once an image is received. 136 137 ### Process 138 139 The method is typically iterative due to reading tar info headers from the 140 archive stream, though this is not a strict requirement. 141 142 #### Files 143 144 Each file in the tar archive have their contents (headers and body) checksummed 145 individually using the designated associated hashing cipher. The ordered 146 headers of the file are written to the checksum calculation first, and then the 147 payload of the file body. 148 149 The resulting checksum of the file is appended to the list of file sums. The 150 sum is encoded as a string of the hexadecimal digest. Additionally, the file 151 name and position in the archive is kept as reference for special ordering. 152 153 #### Headers 154 155 The following headers are read, in this 156 order ( and the corresponding representation of its value): 157 * 'name' - string 158 * 'mode' - string of the base10 integer 159 * 'uid' - string of the integer 160 * 'gid' - string of the integer 161 * 'size' - string of the integer 162 * 'mtime' (_Version0 only_) - string of integer of the seconds since 1970-01-01 00:00:00 UTC 163 * 'typeflag' - string of the char 164 * 'linkname' - string 165 * 'uname' - string 166 * 'gname' - string 167 * 'devmajor' - string of the integer 168 * 'devminor' - string of the integer 169 170 For >= Version1, the extented attribute headers ("SCHILY.xattr." prefixed pax 171 headers) included after the above list. These xattrs key/values are first 172 sorted by the keys. 173 174 #### Header Format 175 176 The ordered headers are written to the hash in the format of 177 178 "{.key}{.value}" 179 180 with no newline. 181 182 #### Body 183 184 After the order headers of the file have been added to the checksum for the 185 file, the body of the file is written to the hash. 186 187 #### List of file sums 188 189 The list of file sums is sorted by the string of the hexadecimal digest. 190 191 If there are two files in the tar with matching paths, the order of occurrence 192 for that path is reflected for the sums of the corresponding file header and 193 body. 194 195 #### Final Checksum 196 197 Begin with a fresh or initial state of the associated hash cipher. If there is 198 additional payload to include in the TarSum calculation for the archive, it is 199 written first. Then each checksum from the ordered list of file sums is written 200 to the hash. 201 202 The resulting digest is formatted per the Elements of TarSum checksum, 203 including the TarSum version, the associated hash cipher and the hexadecimal 204 encoded checksum digest. 205 206 ## Security Considerations 207 208 The initial version of TarSum has undergone one update that could invalidate 209 handcrafted tar archives. The tar archive format supports appending of files 210 with same names as prior files in the archive. The latter file will clobber the 211 prior file of the same path. Due to this the algorithm now accounts for files 212 with matching paths, and orders the list of file sums accordingly [3]. 213 214 ## Footnotes 215 216 * [0] Versioning https://github.com/docker/docker/commit/747f89cd327db9d50251b17797c4d825162226d0 217 * [1] Alternate ciphers https://github.com/docker/docker/commit/4e9925d780665149b8bc940d5ba242ada1973c4e 218 * [2] Tar http://en.wikipedia.org/wiki/Tar_%28computing%29 219 * [3] Name collision https://github.com/docker/docker/commit/c5e6362c53cbbc09ddbabd5a7323e04438b57d31 220 221 ## Acknowledgements 222 223 Joffrey F (shin-) and Guillaume J. Charmes (creack) on the initial work of the 224 TarSum calculation. 225