kythe.io@v0.0.68-0.20240422202219-7225dbc01741/kythe/docs/kythe-kzip.txt (about) 1 // Copyright 2018 The Kythe Authors. All rights reserved. 2 // 3 // Licensed under the Apache License, Version 2.0 (the "License"); 4 // you may not use this file except in compliance with the License. 5 // You may obtain a copy of the License at 6 // 7 // http://www.apache.org/licenses/LICENSE-2.0 8 // 9 // Unless required by applicable law or agreed to in writing, software 10 // distributed under the License is distributed on an "AS IS" BASIS, 11 // WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. 12 // See the License for the specific language governing permissions and 13 // limitations under the License. 14 15 Kythe Compilation ZIP Format (.kzip) 16 ==================================== 17 Michael J. Fromberger <fromberger@google.com> 18 v.0.1.0, 18-May-2018: Draft 19 :toc: 20 :toclevels: 3 21 :priority: 500 22 23 == Summary 24 25 This document specifies a compact persistent storage representation for 26 compilation records, suitable for use by Kythe to generate cross-reference data 27 and to apply other static analysis tools to source files. 28 29 == Background 30 31 To generate cross-references, Kythe captures a record of each compilation that 32 is to be indexed (_e.g.,_ a library or binary) with enough information to 33 enable us to replay the compilation to the front-end of the compiler. This 34 record consists of a 35 `CompilationUnit` link:https://developers.google.com/protocol-buffers[protobuf] 36 message, together with the content of all the source files and other inputs the 37 compiler needs to process the compilation (_e.g.,_ header files or type 38 snapshots from dependencies). 39 40 == Kythe ZIP Format (.kzip) 41 42 To store compilation records compactly, we use a specially formatted ZIP 43 archive that we call a *kzip* file, conventionally given the file extension 44 `.kzip`. A kzip file consists of the following directory structure: 45 46 [literal] 47 root/ # Any valid non-empty directory name 48 units/ 49 abcd1234 # Compilation unit (JSON format, see below for details) 50 ... # (name is hex-coded SHA256 of record content) 51 files/ 52 1a2b3c4e # File contents, uncompressed 53 ... # (name is hex-coded SHA256 of uncompressed file content) 54 55 This organization separates the compilation unit descriptions from their file 56 data, which are shared among multiple compilations. 57 58 `.kzip` can offer alternate structures, in which compilation units are formatted in other encodings. Currently the only other encoding defined and supported is "proto". In this case, the directory structure will look like 59 60 [literal] 61 root/ # Any valid non-empty directory name 62 pbunits/ 63 abcd1234 # Compilation unit (proto format, see below for details) 64 ... # (name is hex-coded SHA256 of record content) 65 files/ 66 1a2b3c4e # File contents, uncompressed 67 ... # (name is hex-coded SHA256 of uncompressed file content) 68 69 === Directory and File Layout 70 71 A kzip is a ZIP file containing a top-level root directory that contains two 72 subdirectories, one named `units` and one named `files`. 73 74 * The `units` (or `pbunits`) subdirectory may contain only unit files. 75 76 * The `files` subdirectory may contain only data files. 77 78 * Other files or directories inside the `units` or `files` subdirectories 79 should cause a tool to consider the kzip file invalid. 80 81 * Other files or subdirectories in the root or other subdirectories should be 82 ignored by a tool processing the kzip file. 83 84 A *unit file* is a file containing a compilation unit description. 85 The name of a unit file is computed by digesting the compilation unit 86 with SHA256, and encoding the resulting hash as a string of lowercase 87 ASCII hexadecimal digits. This string becomes the filename of the unit 88 file. Note that the digest should only process the CompilationUnit 89 itself, and should not include the other contents of the wrapper 90 message. Details on the digesting algorithm are described below. 91 92 A *data file* is a file containing an unstructured blob of raw (uncompressed) 93 file data. The name name of a data file is computed by hashing the file 94 contents with SHA256, and encoding the resulting hash as a string of lowercase 95 ASCII hexadecimal digits. This string becomes the filename of the data file. 96 97 The *root directory* must be the first entry in the ZIP file, and its name must 98 not be empty. 99 100 === Compilation Unit Description Format 101 102 The content of a unit file is the canonical JSON encoding of a 103 `kythe.proto.IndexedCompilation` protobuf message. 104 105 [source,javascript] 106 { 107 "unit": <encoded kythe.proto.CompilationUnit>, 108 "index": { 109 "revision": ["123", "456", "789"] 110 } 111 } 112 113 The `"unit"` key is required, and must contain the canonical JSON encoding of a 114 `kythe.proto.CompilationUnit` protobuf message. The `"index"` key is optional, 115 but if set must contain the canonical JSON encoding of an `Index` message. 116 117 For the `proto` encoding of compilation units, the content of the unit 118 is the standard wire-encoding of the `kythe.proto.IndexedCompilation` 119 protobuf message. 120 121 === Computing the digest for a Compilation Unit 122 123 The representative implementation can be found in 124 `kythe/go/platform/kcd/kythe/units.go`. 125 126 ==== Definitions 127 128 `NULL`:: 129 the one-byte value corresponding to ASCII NULL (`\0x00`) 130 `NL`:: 131 the one-byte value corresponding to ASCII newline (`\x0a`) 132 133 All strings are emitted in UTF-8 encoding. 134 135 ==== Canonical form 136 For the purposes of computing the digest, a compilation unit should 137 be in *_canonical form_*. This is defined as: 138 139 - the field `required_input` is deduplicated and sorted according to 140 `cu.required_input.path` 141 - the field `environment` is sorted by `environment.name` 142 - the field `source_file` is sorted 143 - the field `details` is sorted by `details.type_url` 144 145 ==== Digest computation 146 Let `v` be a `kythe.proto.VName`. Then a digest of `v` is 147 148 .... 149 v.signature NULL v.corpus NULL v.root NULL v.path NULL v.language NULL 150 .... 151 152 Let `cu` be an instance of `kythe.proto.CompilationUnit`. Then a 153 digest of `cu` is computed as the sequence 154 155 .... 156 "CU" NL cu.vname NULL 157 .... 158 159 followed by, for each required input: 160 161 .... 162 "RI" NL cu.required_input[i].vname NULL 163 "IN" NL cu.required_input[i].info.path NULL cu.required_input[i].info.digest NULL 164 .... 165 166 followed by: 167 168 .... 169 "ARG" NL cu.argument[0] NULL cu.argument[1] NULL ... 170 "OUT" NL cu.output_key NULL 171 "SRC" NL cu.source_file[0] NULL cu.source_file[1] NULL ... 172 "CWD" NL cu.working_directory NULL 173 "CTX" NL cu.entry_context NULL 174 .... 175 176 followed by, for each `cu.environment`: 177 178 .... 179 "ENV" NL cu.environment[i].name NULL cu.environment[i].value NULL 180 .... 181 182 finally followed by, for each `cu.details`: 183 184 .... 185 "DET" NL cu.details[i].type_url NULL cu.details[i].value NULL 186 .... 187 188 For the `cu.details.value`, this is the sequence of bytes of the 189 wire-encoding proto representation of the value.