kythe.io@v0.0.68-0.20240422202219-7225dbc01741/kythe/docs/kythe-kzip.txt (about)

     1  // Copyright 2018 The Kythe Authors. All rights reserved.
     2  //
     3  // Licensed under the Apache License, Version 2.0 (the "License");
     4  // you may not use this file except in compliance with the License.
     5  // You may obtain a copy of the License at
     6  //
     7  //   http://www.apache.org/licenses/LICENSE-2.0
     8  //
     9  // Unless required by applicable law or agreed to in writing, software
    10  // distributed under the License is distributed on an "AS IS" BASIS,
    11  // WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
    12  // See the License for the specific language governing permissions and
    13  // limitations under the License.
    14  
    15  Kythe Compilation ZIP Format (.kzip)
    16  ====================================
    17  Michael J. Fromberger <fromberger@google.com>
    18  v.0.1.0, 18-May-2018: Draft
    19  :toc:
    20  :toclevels: 3
    21  :priority: 500
    22  
    23  == Summary
    24  
    25  This document specifies a compact persistent storage representation for
    26  compilation records, suitable for use by Kythe to generate cross-reference data
    27  and to apply other static analysis tools to source files.
    28  
    29  == Background
    30  
    31  To generate cross-references, Kythe captures a record of each compilation that
    32  is to be indexed (_e.g.,_ a library or binary) with enough information to
    33  enable us to replay the compilation to the front-end of the compiler.  This
    34  record consists of a
    35  `CompilationUnit` link:https://developers.google.com/protocol-buffers[protobuf]
    36  message, together with the content of all the source files and other inputs the
    37  compiler needs to process the compilation (_e.g.,_ header files or type
    38  snapshots from dependencies).
    39  
    40  == Kythe ZIP Format (.kzip)
    41  
    42  To store compilation records compactly, we use a specially formatted ZIP
    43  archive that we call a *kzip* file, conventionally given the file extension
    44  `.kzip`. A kzip file consists of the following directory structure:
    45  
    46  [literal]
    47  root/           # Any valid non-empty directory name
    48     units/
    49       abcd1234   # Compilation unit (JSON format, see below for details)
    50       ...        # (name is hex-coded SHA256 of record content)
    51     files/
    52       1a2b3c4e   # File contents, uncompressed
    53       ...        # (name is hex-coded SHA256 of uncompressed file content)
    54  
    55  This organization separates the compilation unit descriptions from their file
    56  data, which are shared among multiple compilations.
    57  
    58  `.kzip` can offer alternate structures, in which compilation units are formatted in other encodings. Currently the only other encoding defined and supported is "proto". In this case, the directory structure will look like
    59  
    60  [literal]
    61  root/           # Any valid non-empty directory name
    62     pbunits/
    63       abcd1234   # Compilation unit (proto format, see below for details)
    64       ...        # (name is hex-coded SHA256 of record content)
    65     files/
    66       1a2b3c4e   # File contents, uncompressed
    67       ...        # (name is hex-coded SHA256 of uncompressed file content)
    68  
    69  === Directory and File Layout
    70  
    71  A kzip is a ZIP file containing a top-level root directory that contains two
    72  subdirectories, one named `units` and one named `files`.
    73  
    74   * The `units` (or `pbunits`) subdirectory may contain only unit files.
    75  
    76   * The `files` subdirectory may contain only data files.
    77  
    78   * Other files or directories inside the `units` or `files` subdirectories
    79     should cause a tool to consider the kzip file invalid.
    80  
    81   * Other files or subdirectories in the root or other subdirectories should be
    82     ignored by a tool processing the kzip file.
    83  
    84  A *unit file* is a file containing a compilation unit description.
    85  The name of a unit file is computed by digesting the compilation unit
    86  with SHA256, and encoding the resulting hash as a string of lowercase
    87  ASCII hexadecimal digits. This string becomes the filename of the unit
    88  file. Note that the digest should only process the CompilationUnit
    89  itself, and should not include the other contents of the wrapper
    90  message. Details on the digesting algorithm are described below.
    91  
    92  A *data file* is a file containing an unstructured blob of raw (uncompressed)
    93  file data.  The name name of a data file is computed by hashing the file
    94  contents with SHA256, and encoding the resulting hash as a string of lowercase
    95  ASCII hexadecimal digits. This string becomes the filename of the data file.
    96  
    97  The *root directory* must be the first entry in the ZIP file, and its name must
    98  not be empty.
    99  
   100  === Compilation Unit Description Format
   101  
   102  The content of a unit file is the canonical JSON encoding of a
   103  `kythe.proto.IndexedCompilation` protobuf message.
   104  
   105  [source,javascript]
   106  {
   107     "unit": <encoded kythe.proto.CompilationUnit>,
   108     "index": {
   109        "revision": ["123", "456", "789"]
   110     }
   111  }
   112  
   113  The `"unit"` key is required, and must contain the canonical JSON encoding of a
   114  `kythe.proto.CompilationUnit` protobuf message.  The `"index"` key is optional,
   115  but if set must contain the canonical JSON encoding of an `Index` message.
   116  
   117  For the `proto` encoding of compilation units, the content of the unit
   118  is the standard wire-encoding of the `kythe.proto.IndexedCompilation`
   119  protobuf message.
   120  
   121  === Computing the digest for a Compilation Unit
   122  
   123  The representative implementation can be found in
   124  `kythe/go/platform/kcd/kythe/units.go`.
   125  
   126  ==== Definitions
   127  
   128  `NULL`::
   129  the one-byte value corresponding to ASCII NULL (`\0x00`)
   130  `NL`::
   131  the one-byte value corresponding to ASCII newline (`\x0a`)
   132  
   133  All strings are emitted in UTF-8 encoding.
   134  
   135  ==== Canonical form
   136  For the purposes of computing the digest, a compilation unit should
   137  be in *_canonical form_*. This is defined as:
   138  
   139  - the field `required_input` is deduplicated and sorted according to
   140    `cu.required_input.path`
   141  - the field `environment` is sorted by `environment.name`
   142  - the field `source_file` is sorted
   143  - the field `details` is sorted by `details.type_url`
   144  
   145  ==== Digest computation
   146  Let `v` be a `kythe.proto.VName`. Then a digest of `v` is
   147  
   148  ....
   149     v.signature NULL v.corpus NULL v.root NULL v.path NULL v.language NULL
   150  ....
   151  
   152  Let `cu` be an instance of `kythe.proto.CompilationUnit`. Then a
   153  digest of `cu` is computed as the sequence
   154  
   155  ....
   156     "CU" NL cu.vname NULL
   157  ....
   158  
   159  followed by, for each required input:
   160  
   161  ....
   162     "RI" NL cu.required_input[i].vname NULL
   163     "IN" NL cu.required_input[i].info.path NULL cu.required_input[i].info.digest NULL
   164  ....
   165  
   166  followed by:
   167  
   168  ....
   169     "ARG" NL cu.argument[0] NULL cu.argument[1] NULL ...
   170     "OUT" NL cu.output_key NULL
   171     "SRC" NL cu.source_file[0] NULL cu.source_file[1] NULL ...
   172     "CWD" NL cu.working_directory NULL
   173     "CTX" NL cu.entry_context NULL
   174  ....
   175  
   176  followed by, for each `cu.environment`:
   177  
   178  ....
   179     "ENV" NL cu.environment[i].name NULL cu.environment[i].value NULL
   180  ....
   181  
   182  finally followed by, for each `cu.details`:
   183  
   184  ....
   185     "DET" NL cu.details[i].type_url NULL cu.details[i].value NULL
   186  ....
   187  
   188  For the `cu.details.value`, this is the sequence of bytes of the
   189  wire-encoding proto representation of the value.