kythe.io@v0.0.68-0.20240422202219-7225dbc01741/kythe/docs/rfc/2909.md (about)

     1  # Background
     2  
     3  The current Kythe serving data format is just a simple key-value store from string keys (usually
     4  [Kythe tickets](https://kythe.io/docs/kythe-uri-spec.html)) to ProtocolBuffer message values.  This
     5  format requires the Kythe PostProcessor to perform a complex reduction over a potentially massive
     6  number of decorations/references/edges/etc. to produce a single lookup value per node.  A set of
     7  CrossReferences can be so large that it already requires manual paging, another complex
     8  transformation in the PostProcessor.  These operations tend to create hot shards and stragglers,
     9  leading to long post-processing times and potentially failures due to memory exhaustion.  Likewise,
    10  the server is forced to slowly decode large ProtocolBuffer messages even when only a subset of the
    11  data is required and the large value sizes can cause issues with some backend stores.
    12  
    13  # Proposal
    14  
    15  Split serving table ProtocolBuffer values into their component fields, making each an independent
    16  key-value entry.  Single lookup API calls become scans over a key prefix; paging becomes trivial.
    17  Due the characteristics of LevelDB/SSTables, the split key-value entries will most likely be read
    18  together (see SSTable block encoding), keys will be prefix-encoded, and there should be no
    19  significant performance hit for reads.  On the contrary, since the Kythe API gives users the ability
    20  to request partial data, the split key-value entries allows the server to possibly skip over large
    21  segments of data while scanning and save the cost of decoding for a performance improvement.
    22  Examples include a user requesting only small span of a file's decorations or a user requesting only
    23  a single kind of cross-references.
    24  
    25  More importantly, keeping serving data separate lets the Kythe PostProcessor do fewer large joins.
    26  There will no longer need to be a join of all file decorations to create the single
    27  `FileDecorations` proto (likewise for CrossReferences).  Fewer large joins avoids stragglers in the
    28  PostProcessor and removes the need for many "limiters" (i.e. cutoff values for the size of outputs).
    29  
    30  The columnar format is designed primarily to accommodate NoSQL/LevelDB constraints (e.g. unique,
    31  ordered keys with arbitrary values), but it could be general enough to fit other storage backends.
    32  For instance, the keys could be split and put into a relational database alongside their values.
    33  
    34  # Implementation
    35  
    36  Essentially each field of an existing ProtocolBuffer value in the serving data becomes zero or more
    37  key-value entries depending on whether it's unset, set, or a repeated field.  The bulk of the field
    38  data will be represented in the key and the keys will be encoded as
    39  [orderedcodes](https://godoc.org/github.com/google/orderedcode).  It's important that each key be
    40  unique to fit storage models like LevelDB which are not multimaps (unlike the underlying SSTables).
    41  Tags within keys will indicate how to decode the values themselves and help order the key-value
    42  entries.  The ordering of the keys will be chosen as to best fit a performant server, allowing for
    43  easy, single-pass scans to retrieve all possible data.
    44  
    45  For message-valued fields, further deconstruction of their component fields is necessary to produce
    46  key-value entries.  Much of their component field values are moved into an entry's key so that it is
    47  guaranteed unique.  The leftover data are made into the entry's value.  Sometimes a single field may
    48  be transformed into multiple entries to remove data redundancy and to allow entries to be
    49  independently processed.  A good example is the `FileDecorations.decoration` field.  A single
    50  `Decoration` is actually decomposed into two separate entries, one storing what can be thought of as
    51  the "actual" reference within the file (i.e. `start-end-kind-target` with no value) and one to store
    52  the `target_definition`.
    53  
    54  As most Kythe ProtocolMessages are relatively flat and consist of basic types (strings, integers,
    55  and floats), most of the below format is trivial.  More complex types can also be handled as long as
    56  a finite, unique key can be constructed of the aforementioned basic values for each value.  For
    57  instance, enums may be converted to integers and arbitrary bytes may be base64-encoded.  Even
    58  recursive types such as `MarkedSource` may be stored as values since they are associated with
    59  `VName` keys.
    60  
    61  The scope of this proposal only includes FileDecorations, CrossReferences, and Edges.  Other
    62  services could be made to use similar formats, but the most improvement will be gained from these
    63  three APIs.
    64  
    65  ## FileDecorations
    66  
    67  ### Existing ProtocolBuffer definition:
    68  
    69  ```protobuf
    70  // Simplified FileDecorations
    71  message FileDecorations {
    72    File file = 1;
    73  
    74    repeated Decoration decoration = 2;
    75    repeated ExpandedAnchor target_definitions = 3;
    76    repeated Node target = 4;
    77    repeated Override target_override = 5;
    78    repeated kythe.proto.common.Diagnostic diagnostic = 6;
    79  
    80    message Decoration {
    81      RawAnchor anchor = 1;
    82      string kind = 2;
    83      string target = 5;
    84      string target_definition = 4;
    85    }
    86    message Override {
    87      enum Kind {
    88        OVERRIDES  = 0;
    89        EXTENDS    = 1;
    90      }
    91      string overriding = 1;
    92      string overridden = 2;
    93      string overridden_definition = 5;
    94      Kind kind = 3;
    95      kythe.proto.common.MarkedSource marked_source = 4;
    96    }
    97  }
    98  ```
    99  
   100  ### Key-value format:
   101  
   102  | key | value | kind |
   103  |---|---|---|
   104  | `"fd"-file` | `FileDecorationsIndex` | file metadata |
   105  | `"fd"-file-00-start-end` | bytes | file contents for span |
   106  | `"fd"-file-10-start-end-kind-target` | `<empty>` | decoration targets |
   107  | `"fd"-file-20-target-kind-override` | `<empty>` | override per target |
   108  | `"fd"-file-30-target` | `srvpb.Node` | target node facts |
   109  | `"fd"-file-40-target` | `spb.VName` | definition for each target |
   110  | `"fd"-file-50-def` | `srvpb.ExpandedAnchor` | each definition's location |
   111  | `"fd"-file-60-override` | `MarkedSource` | `MarkedSource` per override |
   112  | `"fd"-file-70-start-end-hash` | `Diagnostic` | `Diagnostic` per span (use `-1:-1` span for entire file) |
   113  
   114  
   115  Notes:
   116  
   117  * Each value will be contained within a wrapper message to support evolution of the table format
   118  * All file decoration entries are prefixed by "fd" and the file's `VName`
   119  * `VNames` will be stored instead of [Kythe tickets](https://kythe.io/docs/kythe-uri-spec.html)
   120  * The `FileDecorationsIndex` is used to store metadata for the rest of the entries and acts as an existence check
   121  * File text is chunked to support very large files (most files should be a single entry)
   122  * Decorations are ordered by span to support a narrow `FileDecorationsRequest` (group 10)
   123  * Node facts/definitions (groups 30 and 40) are stored after decorations and overrides to allow for the reader to skip over unneeded nodes (due to a narrow request)
   124  * Each definition is stored uniquely (group 50) and are related to nodes separately (group 40)
   125  
   126  ## CrossReferences
   127  
   128  ### Existing ProtocolBuffer definition:
   129  
   130  ```protobuf
   131  // Simplified PagedCrossReference (no paging, removed deprecated features)
   132  message PagedCrossReferences {
   133    string source_ticket = 1;
   134    kythe.proto.common.MarkedSource marked_source = 6;
   135    repeated string merge_with = 7;
   136  
   137    repeated Group group = 2;
   138    Node source_node = 8;
   139  
   140    message Group {
   141      string kind = 1;
   142      repeated ExpandedAnchor anchor = 2;
   143      repeated RelatedNode related_node = 3;
   144      repeated Caller caller = 4;
   145    }
   146    message RelatedNode {
   147      Node node = 1;
   148      int32 ordinal = 2;
   149    }
   150    message Caller {
   151      ExpandedAnchor caller = 1;
   152      string semantic_caller = 2;
   153      kythe.proto.common.MarkedSource marked_source = 3;
   154      repeated ExpandedAnchor callsite = 4;
   155    }
   156  }
   157  ```
   158  
   159  ### Key-value format:
   160  
   161  | key                                    | value                                 | kind                     |
   162  |----------------------------------------|---------------------------------------|--------------------------|
   163  | `"xr"-source`                          | `CrossReferencesIndex`                |                          |
   164  | `"xr"-source-00-kind-file-start-end`   | `srvpb.ExpandedAnchor`                | Regular cross-reference  |
   165  | `"xr"-source-10-kind-ordinal-node`     |                                       | Non-ref relations        |
   166  | `"xr"-source-20-caller`                | `srvpb.ExpandedAnchor`+`MarkedSource` | Definition for Caller    |
   167  | `"xr"-source-20-caller-file-start-end` | `srvpb.ExpandedAnchor`                | Callsite                 |
   168  | `"xr"-source-30-node`                  | `srvpb.Node`                          | Related nodes            |
   169  | `"xr"-source-40-node`                  | `srvpb.ExpandedAnchor`                | Related node definitions |
   170  
   171  
   172  Notes:
   173  
   174  * Each value will be contained within a wrapper message to support evolution of the table format
   175  * All cross-reference entries are prefixed by "xr" and the node's `VName`
   176  * `VNames` will be stored instead of [Kythe tickets](https://kythe.io/docs/kythe-uri-spec.html)
   177  * The `CrossReferencesIndex` is used to store metadata for the rest of the entries (e.g. counts, node facts, `merge_with`) and acts as an existence check
   178  * `Group`s are not used (each `Group` message is spread across multiple KV entries with a shared prefix)
   179  * `kind` fields in keys include a "priority" to achieve desired sorting order (this could alternatively be included in the group number)
   180  * After their `kind`, cross-references are ordered by file location
   181  
   182  ## Edges
   183  
   184  ### Existing ProtocolBuffer definition:
   185  
   186  ```protobuf
   187  // Simplified PagedEdgeSet (no paging, removed deprecated features)
   188  message PagedEdgeSet {
   189    message EdgeGroup {
   190      message Edge {
   191        Node target = 1;
   192        int32 ordinal = 2;
   193      }
   194      string kind = 1;
   195      repeated Edge edge = 2;
   196    }
   197  
   198    Node source = 1;
   199    repeated EdgeGroup group = 2;
   200  }
   201  ```
   202  
   203  ### Key-value format:
   204  
   205  | key                                  | value                  | kind                          |
   206  |--------------------------------------|------------------------|-------------------------------|
   207  | `"eg"-source`                        | `EdgesIndex`           |                               |
   208  | `"eg"-source-10-kind-ordinal-target` | `srvpb.EdgeGroup.Edge` | Single edge                   |
   209  | `"eg"-source-20-target`              | `srvpb.Node`           | Node facts for an edge target |
   210  
   211  
   212  Notes:
   213  
   214  * Each value will be contained within a wrapper message to support evolution of the table format
   215  * All edge entries are prefixed by "eg" and the node's `VName`
   216  * `VNames` will be stored instead of [Kythe tickets](https://kythe.io/docs/kythe-uri-spec.html)
   217  * Each `EdgesIndex` will include the source node's facts (for the `Nodes` API)
   218  * Node facts for edge targets are separated from the edge relations to remove duplication
   219  
   220  ## Beam Pipeline
   221  
   222  The changes to the Beam workflow will be relatively straight-forward.  The current
   223  [combineDecorPieces](https://github.com/kythe/kythe/blob/7f5ba6a/kythe/go/serving/pipeline/beam.go#L178)
   224  and
   225  [groupCrossRefs](https://github.com/kythe/kythe/blob/7f5ba6a/kythe/go/serving/pipeline/beam.go#L116)
   226  `beam.CombinePerKey` operations will each be replaced by a single `beam.ParDo` that converts the
   227  `*ppb.Reference`/`*ppb.DecorationPiece` messages to `([]byte, []byte)` key-value entries as per the
   228  formats above.
   229  
   230  ## Serving Table Reader
   231  
   232  As opposed to the post-processor changes, the changes to
   233  [serving/xrefs/xrefs.go](https://github.com/kythe/kythe/blob/7f5ba6a28370d6e3e2530b4750ec56e07888ea41/kythe/go/serving/xrefs/xrefs.go)
   234  will be necessarily expansive.  The current implementation is intimately tied to the current format.
   235  Transitionally, the columnar format will be added alongside the current `CombinedTable`
   236  implementation so that either it or the new table format will be accepted by the `http_server` until
   237  the legacy format is deprecated and removed.