kythe.io@v0.0.68-0.20240422202219-7225dbc01741/kythe/docs/rfc/2909.md (about) 1 # Background 2 3 The current Kythe serving data format is just a simple key-value store from string keys (usually 4 [Kythe tickets](https://kythe.io/docs/kythe-uri-spec.html)) to ProtocolBuffer message values. This 5 format requires the Kythe PostProcessor to perform a complex reduction over a potentially massive 6 number of decorations/references/edges/etc. to produce a single lookup value per node. A set of 7 CrossReferences can be so large that it already requires manual paging, another complex 8 transformation in the PostProcessor. These operations tend to create hot shards and stragglers, 9 leading to long post-processing times and potentially failures due to memory exhaustion. Likewise, 10 the server is forced to slowly decode large ProtocolBuffer messages even when only a subset of the 11 data is required and the large value sizes can cause issues with some backend stores. 12 13 # Proposal 14 15 Split serving table ProtocolBuffer values into their component fields, making each an independent 16 key-value entry. Single lookup API calls become scans over a key prefix; paging becomes trivial. 17 Due the characteristics of LevelDB/SSTables, the split key-value entries will most likely be read 18 together (see SSTable block encoding), keys will be prefix-encoded, and there should be no 19 significant performance hit for reads. On the contrary, since the Kythe API gives users the ability 20 to request partial data, the split key-value entries allows the server to possibly skip over large 21 segments of data while scanning and save the cost of decoding for a performance improvement. 22 Examples include a user requesting only small span of a file's decorations or a user requesting only 23 a single kind of cross-references. 24 25 More importantly, keeping serving data separate lets the Kythe PostProcessor do fewer large joins. 26 There will no longer need to be a join of all file decorations to create the single 27 `FileDecorations` proto (likewise for CrossReferences). Fewer large joins avoids stragglers in the 28 PostProcessor and removes the need for many "limiters" (i.e. cutoff values for the size of outputs). 29 30 The columnar format is designed primarily to accommodate NoSQL/LevelDB constraints (e.g. unique, 31 ordered keys with arbitrary values), but it could be general enough to fit other storage backends. 32 For instance, the keys could be split and put into a relational database alongside their values. 33 34 # Implementation 35 36 Essentially each field of an existing ProtocolBuffer value in the serving data becomes zero or more 37 key-value entries depending on whether it's unset, set, or a repeated field. The bulk of the field 38 data will be represented in the key and the keys will be encoded as 39 [orderedcodes](https://godoc.org/github.com/google/orderedcode). It's important that each key be 40 unique to fit storage models like LevelDB which are not multimaps (unlike the underlying SSTables). 41 Tags within keys will indicate how to decode the values themselves and help order the key-value 42 entries. The ordering of the keys will be chosen as to best fit a performant server, allowing for 43 easy, single-pass scans to retrieve all possible data. 44 45 For message-valued fields, further deconstruction of their component fields is necessary to produce 46 key-value entries. Much of their component field values are moved into an entry's key so that it is 47 guaranteed unique. The leftover data are made into the entry's value. Sometimes a single field may 48 be transformed into multiple entries to remove data redundancy and to allow entries to be 49 independently processed. A good example is the `FileDecorations.decoration` field. A single 50 `Decoration` is actually decomposed into two separate entries, one storing what can be thought of as 51 the "actual" reference within the file (i.e. `start-end-kind-target` with no value) and one to store 52 the `target_definition`. 53 54 As most Kythe ProtocolMessages are relatively flat and consist of basic types (strings, integers, 55 and floats), most of the below format is trivial. More complex types can also be handled as long as 56 a finite, unique key can be constructed of the aforementioned basic values for each value. For 57 instance, enums may be converted to integers and arbitrary bytes may be base64-encoded. Even 58 recursive types such as `MarkedSource` may be stored as values since they are associated with 59 `VName` keys. 60 61 The scope of this proposal only includes FileDecorations, CrossReferences, and Edges. Other 62 services could be made to use similar formats, but the most improvement will be gained from these 63 three APIs. 64 65 ## FileDecorations 66 67 ### Existing ProtocolBuffer definition: 68 69 ```protobuf 70 // Simplified FileDecorations 71 message FileDecorations { 72 File file = 1; 73 74 repeated Decoration decoration = 2; 75 repeated ExpandedAnchor target_definitions = 3; 76 repeated Node target = 4; 77 repeated Override target_override = 5; 78 repeated kythe.proto.common.Diagnostic diagnostic = 6; 79 80 message Decoration { 81 RawAnchor anchor = 1; 82 string kind = 2; 83 string target = 5; 84 string target_definition = 4; 85 } 86 message Override { 87 enum Kind { 88 OVERRIDES = 0; 89 EXTENDS = 1; 90 } 91 string overriding = 1; 92 string overridden = 2; 93 string overridden_definition = 5; 94 Kind kind = 3; 95 kythe.proto.common.MarkedSource marked_source = 4; 96 } 97 } 98 ``` 99 100 ### Key-value format: 101 102 | key | value | kind | 103 |---|---|---| 104 | `"fd"-file` | `FileDecorationsIndex` | file metadata | 105 | `"fd"-file-00-start-end` | bytes | file contents for span | 106 | `"fd"-file-10-start-end-kind-target` | `<empty>` | decoration targets | 107 | `"fd"-file-20-target-kind-override` | `<empty>` | override per target | 108 | `"fd"-file-30-target` | `srvpb.Node` | target node facts | 109 | `"fd"-file-40-target` | `spb.VName` | definition for each target | 110 | `"fd"-file-50-def` | `srvpb.ExpandedAnchor` | each definition's location | 111 | `"fd"-file-60-override` | `MarkedSource` | `MarkedSource` per override | 112 | `"fd"-file-70-start-end-hash` | `Diagnostic` | `Diagnostic` per span (use `-1:-1` span for entire file) | 113 114 115 Notes: 116 117 * Each value will be contained within a wrapper message to support evolution of the table format 118 * All file decoration entries are prefixed by "fd" and the file's `VName` 119 * `VNames` will be stored instead of [Kythe tickets](https://kythe.io/docs/kythe-uri-spec.html) 120 * The `FileDecorationsIndex` is used to store metadata for the rest of the entries and acts as an existence check 121 * File text is chunked to support very large files (most files should be a single entry) 122 * Decorations are ordered by span to support a narrow `FileDecorationsRequest` (group 10) 123 * Node facts/definitions (groups 30 and 40) are stored after decorations and overrides to allow for the reader to skip over unneeded nodes (due to a narrow request) 124 * Each definition is stored uniquely (group 50) and are related to nodes separately (group 40) 125 126 ## CrossReferences 127 128 ### Existing ProtocolBuffer definition: 129 130 ```protobuf 131 // Simplified PagedCrossReference (no paging, removed deprecated features) 132 message PagedCrossReferences { 133 string source_ticket = 1; 134 kythe.proto.common.MarkedSource marked_source = 6; 135 repeated string merge_with = 7; 136 137 repeated Group group = 2; 138 Node source_node = 8; 139 140 message Group { 141 string kind = 1; 142 repeated ExpandedAnchor anchor = 2; 143 repeated RelatedNode related_node = 3; 144 repeated Caller caller = 4; 145 } 146 message RelatedNode { 147 Node node = 1; 148 int32 ordinal = 2; 149 } 150 message Caller { 151 ExpandedAnchor caller = 1; 152 string semantic_caller = 2; 153 kythe.proto.common.MarkedSource marked_source = 3; 154 repeated ExpandedAnchor callsite = 4; 155 } 156 } 157 ``` 158 159 ### Key-value format: 160 161 | key | value | kind | 162 |----------------------------------------|---------------------------------------|--------------------------| 163 | `"xr"-source` | `CrossReferencesIndex` | | 164 | `"xr"-source-00-kind-file-start-end` | `srvpb.ExpandedAnchor` | Regular cross-reference | 165 | `"xr"-source-10-kind-ordinal-node` | | Non-ref relations | 166 | `"xr"-source-20-caller` | `srvpb.ExpandedAnchor`+`MarkedSource` | Definition for Caller | 167 | `"xr"-source-20-caller-file-start-end` | `srvpb.ExpandedAnchor` | Callsite | 168 | `"xr"-source-30-node` | `srvpb.Node` | Related nodes | 169 | `"xr"-source-40-node` | `srvpb.ExpandedAnchor` | Related node definitions | 170 171 172 Notes: 173 174 * Each value will be contained within a wrapper message to support evolution of the table format 175 * All cross-reference entries are prefixed by "xr" and the node's `VName` 176 * `VNames` will be stored instead of [Kythe tickets](https://kythe.io/docs/kythe-uri-spec.html) 177 * The `CrossReferencesIndex` is used to store metadata for the rest of the entries (e.g. counts, node facts, `merge_with`) and acts as an existence check 178 * `Group`s are not used (each `Group` message is spread across multiple KV entries with a shared prefix) 179 * `kind` fields in keys include a "priority" to achieve desired sorting order (this could alternatively be included in the group number) 180 * After their `kind`, cross-references are ordered by file location 181 182 ## Edges 183 184 ### Existing ProtocolBuffer definition: 185 186 ```protobuf 187 // Simplified PagedEdgeSet (no paging, removed deprecated features) 188 message PagedEdgeSet { 189 message EdgeGroup { 190 message Edge { 191 Node target = 1; 192 int32 ordinal = 2; 193 } 194 string kind = 1; 195 repeated Edge edge = 2; 196 } 197 198 Node source = 1; 199 repeated EdgeGroup group = 2; 200 } 201 ``` 202 203 ### Key-value format: 204 205 | key | value | kind | 206 |--------------------------------------|------------------------|-------------------------------| 207 | `"eg"-source` | `EdgesIndex` | | 208 | `"eg"-source-10-kind-ordinal-target` | `srvpb.EdgeGroup.Edge` | Single edge | 209 | `"eg"-source-20-target` | `srvpb.Node` | Node facts for an edge target | 210 211 212 Notes: 213 214 * Each value will be contained within a wrapper message to support evolution of the table format 215 * All edge entries are prefixed by "eg" and the node's `VName` 216 * `VNames` will be stored instead of [Kythe tickets](https://kythe.io/docs/kythe-uri-spec.html) 217 * Each `EdgesIndex` will include the source node's facts (for the `Nodes` API) 218 * Node facts for edge targets are separated from the edge relations to remove duplication 219 220 ## Beam Pipeline 221 222 The changes to the Beam workflow will be relatively straight-forward. The current 223 [combineDecorPieces](https://github.com/kythe/kythe/blob/7f5ba6a/kythe/go/serving/pipeline/beam.go#L178) 224 and 225 [groupCrossRefs](https://github.com/kythe/kythe/blob/7f5ba6a/kythe/go/serving/pipeline/beam.go#L116) 226 `beam.CombinePerKey` operations will each be replaced by a single `beam.ParDo` that converts the 227 `*ppb.Reference`/`*ppb.DecorationPiece` messages to `([]byte, []byte)` key-value entries as per the 228 formats above. 229 230 ## Serving Table Reader 231 232 As opposed to the post-processor changes, the changes to 233 [serving/xrefs/xrefs.go](https://github.com/kythe/kythe/blob/7f5ba6a28370d6e3e2530b4750ec56e07888ea41/kythe/go/serving/xrefs/xrefs.go) 234 will be necessarily expansive. The current implementation is intimately tied to the current format. 235 Transitionally, the columnar format will be added alongside the current `CombinedTable` 236 implementation so that either it or the new table format will be accepted by the `http_server` until 237 the legacy format is deprecated and removed.