kythe.io@v0.0.68-0.20240422202219-7225dbc01741/kythe/docs/kythe-uri-spec.txt (about) 1 // Copyright 2014 The Kythe Authors. All rights reserved. 2 // 3 // Licensed under the Apache License, Version 2.0 (the "License"); 4 // you may not use this file except in compliance with the License. 5 // You may obtain a copy of the License at 6 // 7 // http://www.apache.org/licenses/LICENSE-2.0 8 // 9 // Unless required by applicable law or agreed to in writing, software 10 // distributed under the License is distributed on an "AS IS" BASIS, 11 // WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. 12 // See the License for the specific language governing permissions and 13 // limitations under the License. 14 15 Kythe URI Specification 16 ======================= 17 Michael J. Fromberger <fromberger@google.com> 18 v0.1.1, 29-Oct-2014: Draft 19 20 This document defines the schema for Kythe uniform resource identifiers ("Kythe URI"). 21 22 The primary purpose of a Kythe URI is to provide a textual encoding of a Kythe 23 VName, which is a unique identifier for a node in the semantic graph generated 24 by Kythe-compatible tools. A Kythe URI may also be extended to encode simple 25 queries about a particular VName in a transportable format. 26 27 The identifiers described in this document are compatible with the grammar 28 given in http://tools.ietf.org/html/rfc3987[RFC 3987] (Internationalized 29 Resource Identifiers) and thereby also with the underlying grammar from 30 http://tools.ietf.org/html/rfc3986[RFC 3986] (Uniform Resource Identifiers). 31 32 == Scheme Label 33 34 The scheme label for Kythe URIs will be "`kythe:`". 35 36 == Character Set 37 38 A Kythe URI is a string of UCS (Unicode) characters. For storage and 39 transmission, a Kythe URI will be encoded as UTF-8 with no byte-order mark, 40 using Normalization Form NFKC. 41 42 Except as restricted by the syntax, all UCS characters are valid in a Kythe URI. 43 Reserved characters (_e.g._, "/", "?") and whitespace must be percent-escaped 44 per Section 2.1 of RFC 3986, e.g., " " becomes "`%20`". 45 46 == Syntax 47 48 The following grammar defines the syntax of a Kythe URI. Some productions have 49 provisional values and will change as the Kythe schema evolves. 50 51 ---- 52 kythe-uri = "kythe:" [corpus] attrs ["#" signature] 53 corpus = "//" label 0*{"/" path-segment} 54 label = ireg-name -- RFC 3987 55 attrs = ["?" lang-attr] ["?" path-attr] ["?" root-attr] 56 lang-attr = "lang=" language 57 path-attr = "path=" path-segment 0*{"/" path-segment} 58 root-attr = "root=" root-segment 0*{"/" root-segment} 59 language = 1*ipchar -- RFC 3987 60 signature = 1*ipchar -- RFC 3987 61 root-segment = 1*ipchar -- RFC 3987 62 path-segment = 1*{unreserved | pct-encoded | "/"} -- RFC 3987 63 ---- 64 65 Note that the order of the attributes (the `attrs` production) is fixed, to 66 ensure that a Kythe URI has a canonical string encoding. 67 68 For queries, path-segment is resolved as specified in 69 http://tools.ietf.org/html/rfc3986#section-5.2.4[RFC 3986 Section 5.2.4 (Remove Dot Segments)]. 70 71 See also link:kythe-storage.html#TermVName[Vector-Name (*VName*)] 72 73 Examples (subject to change): 74 75 * Empty (no fields): `kythe:` 76 * Signature only: `kythe:#loc-a90320dafd60` 77 * Ad-hoc corpus (signature, corpus, path, language): `kythe://corpusname?lang=c%2B%2B?path=file/base/file.h#class-Foo` 78 * Bitbucket (corpus, path): `kythe://bitbucket.org/creachadair/stringset?path=README.md` 79 * Maven (corpus, path, language): `kythe://maven.org/central/org/apache/thrift?lang=java?path=libthrift/0.9.1` 80 * Language, path, signature: `kythe:?lang=go?path=mapreduce/go/contrib/plan.go#MR` 81 * Corpus, path, language: `kythe://code.google.com/p/go.tools?lang=go?path=cmd/godoc/doc.go` 82 * Alternate root: `kythe://chromium.org/chrome?path=openssl/crypto/bf/bf_pi.h?root=third_party/openssl/1650` 83 84 === Rationale 85 86 The grammar for `kythe-uri` is compatible with the generic URI syntax defined 87 in RFC 3986, to the extent that a fairly naive parser should be able to handle 88 parsing a Kythe URI into its high-level components: The "hostname" and "path" 89 components of the generic URI will represent the `corpus`, the "query" 90 component will capture the `attrs`, and the "fragment" component will capture 91 the `signature`. 92 93 The meaning of the strings generated by the `corpus` production is not defined 94 in this specification; the intent is to allow a corpus to behave like a 95 hostname, so that a server providing Kythe data can use the corpus string to 96 locate the data for that corpus. For services that support many independent 97 corpora (e.g., github.com, bitbucket.org, code.google.com), the corpus field 98 will probably include information about the project directly (e.g., 99 "code.google.com/p/go.text"). In cases where there is only a single corpus 100 with a body of different branches or subdivisions, some of that context may 101 be stored in the `root` attribute instead. 102 103 The decision about which representation to choose is mainly controlled by 104 whether the "project" label is likely to vary. A github.com repo will not 105 frequently change name, so it makes sense to include the repo name as part of 106 the corpus, and reserve the `root` field for branches. The encoding of the URI 107 is agnostic to the decision.