kythe.io@v0.0.68-0.20240422202219-7225dbc01741/kythe/docs/kythe-uri-spec.txt (about)

     1  // Copyright 2014 The Kythe Authors. All rights reserved.
     2  //
     3  // Licensed under the Apache License, Version 2.0 (the "License");
     4  // you may not use this file except in compliance with the License.
     5  // You may obtain a copy of the License at
     6  //
     7  //   http://www.apache.org/licenses/LICENSE-2.0
     8  //
     9  // Unless required by applicable law or agreed to in writing, software
    10  // distributed under the License is distributed on an "AS IS" BASIS,
    11  // WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
    12  // See the License for the specific language governing permissions and
    13  // limitations under the License.
    14  
    15  Kythe URI Specification
    16  =======================
    17  Michael J. Fromberger <fromberger@google.com>
    18  v0.1.1, 29-Oct-2014: Draft
    19  
    20  This document defines the schema for Kythe uniform resource identifiers ("Kythe URI").
    21  
    22  The primary purpose of a Kythe URI is to provide a textual encoding of a Kythe
    23  VName, which is a unique identifier for a node in the semantic graph generated
    24  by Kythe-compatible tools.  A Kythe URI may also be extended to encode simple
    25  queries about a particular VName in a transportable format.
    26  
    27  The identifiers described in this document are compatible with the grammar
    28  given in http://tools.ietf.org/html/rfc3987[RFC 3987] (Internationalized
    29  Resource Identifiers) and thereby also with the underlying grammar from
    30  http://tools.ietf.org/html/rfc3986[RFC 3986] (Uniform Resource Identifiers).
    31  
    32  == Scheme Label
    33  
    34  The scheme label for Kythe URIs will be "`kythe:`".
    35  
    36  == Character Set
    37  
    38  A Kythe URI is a string of UCS (Unicode) characters. For storage and
    39  transmission, a Kythe URI will be encoded as UTF-8 with no byte-order mark,
    40  using Normalization Form NFKC.
    41  
    42  Except as restricted by the syntax, all UCS characters are valid in a Kythe URI.
    43  Reserved characters (_e.g._, "/", "?")  and whitespace must be percent-escaped
    44  per Section 2.1 of RFC 3986, e.g., " " becomes "`%20`".
    45  
    46  == Syntax
    47  
    48  The following grammar defines the syntax of a Kythe URI.  Some productions have
    49  provisional values and will change as the Kythe schema evolves.
    50  
    51  ----
    52  kythe-uri    = "kythe:" [corpus] attrs ["#" signature]
    53  corpus       = "//" label 0*{"/" path-segment}
    54  label        = ireg-name -- RFC 3987
    55  attrs        = ["?" lang-attr] ["?" path-attr] ["?" root-attr]
    56  lang-attr    = "lang=" language
    57  path-attr    = "path=" path-segment 0*{"/" path-segment}
    58  root-attr    = "root=" root-segment 0*{"/" root-segment}
    59  language     = 1*ipchar  -- RFC 3987
    60  signature    = 1*ipchar  -- RFC 3987
    61  root-segment = 1*ipchar  -- RFC 3987
    62  path-segment = 1*{unreserved | pct-encoded | "/"}  -- RFC 3987
    63  ----
    64  
    65  Note that the order of the attributes (the `attrs` production) is fixed, to
    66  ensure that a Kythe URI has a canonical string encoding.
    67  
    68  For queries, path-segment is resolved as specified in
    69  http://tools.ietf.org/html/rfc3986#section-5.2.4[RFC 3986 Section 5.2.4 (Remove Dot Segments)].
    70  
    71  See also link:kythe-storage.html#TermVName[Vector-Name (*VName*)]
    72  
    73  Examples (subject to change):
    74  
    75  * Empty (no fields): `kythe:`
    76  * Signature only: `kythe:#loc-a90320dafd60`
    77  * Ad-hoc corpus (signature, corpus, path, language): `kythe://corpusname?lang=c%2B%2B?path=file/base/file.h#class-Foo`
    78  * Bitbucket (corpus, path): `kythe://bitbucket.org/creachadair/stringset?path=README.md`
    79  * Maven (corpus, path, language): `kythe://maven.org/central/org/apache/thrift?lang=java?path=libthrift/0.9.1`
    80  * Language, path, signature: `kythe:?lang=go?path=mapreduce/go/contrib/plan.go#MR`
    81  * Corpus, path, language: `kythe://code.google.com/p/go.tools?lang=go?path=cmd/godoc/doc.go`
    82  * Alternate root: `kythe://chromium.org/chrome?path=openssl/crypto/bf/bf_pi.h?root=third_party/openssl/1650`
    83  
    84  === Rationale
    85  
    86  The grammar for `kythe-uri` is compatible with the generic URI syntax defined
    87  in RFC 3986, to the extent that a fairly naive parser should be able to handle
    88  parsing a Kythe URI into its high-level components: The "hostname" and "path"
    89  components of the generic URI will represent the `corpus`, the "query"
    90  component will capture the `attrs`, and the "fragment" component will capture
    91  the `signature`.
    92  
    93  The meaning of the strings generated by the `corpus` production is not defined
    94  in this specification; the intent is to allow a corpus to behave like a
    95  hostname, so that a server providing Kythe data can use the corpus string to
    96  locate the data for that corpus.  For services that support many independent
    97  corpora (e.g., github.com, bitbucket.org, code.google.com), the corpus field
    98  will probably include information about the project directly (e.g.,
    99  "code.google.com/p/go.text").  In cases where there is only a single corpus
   100  with a body of different branches or subdivisions, some of that context may
   101  be stored in the `root` attribute instead.
   102  
   103  The decision about which representation to choose is mainly controlled by
   104  whether the "project" label is likely to vary.  A github.com repo will not
   105  frequently change name, so it makes sense to include the repo name as part of
   106  the corpus, and reserve the `root` field for branches.  The encoding of the URI
   107  is agnostic to the decision.