github.com/m3db/m3@v1.5.1-0.20231129193456-75a402aa583b/src/m3ninx/index/segment/fst/encoding/docs/README.md (about) 1 # Documents 2 3 Two files are used to represent the documents in a segment. The data file contains the 4 data for each document in the segment. The index file contains, for each document, its 5 corresponding offset in the data file. 6 7 ## Data File 8 9 The data file contains the fields for each document. The documents are stored serially. 10 11 ``` 12 ┌───────────────────────────┐ 13 │ ┌───────────────────────┐ │ 14 │ │ Document 1 │ │ 15 │ ├───────────────────────┤ │ 16 │ │ ... │ │ 17 │ ├───────────────────────┤ │ 18 │ │ Document n │ │ 19 │ └───────────────────────┘ │ 20 └───────────────────────────┘ 21 ``` 22 23 ### Document 24 25 Each document is composed of an ID and its fields. The ID is a sequence of valid UTF-8 bytes 26 and it is encoded first by encoding the length of the ID, in bytes, as a variable-sized 27 unsigned integer and then encoding the actual bytes which comprise the ID. Following the ID 28 are the fields. The number of fields in the document is encoded first as a variable-sized 29 unsigned integer and then the fields themselves are encoded. 30 31 ``` 32 ┌───────────────────────────┐ 33 │ ┌───────────────────────┐ │ 34 │ │ Length of ID │ │ 35 │ │ (uvarint) │ │ 36 │ ├───────────────────────┤ │ 37 │ │ │ │ 38 │ │ ID │ │ 39 │ │ (bytes) │ │ 40 │ │ │ │ 41 │ ├───────────────────────┤ │ 42 │ │ Number of Fields │ │ 43 │ │ (uvarint) │ │ 44 │ ├───────────────────────┤ │ 45 │ │ │ │ 46 │ │ Field 1 │ │ 47 │ │ │ │ 48 │ ├───────────────────────┤ │ 49 │ │ │ │ 50 │ │ ... │ │ 51 │ │ │ │ 52 │ ├───────────────────────┤ │ 53 │ │ │ │ 54 │ │ Field n │ │ 55 │ │ │ │ 56 │ └───────────────────────┘ │ 57 └───────────────────────────┘ 58 ``` 59 60 #### Field 61 62 Each field is composed of a name and a value. The name and value are a sequence of valid 63 UTF-8 bytes and they are stored by encoding the length of the name (value), in bytes, as a 64 variable-sized unsigned integer and then encoding the actual bytes which comprise the name 65 (value). The name is encoded first and the value second. 66 67 ``` 68 ┌───────────────────────────┐ 69 │ ┌───────────────────────┐ │ 70 │ │ Length of Field Name │ │ 71 │ │ (uvarint) │ │ 72 │ ├───────────────────────┤ │ 73 │ │ │ │ 74 │ │ Field Name │ │ 75 │ │ (bytes) │ │ 76 │ │ │ │ 77 │ ├───────────────────────┤ │ 78 │ │ Length of Field Value │ │ 79 │ │ (uvarint) │ │ 80 │ ├───────────────────────┤ │ 81 │ │ │ │ 82 │ │ Field Value │ │ 83 │ │ (bytes) │ │ 84 │ │ │ │ 85 │ └───────────────────────┘ │ 86 └───────────────────────────┘ 87 ``` 88 89 ## Index File 90 91 The index file contains, for each postings ID in the segment, the offset of the corresponding 92 document in the data file. The base postings ID is stored at the start of the file as a 93 little-endian `uint64`. Following it are the actual offsets. 94 95 ``` 96 ┌───────────────────────────┐ 97 │ Base │ 98 │ (uint64) │ 99 ├───────────────────────────┤ 100 │ │ 101 │ │ 102 │ Offsets │ 103 │ │ 104 │ │ 105 └───────────────────────────┘ 106 ``` 107 108 ### Offsets 109 110 The offsets are stored serially starting from the offset for the base postings ID. Each 111 offset is a little-endian `uint64`. Since each offset is of a fixed-size we can access 112 the offset for a given postings ID by calculating its index relative to the start of 113 the offsets. An offset equal to the maximum value for a uint64 indicates that there is 114 no corresponding document for a given postings ID. 115 116 ``` 117 ┌───────────────────────────┐ 118 │ ┌───────────────────────┐ │ 119 │ │ Offset 1 │ │ 120 │ │ (uint64) │ │ 121 │ ├───────────────────────┤ │ 122 │ │ ... │ │ 123 │ ├───────────────────────┤ │ 124 │ │ Offset n │ │ 125 │ │ (uint64) │ │ 126 │ └───────────────────────┘ │ 127 └───────────────────────────┘ 128 ```