github.com/m3db/m3@v1.5.1-0.20231129193456-75a402aa583b/src/m3ninx/index/segment/fst/encoding/docs/README.md (about)

     1  # Documents
     2  
     3  Two files are used to represent the documents in a segment. The data file contains the
     4  data for each document in the segment. The index file contains, for each document, its
     5  corresponding offset in the data file.
     6  
     7  ## Data File
     8  
     9  The data file contains the fields for each document. The documents are stored serially.
    10  
    11  ```
    12  ┌───────────────────────────┐
    13  │ ┌───────────────────────┐ │
    14  │ │      Document 1       │ │
    15  │ ├───────────────────────┤ │
    16  │ │          ...          │ │
    17  │ ├───────────────────────┤ │
    18  │ │      Document n       │ │
    19  │ └───────────────────────┘ │
    20  └───────────────────────────┘
    21  ```
    22  
    23  ### Document
    24  
    25  Each document is composed of an ID and its fields. The ID is a sequence of valid UTF-8 bytes
    26  and it is encoded first by encoding the length of the ID, in bytes, as a variable-sized
    27  unsigned integer and then encoding the actual bytes which comprise the ID. Following the ID
    28  are the fields. The number of fields in the document is encoded first as a variable-sized
    29  unsigned integer and then the fields themselves are encoded.
    30  
    31  ```
    32  ┌───────────────────────────┐
    33  │ ┌───────────────────────┐ │
    34  │ │     Length of ID      │ │
    35  │ │       (uvarint)       │ │
    36  │ ├───────────────────────┤ │
    37  │ │                       │ │
    38  │ │          ID           │ │
    39  │ │        (bytes)        │ │
    40  │ │                       │ │
    41  │ ├───────────────────────┤ │
    42  │ │   Number of Fields    │ │
    43  │ │       (uvarint)       │ │
    44  │ ├───────────────────────┤ │
    45  │ │                       │ │
    46  │ │        Field 1        │ │
    47  │ │                       │ │
    48  │ ├───────────────────────┤ │
    49  │ │                       │ │
    50  │ │          ...          │ │
    51  │ │                       │ │
    52  │ ├───────────────────────┤ │
    53  │ │                       │ │
    54  │ │        Field n        │ │
    55  │ │                       │ │
    56  │ └───────────────────────┘ │
    57  └───────────────────────────┘
    58  ```
    59  
    60  #### Field
    61  
    62  Each field is composed of a name and a value. The name and value are a sequence of valid
    63  UTF-8 bytes and they are stored by encoding the length of the name (value), in bytes, as a
    64  variable-sized unsigned integer and then encoding the actual bytes which comprise the name
    65  (value). The name is encoded first and the value second.
    66  
    67  ```
    68  ┌───────────────────────────┐
    69  │ ┌───────────────────────┐ │
    70  │ │  Length of Field Name │ │
    71  │ │       (uvarint)       │ │
    72  │ ├───────────────────────┤ │
    73  │ │                       │ │
    74  │ │      Field Name       │ │
    75  │ │        (bytes)        │ │
    76  │ │                       │ │
    77  │ ├───────────────────────┤ │
    78  │ │ Length of Field Value │ │
    79  │ │       (uvarint)       │ │
    80  │ ├───────────────────────┤ │
    81  │ │                       │ │
    82  │ │      Field Value      │ │
    83  │ │        (bytes)        │ │
    84  │ │                       │ │
    85  │ └───────────────────────┘ │
    86  └───────────────────────────┘
    87  ```
    88  
    89  ## Index File
    90  
    91  The index file contains, for each postings ID in the segment, the offset of the corresponding
    92  document in the data file. The base postings ID is stored at the start of the file as a
    93  little-endian `uint64`. Following it are the actual offsets.
    94  
    95  ```
    96  ┌───────────────────────────┐
    97  │            Base           │
    98  │          (uint64)         │
    99  ├───────────────────────────┤
   100  │                           │
   101  │                           │
   102  │          Offsets          │
   103  │                           │
   104  │                           │
   105  └───────────────────────────┘
   106  ```
   107  
   108  ### Offsets
   109  
   110  The offsets are stored serially starting from the offset for the base postings ID. Each
   111  offset is a little-endian `uint64`. Since each offset is of a fixed-size we can access
   112  the offset for a given postings ID by calculating its index relative to the start of
   113  the offsets. An offset equal to the maximum value for a uint64 indicates that there is
   114  no corresponding document for a given postings ID.
   115  
   116  ```
   117  ┌───────────────────────────┐
   118  │ ┌───────────────────────┐ │
   119  │ │       Offset 1        │ │
   120  │ │       (uint64)        │ │
   121  │ ├───────────────────────┤ │
   122  │ │          ...          │ │
   123  │ ├───────────────────────┤ │
   124  │ │       Offset n        │ │
   125  │ │       (uint64)        │ │
   126  │ └───────────────────────┘ │
   127  └───────────────────────────┘
   128  ```