github.com/transparency-dev/armored-witness-applet@v0.1.1/trusted_applet/internal/storage/README.md (about)

     1  # Storage
     2  
     3  This directory contains a work-in-progress storage system for use with
     4  the ArmoredWitness.
     5  It should be considered experimental and subject to change!
     6  
     7  Some details on the requirements and design of the storage system are below.
     8  
     9  ### Requirements
    10  
    11  *   Allow the witness unikernel to persist small amounts of data, think multiple independent records of up to a few MB.
    12  *   Use the eMMC as storage
    13  *   Avoid artificially shortening the life of storage hardware (flash)
    14  *   Persisted state should be resilient to corruption from power failure/reboot during writes
    15  
    16  #### Nice-to-haves
    17  
    18  *   Be somewhat reusable for other ArmoredWitness use cases we may have.
    19      *   This probably means being able to store different types of data in specified locations.
    20  
    21  #### Non-requirements
    22  
    23  *   While we're ultimately limited by the performance of the storage hardware, it's not a priority to achieve the lowest possible latency or highest possible throughput for writes.
    24  *   Integration with Go's `os.Open()` style APIs (this _would_ be great, but would require upstream work in TamaGo so is explicitly out of scope for now).
    25  
    26  #### Out-of-scope
    27  
    28  Some things are explicitly out of scope for this design:
    29  
    30  *   Protecting against an attacker modifying the data on the storage in some out-of-band fashion.
    31  *   Hardware failure resulting in previously readable data becoming unreadable/corrupted.
    32  *   Supporting easy discovery / enumeration of data on disk, or preventing duplicate data from being written. Higher level code should be responsible for understanding what data should be in which slots.
    33  
    34  ### Design
    35  
    36  A relatively simple storage API which offers a fixed number of "storage slots" to which a representation of state can be written. Slot storage will be allocated a range of the underlying storage, starting at a known byte offset and with a known length. This slot storage is also preconfigured with the number of slots that it should allocate (or alternatively/equivalently, the number of bytes to be reserved per-slot).
    37  
    38  Each slot is backed by a fixed size "journal" stored across _N_ eMMC blocks. 
    39  
    40  Logically it can be thought of like so:
    41  
    42  ![image showing logical layout](images/logical_layout.png)
    43  
    44  Physically it may look like this on the MMC block device itself (9 blocks per journal is just an example):
    45  
    46  ![image showing physical layout](images/physical_layout.png)
    47  
    48  #### API
    49  
    50  The API tries to be as simple as possible to use and implement for now - e.g. since we're only intending this to be used for O(MB) of data, it's probably fine to pass this to/from the storage layer as a straight `[]byte` slice.
    51  
    52  However, if necessary, we could try to make the API more like Go's io framework, with `Reader` and `Writers`.
    53  
    54  
    55  ```go
    56  // Partition describes the extent and layout of a single contiguous region 
    57  // underlying block storage.
    58  type Partition struct {}
    59  
    60  // Open opens the specified slot, returns an error if the slot is out of bounds.
    61  func (p *Partition) Open(slot int) (*Slot, error)
    62  
    63  
    64  // Slot represents the current data in a slot.
    65  type Slot struct {}
    66  
    67  // Read returns the last data successfully written to the slot, along with 
    68  // a token which can be used with CheckAndWrite.
    69  func (s *Slot) Read() ([]byte, uint32, error)
    70  
    71  // Write stores the provided data to the slot.
    72  // Upon successful completion, this data will be returned by future calls 
    73  // to Read until another successful Write call is mode.
    74  // If the call to Write fails, future calls to Read will return the 
    75  // previous successfully written data, if any.
    76  func (s *Slot) Write(p []byte) error 
    77  
    78  // CheckAndWrite behaves like Write, with the exception that it will 
    79  // immediately return an error if the slot has been successfully written 
    80  // to since the Read call which produced the passed-in token.
    81  func (s *Slot) CheckAndWrite(token uint32, p []byte) error
    82  
    83  ```
    84  
    85  #### Internal structures
    86  
    87  Data stored in the slot is represented by an _"update record"_ written to the journal.
    88  
    89  The update record contains:
    90  
    91  Field Name   | Type                        | Notes
    92  -------------|-----------------------------|-------------------------
    93  `Magic`      |`[4]byte{'T', 'F', 'J', '0'}`| Magic record header, v0
    94  `Revision`   |`uint32`                     | Incremented with each write to slot
    95  `DataLen`    |`uint64`                     | `len(RecordData)`
    96  `Checksum`   |`[32]byte{}`                 | `SHA256` of `RecordData`
    97  `RecordData` |`[DataLen]byte{}`            | Application data
    98  
    99  
   100  An update record is considered _valid_ if its:
   101  
   102  *   `Magic` is correct
   103  *   `Checksum` is correct for the data in `RecordData[:DataLen]`
   104  
   105  The first time `Open` is called for a given slot, the slot's journal will be scanned from the beginning to look for the valid update record with the largest `Revision`. The Data from this record is the data associated with the slot. It could potentially be cached in RAM at this point if it's small enough.
   106  
   107  If no such record exists, then the slot has not yet been successfully written to and there is no data associated with the slot.
   108  
   109  An update to the slot causes an update record to be written to the journal starting at either:
   110  
   111  *   The first byte of the blocks following the extent of the "current" update record (i.e all blocks contain header/data for at most 1 record), if there is sufficient space remaining in the journal to accommodate the entire update record without wrapping around to the first blocks, or
   112  *   The first byte of the first block in the journal, if there is no current record or the update record will not fit in the remaining journal space.
   113  
   114  Following a successful write to storage, the metadata associated with slot (i.e. Revision, current header location, location for next write, etc.) is updated.
   115  
   116  The diagram below shows a sequence of several update record writes of varying data sizes. These writes are taking place in a single journal, which you'll remember comprises several blocks.
   117  
   118  The grey boxes represent blocks containing old/previous data, green represents blocks holding the latest successful write.
   119  
   120  The numbers indicate a header with a particular `Revision`, blocks with `…` contain follow-on `RecordData`, and an x indicates invalid record header:
   121  
   122  ```
   123  ⬛⬛⬛⬛⬛⬛⬛⬛⬛⬛ - Initial state, nothing written
   124  🟩🟩🟩⬛⬛⬛⬛⬛⬛⬛ - First record (rev=1) has been successfully stored
   125  ⬜⬜⬜🟩🟩⬛⬛⬛⬛⬛ - Next record (rev=2) is stored with the next available block
   126  ⬜⬜⬜⬜⬜🟩🟩🟩⬛⬛ - Same again.
   127  🟩🟩🟩⬜⬜⬜⬜⬜⬛⬛ - The 4th record will not fit in the remaining space, so is written starting at the zeroth block, overwriting old revision(s) - note it does not wrap around.
   128  ⬜⬜⬜🟩🟩🟩⬜⬜⬛⬛ - Subsequent revisions continue in this vein.
   129  ```
   130  
   131  Since record revisions should always be increasing as we scan left-to-right through the slot storage, we can assume we've found the newest update record when we've either reached the end of the storage space, or after having read at least 1 _good_ update record we find a record with a lower `Revision` than the previous record, or one with an invalid `Magic` or `Checksum.`
   132  
   133  #### Failed/interrupted writes
   134  
   135  For a failed write to the storage to have any permanent effect at all, it must have succeeded in writing at least the 1st block of the update record, and so the stored header checksum will be invalid. This allows the failure to be detected when reading back with high probability.
   136  
   137  The maximum permitted `RecordData` size is restricted to `(TotalSlotSize/3) - len(Header)`; this prevents a failed write obliterating all or part of the previous successful write, so unless the failed write is the first attempt to write to the slot, there will always be a valid previous record available (modulo storage fabric failure).
   138  
   139  Adding records with failed writes:
   140  
   141  ```
   142  ⬛⬛⬛⬛⬛⬛⬛⬛⬛ - Initial state, nothing written
   143  🟩🟩⬜⬜⬜⬜⬜⬜⬛ - First record (rev=1) stored successfully
   144  ⬜⬜🟩🟩🟩⬜⬜⬜⬛ - Second write (rev=2) is successful too.
   145  ⬜⬜⬜⬜⬜🟥🟥🟥⬛ - Third write fails
   146  ⬜⬜⬜⬜⬜🟩🟩🟩⬛ - Application retries, record (rev=3) is written successfully this time.
   147  🟩🟩⬜⬜⬜⬜⬜⬜⬛ - Application succesfully retries and writes (rev=4)
   148  ⬜⬜🟩🟩🟩⬜⬜⬜⬛ - and (rev=5)
   149  ⬜⬜⬜⬜⬜🟩🟩🟩⬛ - and (rev=6), too
   150  🟥🟥🟥⬜⬜🟩🟩🟩⬛ - Attempt to write (rev=7), located at the zeroth block, fails, corrupting (rev=4) and (rev=5), but rev=6, the current good record, is intact.
   151  ```
   152  
   153  #### Other properties
   154  
   155  This journal type approach affords a couple of additional nice properties given the environment and use case:
   156  
   157  1. The API can provide _check-and-set_ semantics: _"Write an update record with revision X, iff the current record is revision X-1"_.
   158  2. A very basic notion of "wear levelling" is provided since writes are spread out across most blocks. Note that this is less important here as the ArmoredWitness has eMMC storage, which mandates that the integrated controlled performs wear-leveling transparently.