github.com/google/trillian-examples@v0.0.0-20240520080811-0d40d35cef0e/clone/README.md (about)

     1  # Log Cloner
     2  
     3  This directory contains a library, database, and tools for cloning transparency logs.
     4  The core library and database is log-agnostic, and each tool tailors this generic library to a specific log.
     5  
     6  ## Design Considerations
     7  
     8  The core library attempts to balance optimization of the following goals:
     9    1. Downloading as quickly as possible
    10    2. Backing off when requested by the log (i.e. not DoSing the log)
    11    3. Simple local state / recovery
    12    4. Reusability across all verifiable logs
    13  
    14  This is achieved by:
    15    1. Downloading batches of leaves in parallel
    16    2. Using exponential backoff on transient failures
    17    3. Writing leaves to the local database strictly in sequence
    18       1. This ensures there are no missing ranges, which keeps state tracking easier
    19    4. Treating all leaf data as binary blobs, with no attempt to parse them
    20  
    21  ## How it Works
    22  
    23  The goal is to download data as quickly as possible from the log, and then persist verified data locally.
    24  A single download session looks like this:
    25  
    26  1. Get a checkpoint from the log and store this _in memory_
    27     1. We will refer to the size of this checkpoint (i.e. the number of leaves it commits to) as `N`
    28  2. Read the last checkpoint persisted in the local database in order to determine the local size, `M`
    29     1. If no previous checkpoint is stored then `M` is 0
    30  3. Download all leaves in the range `[M, N)` from the log
    31     1. Leaves are fetched in batches, in parallel, and pooled in memory temporarily
    32     2. Leaves from this memory pool are written to the `leaves` table of the database from this memory pool, strictly _in order_ of their index
    33  4. Once `N` leaves have been written to the database, calculate the Merkle root of all of these leaves
    34  5. If, and only if, the Merkle root matches the checkpoint downloaded in (1), write this checkpoint to the `checkpoints` table of the database
    35     1. A compact respresentation of the Merkle tree is also stored along with this checkpoint in the form of a [compact range](https://github.com/transparency-dev/merkle/tree/main/compact)
    36  
    37  Note that this means that until a download session completes successfully, the database may contain unverified leaves with an index greater than that stored in the latest checkpoint.
    38  Leaves must not be trusted if their index is greater or equal to the size of the latest checkpoint.
    39  
    40  ## Custom Processing
    41  
    42  This library was designed to form the first part of a local data pipeline, i.e. downstream tooling can be written that reads from this local mirror of the log.
    43  Such tooling MUST only trust leaves that are committed to by a checkpoint; reading leaves with an index greater than the current checkpoint size is possible, but such data is unverified and using this defeats the purpose of using verifiable data structures.
    44  
    45  The `leaves` table records the leaf data as blobs; this accurately reflects what the log has committed to, but does not enable efficient SQL queries into the data.
    46  A common usage pattern for a specific log ecosystem would be to have the first stage of the local pipeline parse the leaf data and break out the contents into a table with an appropriate schema for the parsed data.
    47  
    48  ## Database Setup
    49  
    50  In MariaDB, create a database and user. Below is an example of doing this for MariaDB 10.6, creating a database `google_xenon2022`, with a user `clonetool` with password `letmein`.
    51  
    52  ```
    53  MariaDB [(none)]> CREATE DATABASE google_xenon2022;
    54  MariaDB [(none)]> CREATE USER 'clonetool'@localhost IDENTIFIED BY 'letmein';
    55  MariaDB [(none)]> GRANT ALL PRIVILEGES ON google_xenon2022.* TO 'clonetool'@localhost;
    56  MariaDB [(none)]> FLUSH PRIVILEGES;
    57  ```
    58  
    59  ## Tuning
    60  
    61  The clone library logs information as it runs using `glog`.
    62  Providing `--alsologtostderr` is passed to any tool using the library, you should see output such as the following during the cloning process:
    63  
    64  ```
    65  I0824 11:09:23.517796 2881011 clone.go:71] Fetching [4459168, 95054738): Remote leaves: 95054738. Local leaves: 4459168 (0 verified).
    66  I0824 11:09:28.519257 2881011 clone.go:177] 1202.8 leaves/s, last leaf=4459168 (remaining: 90595569, ETA: 20h55m21s), time working=24.2%
    67  I0824 11:09:33.518444 2881011 clone.go:177] 1049.6 leaves/s, last leaf=4465183 (remaining: 90589554, ETA: 23h58m29s), time working=23.0%
    68  I0824 11:09:38.518542 2881011 clone.go:177] 1024.0 leaves/s, last leaf=4470430 (remaining: 90584307, ETA: 24h34m21s), time working=23.3%
    69  ```
    70  
    71  When tuning, it is recommended to also provide `--v=1` to see more verbose output.
    72  In particular, this will allow you to see if the tool is encountering errors (such as being told to back off) by the log server, e.g. if you see lines such as `Retryable error getting data` in the output.
    73  
    74  Assuming you aren't being rate limited, then optimization goes as follows:
    75    1. Get the `working` percentage regularly around 100%: this measures how much time is being spent writing to the database. To do this, increase the number of `workers` to ensure that data is always available to the database writer
    76    2. If `working %` is around 100, then increasing the DB write batch size will increase throughput, to a point
    77  
    78  This process is somewhat iterative to find what works for your setup.
    79  It depends on many variables such as log latency and rate limiting, the database, the machine running the clone tool, etc.
    80  
    81  ## Download Clients
    82  
    83  Download clients are provided for:
    84    * [CT](cmd/ctclone/)
    85    * [sum.golang.org](cmd/sumdbclone/)
    86    * [serverless HTTP logs](cmd/serverlessclone/)
    87   
    88  See the documentation for these for specifics of each tool.
    89  
    90  ## Using Cloned Data
    91  
    92  The data is stored in a simple database table named `leaves`, where each leaf in the log
    93  is identified by its position in the log (column `id`) and the data at that position is a
    94  blob (column `data`). The expectation is that clients will write custom tooling that reads
    95  from this SQL table in order to perform custom tasks, e.g. verification of data, searching
    96  to find relevant records, etc.
    97  
    98  An example query that returns the first 5 leaves in the DB:
    99  ```
   100  select * from leaves where id < 5;
   101  ```
   102  
   103  ## Quick Start (Mac)
   104  
   105  On a Mac with [Homebrew](https://brew.sh) already installed, getting MariaDB installed
   106  and connected to it in order to run the setup above is simple.
   107  At the time of writing, this installed `10.8.3-MariaDB Homebrew` which works well for
   108  the cloning tool.
   109  
   110  ```
   111  brew install mariadb
   112  brew services restart mariadb
   113  mysql
   114  ```