github.com/TrueBlocks/trueblocks-core/src/apps/chifra@v0.0.0-20241022031540-b362680128f7/internal/scrape/README.md (about)

     1  ## chifra scrape
     2  
     3  The `chifra scrape` application creates TrueBlocks' chunked index of address appearances -- the
     4  fundamental data structure of the entire system. It also, optionally, pins each chunk of the index
     5  to IPFS.
     6  
     7  `chifra scrape` is a long running process, therefore we advise you run it as a service or in terminal
     8  multiplexer such as `tmux`. You may start and stop `chifra scrape` as needed, but doing so means the
     9  scraper will not be keeping up with the front of the blockchain. The next time it starts, it will
    10  have to catch up to the chain, a process that may take several hours depending on how long ago it
    11  was last run. See the section below and the "Papers" section of our website for more information
    12  on how the scraping process works and prerequisites for its proper operation.
    13  
    14  You may adjust the speed of the index creation with the `--sleep` and `--block_cnt` options. On
    15  some machines, or when running against some EVM node software, the scraper may overburden the
    16  hardware. Slowing things down will ensure proper operation. Finally, you may optionally `--pin`
    17  each new chunk to IPFS which naturally shards the database among all users. By default, pinning
    18  is against a locally running IPFS node, but the `--remote` option allows pinning to an IPFS
    19  pinning service such as Pinata.
    20  
    21  ```[plaintext]
    22  Purpose:
    23    Scan the chain and update the TrueBlocks index of appearances.
    24  
    25  Usage:
    26    chifra scrape [flags]
    27  
    28  Flags:
    29    -n, --block_cnt uint   maximum number of blocks to process per pass (default 2000)
    30    -s, --sleep float      seconds to sleep between scraper passes (default 14)
    31    -l, --touch uint       first block to visit when scraping (snapped back to most recent snap_to_grid mark)
    32    -u, --run_count uint   run the scraper this many times, then quit
    33    -d, --dry_run          show the configuration that would be applied if run,no changes are made
    34    -o, --notify           enable the notify feature
    35    -v, --verbose          enable verbose output
    36    -h, --help             display this help screen
    37  
    38  Notes:
    39    - The --touch option may only be used for blocks after the latest scraped block (if any). It will be snapped back to the latest snap_to block.
    40    - This command requires your RPC to provide trace data. See the README for more information.
    41    - The --notify option requires proper configuration. Additionally, IPFS must be running locally. See the README.md file.
    42  ```
    43  
    44  Data models produced by this tool:
    45  
    46  - [chunkrecord](/data-model/admin/#chunkrecord)
    47  - [manifest](/data-model/admin/#manifest)
    48  - [message](/data-model/other/#message)
    49  
    50  ### configuration
    51  
    52  Each of the following additional configurable command line options are available.
    53  
    54  **Configuration file:** `trueBlocks.toml`  
    55  **Configuration group:** `[scrape.<chain>]`
    56  
    57  | Item         | Type   | Default | Description / Default                                                                                                    |
    58  | ------------ | ------ | ------- | ------------------------------------------------------------------------------------------------------------------------ |
    59  | appsPerChunk | uint64 | 2000000 | the number of appearances to build into a chunk before consolidating it                                                  |
    60  | snapToGrid   | blknum | 250000  | an override to apps_per_chunk to snap-to-grid at every modulo of this value, this allows easier corrections to the index |
    61  | firstSnap    | blknum | 2000000 | the first block at which snap_to_grid is enabled                                                                         |
    62  | unripeDist   | blknum | 28      | the distance (in blocks) from the front of the chain under which (inclusive) a block is considered unripe                |
    63  | channelCount | uint64 | 20      | number of concurrent processing channels                                                                                 |
    64  | allowMissing | bool   | false   | do not report errors for blockchains that contain blocks with zero addresses                                             |
    65  
    66  Note that for Ethereum mainnet, the default values for appsPerChunk and firstSnap are 2,000,000 and 2,300,000 respectively. See the specification for a justification of these values.
    67  
    68  These items may be set in three ways, each overriding the preceding method:
    69  
    70  -- in the above configuration file under the `[scrape.<chain>]` group,  
    71  -- in the environment by exporting the configuration item as upper case (with underbars removed) and prepended with (TB underbar SCRAPE underbar CHAIN) with the underbars included, or  
    72  -- on the command line using the configuration item with leading dashes and in snake case (i.e., `--snake_case`).
    73  
    74  ### further information
    75  
    76  Each time `chifra scrape` runs, it begins at the last block it completed processing (plus one). With
    77  each pass, the scraper descends into each block's complete data. (This is why TrueBlocks requires
    78  a `--tracing` node.) As the scraper encounters appearances of address in the
    79  block's data, it adds those appearances to a growing index. Periodically (after processing the
    80  block that contains the 2,000,000th appearance), the system consolidates an **index chunk**.
    81  
    82  An **index chunk** is a portion of the index containing approximately 2,000,000 records (although,
    83  this number is adjustable for different chains). As part of the consolidation, the scraper creates
    84  a Bloom filter representing the set membership in the associated index portion. The Bloom filters
    85  are an order of magnitude smaller than the index chunks. The system then pushes both the index
    86  chunk and the Bloom filter to IPFS. In this way, TrueBlocks creates an immutable, uncapturable
    87  index of appearances that can be used not only by TrueBlocks, but any member of the community who
    88  needs it. (Hint: We all need it.)
    89  
    90  Users of of any of the TrueBlocks applications (or anyone else's applications) may subsequently
    91  download the Bloom filters, query them to determine which **index chunks** need to be downloaded,
    92  and thereby build a historical list of transactions for a given address. This is accomplished
    93  while imposing a minimum amount of resource requirement on the end user's machine.
    94  
    95  Recently, we enabled the ability for the end user to pin these downloaded index chunks and blooms
    96  on their own machines. The user needs the data for the software to operate--sharing requires
    97  minimal effort and makes the data available to other people. Everyone is better off. A
    98  naturally-occuring network effect.
    99  
   100  ### tracing
   101  
   102  The `chifra scrape` command requires your node to provide the `trace_block` (and related) RPC endpoints. Please see the
   103  README file for the `chifra traces` command for more information.
   104  
   105  ### prerequisites
   106  
   107  `chifra scrape` works with any EVM-based blockchain, but does not currently work without a "tracing,
   108  archive" RPC endpoint. The Erigon and Reth blockchain nodes, given their minimal disc footprint for an
   109  archive node and their support of the required `trace_` endpoint routines, are recommended.
   110  
   111  Please [see this article](https://trueblocks.io/blog/a-long-winded-explanation-of-trueblocks/) for
   112  more information about running the scraper and building and sharing the index of appearances.
   113  
   114  ### notifications
   115  
   116  The `chifra scrape` command provides a notification feature which is used primarily for `trueblocks-key`.
   117  To configure it, you must edit the `trueBlocks.toml` file. You may edit the configuration file with
   118  `chifra config edit`. Add the following configuration items to the `[settings]` group:
   119  
   120  ```toml
   121  [settings.notify]
   122      url = "http://localhost:5555" # or other
   123      author = "TrueBlocks" #optional
   124  ```
   125  
   126  In addition, you must enable the feature by adding the `--notify` option to the command line.
   127  
   128  ### Other Options
   129  
   130  All tools accept the following additional flags, although in some cases, they have no meaning.
   131  
   132  ```[plaintext]
   133    -v, --version         display the current version of the tool
   134        --output string   write the results to file 'fn' and return the filename
   135        --append          for --output command only append to instead of replace contents of file
   136        --file string     specify multiple sets of command line options in a file
   137  ```
   138  
   139  **Note:** For the `--file string` option, you may place a series of valid command lines in a file using any
   140  valid flags. In some cases, this may significantly improve performance. A semi-colon at the start
   141  of any line makes it a comment.
   142  
   143  **Note:** If you use `--output --append` option and at the same time the `--file` option, you may not switch
   144  export formats in the command file. For example, a command file with two different commands, one with `--fmt csv`
   145  and the other with `--fmt json` will produce both invalid CSV and invalid JSON.
   146  
   147  *Copyright (c) 2024, TrueBlocks, LLC. All rights reserved. Generated with goMaker.*