github.com/TrueBlocks/trueblocks-core/src/apps/chifra@v0.0.0-20241022031540-b362680128f7/internal/scrape/README.md

github.com/TrueBlocks/trueblocks-core/src/apps/chifra@v0.0.0-20241022031540-b362680128f7/internal/scrape/README.md (about)

1 ## chifra scrape
2
3 The `chifra scrape` application creates TrueBlocks' chunked index of address appearances -- the
4 fundamental data structure of the entire system. It also, optionally, pins each chunk of the index
5 to IPFS.
6
7 `chifra scrape` is a long running process, therefore we advise you run it as a service or in terminal
8 multiplexer such as `tmux`. You may start and stop `chifra scrape` as needed, but doing so means the
9 scraper will not be keeping up with the front of the blockchain. The next time it starts, it will
10 have to catch up to the chain, a process that may take several hours depending on how long ago it
11 was last run. See the section below and the "Papers" section of our website for more information
12 on how the scraping process works and prerequisites for its proper operation.
13
14 You may adjust the speed of the index creation with the `--sleep` and `--block_cnt` options. On
15 some machines, or when running against some EVM node software, the scraper may overburden the
16 hardware. Slowing things down will ensure proper operation. Finally, you may optionally `--pin`
17 each new chunk to IPFS which naturally shards the database among all users. By default, pinning
18 is against a locally running IPFS node, but the `--remote` option allows pinning to an IPFS
19 pinning service such as Pinata.
20
21 ```[plaintext]
22 Purpose:
23 Scan the chain and update the TrueBlocks index of appearances.
24
25 Usage:
26 chifra scrape [flags]
27
28 Flags:
29 -n, --block_cnt uint maximum number of blocks to process per pass (default 2000)
30 -s, --sleep float seconds to sleep between scraper passes (default 14)
31 -l, --touch uint first block to visit when scraping (snapped back to most recent snap_to_grid mark)
32 -u, --run_count uint run the scraper this many times, then quit
33 -d, --dry_run show the configuration that would be applied if run,no changes are made
34 -o, --notify enable the notify feature
35 -v, --verbose enable verbose output
36 -h, --help display this help screen
37
38 Notes:
39 - The --touch option may only be used for blocks after the latest scraped block (if any). It will be snapped back to the latest snap_to block.
40 - This command requires your RPC to provide trace data. See the README for more information.
41 - The --notify option requires proper configuration. Additionally, IPFS must be running locally. See the README.md file.
42 ```
43
44 Data models produced by this tool:
45
46 - [chunkrecord](/data-model/admin/#chunkrecord)
47 - [manifest](/data-model/admin/#manifest)
48 - [message](/data-model/other/#message)
49
50 ### configuration
51
52 Each of the following additional configurable command line options are available.
53
54 **Configuration file:** `trueBlocks.toml`
55 **Configuration group:** `[scrape.<chain>]`
56
57 | Item | Type | Default | Description / Default |
58 | ------------ | ------ | ------- | ------------------------------------------------------------------------------------------------------------------------ |
59 | appsPerChunk | uint64 | 2000000 | the number of appearances to build into a chunk before consolidating it |
60 | snapToGrid | blknum | 250000 | an override to apps_per_chunk to snap-to-grid at every modulo of this value, this allows easier corrections to the index |
61 | firstSnap | blknum | 2000000 | the first block at which snap_to_grid is enabled |
62 | unripeDist | blknum | 28 | the distance (in blocks) from the front of the chain under which (inclusive) a block is considered unripe |
63 | channelCount | uint64 | 20 | number of concurrent processing channels |
64 | allowMissing | bool | false | do not report errors for blockchains that contain blocks with zero addresses |
65
66 Note that for Ethereum mainnet, the default values for appsPerChunk and firstSnap are 2,000,000 and 2,300,000 respectively. See the specification for a justification of these values.
67
68 These items may be set in three ways, each overriding the preceding method:
69
70 -- in the above configuration file under the `[scrape.<chain>]` group,
71 -- in the environment by exporting the configuration item as upper case (with underbars removed) and prepended with (TB underbar SCRAPE underbar CHAIN) with the underbars included, or
72 -- on the command line using the configuration item with leading dashes and in snake case (i.e., `--snake_case`).
73
74 ### further information
75
76 Each time `chifra scrape` runs, it begins at the last block it completed processing (plus one). With
77 each pass, the scraper descends into each block's complete data. (This is why TrueBlocks requires
78 a `--tracing` node.) As the scraper encounters appearances of address in the
79 block's data, it adds those appearances to a growing index. Periodically (after processing the
80 block that contains the 2,000,000th appearance), the system consolidates an **index chunk**.
81
82 An **index chunk** is a portion of the index containing approximately 2,000,000 records (although,
83 this number is adjustable for different chains). As part of the consolidation, the scraper creates
84 a Bloom filter representing the set membership in the associated index portion. The Bloom filters
85 are an order of magnitude smaller than the index chunks. The system then pushes both the index
86 chunk and the Bloom filter to IPFS. In this way, TrueBlocks creates an immutable, uncapturable
87 index of appearances that can be used not only by TrueBlocks, but any member of the community who
88 needs it. (Hint: We all need it.)
89
90 Users of of any of the TrueBlocks applications (or anyone else's applications) may subsequently
91 download the Bloom filters, query them to determine which **index chunks** need to be downloaded,
92 and thereby build a historical list of transactions for a given address. This is accomplished
93 while imposing a minimum amount of resource requirement on the end user's machine.
94
95 Recently, we enabled the ability for the end user to pin these downloaded index chunks and blooms
96 on their own machines. The user needs the data for the software to operate--sharing requires
97 minimal effort and makes the data available to other people. Everyone is better off. A
98 naturally-occuring network effect.
99
100 ### tracing
101
102 The `chifra scrape` command requires your node to provide the `trace_block` (and related) RPC endpoints. Please see the
103 README file for the `chifra traces` command for more information.
104
105 ### prerequisites
106
107 `chifra scrape` works with any EVM-based blockchain, but does not currently work without a "tracing,
108 archive" RPC endpoint. The Erigon and Reth blockchain nodes, given their minimal disc footprint for an
109 archive node and their support of the required `trace_` endpoint routines, are recommended.
110
111 Please [see this article](https://trueblocks.io/blog/a-long-winded-explanation-of-trueblocks/) for
112 more information about running the scraper and building and sharing the index of appearances.
113
114 ### notifications
115
116 The `chifra scrape` command provides a notification feature which is used primarily for `trueblocks-key`.
117 To configure it, you must edit the `trueBlocks.toml` file. You may edit the configuration file with
118 `chifra config edit`. Add the following configuration items to the `[settings]` group:
119
120 ```toml
121 [settings.notify]
122 url = "http://localhost:5555" # or other
123 author = "TrueBlocks" #optional
124 ```
125
126 In addition, you must enable the feature by adding the `--notify` option to the command line.
127
128 ### Other Options
129
130 All tools accept the following additional flags, although in some cases, they have no meaning.
131
132 ```[plaintext]
133 -v, --version display the current version of the tool
134 --output string write the results to file 'fn' and return the filename
135 --append for --output command only append to instead of replace contents of file
136 --file string specify multiple sets of command line options in a file
137 ```
138
139 **Note:** For the `--file string` option, you may place a series of valid command lines in a file using any
140 valid flags. In some cases, this may significantly improve performance. A semi-colon at the start
141 of any line makes it a comment.
142
143 **Note:** If you use `--output --append` option and at the same time the `--file` option, you may not switch
144 export formats in the command file. For example, a command file with two different commands, one with `--fmt csv`
145 and the other with `--fmt json` will produce both invalid CSV and invalid JSON.
146
147 *Copyright (c) 2024, TrueBlocks, LLC. All rights reserved. Generated with goMaker.*