github.com/onflow/flow-go@v0.35.7-crescendo-preview.23-atree-inlining/module/jobqueue/README.md (about) 1 ## JobQueue Design Goal 2 3 The jobqueue package implemented a reusable job queue system for async message processing. 4 5 The most common use case is to work on each finalized block async. 6 7 For instance, verification nodes must verify each finalized block. This needs to happen async, otherwise a verification node might get overwhelmed during periods when a large amount of blocks are finalized quickly (e.g. when a node comes back online and is catching up from other peers). 8 9 So the goal for the jobqueue system are: 10 1. guarantee each job (i.e. finalized block) will be processed eventually 11 2. in the event of a crash failure, the jobqueue state is persisted and workers can be rescheduled so that no job is skipped. 12 3. allow concurrent processing of multiple jobs 13 4. the number of concurrent workers is configurable so that the node won't get overwhelmed when too many jobs are created (i.e. too many blocks are finalized in a short period of time) 14 15 ## JobQueue components 16 To achieve the above goal, the jobqueue system contains the following components/interfaces: 17 1. A `Jobs` module to find jobs by job index 18 2. A `storage.ConsumerProgress` to store job processing progress 19 3. A `Worker` module to process jobs and report job completion. 20 4. A `Consumer` that orchestrates the job processing by finding new jobs, creating workers for each job using the above modules, and managing job processing status internally. 21 22 ### Using module.Jobs to find jobs 23 There is no JobProducer in jobqueue design. Job queue assumes each job can be indexed by a uint64 value, just like each finalized block (or sealed block) can be indexed by block height. 24 25 Let's just call this uint64 value "Job Index" or index. 26 27 So if we iterate through each index from low to high, and find each job by index, then we are able to iterate through each job. 28 29 Therefore modules.Job interface abstracts it into a method: `AtIndex`. 30 31 `AtIndex` method returns the job at any given index. 32 33 Job consumer relies on the modules.Jobs to find jobs. However, modules.Jobs doesn't provide a way to notify as soon as a new job is available. So it's consumer's job to keep track of the values returned by module.Jobs's `Head` method and find jobs that are new. 34 35 ### Using Check method to notify job consumer for checking new jobs 36 Job consumer provides the `Check` method for users to notify new jobs available. 37 38 Once called, job consumer will iterate through each height with the `AtIndex` method. It stops when one of the following condition is true: 39 1. no job was found at a index 40 2. no more workers to work on them, which is limited by the config item `maxProcessing` 41 42 `Check` method is concurrent safe, meaning even if job consumer is notified concurrently about new jobs available, job consumer will check at most once to find new jobs. 43 44 Whenever a worker finishes a job, job consumer will also call `Check` internally. 45 46 ### Storing job consuming progress in storage.ConsumerProgress 47 Job consumer stores the last processed job index in `storage.ConsumerProgress`, so that on startup, the job consumer can read the last processed job index from storage and compare with the last available job index from `module.Jobs`'s `Head` method to resume job processing. 48 49 This ensures each job will be processed at least once. Note: given the at least once execution, the `Worker` should gracefully handle duplicate runs of the same job. 50 51 ### Using Workers to work on each job 52 53 When Job consumer finds a new job, it uses an implementation of the `Worker` interface to process each job. The `Worker`s `Run` method accepts a `module.Job` interface. So it's the user's responsibility to handle the conversion between `module.Job` and the underlying data type. 54 55 In the scenario of processing finalized blocks, implementing symmetric functions like BlockToJob and JobToBlock are recommended for this conversion. 56 57 In order to report job completion, the worker needs to call job consumer's `NotifyJobIsDone` method. 58 59 ### Error handling 60 Job queue doesn't allow job to fail, because job queue has to guarantee any job below the last processed job index has been finished successfully. Leaving a gap is not accpeted. 61 62 Therefore, if a worker fails to process a job, it should retry by itself, or just crash. 63 64 Note, Worker should not log the error and report the job is completed, because that would change the last processed job index, and will not have the chance to retry that job. 65 66 67 ## Pipeline Pattern 68 Multiple jobqueues can be combined to form a pipeline. This is useful in the scenario that the first job queue will process each finalized block and create jobs to process data depending on the block, and having the second job queue to process each job created by the worker of the first job queue. 69 70 For instance, verification node uses 2-jobqueue pipeline to find chunks from each block and create jobs if the block has chunks that it needs to verify, and the second job queue will allow verification node to verify each chunk with a max number of workers. 71 72 ## Considerations 73 74 ### Push vs Pull 75 The jobqueue architecture is optimized for "pull" style processes, where the job producer simply notify the job consumer about new jobs without creating any job, and job consumer pulls jobs from a source when workers are available. All current implementations are using this pull style since it lends well to asynchronously processing jobs based on block heights. 76 77 Some use cases might require "push" style jobs where there is a job producer that create new jobs, and a consumer that processes work from the producer. This is possible with the jobqueue, but requires the producer persist the jobs to a database, then implement the `Head` and `AtIndex` methods that allow accessing jobs by sequential `uint64` indexes. 78 79 ### TODOs 80 1. Jobs at different index are processed in parallel, it's possible that there is a job takes a long time to work on, and causing too many completed jobs cached in memory before being used to update the the last processed job index. 81 `maxSearchAhead` will allow the job consumer to stop consume more blocks if too many jobs are completed, but the job at index lastProcesssed + 1 has not been unprocessed yet. 82 The difference between `maxSearchAhead` and `maxProcessing` is that: `maxProcessing` allows at most `maxProcessing` number of works to process jobs. However, even if there is worker available, it might not be assigned to a job, because the job at index lastProcesssed +1 has not been done, it won't work on an job with index higher than `lastProcesssed + maxSearchAhead`. 83 2. accept callback to get notified when the consecutive job index is finished. 84 3. implement ReadyDoneAware interface 85