gitlab.com/SkynetLabs/skyd@v1.6.9/skymodules/renter/README.md (about)

     1  # Renter
     2  The Renter is responsible for tracking and actively maintaining all of the files
     3  that a user has uploaded to Sia. This includes the location and health of these
     4  files. The Renter, via the HostDB and the Contractor, is also responsible for
     5  picking hosts and maintaining the relationship with them.
     6  
     7  The renter is unique for having two different logs. The first is a general
     8  renter activity log, and the second is a repair log. The repair log is intended
     9  to be a high-signal log that tells users what files are being repaired, and
    10  whether the repair jobs have been successful. Where there are failures, the
    11  repair log should try and document what those failures were. Every message of
    12  the repair log should be interesting and useful to a power user, there should be
    13  no logspam and no messages that would only make sense to siad developers.
    14  
    15  ## Testing
    16  Testing the Renter module follows the following guidelines. 
    17  1. `file.go` will have unit tests in `file_test.go`
    18  1. In `file_test.go` there will be one main test named `TestFile`. `TestFile`
    19     will have subtests for specific methods, conditions, etc., such as
    20     `TestFile/method1` or `TestFile/condition1`.
    21  
    22  Since tests are run by providing the package, it is already clear that the tests
    23  correspond to the Renter, so Renter in the name is redundant.
    24  
    25  An example of a simple test file can be found in
    26  [`refreshpaths_test.go`](./refreshpaths_test.go).
    27  
    28  An example of a test with a number of subtests can be found in
    29  [`uploadheap_test.go`](./uploadheap_test.go).
    30  
    31  ## Submodules
    32  The Renter has several submodules that each perform a specific function for the
    33  Renter. This README will provide brief overviews of the submodules, but for more
    34  detailed descriptions of the inner workings of the submodules the respective
    35  README files should be reviewed.
    36   - Contractor
    37   - Filesystem
    38   - HostDB
    39   - Proto
    40   - Skynet Blocklist
    41   - Skynet Portals
    42  
    43  ### Contractor
    44  The Contractor manages the Renter's contracts and is responsible for all
    45  contract actions such as new contract formation and contract renewals. The
    46  Contractor determines which contracts are GoodForUpload and GoodForRenew and
    47  marks them accordingly.
    48  
    49  ### Filesystem
    50  The Filesystem is responsible for ensuring that all of its supported file
    51  formats can be accessed in a threadsafe manner. It doesn't handle any
    52  persistence directly but instead relies on the underlying format's package to
    53  handle that itself.
    54  
    55  ### HostDB
    56  The HostDB curates and manages a list of hosts that may be useful for the renter
    57  in storing various types of data. The HostDB is responsible for scoring and
    58  sorting the hosts so that when hosts are needed for contracts high quality hosts
    59  are provided. 
    60  
    61  ### Proto
    62  The proto module implements the renter's half of the renter-host protocol,
    63  including contract formation and renewal RPCs, uploading and downloading,
    64  verifying Merkle proofs, and synchronizing revision states. It is a low-level
    65  module whose functionality is largely wrapped by the Contractor.
    66  
    67  ### Skynet Blocklist
    68  The Skynet Blocklist module manages the list of skylinks that the Renter wants
    69  blocked. It also manages persisting the blocklist in an ACID and performant
    70  manner.
    71  
    72  ### Skynet Portals
    73  The Skynet Portals module manages the list of known Skynet portals that the
    74  Renter wants to keep track of. It also manages persisting the list in an ACID
    75  and performant manner.
    76  
    77  ## Subsystems
    78  The Renter has the following subsystems that help carry out its
    79  responsibilities.
    80   - [Backup Subsystem](#backup-subsystem)
    81   - [Bubble Subsystem](#bubble-subsystem)
    82   - [Download Project Subsystem](#download-project-subsystem)
    83   - [Download Streaming Subsystem](#download-streaming-subsystem)
    84   - [Download Subsystem](#download-subsystem)
    85   - [Filesystem Controllers](#filesystem-controllers)
    86   - [Fuse Manager Subsystem](#fuse-manager-subsystem)
    87   - [Fuse Subsystem](#fuse-subsystem)
    88   - [Health and Repair Subsystem](#health-and-repair-subsystem)
    89   - [Memory Subsystem](#memory-subsystem)
    90   - [Persistence Subsystem](#persistence-subsystem)
    91   - [Refresh Paths Subsystem](#refresh-paths-subsystem)
    92   - [Skyfile Subsystem](#skyfile-subsystem)
    93   - [Skylink Manager Subsystem](#skylink-manager-subsystem)
    94   - [Stream Buffer Subsystem](#stream-buffer-subsystem)
    95   - [Upload Streaming Subsystem](#upload-streaming-subsystem)
    96   - [Upload Subsystem](#upload-subsystem)
    97   - [Worker Subsystem](#worker-subsystem)
    98  
    99  **TODO** Subsystems need to be alphabetized below to match above list
   100  
   101  ### Bubble Subsystem
   102  **Key Files**
   103   - [bubble.go](./bubble.go)
   104  
   105  The bubble subsystem is responsible making sure the updates to the file system's
   106  metadata are propagated up to the root directory. A bubble is the process of
   107  updating the filesystem metadata for the renter. It is called bubble because
   108  when a directory's metadata is updated, a call to update the parent directory
   109  will be made. This process continues until the root directory is reached. This
   110  results in any changes in metadata being "bubbled" to the top so that the root
   111  directory's metadata reflects the status of the entire filesystem.
   112  
   113  If during a bubble a file is found that meets the threshold health for repair,
   114  a signal is sent to the repair loop. If a stuck chunk is found then a signal is
   115  sent to the stuck loop. 
   116  
   117  Since we are updating the metadata on disk during the bubble calls we want to
   118  ensure that only one bubble is being called on a directory at a time. We do this
   119  through `callQueueBubbleUpdate` and `managedCompleteBubbleUpdate`. The
   120  `bubbleScheduler` has a `bubbleUpdates` field that tracks all the bubbles and
   121  the `bubbleStatus`.  Bubbles can either be queued, active or pending. 
   122  
   123  When bubble is called on a directory, `callQueueBubbleUpdate` will check to see
   124  if there are any queued, active or pending bubbles for the directory. If there
   125  are no bubbles being tracked for that directory then the bubble update is queued
   126  and added to the fifo queue. If there is a bubble currently queued or pending
   127  for the directory then the update is ignored. If there is a bubble update that
   128  is active then the status will be updated to pending. 
   129  
   130  The `bubbleScheduler` works through the queued bubble updates in
   131  `callThreadedProcessBubbleUpdates`. When a bubble update is popped from the
   132  queue its status is set to active while the bubble is being performed. When the
   133  bubble is complete, `managedCompleteBubbleUpdate` is called.
   134  
   135  When `managedCompleteBubbleUpdate` is called, if the status is active then the
   136  update is complete and it is removed from the `bubbleScheduler`. If the status
   137  is pending then the update is added back to the fifo queue with a status of
   138  queued.
   139  
   140  When a directory is bubbled, the metadata information is
   141  recalculated and saved to disk and then bubble is called on the parent directory
   142  until the top level directory is reached. During this calculation, every file in
   143  the directory is opened, modified, and fsync'd individually. 
   144  
   145  See benchmark results:
   146  
   147  ```
   148  BenchmarkBubbleMetadata runs a benchmark on the perform bubble metadata method
   149  
   150  Results (goos, goarch, CPU: Benchmark Output: date)
   151  
   152  linux, amd64, Intel(R) Core(TM) i7-8550U CPU @ 1.80GHz:  6 | 180163684 ns/op | 249937 B/op | 1606 allocs/op: 03/19/2020
   153  linux, amd64, Intel(R) Core(TM) i7-8550U CPU @ 1.80GHz: 34 |  34416443 ns/op                                 11/10/2020
   154  linux, amd64, Intel(R) Core(TM) i7-8550U CPU @ 1.80GHz: 15 |  75880486 ns/op                                 02/26/2021
   155  ```
   156  
   157  ### Exports
   158   - `BubbleMetadata`
   159  
   160  #### Inbound Complexities
   161   - `callQueueBubbleUpdate` is used by external subsystems to trigger a bubble
   162       update on a directory.
   163   - `callThreadedProcessBubbleUpdates` is called by the Renter on startup to
   164       launch the background thread that processes the queued bubble updates.
   165  
   166  #### Outbound Complexities
   167   - `BubbleMetadata` calls `callPrepareForBubble` on a directory when the
   168       `recursive` flag is set to `true` and then calls `callRefreshAll` to
   169       execute the bubble updates.
   170   - `managedPerformBubbleMetadata` calls `callRenterContractsAndUtilities` to get
   171      the contract and utility maps before calling `callCalculateDirectoryMetadata`
   172   - The Bubble subsystem triggers the Repair Loop when unhealthy files are found. This
   173     is done by `managedPerformBubbleMetadata` signaling the
   174     `r.uploadHeap.repairNeeded` channel when it is at the root directory and the
   175     `AggregateHealth` is above the `RepairThreshold`.
   176   - The Bubble subsystem triggers the Stuck Loop when stuck files are found. This is
   177     done by `managedPerformBubbleMetadata` signaling the
   178     `r.uploadHeap.stuckChunkFound` channel when it is at the root directory and
   179     `AggregateNumStuckChunks` is greater than zero.
   180  
   181  ### Filesystem Controllers
   182  **Key Files**
   183   - [dirs.go](./dirs.go)
   184   - [files.go](./files.go)
   185  
   186  *TODO* 
   187    - fill out subsystem explanation
   188  
   189  #### Outbound Complexities
   190   - `DeleteFile` calls `callThreadedBubbleMetadata` after the file is deleted
   191   - `RenameFile` calls `callThreadedBubbleMetadata` on the current and new
   192     directories when a file is renamed
   193  
   194  ### Fuse Subsystem
   195  **Key Files**
   196   - [fuse.go](./fuse.go)
   197  
   198  The fuse subsystem enables mounting the renter as a virtual filesystem. When
   199  mounted, the kernel forwards I/O syscalls on files and folders to the userland
   200  code in this subsystem. For example, the `read` syscall is implemented by
   201  downloading data from Sia hosts.
   202  
   203  Fuse is implemented using the `hanwen/go-fuse/v2` series of packages, primarily
   204  `fs` and `fuse`. The fuse package recognizes a single node interface for files
   205  and folders, but the renter has two structs, one for files and another for
   206  folders. Each the fuseDirnode and the fuseFilenode implement the same Node
   207  interfaces.
   208  
   209  The fuse implementation is remarkably sensitive to small details. UID mistakes,
   210  slow load times, or missing/incorrect method implementations can often destroy
   211  an external application's ability to interact with fuse. Currently we use
   212  ranger, Nautilus, vlc/mpv, and siastream when testing if fuse is still working
   213  well. More programs may be added to this list as we discover more programs that
   214  have unique requirements for working with the fuse package.
   215  
   216  The siatest/renter suite has two packages which are useful for testing fuse. The
   217  first is [fuse\_test.go](../../siatest/renter/fuse_test.go), and the second is
   218  [fusemock\_test.go](../../siatest/renter/fusemock_test.go). The first file
   219  leverages a testgroup with a renter, a miner, and several hosts to mimic the Sia
   220  network, and then mounts a fuse folder which uses the full fuse implementation.
   221  The second file contains a hand-rolled implementation of a fake filesystem which
   222  implements the fuse interfaces. Both have a commented out sleep at the end of
   223  the test which, when uncommented, allows a developer to explore the final
   224  mounted fuse folder with any system application to see if things are working
   225  correctly.
   226  
   227  The mocked fuse is useful for debugging issues related to the fuse
   228  implementation. When using the renter implementation, it can be difficult to
   229  determine whether something is not working because there is a bug in the renter
   230  code, or whether something is not working because the fuse libraries are being
   231  used incorrectly. The mocked fuse is an easy way to replicate any desired
   232  behavior and check for misunderstandings that the programmer may have about how
   233  the fuse librires are meant to be used.
   234  
   235  ### Fuse Manager Subsystem
   236  **Key Files**
   237   - [fusemanager.go](./fusemanager.go)
   238  
   239  The fuse manager subsystem keeps track of multiple fuse directories that are
   240  mounted at the same time. It maintains a list of mountpoints, and maps to the
   241  fuse filesystem object that is mounted at those point. Only one folder can be
   242  mounted at each mountpoint, but the same folder can be mounted at many
   243  mountpoints.
   244  
   245  When debugging fuse, it can be helpful to enable the 'Debug' option when
   246  mounting a filesystem. This option is commented out in the fuse manager in
   247  production, but searching for 'Debug:' in the file will reveal the line that can
   248  be uncommented to enable debugging. Be warned that when debugging is enabled,
   249  fuse becomes incredibly verbose.
   250  
   251  Upon shutdown, the fuse manager will only attempt to unmount each folder one
   252  time. If the folder is busy or otherwise in use by another application, the
   253  unmount will fail and the user will have to manually unmount using `fusermount`
   254  or `umount` before that folder becomes available again. To the best of our
   255  current knowledge, there is no way to force an unmount.
   256  
   257  ### Persistence Subsystem
   258  **Key Files**
   259   - [persist_compat.go](./persist_compat.go)
   260   - [persist.go](./persist.go)
   261  
   262  *TODO* 
   263    - fill out subsystem explanation
   264  
   265  ### Memory Subsystem
   266  **Key Files**
   267   - [memory.go](./memory.go)
   268  
   269  The memory subsystem acts as a limiter on the total amount of memory that the
   270  renter can use. The memory subsystem does not manage actual memory, it's really
   271  just a counter. When some process in the renter wants to allocate memory, it
   272  uses the 'Request' method of the memory manager. The memory manager will block
   273  until enough memory has been returned to allow the request to be granted. The
   274  process is then responsible for calling 'Return' on the memory manager when it
   275  is done using the memory.
   276  
   277  The memory manager is initialized with a base amount of memory. If a request is
   278  made for more than the base memory, the memory manager will block until all
   279  memory has been returned, at which point the memory manager will unblock the
   280  request. No other memory requests will be unblocked until the large memory
   281  sufficiently returned.
   282  
   283  Because 'Request' and 'Return' are just counters, they can be called as many
   284  times as necessary in whatever sizes are convenient.
   285  
   286  When calling 'Request', a process should be sure to request all necessary memory
   287  at once, because if a single process calls 'Request' multiple times before
   288  returning any memory, this can cause a deadlock between multiple processes that
   289  are stuck waiting for more memory before they release memory.
   290  
   291  ### Worker Subsystem
   292  **Key Files**
   293   - [worker.go](./worker.go)
   294   - [workerdownload.go](./workerdownload.go)
   295   - [workerpool.go](./workerpool.go)
   296   - [workerupload.go](./workerupload.go)
   297  
   298  The worker subsystem is the interface between the renter and the hosts. All
   299  actions (with the exception of some legacy actions that are currently being
   300  updated) that involve working with hosts will pass through the worker subsystem.
   301  
   302  #### The Worker Pool
   303  
   304  The heart of the worker subsystem is the worker pool, implemented in
   305  [workerpool.go](./workerpool.go). The worker pool contains the set of workers
   306  that can be used to communicate with the hosts, one worker per host. The
   307  function `callWorker` can be used to retrieve a specific worker from the pool,
   308  and the function `callUpdate` can be used to update the set of workers in the
   309  worker pool. `callUpdate` will create new workers for any new contracts, will
   310  update workers for any contracts that changed, and will kill workers for any
   311  contracts that are no longer useful.
   312  
   313  ##### Inbound Complexities
   314  
   315   - `callUpdate` should be called on the worker pool any time that that the set
   316     of contracts changes or has updates which would impact what actions a worker
   317     can take. For example, if a contract's utility changes or if a contract is
   318     cancelled.
   319     - `Renter.SetSettings` calls `callUpdate` after changing the settings of the
   320  	 renter. This is probably incorrect, as the actual contract set is updated
   321  	 by the contractor asynchronously, and really `callUpdate` should be
   322  	 triggered by the contractor as the set of hosts is changed.
   323     - `Renter.threadedDownloadLoop` calls `callUpdate` on each iteration of the
   324  	 outer download loop to ensure that it is always working with the most
   325  	 recent set of hosts. If the contractor is updated to be able to call
   326  	 `callUpdate` during maintenance, this call becomes unnecessary.
   327     - `Renter.managedRefreshHostsAndWorkers` calls `callUpdate` so that the
   328  	 renter has the latest list of hosts when performing uploads.
   329  	 `Renter.managedRefreshHostsAndWorkers` is itself called in many places,
   330  	 which means there's substantial complexity between the upload subsystem and
   331  	 the worker subsystem. This complexity can be eliminated by having the
   332  	 contractor being responsible for updating the worker pool as it changes the
   333  	 set of hosts, and also by having the worker pool store host map, which is
   334  	 one of the key reasons `Renter.managedRefreshHostsAndWorkers` is called so
   335  	 often - this function returns the set of hosts in addition to updating the
   336  	 worker pool.
   337   - `callWorker` can be used to fetch a worker and queue work into the worker.
   338     The worker can be killed after `callWorker` has been called but before the
   339     returned worker has been used in any way.
   340     - `renter.BackupsOnHost` will use `callWorker` to retrieve a worker that can
   341  	 be used to pull the backups off of a host.
   342   - `callWorkers` can be used to fetch the list of workers from the worker pool.
   343     It should be noted that it is not safe to lock the worker pool, iterate
   344     through the workers, and then call locking functions on the workers. The
   345     worker pool must be unlocked if the workers are going to be acquiring locks.
   346     Which means functions that loop over the list of workers must fetch that list
   347     separately.
   348  
   349  #### The Worker
   350  
   351  Each worker in the worker pool is responsible for managing communications with a
   352  single host. The worker has an infinite loop where it checks for work, performs
   353  any outstanding work, and then sleeps for a wake, kill, or shutdown signal. The
   354  implementation for the worker is primarily in [worker.go](./worker.go) and
   355  [workerloop.go](./workerloop.go).
   356  
   357  Each type of work that the worker can perform has a queue. A unit of work is
   358  called a job. The worker queue and job structure has been re-written multiple
   359  times, and not every job has been ported yet to the latest structure. But using
   360  the latest structure, you can call `queue.callAdd()` to add a job to a queue.
   361  The worker loop will make all of the decisions around when to execute the job.
   362  Jobs are split into two types, serial and async. Serial jobs are anything that
   363  requires exclusive access to the file contract with the host, the worker will
   364  ensure that only one of these is running at a time. Async jobs are any jobs that
   365  don't require exclusive access to a resource, the worker will run multiple of
   366  these in parallel.
   367  
   368  When a worker wakes or otherwise begins the work loop, the worker will check for
   369  each type of work in a specific order, therefore giving certain types of work
   370  priority over other types of work. For example, downloads are given priority
   371  over uploads. When the worker performs a piece of work, it will jump back to the
   372  top of the loop, meaning that a continuous stream of higher priority work can
   373  stall out all lower priority work.
   374  
   375  When a worker is killed, the worker is responsible for going through the list of
   376  jobs that have been queued and gracefully terminating the jobs, returning or
   377  signaling errors where appropriate.
   378  
   379  [workerjobgeneric.go](./workerjobgeneric.go) and
   380  [workerjobgeneric_test.go](./workerjobgeneric_test.go) contain all of the
   381  generic code and a basic reference implementation for building a job.
   382  
   383  ##### Inbound Complexities
   384   - `callQueueDownloadChunk` can be used to schedule a job to participate in a
   385     chunk download
   386     - `Renter.managedDistributeDownloadChunkToWorkers` will use this method to
   387  	 issue a brand new download project to all of the workers.
   388     - `unfinishedDownloadChunk.managedCleanUp` will use this method to re-issue
   389  	 work to workers that are known to have passed on a job previously, but may
   390  	 be required now.
   391   - `callQueueUploadChunk` can be used to schedule a job to participate in a
   392     chunk upload
   393     - `Renter.managedDistributeChunkToWorkers` will use this method to distribute
   394  	 a brand new upload project to all of the workers.
   395     - `unfinishedUploadChunk.managedNotifyStandbyWorkers` will use this method to
   396  	 re-issue work to workers that are known to have passed on a job previously,
   397  	 but may be required now.
   398  
   399  ##### Outbound Complexities
   400   - `managedPerformDownloadChunkJob` is a mess of complexities and needs to be
   401     refactored to be compliant with the new subsystem format.
   402   - `managedPerformUploadChunkJob` is a mess of complexities and needs to be
   403     refactored to be compliant with the new subsystem format.
   404  
   405  ### Download Subsystem
   406  **Key Files**
   407   - [download.go](./download.go)
   408   - [downloadchunk.go](./downloadchunk.go)
   409   - [downloaddestination.go](./downloaddestination.go)
   410   - [downloadheap.go](./downloadheap.go)
   411   - [workerdownload.go](./workerdownload.go)
   412  
   413  *TODO* 
   414    - expand subsystem description
   415  
   416  The download code follows a clean/intuitive flow for getting super high and
   417  computationally efficient parallelism on downloads. When a download is
   418  requested, it gets split into its respective chunks (which are downloaded
   419  individually) and then put into the download heap and download history as a
   420  struct of type `download`.
   421  
   422  A `download` contains the shared state of a download with all the information
   423  required for workers to complete it, additional information useful to users
   424  and completion functions which are executed upon download completion.
   425  
   426  The download history contains a mapping of all of the downloads' UIDs, which
   427  are randomly assigned upon initialization to their corresponding `download`
   428  struct. Unless cleared, users can retrieve information about ongoing and
   429  completed downloads by either retrieving the full history or a specific
   430  download from the history using the API.
   431  
   432  The primary purpose of the download heap is to keep downloads on standby
   433  until there is enough memory available to send the downloads off to the
   434  workers. The heap is sorted first by priority, but then a few other criteria
   435  as well.
   436  
   437  Some downloads, in particular downloads issued by the repair code, have
   438  already had their memory allocated. These downloads get to skip the heap and
   439  go straight for the workers.
   440  
   441  Before we distribute a download to workers, we check the `localPath` of the
   442  file to see if it available on disk. If it is, and `disableLocalFetch` isn't
   443  set, we load the download from disk instead of distributing it to workers.
   444  
   445  When a download is distributed to workers, it is given to every single worker
   446  without checking whether that worker is appropriate for the download. Each
   447  worker has their own queue, which is bottlenecked by the fact that a worker
   448  can only process one item at a time. When the worker gets to a download
   449  request, it determines whether it is suited for downloading that particular
   450  file. The criteria it uses include whether or not it has a piece of that
   451  chunk, how many other workers are currently downloading pieces or have
   452  completed pieces for that chunk, and finally things like worker latency and
   453  worker price.
   454  
   455  If the worker chooses to download a piece, it will register itself with that
   456  piece, so that other workers know how many workers are downloading each
   457  piece. This keeps everything cleanly coordinated and prevents too many
   458  workers from downloading a given piece, while at the same time you don't need
   459  a giant messy coordinator tracking everything. If a worker chooses not to
   460  download a piece, it will add itself to the list of standby workers, so that
   461  in the event of a failure, the worker can be returned to and used again as a
   462  backup worker. The worker may also decide that it is not suitable at all (for
   463  example, if the worker has recently had some consecutive failures, or if the
   464  worker doesn't have access to a piece of that chunk), in which case it will
   465  mark itself as unavailable to the chunk.
   466  
   467  As workers complete, they will release memory and check on the overall state
   468  of the chunk. If some workers fail, they will enlist the standby workers to
   469  pick up the slack.
   470  
   471  When the final required piece finishes downloading, the worker who completed
   472  the final piece will spin up a separate thread to decrypt, decode, and write
   473  out the download. That thread will then clean up any remaining resources, and
   474  if this was the final unfinished chunk in the download, it'll mark the
   475  download as complete.
   476  
   477  The download process has a slightly complicating factor, which is overdrive
   478  workers. Traditionally, if you need 10 pieces to recover a file, you will use
   479  10 workers. But if you have an overdrive of '2', you will actually use 12
   480  workers, meaning you download 2 more pieces than you need. This means that up
   481  to two of the workers can be slow or fail and the download can still complete
   482  quickly. This complicates resource handling, because not all memory can be
   483  released as soon as a download completes - there may be overdrive workers
   484  still out fetching the file. To handle this, a catchall 'cleanUp' function is
   485  used which gets called every time a worker finishes, and every time recovery
   486  completes. The result is that memory gets cleaned up as required, and no
   487  overarching coordination is needed between the overdrive workers (who do not
   488  even know that they are overdrive workers) and the recovery function.
   489  
   490  By default, the download code organizes itself around having maximum possible
   491  throughput. That is, it is highly parallel, and exploits that parallelism as
   492  efficiently and effectively as possible. The hostdb does a good job of selecting
   493  for hosts that have good traits, so we can generally assume that every host
   494  or worker at our disposable is reasonably effective in all dimensions, and
   495  that the overall selection is generally geared towards the user's
   496  preferences.
   497  
   498  We can leverage the standby workers in each unfinishedDownloadChunk to
   499  emphasize various traits. For example, if we want to prioritize latency,
   500  we'll put a filter in the 'managedProcessDownloadChunk' function that has a
   501  worker go standby instead of accept a chunk if the latency is higher than the
   502  targeted latency. These filters can target other traits as well, such as
   503  price and total throughput.
   504  
   505  ### Download Streaming Subsystem
   506  **Key Files**
   507   - [downloadstreamer.go](./downloadstreamer.go)
   508  
   509  *TODO* 
   510    - fill out subsystem explanation
   511  
   512  ### Download Project Subsystem
   513  **Key Files**
   514   - [projectchunkworkerset.go](./projectchunkworkerset.go)
   515   - [projectdownloadchunk.go](./projectdownloadchunk.go)
   516   - [projectdownloadinit.go](./projectdownloadinit.go)
   517   - [projectdownloadoverdrive.go](./projectdownloadoverdrive.go)
   518  
   519  The download project subsystem contains all the necessary logic to download a
   520  single chunk. Such a project can be initialized with a set of roots, which is
   521  what happens for Skynet downloads, or with a Siafile, where we already know what
   522  hosts have what roots.
   523  
   524  The project will, immediately after it has been initialized, spin up a bunch of
   525  jobs that will locate what hosts have what sectors. This is accomplished through
   526  'HasSector' worker jobs. The result of this initial scan is saved in the
   527  project's worker state. Every so often this state is being recalculated to
   528  ensure we keep up-to-date on the best way to retrieve the file from the network.
   529  
   530  Once the project has been initialized it can be used to download data, because
   531  we keep the track of the network state it is beneficial to reuse these objects,
   532  as it saves the time it takes to scan the network. Downloading data happens
   533  through a different download project called the 'ProjectDownloadChunk', or PDC
   534  for short.
   535  
   536  The PDC will use the network scan performed earlier to launch download jobs on
   537  workers that should be able to retrieve the piece. This process consist of two
   538  stages, namely the initial launch stage and the overdrive stage. The download
   539  code is very clever when selecting the initial set of workers that are being
   540  launched, it will take into account historical job timings and will try to make
   541  good estimates on how long a worker should take to retrieve the data from its
   542  host. This, in combination with a parameter that can be configured by the caller
   543  called 'price per millisecond', will be used to construct a set of workers, best
   544  suited for the download job. Once these workers have been launched, the second
   545  stage will kick in. This stage is called the overdrive stage, and will make sure
   546  that consecutive workers will be launched, should a worker in the initial set
   547  fail or be late.
   548  
   549  ### Skyfile Subsystem
   550  **Key Files**
   551   - [skyfile.go](./skyfile.go)
   552   - [skyfilefanout.go](./skyfilefanout.go)
   553  
   554  The skyfile system contains methods for encoding, decoding, uploading, and
   555  downloading skyfiles using Skylinks, and is one of the foundations underpinning
   556  Skynet.
   557  
   558  The skyfile format is a custom format which prepends metadata to a file such
   559  that the entire file and all associated metadata can be recovered knowing
   560  nothing more than a single sector root. That single sector root can be encoded
   561  alongside some compressed fetch offset and length information to create a
   562  skylink.
   563  
   564  **Outbound Complexities**
   565   - callUploadStreamFromReader is used to upload new data to the Sia network when
   566     creating skyfiles. This call appears three times in
   567     [skyfile.go](./skyfile.go)
   568  
   569  ### Skylink Manager Subsystem
   570  **Key Files**
   571   - [skylink.go](./skylink.go)
   572  
   573  The skylink manager system is responsible for managing actions that are related
   574  to skylinks.
   575  
   576  The skylink manager manages the unpinning of skylinks by maintaining a list of
   577  skylinks to be unpinned and when they should be unpinned by. The skylinks are
   578  unpinned by the bubble code while it iterates over the filesystem.
   579  
   580  ### Exports
   581   - `UnpinSkylink`
   582  
   583  **Inbound Complexities**
   584   - `callIsUnpinned` is used in `managedCachedFileMetadata` to decide if the file
   585       needs to be deleted. 
   586   - `callPruneUnpinRequests` is used by the bubble subsystem in
   587      `callThreadedProcessBubbleUpdates` to clear outdated unpin requests.
   588   - `callUpdatePruneTimeThreshold` is used by the bubble subsystem in
   589      `managedPerformBubbleMetadata` to clear update the skylink manager's
   590      `pruneTimeThreshold`.
   591  
   592  ### Stream Buffer Subsystem
   593  **Key Files**
   594   - [streambuffer.go](./streambuffer.go)
   595   - [streambufferlru.go](./streambufferlru.go)
   596   - [skylinkdatasource.go](./skylinkdatasource.go)
   597  
   598  The stream buffer subsystem coordinates buffering for a set of streams. Each
   599  stream has an LRU which includes both the recently visited data as well as data
   600  that is being buffered in front of the current read position. The LRU is
   601  implemented in [streambufferlru.go](./streambufferlru.go).
   602  
   603  If there are multiple streams open from the same data source at once, they will
   604  share their cache. Each stream will maintain its own LRU, but the data is stored
   605  in a common stream buffer. The stream buffers draw their data from a data source
   606  interface, which allows multiple different types of data sources to use the
   607  stream buffer.
   608  
   609  A SkylinkDataSource acts as a data source to the stream buffer. As the name
   610  suggests, such a data source can be initialized using a Skylink, and thus be
   611  used to download the data from the Skyfile. Internally this data source uses the
   612  download projects, as described in the [download project subsystem](#download-project-subsystem).
   613  
   614  ### Upload Subsystem
   615  **Key Files**
   616   - [directoryheap.go](./directoryheap.go)
   617   - [upload.go](./upload.go)
   618   - [uploadheap.go](./uploadheap.go)
   619   - [uploadchunk.go](./uploadchunk.go)
   620   - [workerupload.go](./workerupload.go)
   621  
   622  *TODO* 
   623    - expand subsystem description
   624  
   625  The Renter uploads `siafiles` in 40MB chunks. Redundancy kept at the chunk level
   626  which means each chunk will then be split in `datapieces` number of pieces. For
   627  example, a 10/20 scheme would mean that each 40MB chunk will be split into 10
   628  4MB pieces, which is turn will be uploaded to 30 different hosts (10 data pieces
   629  and 20 parity pieces).
   630  
   631  Chunks are uploaded by first distributing the chunk to the worker pool. The
   632  chunk is distributed to the worker pool by adding it to the upload queue and
   633  then signalling the worker upload channel. Workers that are waiting for work
   634  will receive this channel and begin the upload. First the worker creates a
   635  connection with the host by creating an `editor`. Next the `editor` is used to
   636  update the file contract with the next data being uploaded. This will update the
   637  merkle root and the contract revision.
   638  
   639  **Outbound Complexities**  
   640   - The upload subsystem calls `callThreadedBubbleMetadata` from the Health Loop
   641     to update the filesystem of the new upload
   642   - `Upload` calls `callBuildAndPushChunks` to add upload chunks to the
   643     `uploadHeap` and then signals the heap's `newUploads` channel so that the
   644     Repair Loop will work through the heap and upload the chunks
   645  ### Upload Streaming Subsystem
   646  **Key Files**
   647   - [uploadstreamer.go](./uploadstreamer.go)
   648  
   649  *TODO* 
   650    - fill out subsystem explanation
   651  
   652  **Inbound Complexities**
   653   - The skyfile subsystem makes three calls to `callUploadStreamFromReader()` in
   654     [skyfile.go](./skyfile.go)
   655   - The snapshot subsystem makes a call to `callUploadStreamFromReader()`
   656  
   657  ### Health and Repair Subsystem
   658  **Key Files**
   659   - [metadata.go](./metadata.go)
   660   - [repair.go](./repair.go)
   661   - [stuckstack.go](./stuckstack.go)
   662   - [uploadheap.go](./uploadheap.go)
   663  
   664  *TODO*
   665    - Move HealthLoop and related methods out of repair.go to health.go
   666    - Pull out repair code from  uploadheap.go so that uploadheap.go is only heap
   667      related code. Put in repair.go
   668    - Pull out stuck loop code from uploadheap.go and put in repair.go
   669    - Review naming of files associated with this subsystem
   670    - Create benchmark for health loop and add print outs to Health Loop section
   671    - Break out Health, Repair, and Stuck code into 3 distinct subsystems
   672    
   673  There are 3 main functions that work together to make up Sia's file repair
   674  mechanism, `threadedUpdateRenterHealth`, `threadedUploadAndRepairLoop`, and
   675  `threadedStuckFileLoop`. These 3 functions will be referred to as the health
   676  loop, the repair loop, and the stuck loop respectively.
   677  
   678  The Health and Repair subsystem operates by scanning aggregate information kept
   679  in each directory's metadata. An example of this metadata would be the aggregate
   680  filesystem health. Each directory has a field `AggregateHealth` which represents
   681  the worst aggregate health of any file or subdirectory in the directory. Because
   682  the field is recursive, the `AggregateHealth` of the root directory represents
   683  the worst health of any file in the entire filesystem. Health is defined as the
   684  percent of redundancy missing, this means that a health of 0 is a full health
   685  file.
   686  
   687  `threadedUpdateRenterHealth` is responsible for keeping the aggregate
   688  information up to date, while the other two loops use that information to decide
   689  what upload and repair actions need to be performed.
   690  
   691  #### Health Loops
   692  The health loop is responsible for ensuring that the health of the renter's file
   693  directory is updated periodically. Along with the health, the metadata for the
   694  files and directories is also updated. 
   695  
   696  One of the key directory metadata fields that the health loop uses is
   697  `LastHealthCheckTime` and `AggregateLastHealthCheckTime`. `LastHealthCheckTime`
   698  is the timestamp of when a directory or file last had its health re-calculated
   699  during a bubble call. When determining which directory to start with when
   700  updating the renter's file system, the health loop follows the path of oldest
   701  `AggregateLastHealthCheckTime` to find the directory  or sub tree that is the
   702  most out of date. To do this, the health loop uses
   703  `managedOldestHealthCheckTime`. This method starts at the root level of the
   704  renter's file system and begins checking the `AggregateLastHealthCheckTime` of
   705  the subdirectories. It then finds which one is the oldest and moves into that
   706  subdirectory and continues the search.  Once it reaches a directory that either
   707  has no subdirectories, or the current directory  has an older
   708  `AggregateLastHealthCheckTime` than any of the subdirectories, or it has found
   709  a reasonably sized sub tree defined by the health loop constants, it returns
   710  that timestamp and the SiaPath of the directory.
   711  
   712  Once the health loop has found the most out of date directory or sub tree, it
   713  uses the Refresh Paths subsystem to trigger bubble updates that the Bubble
   714  subsystem manages. Once the entire renter's directory has been updated within
   715  the healthCheckInterval the health loop sleeps until the time interval has
   716  passed.
   717  
   718  
   719  **Inbound Complexities**  
   720   - The Repair loop relies on Health Loop and the Bubble Subsystem to
   721     keep the filesystem accurately updated in order to work through the file
   722     system in the correct order.
   723  
   724  #### Repair Loop
   725  The repair loop is responsible for uploading new files to the renter and
   726  repairing existing files. The heart of the repair loop is
   727  `threadedUploadAndRepair`, a thread that continually checks for work, schedules
   728  work, and then updates the filesystem when work is completed.
   729  
   730  The renter tracks backups and siafiles separately, which essentially means the
   731  renter has a backup filesystem and a siafile filesystem. As such, we need to
   732  check both these filesystems separately with the repair loop. Since the backups
   733  are in a different filesystem, the health loop does not check on the backups
   734  which means that there are no outside triggers for the repair loop that a backup
   735  wasn't uploaded successfully and needs to be repaired. Because of this we always
   736  check for backup chunks first to ensure backups are succeeding. There is a size
   737  limit on the heap to help check memory usage in check, so by adding backup
   738  chunks to the heap first we ensure that we are never skipping over backup chunks
   739  due to a full heap.
   740  
   741  For the siafile filesystem the repair loop uses a directory heap to prioritize
   742  which chunks to add. The directoryHeap is a max heap of directory elements
   743  sorted by health. The directory heap is initialized by pushing an unexplored
   744  root directory element. As directory elements are popped of the heap, they are
   745  explored, which means the directory that was popped off the heap as unexplored
   746  gets marked as explored and added back to the heap, while all the subdirectories
   747  are added as unexplored. Each directory element contains the health information
   748  of the directory it represents, both directory health and aggregate health. If a
   749  directory is unexplored the aggregate health is considered, if the directory is
   750  explored the directory health is consider in the sorting of the heap. This is to
   751  allow us to navigate through the filesystem and follow the path of worse health
   752  to find the most in need directories first. When the renter needs chunks to add
   753  to the upload heap, directory elements are popped of the heap and chunks are
   754  pulled from that directory to be added to the upload heap. If all the chunks
   755  that need repairing are added to the upload heap then the directory element is
   756  dropped. If not all the chunks that need repair are added, then the directory
   757  element is added back to the directory heap with a health equal to the next
   758  chunk that would have been added, thus re-prioritizing that directory in the
   759  heap.
   760  
   761  To build the upload heap for the siafile filesystem, the repair loop checks if
   762  the file system is healthy by checking the top directory element in the
   763  directory heap. If healthy and there are no chunks currently in the upload heap,
   764  then the repair loop sleeps until it is triggered by a new upload or a repair is
   765  needed. If the filesystem is in need of repair, chunks are added to the upload
   766  heap by popping the directory off the directory heap and adding any chunks that
   767  are a worse health than the next directory in the directory heap. This continues
   768  until the `MaxUploadHeapChunks` is met. The repair loop will then repair those
   769  chunks and call bubble on the directories that chunks were added from to keep
   770  the file system updated. This will continue until the file system is healthy,
   771  which means all files have a health less than the `RepairThreshold`.
   772  
   773  When repairing chunks, the Renter will first try and repair the chunk from the
   774  local file on disk. If the local file is not present, the Renter will download
   775  the needed data from its contracts in order to perform the repair. In order for
   776  a remote repair, ie repairing from data downloaded from the Renter's contracts,
   777  to be successful the chunk must be at 1x redundancy or better. If a chunk is
   778  below 1x redundancy and the local file is not present the chunk, and therefore
   779  the file, is considered lost as there is no way to repair it. 
   780  
   781  **NOTE:** if the repair loop does not find a local file on disk, it will reset
   782  the localpath of the siafile to an empty string. This is done to avoid the
   783  siafile being corrupted in the future by a different file being placed on disk
   784  at the original localpath location.
   785  
   786  **Inbound Complexities**  
   787   - `Upload` adds chunks directly to the upload heap by calling
   788     `callBuildAndPushChunks`
   789   - Repair loop will sleep until work is needed meaning other threads will wake
   790     up the repair loop by calling the `repairNeeded` channel
   791   - There is always enough space in the heap, or the number of backup chunks is
   792     few enough that all the backup chunks are always added to the upload heap.
   793   - Stuck chunks get added directly to the upload heap and have priority over
   794     normal uploads and repairs
   795   - Streaming upload chunks are added directory to the upload heap and have the
   796     highest priority
   797  
   798  **Outbound Complexities**  
   799   - The Repair loop relies on the Health Loop and the Bubble subsystem to
   800     keep the filesystem accurately updated in order to work through the file
   801     system in the correct order.
   802   - The repair loop passes chunks on to the upload subsystem and expects that
   803     subsystem to handle the request 
   804   - `Upload` calls `callBuildAndPushChunks` to add upload chunks to the
   805     `uploadHeap` and then signals the heap's `newUploads` channel so that the
   806     Repair Loop will work through the heap and upload the chunks
   807  
   808  #### Stuck Loop
   809  File's are marked as `stuck` if the Renter is unable to fully repair a file that
   810  has previously finished uploading, i.e. attained a health of < 1. The goal is to
   811  mark a chunk as stuck if it is independently unable to be repaired. Meaning,
   812  this chunk is unable to be repaired but other chunks are able to be repaired. We
   813  mark a chunk as stuck so that the repair loop will ignore it in the future and
   814  instead focus on chunks that are able to be repaired.
   815  
   816  The stuck loop is responsible for targeting chunks that didn't get repaired
   817  properly, or chunks that are marked as unfinished. There are two methods for
   818  adding stuck chunks to the upload heap, the first method is random selection and
   819  the second is using the `stuckStack`. On start up the `stuckStack` is empty so
   820  the stuck loop begins using the random selection method. Once the `stuckStack`
   821  begins to fill, the stuck loop will use the `stuckStack` first before using the
   822  random method.
   823  
   824  For the random selection one chunk is selected uniformly at random out of all of
   825  the stuck chunks in the filesystem. The stuck loop does this by first selecting
   826  a directory containing stuck chunks by calling `managedStuckDirectory`. Then
   827  `managedBuildAndPushRandomChunk` is called to select a file with stuck chunks to
   828  then add one stuck chunk from that file to the heap. The stuck loop repeats this
   829  process of finding a stuck chunk until there are `maxRandomStuckChunksInHeap`
   830  stuck chunks in the upload heap or it has added `maxRandomStuckChunksAddToHeap`
   831  stuck chunks to the upload heap. Stuck chunks are priority in the heap, so
   832  limiting it to `maxStuckChunksInHeap` at a time prevents the heap from being
   833  saturated with stuck chunks that potentially cannot be repaired which would
   834  cause no other files to be repaired. 
   835  
   836  For the stuck loop to begin using the `stuckStack` there needs to have been
   837  successful stuck chunk repairs. If the repair of a stuck chunk is successful,
   838  the SiaPath of the SiaFile it came from is added to the Renter's `stuckStack`
   839  and a signal is sent to the stuck loop so that another stuck chunk can added to
   840  the heap. The repair loop with continue to add stuck chunks from the
   841  `stuckStack` until there are `maxStuckChunksInHeap` stuck chunks in the upload
   842  heap. Stuck chunks added from the `stuckStack` will have priority over random
   843  stuck chunks, this is determined by setting the `fileRecentlySuccessful` field
   844  to true for the chunk. The `stuckStack` tracks `maxSuccessfulStuckRepairFiles`
   845  number of SiaFiles that have had stuck chunks successfully repaired in a LIFO
   846  stack. If the LIFO stack already has `maxSuccessfulStuckRepairFiles` in it, when
   847  a new SiaFile is pushed onto the stack the oldest SiaFile is dropped from the
   848  stack so the new SiaFile can be added. Additionally, if SiaFile is being added
   849  that is already being tracked, then the original reference is removed and the
   850  SiaFile is added to the top of the Stack. If there have been successful stuck
   851  chunk repairs, the stuck loop will try and add additional stuck chunks from
   852  these files first before trying to add a random stuck chunk. The idea being that
   853  since all the chunks in a SiaFile have the same redundancy settings and were
   854  presumably uploaded around the same time, if one chunk was able to be repaired,
   855  the other chunks should be able to be repaired as well. Additionally, the reason
   856  a LIFO stack is used is because the more recent a success was the higher
   857  confidence we have for additional successes.
   858  
   859  If the repair wasn't successful, the stuck loop will wait for the
   860  `repairStuckChunkInterval` to pass and then try another random stuck chunk. If
   861  the stuck loop doesn't find any stuck chunks, it will sleep until a bubble wakes
   862  it up by finding a stuck chunk.
   863  
   864  **Inbound Complexities**  
   865   - Chunk repair code signals the stuck loop when a stuck chunk is successfully
   866     repaired
   867   - The Bubble subsystem signals the stuck loop when `AggregateNumStuckChunks` for the root
   868     directory is > 0
   869  
   870  **State Complexities**  
   871   - The stuck loop and the repair loop use a number of the same methods when
   872     building `unfinishedUploadChunks` to add to the `uploadHeap`. These methods
   873     rely on the `repairTarget` to know if they should target stuck chunks or
   874     unstuck chunks 
   875  
   876  ### Backup Subsystem
   877  **Key Files**
   878   - [backup.go](./backup.go)
   879   - [backupsnapshot.go](./backupsnapshot.go)
   880  
   881  *TODO* 
   882    - expand subsystem description
   883  
   884  The backup subsystem of the renter is responsible for creating local and remote
   885  backups of the user's data, such that all data is able to be recovered onto a
   886  new machine should the current machine + metadata be lost.
   887  
   888  ### Refresh Paths Subsystem
   889  **Key Files**
   890   - [refreshpaths.go](./refreshpaths.go)
   891  
   892  The refresh paths subsystem of the renter is a helper subsystem that tracks the
   893  minimum unique paths that need to be refreshed in order to refresh the entire
   894  affected portion of the file system.
   895  
   896  **Inbound Complexities** 
   897   - `callAdd` is used to try and add a new path. 
   898   - `callRefreshAll` is used to refresh all the directories corresponding to the
   899     unique paths in order to update the filesystem