github.com/cozy/cozy-stack@v0.0.0-20240603063001-31110fa4cae1/docs/archives/replication.md (about)

     1  [Table of contents](../README.md#table-of-contents)
     2  
     3  # Replication
     4  
     5  Replication is the ability of a cozy-stack to copy / move all or a subset of its
     6  data to another support. It should cover 2 use cases
     7  
     8  -   Devices: The continuous act of syncing change to a subset of the cozy
     9      documents and files to and from an user cozy to the same user's Devices
    10      through cozy-desktop and cozy-mobile
    11  -   Sharing: The continuous act of syncing change to a subset of the cozy
    12      documents and files to and from another user's cozy.
    13  
    14  Replication will not be used for Moving, nor Backup. See associated docs in this
    15  folder.
    16  
    17  CouchDB replication is a well-understood, safe and stable algorithm ensuring
    18  replication between two couchdb databases. It is delta-based, and is stateful:
    19  we sync changes since a given checkpoint.
    20  
    21  ## Files synchronization
    22  
    23  Replication of too heavy database (with a lot of attachments) has been, in cozy
    24  experience, the source of some bugs. To avoid that (and some other issues), in
    25  cozy-stack, the attachments are stored outside of couchdb.
    26  
    27  This means we need a way to synchronize files in parallel to couchdb
    28  replication.
    29  
    30  Rsync is a well-understood, safe and stable algorithm to replicate files
    31  hierarchy from one hosts to the other by only transfering changes. It is
    32  one-shot and stateless. It could be an inspiration if we have to focus on
    33  syncing small changes to big files. The option to use rsync/zsync have been
    34  discussed internally (https://github.com/cozy/cozy-stack/pull/57 & Sprint Start
    35  2016-10-24), but for now we should focus on using the files/folders couchdb
    36  document for synchronization purpose and upload/download the actual binaries
    37  with existing files routes or considering webdav.
    38  
    39  **Useful golang package:** http://rclone.org/ sync folder in FS / Swift (may be
    40  we can add a driver for cozy)
    41  
    42  ## Couchdb replication & limitation
    43  
    44  Couchdb 2 replication protocol is described
    45  [in details here](http://docs.couchdb.org/en/stable/replication/protocol.html).
    46  
    47  ### Quick summary
    48  
    49  1. The replicator figures out what was the last sequence number _lastseqno_
    50     which was correctly synced from the source database, using a
    51     `_local/:replicationid` document.
    52  2. The replicator `GET :source/changes?since=lastseqno&limit=batchsize` and
    53     identify which docs have changed (and their _open_revs_).
    54  3. The replicator `POST :target/_revs_diff` to get a list of all revisions of
    55     the changed documents the target does not know.
    56  4. The replicator `GET :source/:docid?open_revs=['rev1', ...]` several times to
    57     retrieve the missing documents revisions.
    58  5. The replicator `POST :target/_bulk_docs` with these documents revisions.
    59  6. Store the new last sequence number
    60  
    61  Repeat 2-5 until there is no more changes.
    62  
    63  ### Details
    64  
    65  -   In step 5, the replicator can also attempt to `PUT :target/:docid` if doc
    66      are too heavy, but this should not happens in cozy-stack considering there
    67      wont be attachment in couchdb.
    68  -   In step 4, the replicator can optimize by calling `GET`
    69  -   The main difference from couchdb 1.X is in the replication history and the
    70      manner to determine and store the last sequence number. **TODO:** actually
    71      understand this and how it might increase disk usage if we have a lot of
    72      replications.
    73  -   Couchdb `_changes` and by extension replication can be either by polling or
    74      continuous (SSE / COMET)
    75  -   In couchdb benchmarking, we understood that the number of couch databases
    76      only use a bit of disk space but no RAM or CPU **as long as the database is
    77      not used**. Having a continuous replication active force us to keep the
    78      database file open and will starve RAM & FD usage. Moreover, continuous
    79      replication costs a lot (cf. ScienceTeam benchmark). To permit an unlimited
    80      number of inactive user on a single cozy-stack process, **the stack should
    81      avoid continuous replication from couchdb**.
    82  -   Two way replication is simply two one way replications
    83  
    84  ### Routes used by replication
    85  
    86  To be a source of replication, the stack only need to support the following
    87  route (and query parameters):
    88  
    89  -   `GET :source/:docid` get revisions of a document. The query parameters
    90      `open_revs, revs, latest` is necessary for replication.
    91  -   `POST :source/_all_docs` is used by current version of cozy-mobile and
    92      pouchdb as an optimization to fetch several document's revision at once.
    93  -   `GET :source/_changes` get a list of ID -> New Revision since a given
    94      sequence number. The query parameters `since, limit` are necessary for
    95      replication.
    96  
    97  To be a target of replication, the stack need to support the following routes:
    98  
    99  -   `POST :target/_revs_diff` takes a list of ID -> Rev and returns which one
   100      are missing.
   101  -   `POST :target/_bulk_docs` create several documents at once
   102  -   `POST :target/_ensure_full_commit` ensure the documents are written to disk.
   103      This is useless if couchdb is configured without delayed write (default),
   104      but remote couchdb will call, so the stack should return the expected 201
   105  
   106  In both case, we need to support
   107  
   108  -   `PUT :both/_local/:revdocid` to store the current sequence number.
   109  
   110  ## Stack Sync API exploration
   111  
   112  ### Easy part: 1 db/doctype on stack AND remote, no binaries
   113  
   114  We _just_ need to implement the routes described above by proxying to the
   115  underlying couchdb database.
   116  
   117  ```javascript
   118  var db = new PouchDB("my-local-contacts");
   119  db.replicate.from("https://bob.cozycloud.cc/data/contacts");
   120  db.replicate.to("https://bob.cozycloud.cc/data/contacts");
   121  ```
   122  
   123  To suport this we need to:
   124  
   125  -   Proxy `/data/:doctype/_changes` route with since, limit, feed=normal. Refuse
   126      all filter parameters with a clear error message.
   127      [(Doc)](http://docs.couchdb.org/en/stable/api/database/changes.html)
   128  -   Add support of `open_revs`, `revs`, `latest` query parameter to
   129      `GET /data/:doctype/:docid`
   130      [(Doc) ](http://docs.couchdb.org/en/stable/api/document/common.html?highlight=open_revs#get--db-docid)
   131  -   Proxy the `/data/:doctype/_revs_diff`
   132      [(Doc)](http://docs.couchdb.org/en/stable/api/database/misc.html#db-revs-diff)
   133      and `/data/:doctype/_bulk_docs` routes
   134      [(Doc)](http://docs.couchdb.org/en/stable/api/database/bulk-api.html) routes
   135  -   Have `/data/:doctype/_ensure_full_commit`
   136      [(Doc)](http://docs.couchdb.org/en/stable/api/database/compact.html#db-ensure-full-,
   137      revs, latestcommit) returns 201
   138  
   139  This will cover documents part of the Devices use case.
   140  
   141  ### Continuous replication
   142  
   143  It is impossible to implement it by simply proxying to couchdb (see unlimited
   144  inactive users).
   145  
   146  The current version of cozy-desktop uses it. **TODISCUSS** It could be replaced
   147  by 3-minutes polling without big losses in functionality, eventually with some
   148  more triggers based on user activity.
   149  
   150  The big use case for which we might want it is Sharing. But, because Sharing
   151  will (first) be stack to stack, we might imagine having the source of event
   152  "ping" the remote through a separate route.
   153  
   154  **Conclusion:** Start with polling replication. Consider alternative
   155  notifications mechanism when the usecase appears and eventually use time-limited
   156  continuous replication for a few special case (collaborative edit).
   157  
   158  ### Realtime
   159  
   160  One of the cool feature of cozy apps was how changes are sent to the client in
   161  realtime through websocket. However this have a cost: every cozy and every apps
   162  open in a browser, even while not used keep one socket open on the server, this
   163  is not scalable to thousands of users.
   164  
   165  Several technology can be used for realtime: Websocket, SSE or COMET like
   166  long-polling. Goroutines makes all solution similar performances wise. COMET is
   167  a hack, websocket seems more popular and tested (x/net/websocket vs
   168  html5-sse-example). SSE is not widely available and has some limitations
   169  (headers...)
   170  
   171  Depending on benchmarking, we can do some optimization on the feed:
   172  
   173  -   close feeds when the user is not on screen
   174  -   multiplex different applications' feed, so each open cozy will only use one
   175      socket to the server. This is hard, as all apps live on separate domain, an
   176      (hackish) option might be a iframe/SharedWorker bridge.
   177  
   178  To have some form of couchdb-to-stack continuous changes monitoring, we can
   179  monitor `_db_udpates`
   180  [(Doc)](http://docs.couchdb.org/en/stable/api/server/common.html#db-updates)
   181  which gives us an update when a couchdb has changed. We can then perform a
   182  `_changes` query on this database to get the changed docs and proxy that to the
   183  stack-to-client change feed.
   184  
   185  **Conclusion:** We will use Websocket from the client to the stack. We will try
   186  to avoid using continuous changes feed from couchdb to the stack. We will
   187  optimize if proven needed by benchmarks, starting with "useless" changes and
   188  eventually some multiplexing.
   189  
   190  ### Sharing
   191  
   192  To be completed by discussing with Science team.
   193  
   194  Current ideas (as understood by Romain)
   195  
   196  -   any filtered replication is unscalable
   197  -   1 db for all shared docs `sharing_db`.
   198  -   Cozy-stack is responsible for saving documents that should be shared in both
   199      their dg implemented by 2-way replication between the 2 users `sharing_db`,
   200      filtering is done by computing a list of IDs and then `doc_ids` (sharing
   201      with filters/views are not efficient)
   202  -   Sharing is performed by batches regularly.
   203  -   Continuous replication can be considered for collaborative editing.
   204  
   205  Proposal by Romain, if we find `_selector` filter replication performances to be
   206  acceptable on very large / very old databases.
   207  
   208  -   No sharing database
   209  -   A permission, for anything is a mango-style selector.
   210  -   on every query, the Mango selector is checked at the stack or couchdb level
   211      (`$and`-ing for queries, testing output document, input document)
   212  -   Sharing is a filtered replication between user's 1 doctypedb et user's 2
   213      samedoctypedb
   214  -   No continuous replication
   215  -   Upon update, the stack trigger a PUSH replication to its remote or "ping"
   216      the remote, and the remote perform a normal PULL replication.
   217  
   218  **TODO** experiment with performance of `_selector` filtered replication in
   219  couchdb2