vitess.io/vitess@v0.16.2/doc/design-docs/VitessQueues.md (about)

     1  # Vitess Queues
     2  
     3  This document describes a proposed design and implementation guidelines for a
     4  Queue feature inside Vitess. Events can be added to a Queue, and processed
     5  asynchronously by Receivers.
     6  
     7  ## Problem and Challenges
     8  
     9  Queues are hard to implement right on MySQL. They need to scale, and usually
    10  issues when implementing them at the application level start creeping up at high
    11  QPS.
    12  
    13  This document proposes a design to address the following requirements:
    14  
    15  * Need to support sharding transparently (and scale linearly with number of
    16    shards).
    17  * Need to not lose events.
    18  * Extracting the events from the queue for processing needs to be scalable.
    19  * Adding to the queue needs to be transaction-safe (so changing some content and
    20    adding to the queue can be committed as a transaction).
    21  * Events have an optional timestamp of when they should fire. If none is
    22    specified, it will be as soon as possible.
    23  * May fire events more than once, if the event receiver is unresponsive, or in
    24    case of topology events (resharding, reparent, ...).
    25  * Events that are successfully acked should never fire again.
    26  * Strict event processing ordering is not a requirement.
    27  
    28  While it is possible to implement high QPS queues using a sharded table, it
    29  would introduce the following limitations:
    30  
    31  * The Queue consumers need to poll the database to find new work.
    32  * The Queue consumers have to be somewhat shard-aware, to only consume events
    33    from one shard.
    34  * With a higher number of consumers on one shard, they step on each-others toes,
    35    and introduce database latency.
    36  
    37  ## Proposed Design
    38  
    39  At a high level, we propose to still use a table as backend for a Queue. It
    40  follows the same sharding as the Keyspace it is in, meaning it is present on all
    41  shards. However, instead of having consumers query the table directly, we
    42  introduce the Queue Manager:
    43  
    44  * When a tablet becomes the primary of a shard (at startup, or at reparent time),
    45    it creates a Queue Manager routine for each Queue Table in the schema.
    46  * The Queue Manager is responsible for managing the data in the Queue Table.
    47  * The Queue Manager dispatches events that are firing to a pool of
    48    Listeners. Each Listener maintains a streaming connection to the Queue it is
    49    listening on.
    50  * The Queue Manager can then use an in-memory cache covering the next time
    51    interval, and be efficient.
    52  * All requests to the Queue table are then done by Primary Key, and are very
    53    fast.
    54  
    55  Each Queue Table has the following mandatory fields. At first, we can create
    56  them using a regular schema change. Eventually, we'll support a `CREATE QUEUE
    57  name(...)` command. Each entry in the queue has a sharding key (could be a hash
    58  of the trigger timestamp). The Primary Key for the table is a timestamp,
    59  nanosecond granularity:
    60  
    61  * Timestamp in nanoseconds since Epoch (PK): uint64
    62  * State (enum): to_be_fired, fired, acked.
    63  * Some sharding key: user-defined, specified at creation time, with vindex.
    64  * Other columns: user-defined, specified at creation time.
    65  
    66  A Queue is linked to a Queue Manager on the primary vttablet, on each shard. The
    67  Queue Manager is responsible for dispatching events to the Receivers. The Queue
    68  Manager can impose the following limits:
    69  
    70  * Maximum QPS of events being processed.
    71  * Maximum concurrency of event processing.
    72  
    73  *Note*: we still use the database to store the items, so any state change is
    74  persisted. We'll explore scalability below, but the QPS for Queues will still be
    75  limited by the insert / update QPS of each shard primary. To increase Queue
    76  capabilities, the usual horizontal sharding process can be used.
    77  
    78  ## Implementation Details
    79  
    80  Queue Tables are marked in the schema by a comment, in a similar way we detect
    81  Sequence Tables
    82  [now](https://github.com/vitessio/vitess/blob/0b3de7c4a2de8daec545f040639b55a835361685/go/vt/vttablet/tabletserver/tabletserver.go#L138).
    83  
    84  When a tablet becomes a primary, and there are Queue tables, it creates a
    85  QueueManager for each of them.
    86  
    87  The Queue Manager has a window of ‘current or soon’ events (let’s say the next
    88  30 seconds or 1 minute). Upon startup, it reads these events into memory. Then
    89  for every event modification that is done for events in that window, it keeps a
    90  memory copy of the event.
    91  
    92  Receivers use streaming RPCs to ask for events, providing a KeyRange if
    93  necessary to only target a subset of the Queues. That RPC is connected to the
    94  corresponding Queue Managers on the targeted shards. It is recommended for
    95  multiple Receivers to target the same shard, for redundancy and scalability.
    96  
    97  Whenever an event becomes current and needs to be fired, the Queue Manager
    98  changes the state of the event on disk to ‘fired’, and sends it to a random
    99  Receiver (or the Receiver that is the least busy, or the most responsive, which
   100  ever it is). The Receiver will then either:
   101  
   102  * Ack the event after processing, and then the Manager can mark it ‘acked’ in
   103    the table, or even delete it.
   104  * Not respond in time, or not ack it: the Manager can mark it as ‘to be fired’
   105    again. In that case, the Manager can back off and try again later. The logic
   106    here can be as simple or complicated as necessary.
   107  
   108  The easiest implementation is to not guarantee an execution order for the
   109  events. Providing ordering would not be too difficult, but has an impact on
   110  performance (when a Receiver doesn't respond in time, it delays eveybody else, so
   111  we'd need to timeout quickly and keep going).
   112  
   113  For ‘old’ events (that have been fired but not acked), we propose to use an old
   114  event collector that retries with some kind of longer threshold / retry
   115  period. That way we don't use the Queue Manager for old events, and keep it
   116  firing recent events.
   117  
   118  Event history can be managed in multiple ways. The Queue Manager should probably
   119  have some kind of retention policy, and delete events that have been fired and
   120  older than N days. Deleting the table would delete the entire queue.
   121  
   122  ## Example / Code
   123  
   124  We create the Queue:
   125  
   126  ``` sql
   127  CREATE QUEUE my_queue(
   128    user_id BIGINT(10),
   129    payload VARCHAR(256),
   130  )
   131  
   132  ```
   133  
   134  Then we change VSchema to have a hash vindex for user_id.
   135  
   136  Let's insert something into the queue:
   137  
   138  ``` sql
   139  BEGIN
   140  INSERT INTO my_queue(user_id, payload) VALUES (10, ‘I want to send this’);
   141  COMMIT
   142  ```
   143  
   144  We add another streaming endpoint to vtgate for queue event consumption, and one
   145  for acking:
   146  
   147  ``` protocol-buffer
   148  message QueueReceiveRequest {
   149    // caller_id identifies the caller. This is the effective caller ID,
   150    // set by the application to further identify the caller.
   151    vtrpc.CallerID caller_id = 1;
   152  
   153    // keyspace to target the query to.
   154    string keyspace = 2;
   155  
   156    // shard to target the query to, for unsharded keyspaces.
   157    string shard = 3;
   158  
   159    // KeyRange to target the query to, for sharded keyspaces.
   160    topodata.KeyRange key_range = 4;
   161  }
   162  
   163  message QueueReceiveResponse {
   164    // First response has fields, next ones have data.
   165    // Note we might get multiple notifications in a single packet.
   166    QueryResult result = 1;
   167  }
   168  
   169  message QueueAckRequest {
   170    // TODO: find a good way to reference the event, maybe
   171    // have an ‘id’ in the query result.
   172    Repeated Id string;
   173  }
   174  
   175  message QueueAckResponse {
   176  }
   177  
   178  service Vitess {
   179    ...
   180    rpc QueueReceive(vtgate.QueueReceiveRequest) returns (stream vtgate.QueueReceiveResponse) {};
   181  
   182    Rpc QueueAck(vtgate.QueueAckRequest) returns (vtgate.QueueAckResponse) {};
   183    ...
   184  }
   185  
   186  ```
   187  
   188  Then the receiver code is just an endless loop:
   189  
   190  * Call QueueReceive(). For each QueueReceiveResponse:
   191    * For each row:
   192      * Do what needs to be done.
   193      * Call QueueAck() for the processed events.
   194  
   195  *Note*: acks may need to be inside a transaction too. In that case, we may want
   196  to remove the QueueAck, and replace it with an: `UPDATE my_queue(state)
   197  VALUES(‘acked’) WHERE <PK>=...;` Statement. It might be re-written by vttablet.
   198  
   199  ## Scalability
   200  
   201  How does this scale? The answer should be: very well. There is no expensive
   202  query done to the database. Only a range query (might need locking) every N
   203  seconds. Everything else is point updates.
   204  
   205  This approach scales with the number of shards too. Resharding will multiply the
   206  queue processing power.
   207  
   208  Since we have a process on the primary, we piggy-back on our regular tablet
   209  primary election.
   210  
   211  *Event creation*: each creation is one INSERT into the database. Creation of
   212  events that are supposed to execute as soon as possible may create a hotspot in
   213  the table. If so, we might be better off using the keyspace id for primary key,
   214  and having an index on timestamps. If the event creation is in the execution
   215  window, we also keep it in memory. Multiple events can even be created in the
   216  same transaction. *Optimization*: if the event is supposed to be fired right
   217  away, we might even create it as ‘fired’ directly.
   218  
   219  *Event dispatching*: mostly done by the Queue Manager. Each dispatch is one
   220  update (by PK) for firing, and one update (by PK again) for acking. The RPC
   221  overhead for sending these events and receiving the acks from the receivers is
   222  negligible (as it has no DB impact).
   223  
   224  *Queue Manager*: it does one scan of the upcoming events for the next N seconds,
   225  every N seconds. Hopefully that query is fast. Note we could add a ‘LIMIT’ to
   226  that query and only consider the next N seconds, or the period that has 10k
   227  events (value TBD), whichever is smaller.
   228  
   229  *Event Processing*: a pool of Receivers is consuming the events. Scale that up
   230  to as many as is needed, with any desired redundancy. The shard primary will send
   231  the event to one receiver only, until it is acked, or times out.
   232  
   233  *Hot Spots in the Queue table*: with very high QPS, we will end up having a hot
   234  spot on the window of events that are coming up. A proposed solution is to use
   235  an 'event_name' that has a unique value, and use it as a primary key. When the
   236  Queue Manager gets new events however, it will still need to sort them by
   237  timestamp, and therefore will require an index. It is unclear which solution
   238  will work better, we'd need to experiment. For instance, if most events are only
   239  present in the table while they are being processed, the entire table will have
   240  high QPS anyway.
   241  
   242  ## Caveats
   243  
   244  This is not meant to replace UpdateStream. UpdateStream is meant to send one
   245  notification for each database event, without extra database cost, and without
   246  any filtering. Queues have a database cost.
   247  
   248  Events can be fired multiple times, in corner cases. However, we always do a
   249  database access before firing the event. So in case of reparent / resharding,
   250  that database access should fail. So we should stop dispatching events as soon
   251  as the primary is read-only. Also, if we receive an ack when the database is
   252  read-only, we can’t store it. With the Query Buffering feature, vtgate may be
   253  able to retry later on the new primary.
   254  
   255  
   256  ## Extensions
   257  
   258  The following extensions can be implemented, but won't be at first:
   259  
   260  * *Queue Ordering*: only fire one event at a time, in order.
   261  
   262  * *Event Groups*: groups events together using a common key, and when they're
   263    all done, fire another event.