vitess.io/vitess@v0.16.2/doc/design-docs/VitessQueues.md (about) 1 # Vitess Queues 2 3 This document describes a proposed design and implementation guidelines for a 4 Queue feature inside Vitess. Events can be added to a Queue, and processed 5 asynchronously by Receivers. 6 7 ## Problem and Challenges 8 9 Queues are hard to implement right on MySQL. They need to scale, and usually 10 issues when implementing them at the application level start creeping up at high 11 QPS. 12 13 This document proposes a design to address the following requirements: 14 15 * Need to support sharding transparently (and scale linearly with number of 16 shards). 17 * Need to not lose events. 18 * Extracting the events from the queue for processing needs to be scalable. 19 * Adding to the queue needs to be transaction-safe (so changing some content and 20 adding to the queue can be committed as a transaction). 21 * Events have an optional timestamp of when they should fire. If none is 22 specified, it will be as soon as possible. 23 * May fire events more than once, if the event receiver is unresponsive, or in 24 case of topology events (resharding, reparent, ...). 25 * Events that are successfully acked should never fire again. 26 * Strict event processing ordering is not a requirement. 27 28 While it is possible to implement high QPS queues using a sharded table, it 29 would introduce the following limitations: 30 31 * The Queue consumers need to poll the database to find new work. 32 * The Queue consumers have to be somewhat shard-aware, to only consume events 33 from one shard. 34 * With a higher number of consumers on one shard, they step on each-others toes, 35 and introduce database latency. 36 37 ## Proposed Design 38 39 At a high level, we propose to still use a table as backend for a Queue. It 40 follows the same sharding as the Keyspace it is in, meaning it is present on all 41 shards. However, instead of having consumers query the table directly, we 42 introduce the Queue Manager: 43 44 * When a tablet becomes the primary of a shard (at startup, or at reparent time), 45 it creates a Queue Manager routine for each Queue Table in the schema. 46 * The Queue Manager is responsible for managing the data in the Queue Table. 47 * The Queue Manager dispatches events that are firing to a pool of 48 Listeners. Each Listener maintains a streaming connection to the Queue it is 49 listening on. 50 * The Queue Manager can then use an in-memory cache covering the next time 51 interval, and be efficient. 52 * All requests to the Queue table are then done by Primary Key, and are very 53 fast. 54 55 Each Queue Table has the following mandatory fields. At first, we can create 56 them using a regular schema change. Eventually, we'll support a `CREATE QUEUE 57 name(...)` command. Each entry in the queue has a sharding key (could be a hash 58 of the trigger timestamp). The Primary Key for the table is a timestamp, 59 nanosecond granularity: 60 61 * Timestamp in nanoseconds since Epoch (PK): uint64 62 * State (enum): to_be_fired, fired, acked. 63 * Some sharding key: user-defined, specified at creation time, with vindex. 64 * Other columns: user-defined, specified at creation time. 65 66 A Queue is linked to a Queue Manager on the primary vttablet, on each shard. The 67 Queue Manager is responsible for dispatching events to the Receivers. The Queue 68 Manager can impose the following limits: 69 70 * Maximum QPS of events being processed. 71 * Maximum concurrency of event processing. 72 73 *Note*: we still use the database to store the items, so any state change is 74 persisted. We'll explore scalability below, but the QPS for Queues will still be 75 limited by the insert / update QPS of each shard primary. To increase Queue 76 capabilities, the usual horizontal sharding process can be used. 77 78 ## Implementation Details 79 80 Queue Tables are marked in the schema by a comment, in a similar way we detect 81 Sequence Tables 82 [now](https://github.com/vitessio/vitess/blob/0b3de7c4a2de8daec545f040639b55a835361685/go/vt/vttablet/tabletserver/tabletserver.go#L138). 83 84 When a tablet becomes a primary, and there are Queue tables, it creates a 85 QueueManager for each of them. 86 87 The Queue Manager has a window of ‘current or soon’ events (let’s say the next 88 30 seconds or 1 minute). Upon startup, it reads these events into memory. Then 89 for every event modification that is done for events in that window, it keeps a 90 memory copy of the event. 91 92 Receivers use streaming RPCs to ask for events, providing a KeyRange if 93 necessary to only target a subset of the Queues. That RPC is connected to the 94 corresponding Queue Managers on the targeted shards. It is recommended for 95 multiple Receivers to target the same shard, for redundancy and scalability. 96 97 Whenever an event becomes current and needs to be fired, the Queue Manager 98 changes the state of the event on disk to ‘fired’, and sends it to a random 99 Receiver (or the Receiver that is the least busy, or the most responsive, which 100 ever it is). The Receiver will then either: 101 102 * Ack the event after processing, and then the Manager can mark it ‘acked’ in 103 the table, or even delete it. 104 * Not respond in time, or not ack it: the Manager can mark it as ‘to be fired’ 105 again. In that case, the Manager can back off and try again later. The logic 106 here can be as simple or complicated as necessary. 107 108 The easiest implementation is to not guarantee an execution order for the 109 events. Providing ordering would not be too difficult, but has an impact on 110 performance (when a Receiver doesn't respond in time, it delays eveybody else, so 111 we'd need to timeout quickly and keep going). 112 113 For ‘old’ events (that have been fired but not acked), we propose to use an old 114 event collector that retries with some kind of longer threshold / retry 115 period. That way we don't use the Queue Manager for old events, and keep it 116 firing recent events. 117 118 Event history can be managed in multiple ways. The Queue Manager should probably 119 have some kind of retention policy, and delete events that have been fired and 120 older than N days. Deleting the table would delete the entire queue. 121 122 ## Example / Code 123 124 We create the Queue: 125 126 ``` sql 127 CREATE QUEUE my_queue( 128 user_id BIGINT(10), 129 payload VARCHAR(256), 130 ) 131 132 ``` 133 134 Then we change VSchema to have a hash vindex for user_id. 135 136 Let's insert something into the queue: 137 138 ``` sql 139 BEGIN 140 INSERT INTO my_queue(user_id, payload) VALUES (10, ‘I want to send this’); 141 COMMIT 142 ``` 143 144 We add another streaming endpoint to vtgate for queue event consumption, and one 145 for acking: 146 147 ``` protocol-buffer 148 message QueueReceiveRequest { 149 // caller_id identifies the caller. This is the effective caller ID, 150 // set by the application to further identify the caller. 151 vtrpc.CallerID caller_id = 1; 152 153 // keyspace to target the query to. 154 string keyspace = 2; 155 156 // shard to target the query to, for unsharded keyspaces. 157 string shard = 3; 158 159 // KeyRange to target the query to, for sharded keyspaces. 160 topodata.KeyRange key_range = 4; 161 } 162 163 message QueueReceiveResponse { 164 // First response has fields, next ones have data. 165 // Note we might get multiple notifications in a single packet. 166 QueryResult result = 1; 167 } 168 169 message QueueAckRequest { 170 // TODO: find a good way to reference the event, maybe 171 // have an ‘id’ in the query result. 172 Repeated Id string; 173 } 174 175 message QueueAckResponse { 176 } 177 178 service Vitess { 179 ... 180 rpc QueueReceive(vtgate.QueueReceiveRequest) returns (stream vtgate.QueueReceiveResponse) {}; 181 182 Rpc QueueAck(vtgate.QueueAckRequest) returns (vtgate.QueueAckResponse) {}; 183 ... 184 } 185 186 ``` 187 188 Then the receiver code is just an endless loop: 189 190 * Call QueueReceive(). For each QueueReceiveResponse: 191 * For each row: 192 * Do what needs to be done. 193 * Call QueueAck() for the processed events. 194 195 *Note*: acks may need to be inside a transaction too. In that case, we may want 196 to remove the QueueAck, and replace it with an: `UPDATE my_queue(state) 197 VALUES(‘acked’) WHERE <PK>=...;` Statement. It might be re-written by vttablet. 198 199 ## Scalability 200 201 How does this scale? The answer should be: very well. There is no expensive 202 query done to the database. Only a range query (might need locking) every N 203 seconds. Everything else is point updates. 204 205 This approach scales with the number of shards too. Resharding will multiply the 206 queue processing power. 207 208 Since we have a process on the primary, we piggy-back on our regular tablet 209 primary election. 210 211 *Event creation*: each creation is one INSERT into the database. Creation of 212 events that are supposed to execute as soon as possible may create a hotspot in 213 the table. If so, we might be better off using the keyspace id for primary key, 214 and having an index on timestamps. If the event creation is in the execution 215 window, we also keep it in memory. Multiple events can even be created in the 216 same transaction. *Optimization*: if the event is supposed to be fired right 217 away, we might even create it as ‘fired’ directly. 218 219 *Event dispatching*: mostly done by the Queue Manager. Each dispatch is one 220 update (by PK) for firing, and one update (by PK again) for acking. The RPC 221 overhead for sending these events and receiving the acks from the receivers is 222 negligible (as it has no DB impact). 223 224 *Queue Manager*: it does one scan of the upcoming events for the next N seconds, 225 every N seconds. Hopefully that query is fast. Note we could add a ‘LIMIT’ to 226 that query and only consider the next N seconds, or the period that has 10k 227 events (value TBD), whichever is smaller. 228 229 *Event Processing*: a pool of Receivers is consuming the events. Scale that up 230 to as many as is needed, with any desired redundancy. The shard primary will send 231 the event to one receiver only, until it is acked, or times out. 232 233 *Hot Spots in the Queue table*: with very high QPS, we will end up having a hot 234 spot on the window of events that are coming up. A proposed solution is to use 235 an 'event_name' that has a unique value, and use it as a primary key. When the 236 Queue Manager gets new events however, it will still need to sort them by 237 timestamp, and therefore will require an index. It is unclear which solution 238 will work better, we'd need to experiment. For instance, if most events are only 239 present in the table while they are being processed, the entire table will have 240 high QPS anyway. 241 242 ## Caveats 243 244 This is not meant to replace UpdateStream. UpdateStream is meant to send one 245 notification for each database event, without extra database cost, and without 246 any filtering. Queues have a database cost. 247 248 Events can be fired multiple times, in corner cases. However, we always do a 249 database access before firing the event. So in case of reparent / resharding, 250 that database access should fail. So we should stop dispatching events as soon 251 as the primary is read-only. Also, if we receive an ack when the database is 252 read-only, we can’t store it. With the Query Buffering feature, vtgate may be 253 able to retry later on the new primary. 254 255 256 ## Extensions 257 258 The following extensions can be implemented, but won't be at first: 259 260 * *Queue Ordering*: only fire one event at a time, in order. 261 262 * *Event Groups*: groups events together using a common key, and when they're 263 all done, fire another event.