vitess.io/vitess@v0.16.2/doc/design-docs/VitessQueues.md

vitess.io/vitess@v0.16.2/doc/design-docs/VitessQueues.md (about)

1 # Vitess Queues
2
3 This document describes a proposed design and implementation guidelines for a
4 Queue feature inside Vitess. Events can be added to a Queue, and processed
5 asynchronously by Receivers.
6
7 ## Problem and Challenges
8
9 Queues are hard to implement right on MySQL. They need to scale, and usually
10 issues when implementing them at the application level start creeping up at high
11 QPS.
12
13 This document proposes a design to address the following requirements:
14
15 * Need to support sharding transparently (and scale linearly with number of
16 shards).
17 * Need to not lose events.
18 * Extracting the events from the queue for processing needs to be scalable.
19 * Adding to the queue needs to be transaction-safe (so changing some content and
20 adding to the queue can be committed as a transaction).
21 * Events have an optional timestamp of when they should fire. If none is
22 specified, it will be as soon as possible.
23 * May fire events more than once, if the event receiver is unresponsive, or in
24 case of topology events (resharding, reparent, ...).
25 * Events that are successfully acked should never fire again.
26 * Strict event processing ordering is not a requirement.
27
28 While it is possible to implement high QPS queues using a sharded table, it
29 would introduce the following limitations:
30
31 * The Queue consumers need to poll the database to find new work.
32 * The Queue consumers have to be somewhat shard-aware, to only consume events
33 from one shard.
34 * With a higher number of consumers on one shard, they step on each-others toes,
35 and introduce database latency.
36
37 ## Proposed Design
38
39 At a high level, we propose to still use a table as backend for a Queue. It
40 follows the same sharding as the Keyspace it is in, meaning it is present on all
41 shards. However, instead of having consumers query the table directly, we
42 introduce the Queue Manager:
43
44 * When a tablet becomes the primary of a shard (at startup, or at reparent time),
45 it creates a Queue Manager routine for each Queue Table in the schema.
46 * The Queue Manager is responsible for managing the data in the Queue Table.
47 * The Queue Manager dispatches events that are firing to a pool of
48 Listeners. Each Listener maintains a streaming connection to the Queue it is
49 listening on.
50 * The Queue Manager can then use an in-memory cache covering the next time
51 interval, and be efficient.
52 * All requests to the Queue table are then done by Primary Key, and are very
53 fast.
54
55 Each Queue Table has the following mandatory fields. At first, we can create
56 them using a regular schema change. Eventually, we'll support a `CREATE QUEUE
57 name(...)` command. Each entry in the queue has a sharding key (could be a hash
58 of the trigger timestamp). The Primary Key for the table is a timestamp,
59 nanosecond granularity:
60
61 * Timestamp in nanoseconds since Epoch (PK): uint64
62 * State (enum): to_be_fired, fired, acked.
63 * Some sharding key: user-defined, specified at creation time, with vindex.
64 * Other columns: user-defined, specified at creation time.
65
66 A Queue is linked to a Queue Manager on the primary vttablet, on each shard. The
67 Queue Manager is responsible for dispatching events to the Receivers. The Queue
68 Manager can impose the following limits:
69
70 * Maximum QPS of events being processed.
71 * Maximum concurrency of event processing.
72
73 *Note*: we still use the database to store the items, so any state change is
74 persisted. We'll explore scalability below, but the QPS for Queues will still be
75 limited by the insert / update QPS of each shard primary. To increase Queue
76 capabilities, the usual horizontal sharding process can be used.
77
78 ## Implementation Details
79
80 Queue Tables are marked in the schema by a comment, in a similar way we detect
81 Sequence Tables
82 [now](https://github.com/vitessio/vitess/blob/0b3de7c4a2de8daec545f040639b55a835361685/go/vt/vttablet/tabletserver/tabletserver.go#L138).
83
84 When a tablet becomes a primary, and there are Queue tables, it creates a
85 QueueManager for each of them.
86
87 The Queue Manager has a window of ‘current or soon’ events (let’s say the next
88 30 seconds or 1 minute). Upon startup, it reads these events into memory. Then
89 for every event modification that is done for events in that window, it keeps a
90 memory copy of the event.
91
92 Receivers use streaming RPCs to ask for events, providing a KeyRange if
93 necessary to only target a subset of the Queues. That RPC is connected to the
94 corresponding Queue Managers on the targeted shards. It is recommended for
95 multiple Receivers to target the same shard, for redundancy and scalability.
96
97 Whenever an event becomes current and needs to be fired, the Queue Manager
98 changes the state of the event on disk to ‘fired’, and sends it to a random
99 Receiver (or the Receiver that is the least busy, or the most responsive, which
100 ever it is). The Receiver will then either:
101
102 * Ack the event after processing, and then the Manager can mark it ‘acked’ in
103 the table, or even delete it.
104 * Not respond in time, or not ack it: the Manager can mark it as ‘to be fired’
105 again. In that case, the Manager can back off and try again later. The logic
106 here can be as simple or complicated as necessary.
107
108 The easiest implementation is to not guarantee an execution order for the
109 events. Providing ordering would not be too difficult, but has an impact on
110 performance (when a Receiver doesn't respond in time, it delays eveybody else, so
111 we'd need to timeout quickly and keep going).
112
113 For ‘old’ events (that have been fired but not acked), we propose to use an old
114 event collector that retries with some kind of longer threshold / retry
115 period. That way we don't use the Queue Manager for old events, and keep it
116 firing recent events.
117
118 Event history can be managed in multiple ways. The Queue Manager should probably
119 have some kind of retention policy, and delete events that have been fired and
120 older than N days. Deleting the table would delete the entire queue.
121
122 ## Example / Code
123
124 We create the Queue:
125
126 ``` sql
127 CREATE QUEUE my_queue(
128 user_id BIGINT(10),
129 payload VARCHAR(256),
130 )
131
132 ```
133
134 Then we change VSchema to have a hash vindex for user_id.
135
136 Let's insert something into the queue:
137
138 ``` sql
139 BEGIN
140 INSERT INTO my_queue(user_id, payload) VALUES (10, ‘I want to send this’);
141 COMMIT
142 ```
143
144 We add another streaming endpoint to vtgate for queue event consumption, and one
145 for acking:
146
147 ``` protocol-buffer
148 message QueueReceiveRequest {
149 // caller_id identifies the caller. This is the effective caller ID,
150 // set by the application to further identify the caller.
151 vtrpc.CallerID caller_id = 1;
152
153 // keyspace to target the query to.
154 string keyspace = 2;
155
156 // shard to target the query to, for unsharded keyspaces.
157 string shard = 3;
158
159 // KeyRange to target the query to, for sharded keyspaces.
160 topodata.KeyRange key_range = 4;
161 }
162
163 message QueueReceiveResponse {
164 // First response has fields, next ones have data.
165 // Note we might get multiple notifications in a single packet.
166 QueryResult result = 1;
167 }
168
169 message QueueAckRequest {
170 // TODO: find a good way to reference the event, maybe
171 // have an ‘id’ in the query result.
172 Repeated Id string;
173 }
174
175 message QueueAckResponse {
176 }
177
178 service Vitess {
179 ...
180 rpc QueueReceive(vtgate.QueueReceiveRequest) returns (stream vtgate.QueueReceiveResponse) {};
181
182 Rpc QueueAck(vtgate.QueueAckRequest) returns (vtgate.QueueAckResponse) {};
183 ...
184 }
185
186 ```
187
188 Then the receiver code is just an endless loop:
189
190 * Call QueueReceive(). For each QueueReceiveResponse:
191 * For each row:
192 * Do what needs to be done.
193 * Call QueueAck() for the processed events.
194
195 *Note*: acks may need to be inside a transaction too. In that case, we may want
196 to remove the QueueAck, and replace it with an: `UPDATE my_queue(state)
197 VALUES(‘acked’) WHERE <PK>=...;` Statement. It might be re-written by vttablet.
198
199 ## Scalability
200
201 How does this scale? The answer should be: very well. There is no expensive
202 query done to the database. Only a range query (might need locking) every N
203 seconds. Everything else is point updates.
204
205 This approach scales with the number of shards too. Resharding will multiply the
206 queue processing power.
207
208 Since we have a process on the primary, we piggy-back on our regular tablet
209 primary election.
210
211 *Event creation*: each creation is one INSERT into the database. Creation of
212 events that are supposed to execute as soon as possible may create a hotspot in
213 the table. If so, we might be better off using the keyspace id for primary key,
214 and having an index on timestamps. If the event creation is in the execution
215 window, we also keep it in memory. Multiple events can even be created in the
216 same transaction. *Optimization*: if the event is supposed to be fired right
217 away, we might even create it as ‘fired’ directly.
218
219 *Event dispatching*: mostly done by the Queue Manager. Each dispatch is one
220 update (by PK) for firing, and one update (by PK again) for acking. The RPC
221 overhead for sending these events and receiving the acks from the receivers is
222 negligible (as it has no DB impact).
223
224 *Queue Manager*: it does one scan of the upcoming events for the next N seconds,
225 every N seconds. Hopefully that query is fast. Note we could add a ‘LIMIT’ to
226 that query and only consider the next N seconds, or the period that has 10k
227 events (value TBD), whichever is smaller.
228
229 *Event Processing*: a pool of Receivers is consuming the events. Scale that up
230 to as many as is needed, with any desired redundancy. The shard primary will send
231 the event to one receiver only, until it is acked, or times out.
232
233 *Hot Spots in the Queue table*: with very high QPS, we will end up having a hot
234 spot on the window of events that are coming up. A proposed solution is to use
235 an 'event_name' that has a unique value, and use it as a primary key. When the
236 Queue Manager gets new events however, it will still need to sort them by
237 timestamp, and therefore will require an index. It is unclear which solution
238 will work better, we'd need to experiment. For instance, if most events are only
239 present in the table while they are being processed, the entire table will have
240 high QPS anyway.
241
242 ## Caveats
243
244 This is not meant to replace UpdateStream. UpdateStream is meant to send one
245 notification for each database event, without extra database cost, and without
246 any filtering. Queues have a database cost.
247
248 Events can be fired multiple times, in corner cases. However, we always do a
249 database access before firing the event. So in case of reparent / resharding,
250 that database access should fail. So we should stop dispatching events as soon
251 as the primary is read-only. Also, if we receive an ack when the database is
252 read-only, we can’t store it. With the Query Buffering feature, vtgate may be
253 able to retry later on the new primary.
254
255
256 ## Extensions
257
258 The following extensions can be implemented, but won't be at first:
259
260 * *Queue Ordering*: only fire one event at a time, in order.
261
262 * *Event Groups*: groups events together using a common key, and when they're
263 all done, fire another event.