github.com/cockroachdb/cockroach@v20.2.0-alpha.1+incompatible/docs/RFCS/20180501_change_data_capture.md

github.com/cockroachdb/cockroach@v20.2.0-alpha.1+incompatible/docs/RFCS/20180501_change_data_capture.md (about)

1 - Feature Name: Change Data Capture (CDC)
2 - Status: draft
3 - Start Date: 2018-05-01
4 - Authors: Arjun Narayan, Daniel Harrison, Nathan VanBenschoten, Tobias Schottdorf
5 - RFC PR: (PR # after acceptance of initial draft)
6 - Cockroach Issue: [#2656], [#6130]
7
8
9 # Summary
10
11 Change Data Capture (CDC) provides efficient, distributed, row-level change
12 subscriptions.
13
14
15 # Motivation
16
17 CockroachDB is an excellent system of record, but no technology exists in a
18 vacuum. Users would like to keep their data mirrored in full-text indexes,
19 analytics engines, and big data pipelines, among others. A pull model can
20 currently be used to do this, but it’s inefficient, overly manual, and doesn’t
21 scale; the push model described in this RFC addresses these limitations.
22
23 Anecdotally, CDC is one of the more frequently requested features for
24 CockroachDB.
25
26
27 # Guide-level explanation
28
29 The core primitive of CDC is the `CHANGEFEED`. Changefeeds target a whitelist of
30 databases, tables, partitions, rows, or a combination of these, called the
31 "watched rows". Every change to a watched row is emitted as a record in a
32 configurable format (initially JSON or Avro) to a configurable sink (initially
33 Kafka). Changefeeds can be paused, unpaused, and cancelled.
34
35 Changefeeds scale to any size CockroachDB cluster and are designed to impact
36 production traffic as little as possible.
37
38 By default, when a new changefeed is created, an initial timestamp is chosen and
39 the current value of each watched row as of that timestamp is emitted (similar
40 to `SELECT ... AS OF SYSTEM TIME`). After this "initial scan", updates to
41 watched rows are emitted. The `WITH cursor=<timestamp>` syntax can be used to
42 start a new changefeed that skips the initial scan and only emits all changes at
43 or after the user-given timestamp.
44
45 Kafka was chosen as our initial sink because of customer demand and its status
46 as the clear frontrunner in the category. Most of these customers also use the
47 associated Confluent Platform ecosystem and Avro format; our changefeeds are
48 designed to integrate well with them. Other sinks will be added as demand
49 dictates, but the details of this are out of scope. The rest of this document is
50 written with Kafka specifics, like topics and partitions, to make it easier to
51 understand, but these same processes will work for other sinks.
52
53
54 ## Row Ordering
55
56 Each emitted record contains the data of a changed row along with the timestamp
57 associated with the transaction that updated the row. Rows are sharded between
58 Kafka partitions by the row’s primary key. Tables with more than one [column
59 family] will require some buffering and are more expensive in a changefeed. If a
60 row is modified more than once in the same transaction, only the last will be
61 emitted.
62
63 Once a row has been emitted with some timestamp, no _previously unseen_ versions
64 of that row will be emitted with a lower timestamp. In the common case, each
65 version of a row will be emitted once, but some (infrequent) conditions will
66 cause them to be repeated, giving our changefeeds an at-least-once delivery
67 guarantee. See [Full-text index example: Elasticsearch]’s "external versions"
68 for one way to be resilient to these repetitions.
69
70 Many of the sinks we could support have limited exactly-once guarantees on the
71 consumer side, so there's not much to be gained by offering exactly-once
72 delivery on the producer side, when that's even possible. Kafka, however, does
73 have the necessary primitives for us to build an option to provide [exactly-once
74 delivery] at the cost of performance and scalability.
75
76 Cross-row and cross-table order guarantees are not given.
77
78 These particular ordering and delivery guarantees were selected to allow simple
79 cases to be easy and low-latency (after the initial catch-up and in the absence
80 of rebalancing, perhaps single or double digit milliseconds).
81
82
83 ## Cross-Row and Cross-Table Ordering
84
85 Some users require stronger order guarantees, so we provide the pieces necessary
86 to reconstruct them. (See [Transaction grouped changes] for why CockroachDB
87 doesn’t do this ordering itself.)
88
89 As mentioned above, each emitted record contains the timestamp of the
90 transaction that updated the row. This can be used with the in-stream _timestamp
91 resolved_ notifications on every kafka partition. The timestamp included in this
92 record is a guarantee that no previously unseen version of _any_ row will be
93 emitted to that partition afterward will have a higher timestamp.
94
95 Together, these can be used to give strong ordering and global consistency
96 guarantees by buffering records in between timestamp closures. See [Data
97 warehouse example: Amazon Redshift] below for a concrete example of how this
98 works.
99
100
101 ## Syntax Examples
102
103 Creating a changefeed is accomplished via `CREATE CHANGEFEED`:
104
105 ```sql
106 CREATE CHANGEFEED <name> FOR TABLE tweets INTO 'kafka://host:port' WITH <...>
107 CREATE CHANGEFEED <name> FOR TABLE tweets VALUES FROM (1) TO (2) INTO <...>
108 CREATE CHANGEFEED <name> FOR TABLE tweets PARTITION north_america INTO <...>
109 CREATE CHANGEFEED <name> FOR DATABASE db INTO <...>
110 ```
111
112 - INTO `<sink>` is a URI that contains all the configuration information needed
113 to connect to a given sink. Any necessary configuration is done through query
114 parameters. This is similar to how the URIs work in BACKUP/RESTORE/IMPORT.
115 - `kafka://host:port` Kafka is used as the sink. The bootstrap server is
116 specified as the host and port.
117 - `?kafka_topic_prefix=<...>` A string to prepend to the topic names used by
118 this changefeed.
119 - `?schema_topic=<...>` A Kafka topic to emit all schema changes to.
120 - `?confluent_schema_registry=<address>` The address of a schema registry
121 instance. Only allowed with the avro format. When unspecified, no schema
122 registry is used.
123 - `experimental-sql:///` A [SQL table sink] in the same CockroachDB cluster.
124 - `WITH <...>`
125 - `WITH envelope=<...>` Records have a key and a value. The key is always set
126 to the primary key of the changed row and the value’s contents are
127 controlled with this option.
128 - `envelope='none'` [DEFAULT] The new values in the row (or empty for a
129 deletion). This works with Kafka log compaction. The system may internally
130 need to read additional column families.
131 - `envelope='key_only'` The value is always empty. No additional reads are
132 ever needed.
133 - `envelope='diff'` The value is a composite with the old row value and the
134 new row value. The old row value is empty for INSERTs and the new row
135 value is empty for DELETEs. The system will internally need additional
136 reads.
137 - `WITH format=<...>`
138 - `format='json'` [DEFAULT] The record key is a serialized JSON array. The
139 record value is a serialized JSON object mapping column names to column
140 values.
141 - `format='avro'` The record key is an [Avro] array, serialized in the
142 binary format. The record value is an Avro record, mapping column names to
143 column values, serialized in the binary format.
144 - `WITH cursor=<timestamp>` can be used to set the initial high-water mark. If
145 this option is used, the changefeed will emit any changes after the given
146 timestamp and no initial scan.
147
148 See [other CDC syntaxes] for alternatives that were considered.
149
150
151 ## Altering the changefeed
152
153 `PAUSE CHANGEFEED <name>` and `RESUME CHANGEFEED <name>` can be used to pause
154 and unpause changefeeds.
155
156 `DROP CHANGEFEED <name>` can be used to remove a changefeed.
157
158 `SHOW CREATE CHANGEFEED <name>` works similarly to the other `SHOW CREATE`
159 command and can be used view the changefeed definition.
160
161 The following `ALTER CHANGEFEED` commands can be used to change the
162 configuration. Only paused changefeed jobs can be altered, so the user may have
163 to `PAUSE CHANGEFEED` and `RESUME CHANGEFEED` as necessary.
164
165 `ALTER CHANGEFEED <name> INTO <...>` can be used to migrate Kafka clusters.
166
167 `ALTER CHANGEFEED <name> SET key=<...>` can be used to change the `WITH`
168 options.
169
170 `ALTER CHANGEFEED <name> SET cursor=<timestamp>` can be used to manually adjust
171 the high-water mark forward or back. When resumed, the changefeed will emit any
172 changes after the given timestamp, but will not emit the initial catch-up. To do
173 this, the changefeed must be cancelled and replaced with a new one.
174
175
176 ## Full-text index example: Elasticsearch
177
178 MovR is a fictional ride sharing company. Their users are quite particular about
179 which vehicle picks them up, so MovR has added a search box for unstructured
180 text when booking a ride. A full-text index is needed to make this work, but
181 CockroachDB doesn't (yet) offer one, so MovR uses CDC to keep vehicle
182 descriptions and their current availability in [ElasticSearch].
183
184 The authoritative store for vehicles is a table in CockroachDB:
185
186 ```sql
187 CREATE TABLE vehicles (
188 id UUID NOT NULL,
189 type STRING NULL,
190 status STRING NULL,
191 <...>
192 )
193 ```
194
195 The `type` field is unstructured and can be anything from "BMW Z4" to "Vespa
196 Scooter" to "1980s military tank". The `status` field is updated when the
197 vehicle becomes available or unavilable.
198
199 This is watched by a changefeed and emitted to Kafka:
200
201 ```sql
202 CREATE CHANGEFEED feed_vehicles FOR TABLE movr.vehicles INTO 'kafka:/...'
203 ```
204
205 When initially run, the `CREATE CHANGEFEED` takes a snapshot of all the initial
206 states, consistent as of a timestamp and emits them to Kafka. All subsequest
207 changes are emitted as they happen. The default options are suitable for use
208 with the Confluent [ElasticsearchSinkConnector], so Movr uses this to tail the
209 Kafka `vehicles` topic and load it into ElasticSearch.
210
211 This is everything MovR needs for an ElasticSearch query over the `type` and
212 `status` fields to power the available vehicle search in the MovR app.
213
214
215 ## Data warehouse example: Amazon Redshift
216
217 MovR uses real-time analytics to determine whether there is more demand or
218 supply and adjust prices accordingly. The algorithm used is much easier to write
219 if it's always operating on a consistent view of the `rides` tables, but it
220 needs to run once per minute to keep prices up to date. CockroachDB is not (yet)
221 tuned for analytics, so MovR has decided to use Amazon Redshift in the meantime.
222 This Redshift instance is also used by MovR's Tableau dashboards.
223
224 A changefeed watches the `rides` table and continually emits changes to Kafka:
225
226 ```sql
227 CREATE CHANGEFEED feed_analytics FOR DATABASE movr.rides INTO 'kafka:/' WITH
228 format='avro',
229 confluent_schema_registry='<address>'
230 ```
231
232 The entire history of rides must be available in Redshift for the pricing
233 algorithm to work, but occasionally new columns are added, so the Avro format
234 and the Confluent Schema Registry are used to deal with schema changes.
235
236 MovR is popular, so the rides don't fit in one Kafka partition and are sharded
237 between `N` of them. A set of `N` distributed processes are run, each tailing
238 one partition and writing every ride to s3 as it comes in. Each ride has a
239 timestamp and is written to s3 bucketed first by minute then by Kafka partition.
240
241 ```
242 s3://movr/rides/2018-05-06-07-05/partition-0
243 s3://movr/rides/2018-05-06-07-05/partition-1
244 ...
245 s3://movr/rides/2018-05-06-07-05/partition-N
246 s3://movr/rides/2018-05-06-07-06/partition-0
247 ...
248 s3://movr/rides/2018-05-06-07-06/partition-N
249 ```
250
251 When the `partition-0` process receives a resolved timestamp after 2018-05-06
252 07:06, then it's guaranteed that no previously unseen values for the
253 `2018-05-06-07-05` bucket will be emitted on that partition. (These
254 notifications typically trail realtime by about 10 seconds.) Once this has
255 happened for all of the partitions, then this bucket contains every value that
256 it ever will and MovR uses the Amazon Redshift `COPY` command to load it in from
257 s3. Redshift now contains a consistent snapshot of the table.
258
259
260 ## Failures
261
262 If Kafka is unavailable or underprovisioned the changefeed will buffer records
263 as necessary. Buffering is limited by the `--max-changefeed-memory` and
264 `--max-disk-changefeed-storage` flags. As with any changefeed, degraded sinks
265 minimally affect foreground traffic. Further, one table failing will not affect
266 other tables, even ones watched by the same changefeed.
267
268 It is crucial that Kafka recovers before the watched rows (and the system
269 tables) exit their respective [garbage collection] TTL (default 25 hours) with
270 enough time for the changefeed to catch up. This is similar to the restriction
271 on [incremental backups]. A metric is exported to monitor how close to the GC
272 TTL each changefeed’s most behind timestamp is. Production users should monitor
273 and alert on this metric.
274
275 If the sink is down for too long, some data may be lost and the changefeed will
276 be marked as failed. User intervention is required. Either a new changefeed can
277 be created with the existing Kafka topics and an appropriate `WITH cursor=<...>`
278 (which will leave a hole of missing data) or a new changefeed can be restarted
279 with an empty sink (a clean slate).
280
281 If Kafka recovery won't happen in time, the relevant garbage collection TTLs can
282 be manually increased by the user. This affects disk usage and query
283 performance, so the system will not do it automatically. Similarly, additional
284 partitions are added by the user to underprovisioned sinks because the
285 key-to-partition sharding would change. This would break any consumers that
286 assume all keys are in the same partition, a common assumption.
287
288 Records are first buffered in memory, but this is limited to at most
289 `--max-changefeed-memory` bytes across all changefeeds on a node. If this limit
290 is reached, records are then spilled to disk, up to the
291 `--max-disk-changefeed-storage` flag. If this limit is also reached, the
292 changefeed will enter a "stalled" state and will tear down its internal
293 connections to reduce load on the cluster. When the backlog clears out, the
294 changefeed will restart from where it left off.
295
296 There are also scenarios where this is concern even when the sink is healthy.
297 Particularly, if a changefeed is started on a large set of watched rows, the
298 initial scan may take longer than the garbage collection TTL. In this case, the
299 user will have to temporarily raise the GC TTL while the initial scan runs.
300
301
302 ## AdminUI
303
304 A new page is used in the AdminUI to show the status of changefeeds, distinct
305 from the "Jobs" page because changefeeds don't have a finite lifespan.
306 Incrementally updated materialized views, for example, would also appear on this
307 page. Similarly, a counterpart to `SHOW JOBS` is introduced.
308
309
310 # Reference-level explanation
311
312 Each changefeed is structured as a single coordinator which sets up a
313 long-running [DistSQL] flow with a new processor described below.
314
315 Each changefeed is given an entry in the [system jobs], which provides this
316 fault-resistent coordinator. Briefly, this works by a time based lease, which
317 must be continually extended; if it is not, a new coordinator is started. Some
318 failure scenarios result in a split-brain coordinator, but our ordering and
319 duplicate delivery guarantees were chosen to account for this.
320
321 In addition to the 0-100% progress tracking, the `system.jobs` tracking is
322 extended with an option to show the minimum of all resolved timestamps.
323
324
325 ## DistSQL flow
326
327 ```proto
328 // ChangeAggregatorSpec is the specification for a processor that watches for
329 // changes in a set of spans. Each span may cross multiple ranges.
330 message ChangeAggregatorSpec {
331 message Watch {
332 util.hlc.Timestamp timestamp = 1;
333 roachpb.Span span = 2;
334 }
335 repeated Watch watches = 1;
336 bool initial_scan = 2;
337 }
338 ```
339
340 Each flow is made of a set of `ChangeAggregator` processors. Progress
341 information in the form of `(span, timestamp)` pairs is passed back to the job
342 coordinator via DistSQL metadata.
343
344 DistSQL is currently optimized for short-running queries and doesn’t yet have a
345 facility for resilience to processor errors. Instead, it shuts down the whole
346 flow. This is unfortunate for CDC, but failures should be sufficiently
347 infrequent and the amount of work duplicated when the flow is restarted
348 sufficiently small that it should be workable in practice. The largest impact is
349 on tail latencies. If/when DistSQL's capabilities change, this will be
350 revisited.
351
352 `ChangeAggregator`s are scheduled such that each is responsible for data local
353 to the node it’s running on. If the data it’s watching moves, the local data
354 will become remote, which will continue to work, but will use bandwidth and add
355 a network hop to latency. DistSQL doesn’t yet have a mechanism for moving or
356 reconfiguring processors in a running flow, but leaving this state indefinitely
357 is unfortunate. Instead, the job coordinator will periodically run some
358 heuristics and when it’s advantageous, stop and restart the flow such that all
359 processors will be doing local watches again. If the heuristics are not
360 aggressive enough, the user always has an escape hatch of manually pausing and
361 resuming the job. This restarting of the flow is costly in proportion to the
362 size of the changefeed and will affect tail latencies, so something should be
363 done here in future work.
364
365 GDPR will also require that processors be scheduled in a certain region, once
366 that's supported by DistSQL.
367
368
369 ## ChangeAggregator processor
370
371 A `ChangeAggregator` is responsible for watching for changes in a set of
372 disjoint spans, buffering during the initial catch up, translating kvs into sql
373 rows, and emitting to a sink. Additionally, it bookkeeps and emits the timestamp
374 resolved notifications described in [Cross-Row and Cross-Table Ordering].
375
376 If the `initial_scan` option is set, then the current value of every watched row
377 at `timestamp` is emitted. `ChangeAggregator` then proceeds to set up
378 RangeFeeds.
379
380
381 ### RangeFeed
382
383 The [RangeFeed] (previously called ChangeFeed) command sets up a stream of all
384 changes to keys in a given span. The changes are returned exactly as they are
385 [proposed and sent through raft]: in RocksDB WriteBatch format, possibly with
386 unrelated data (parts of the range’s keyspace not being watched by the
387 `RangeFeed` and range-local keys). Notably, the range feed will simply stream
388 intents upstream (which means these intents may or may not commit or move
389 timestamps), and the distributed architecture has to deduplicate/track/resolve
390 them.
391
392 Passing the raft proposal through unchanged minimizes cpu load on the replica
393 serving the `RangeFeed`. This is not initially critical, since we’ll just be
394 scheduling the processors on the same node, but gives us flexibility later to
395 make changefeeds extremely low-impact on production traffic. (Note that a
396 `RangeFeed` can be thought of as a slightly lighter weight additional Raft
397 follower.) Using the same format as proposer-evaluated kv also lesses the chance
398 that we’ll have tricky rpc migration issues in the future.
399
400 The amount of unrelated data returned will be low if most or all of the range’s
401 keyspace is being watched, but in the worst case (changes touching many keys in
402 the range), may be high if only a small subset is watched. This means table and
403 partition watches will be efficient, but the efficiency of single-row watches
404 will depend on the workload. We may support single-row watches, but are not
405 optimizing for them yet, so this is okay.
406
407 `RangeFeed`s are run on the leaseholder for now. It should be possible to run
408 `RangeFeed`s on followers in the future, which becomes especially important when
409 ranges are geographically distributed. However, there are complications to work
410 through. Importantly, this follower could be the slowest one, increasing
411 commit-to-kafka changefeed latency. We also don’t currently have common code for
412 geography-aware scheduling; this would have to be built.
413
414 A new `RangeFeed` first registers itself to emit any relevant raft commands as
415 they commit. These are sent to the (ordered) rpc stream in the same order they
416 are in the raft log. These are buffered by `ChangeAggregator`.
417
418 After the `RangeFeed` is set up, the `ChangeAggregator` uses `ExportRequest` to
419 catch up on any changes between `watch.timestamp` and some timestamp after the
420 raft hook was registered. These partially overlap the `RangeFeed` output and
421 must be deduplicated with it. `ExportRequest` already uses the time-bounded
422 iterator to minimize the data read, but support for `READ_UNCOMMITTED` is added
423 so we don't need to block on intent resolution. The potential move to larger
424 range sizes may dictate that this becomes a streaming process so we don't have
425 to hold everything in memory, but details of that are out of scope.
426
427 There is an ongoing discussion about how `RangeFeed` will handle splits, merges,
428 leaseholder moves, node failures, etc. The tradeoffs will be worked out in an
429 update to the `RangeFeed` RFC and are out of scope here. However, it's important
430 to note that the `RangeFeed` will occasionally need to disconnect as part of
431 normal operation (e.g. if it receives a snapshot) and so the reconnection should
432 be handled as cheaply as possible, likely by using resolved timestamps as lower
433 bounds on the catch up scans.
434
435
436 ### Intent tracking
437
438 Once the catch up scans finish, the buffered data is processed. Single range
439 fastpath transations and committed intents are passed on to the next stage.
440 Aborted intents and unrelated data are filtered. This is guaranteed to emit
441 individual keys in mvcc timestamp order. Opened intents are tracked as follows.
442
443 The `RangeFeed` also returns in-stream _close notifications_ (the same ones as
444 [follower reads]) which are a guarantee that no new intents or transactions will
445 be written below a timestamp. Notably, this does not imply that all the intents
446 below it have been resolved. An in-memory structure is bootstrapped with all
447 open intents returned by the `ExportRequest` catch up scans. As intents are
448 resolved and follower read notifications arrive, `min(earliest unresolved
449 intent, latest follower read notification)` will advance. These advances
450 correspond to our resolved timestamp guarantees and are passed to later stages.
451
452
453 ### KV to row conversion
454
455 `TableDescriptor`s are required to interpret kvs as sql rows. The correctness of
456 our online schema changes depends on each query holding a "lease" for the
457 TableDescriptor that is valid for the timestamp it’s executing at. In contrast
458 to queries, which execute with a distinct timestamp, changefeeds operate
459 continuously.
460
461 Rows are stored using one kv per [column family]. For tables with one column
462 family, which is the common case and the default if the user doesn’t specify,
463 each entry in the `RangeFeed`'s WriteBatch is the entire data for the row. For
464 tables with more than one column family, a followup kv scan will be needed to
465 reconstruct the row. A similar fetch will be required if the user requests a
466 changefeed with both the old and new values. `RangeFeed` could instead send this
467 along when necessary, eliminating a network hop and the cost of serving a
468 request, but it’s preferable to avoid pushing knowledge of column families and
469 sql rows down to kv.
470
471
472 ### Sink emitter
473
474 The sink emitter receives an ordered stream of sql rows, schema changes, and
475 resolved timestamps. The sql rows are sent to the relevant kafka topic, sharded
476 into partitions by primary key. When a schema registry is provided, the schema
477 changes are forwarded.
478
479 Resolved timestamps are forwarded when requested by the user. We guarantee that
480 no records with a timestamp less than the resolved notification will be after
481 it, so progress must be synchronously written to the job entry before emitting a
482 resolved timestamp notification.
483
484 For easy development and testing, a 1 partition topic will be auto-created on
485 demand when missing. For production deployments, the user should set the Kafka
486 option to disallow auto-creation of topics and manually create the topic with
487 the correct number of partitions. A user may manually add partitions at any time
488 and this transparently works, though the row to partition mapping will change
489 and consumers are responsible for handling this. Kafka does not support removing
490 partitions.
491
492
493 ### Buffering
494
495 For simplicity and to even out cpu utilization, there is backpressure from Kafka
496 all the way back to the buffer between `RangeFeed` and the intent tracker. The
497 buffer starts in-memory but spills to disk if it exceeds its budget. If the disk
498 buffer also exceeds its budget, the `RangeFeed`s are temporarily shut down until
499 the emitting process has caught up.
500
501
502 ## SQL Table Sink
503
504 ```sql
505 CREATE TABLE <name> (
506 table STRING,
507 message_id INT,
508 key BYTES,
509 value BYTES,
510 PRIMARY KEY (table, key, message_id)
511 )
512 ```
513
514 The SQL Table Sink is initially only for internal testing. A single table is
515 created and contains all updates for a changefeed, even if that changefeed is
516 watching multiple tables. The `key` and `value` are encoded using the `format`
517 specified by the changefeed. `message_id`s are only comparable between rows with
518 equal `key`s.
519
520
521 ## Truncate
522
523 For efficiency, in CockroachDB and in other databases, truncate is implemented
524 as a schema change instead of a large number of row deletions. This has
525 consequences for changefeeds. Consider the [Elasticsearch example] above. It’s
526 not using schema changes for anything, so rows in a truncated table would
527 continue to exist in Elasticsearch after truncate.
528
529 Postgres’s logical decoding is not aware of truncate or schema changes. We adopt
530 the SQL Server behavior, which prevents a table with an active changefeed from
531 being truncated.
532
533
534 ## Permissions
535
536 Changefeeds will initially be admin-only. Eventually we should allow non-admin
537 users to create changefeeds, and admins should be able to revoke those users'
538 permissions (breaking/cancelling the feeds).
539
540
541
542 # Drawbacks
543
544 ## Lots of configuration options
545
546 There are a number of options described above for format, envelope, sink
547 configuration, schema topics, etc. This is a lot, but they all seem necessary.
548 Simple cases like mirroring into Elasticsearch should work without the overhead
549 of avro, schema registries, and message envelopes. At the same time, we don’t
550 want to limit power by not having the ability for users to subscribe to the diff
551 of a changed row. Connectors in the Confluent platform and other CDC
552 implementations also seem to have a large number of configuration options, so
553 maybe this is just unavoidable.
554
555
556 ## No transaction grouped changes
557
558 One could easily imagine that a user would want to subscribe to a totally
559 ordered feed of all transactions that happen in the database. However, unlike
560 sharded PostgreSQL (for example), CockroachDB supports transactions touching any
561 part of the cluster. Which means that in the general case, there is no way to
562 horizontally divide up the transaction log of the cluster such that each piece
563 is independent. Our only options are (a) a single transaction log, which limits
564 scalability, (b) encoding the transaction dependency graph and respecting it
565 when replaying transactions, which is complex, or (c) taking advantage of
566 special cases (such and such tables are never used in the same transactions as
567 these other ones), which will not always work.
568
569 As a result, we’ve decided to give CockroachDB users the information necessary
570 to reconstruct transactions, but will not immediately be building anything to do
571 it in the general case.
572
573
574 # Rationale and alternatives
575
576 ## Alternative sinks
577
578 Our customers have overwhelmingly asked for Kafka as the sink they want for CDC
579 changes, so it will be the first we support but Kafka won’t work for everyone.
580
581 Some changefeeds will be low-throughput and for this a streaming SQL connection
582 (with the same wire format as PostgreSQL’s `LISTEN`) would be sufficient and
583 much simpler operationally. This lends itself well to single-row watches (in the
584 style of RethinkDB).
585
586 Other users will want their changefeeds to emit to a cloud-hosted pubsub.
587
588 Kafka is not the only sink we’ll need, but to limit scope these other sinks not
589 be in the initial version.
590
591
592 ## Write a Kafka connector
593
594 The Confluent platform ships with a [JDBC source connector], so we’re done,
595 right? It works by periodically scanning an incrementing id column or a last
596 updated timestamp column or both. The id column only works with append-only
597 tables and the last updated timestamp column must be maintained correctly by the
598 user. It works well for prototyping, but ultimately we feel that latency,
599 cluster load, and developer burden means that this doesn’t fit with our mission
600 to make data easy.
601
602 We could also write our own Confluent Platform connector. While Kafka will be
603 our initial sink, we don’t want to tie ourselves to it too closely. It’s also
604 not clear how we could make this work with end-to-end push to keep latencies and
605 performance impact low.
606
607
608 ## Exactly-once delivery
609
610 TODO: Description of kafka transactions and how they could be used for this.
611
612
613 # Unresolved Questions
614
615 - Flow control/throttling of the RangeFeed streams
616 - Lots of small changefeeds will require some coalescing
617 - A column whitelist, which could be useful for GDPR compliance or performance
618 - Backup/restore of changefeeds
619 - Details of recovering a failed changefeed using `revision_history` backups
620
621
622 # Appendix
623
624 ## Appendix: Kafka background
625
626 [Kafka] is used for all sorts of things, including as a pub/sub. The fundamental
627 abstraction is a named _topic_, which is a stream of _records_.
628
629 A topic is subdivided into a number of (essentially totally independant)
630 _partitions_. Each partition is a distributed, replicated, ordered, immutable
631 sequence of records that is continually appended to. Records are a timestamped
632 key/value byte pair. Keys are not required. If the timestamp is not specified,
633 it will be assigned by Kafka. Once appended to a partition, an _offset_ is given
634 to the record, so topic+partition+offset will uniquely identify a record.
635
636 The number of partitions in a topic can be increased, but never decreased. There
637 are no ordering guarantees across partitions.
638
639 Any producers for a topic may append new records to any partition, by default
640 hash of key or round-robin if no key is specified. Two records sent from a
641 producer to the same partition are guaranteed to retain their order in the log,
642 but there are no ordering guarantees across-producers.
643
644 Any number of _consumer groups_ may tail a topic. A consumer group is made up of
645 independent consumer processes. Each partition in a topic is assigned to exactly
646 one consumer in the group. By default, the consumer works beginning to end and
647 Kafka keeps track of the current position in each partition, handling consumer
648 failures as needed, but consumers may seek to positions at will.
649
650 Old log entries are cleaned up according to time/space/etc policies.
651
652
653 ## Appendix: Avro background
654
655 Avro is a serialization format, similar to protocol buffers. Notably, schemas
656 are designed to be serialized and stored alongside the serialized records. The
657 spec defines clear "[schema resolution]" rules for asserting whether a
658 progression of schemas is forward or backward compatible and for reading data
659 written with an older or newer version of the schema. The [Confluent Schema
660 Registry] integrates Kafka and Avro and allows consumers to declare which schema
661 evolutions are supported.
662
663
664 ## Appendix: Other CDC syntaxes
665
666 Whenever possible, we reuse PostgreSQL syntax in an attempt to minimize user
667 friction. For CDC, there are two potentially relevant feature: logical decoding
668 and logical replication.
669
670 Logical decoding allows a dynamically loaded plugin to access the WAL. This is
671 how most Postgres CDC works. For example, [Debezium] uses logical decoding to
672 hook PostgreSQL up to Kafka. Unfortunately, the syntax is tightly tied to
673 PostgreSQL's notion of replication slots, which don't map cleanly onto
674 CockroachDB's distributed ranges and replicas.
675
676 Logical replication is used to stream changes from one PostgreSQL instance to
677 another to keep them in sync. Its syntax is in terms of producers and
678 subscribers. The biggest reason to have drop-in compatibility is so tooling,
679 such as ORMs, will work out of the box, but it's unlikely that ORMs would be
680 creating changefeeds. Additionally, our model of push to a sink is different
681 enough from a subscriber pulling from a producer that trying to reuse syntax is
682 likely to be more confusing than helpful. This syntax could always be added as
683 an alias if we build support for CockroachDB to CockroachDB replication.
684
685
686
687 [#2656]: https://github.com/cockroachdb/cockroach/issues/2656
688 [#6130]: https://github.com/cockroachdb/cockroach/issues/6130
689 [amazon redshift]: https://aws.amazon.com/redshift/
690 [avro]: #appendix-avro-background
691 [rangefeed]: https://github.com/cockroachdb/cockroach/blob/381e4dafa596c5f3621a48fcb5fce1f62b18c186/docs/RFCS/20170613_change_feeds_storage_primitive.md
692 [column family]: https://www.cockroachlabs.com/docs/stable/column-families.html
693 [confluent schema registry]: https://docs.confluent.io/current/schema-registry/docs/index.html
694 [cross-row and cross-table ordering]: #cross-row-and-cross-table-ordering
695 [data warehouse example: amazon redshift]: #data-warehouse-example-amazon-redshift
696 [debezium]: http://debezium.io
697 [distsql]: https://github.com/cockroachdb/cockroach/blob/381e4dafa596c5f3621a48fcb5fce1f62b18c186/docs/RFCS/20160421_distributed_sql.md
698 [elasticsearch]: https://www.elastic.co/products/elasticsearch
699 [elasticsearch-example]: #fulltext-index-example-elasticsearch
700 [elasticsearchsinkconnector]: https://docs.confluent.io/current/connect/connect-elasticsearch/docs/elasticsearch_connector.html
701 [exactly-once delivery]: #exactly-once-delivery
702 [follower reads]: https://github.com/cockroachdb/cockroach/pull/26362
703 [full-text index example: elasticsearch]: #full-text-index-example-elasticsearch
704 [garbage collection]: https://www.cockroachlabs.com/docs/stable/architecture/storage-layer.html#garbage-collection
705 [incremental backups]: https://www.cockroachlabs.com/docs/stable/backup.html
706 [jdbc source connector]: https://docs.confluent.io/current/connect/connect-jdbc/docs/source_connector.html
707 [kafka]: https://kafka.apache.org/intro
708 [other cdc syntaxes]: #appendix-other-cdc-syntaxes
709 [proposed and sent through raft]: https://github.com/cockroachdb/cockroach/blob/381e4dafa596c5f3621a48fcb5fce1f62b18c186/docs/RFCS/20160420_proposer_evaluated_kv.md
710 [schema resolution]: http://avro.apache.org/docs/current/spec.html#Schema+Resolution
711 [sql table sink]: #sql-table-sink
712 [system jobs]: https://github.com/cockroachdb/cockroach/blob/381e4dafa596c5f3621a48fcb5fce1f62b18c186/docs/RFCS/20170215_system_jobs.md
713 [transaction grouped changes]: #no-transaction-grouped-changes
714 [follower read]: https://github.com/cockroachdb/cockroach/pull/19222