github.com/datastax/go-cassandra-native-protocol@v0.0.0-20220706104457-5e8aad05cf90/specs/native_protocol_v4.spec (about) 1 2 CQL BINARY PROTOCOL v4 3 4 5 Table of Contents 6 7 1. Overview 8 2. Frame header 9 2.1. version 10 2.2. flags 11 2.3. stream 12 2.4. opcode 13 2.5. length 14 3. Notations 15 4. Messages 16 4.1. Requests 17 4.1.1. STARTUP 18 4.1.2. AUTH_RESPONSE 19 4.1.3. OPTIONS 20 4.1.4. QUERY 21 4.1.5. PREPARE 22 4.1.6. EXECUTE 23 4.1.7. BATCH 24 4.1.8. REGISTER 25 4.2. Responses 26 4.2.1. ERROR 27 4.2.2. READY 28 4.2.3. AUTHENTICATE 29 4.2.4. SUPPORTED 30 4.2.5. RESULT 31 4.2.5.1. Void 32 4.2.5.2. Rows 33 4.2.5.3. Set_keyspace 34 4.2.5.4. Prepared 35 4.2.5.5. Schema_change 36 4.2.6. EVENT 37 4.2.7. AUTH_CHALLENGE 38 4.2.8. AUTH_SUCCESS 39 5. Compression 40 6. Data Type Serialization Formats 41 7. User Defined Type Serialization 42 8. Result paging 43 9. Error codes 44 10. Changes from v3 45 46 47 1. Overview 48 49 The CQL binary protocol is a frame based protocol. Frames are defined as: 50 51 0 8 16 24 32 40 52 +---------+---------+---------+---------+---------+ 53 | version | flags | stream | opcode | 54 +---------+---------+---------+---------+---------+ 55 | length | 56 +---------+---------+---------+---------+ 57 | | 58 . ... body ... . 59 . . 60 . . 61 +---------------------------------------- 62 63 The protocol is big-endian (network byte order). 64 65 Each frame contains a fixed size header (9 bytes) followed by a variable size 66 body. The header is described in Section 2. The content of the body depends 67 on the header opcode value (the body can in particular be empty for some 68 opcode values). The list of allowed opcodes is defined in Section 2.3 and the 69 details of each corresponding message are described Section 4. 70 71 The protocol distinguishes two types of frames: requests and responses. Requests 72 are those frames sent by the client to the server. Responses are those frames sent 73 by the server to the client. Note, however, that the protocol supports server pushes 74 (events) so a response does not necessarily come right after a client request. 75 76 Note to client implementors: client libraries should always assume that the 77 body of a given frame may contain more data than what is described in this 78 document. It will however always be safe to ignore the remainder of the frame 79 body in such cases. The reason is that this may enable extending the protocol 80 with optional features without needing to change the protocol version. 81 82 83 84 2. Frame header 85 86 2.1. version 87 88 The version is a single byte that indicates both the direction of the message 89 (request or response) and the version of the protocol in use. The most 90 significant bit of version is used to define the direction of the message: 91 0 indicates a request, 1 indicates a response. This can be useful for protocol 92 analyzers to distinguish the nature of the packet from the direction in which 93 it is moving. The rest of that byte is the protocol version (4 for the protocol 94 defined in this document). In other words, for this version of the protocol, 95 version will be one of: 96 0x04 Request frame for this protocol version 97 0x84 Response frame for this protocol version 98 99 Please note that while every message ships with the version, only one version 100 of messages is accepted on a given connection. In other words, the first message 101 exchanged (STARTUP) sets the version for the connection for the lifetime of this 102 connection. 103 104 This document describes version 4 of the protocol. For the changes made since 105 version 3, see Section 10. 106 107 108 2.2. flags 109 110 Flags applying to this frame. The flags have the following meaning (described 111 by the mask that allows selecting them): 112 0x01: Compression flag. If set, the frame body is compressed. The actual 113 compression to use should have been set up beforehand through the 114 Startup message (which thus cannot be compressed; Section 4.1.1). 115 0x02: Tracing flag. For a request frame, this indicates the client requires 116 tracing of the request. Note that only QUERY, PREPARE and EXECUTE queries 117 support tracing. Other requests will simply ignore the tracing flag if 118 set. If a request supports tracing and the tracing flag is set, the response 119 to this request will have the tracing flag set and contain tracing 120 information. 121 If a response frame has the tracing flag set, its body contains 122 a tracing ID. The tracing ID is a [uuid] and is the first thing in 123 the frame body. The rest of the body will then be the usual body 124 corresponding to the response opcode. 125 0x04: Custom payload flag. For a request or response frame, this indicates 126 that a generic key-value custom payload for a custom QueryHandler 127 implementation is present in the frame. Such a custom payload is simply 128 ignored by the default QueryHandler implementation. 129 Currently, only QUERY, PREPARE, EXECUTE and BATCH requests support 130 payload. 131 Type of custom payload is [bytes map] (see below). 132 0x08: Warning flag. The response contains warnings which were generated by the 133 server to go along with this response. 134 If a response frame has the warning flag set, its body will contain the 135 text of the warnings. The warnings are a [string list] and will be the 136 first value in the frame body if the tracing flag is not set, or directly 137 after the tracing ID if it is. 138 139 The rest of flags is currently unused and ignored. 140 141 2.3. stream 142 143 A frame has a stream id (a [short] value). When sending request messages, this 144 stream id must be set by the client to a non-negative value (negative stream id 145 are reserved for streams initiated by the server; currently all EVENT messages 146 (section 4.2.6) have a streamId of -1). If a client sends a request message 147 with the stream id X, it is guaranteed that the stream id of the response to 148 that message will be X. 149 150 This helps to enable the asynchronous nature of the protocol. If a client 151 sends multiple messages simultaneously (without waiting for responses), there 152 is no guarantee on the order of the responses. For instance, if the client 153 writes REQ_1, REQ_2, REQ_3 on the wire (in that order), the server might 154 respond to REQ_3 (or REQ_2) first. Assigning different stream ids to these 3 155 requests allows the client to distinguish to which request a received answer 156 responds to. As there can only be 32768 different simultaneous streams, it is up 157 to the client to reuse stream id. 158 159 Note that clients are free to use the protocol synchronously (i.e. wait for 160 the response to REQ_N before sending REQ_N+1). In that case, the stream id 161 can be safely set to 0. Clients should also feel free to use only a subset of 162 the 32768 maximum possible stream ids if it is simpler for its implementation. 163 164 2.4. opcode 165 166 An integer byte that distinguishes the actual message: 167 0x00 ERROR 168 0x01 STARTUP 169 0x02 READY 170 0x03 AUTHENTICATE 171 0x05 OPTIONS 172 0x06 SUPPORTED 173 0x07 QUERY 174 0x08 RESULT 175 0x09 PREPARE 176 0x0A EXECUTE 177 0x0B REGISTER 178 0x0C EVENT 179 0x0D BATCH 180 0x0E AUTH_CHALLENGE 181 0x0F AUTH_RESPONSE 182 0x10 AUTH_SUCCESS 183 184 Messages are described in Section 4. 185 186 (Note that there is no 0x04 message in this version of the protocol) 187 188 189 2.5. length 190 191 A 4 byte integer representing the length of the body of the frame (note: 192 currently a frame is limited to 256MB in length). 193 194 195 3. Notations 196 197 To describe the layout of the frame body for the messages in Section 4, we 198 define the following: 199 200 [int] A 4 bytes integer 201 [long] A 8 bytes integer 202 [short] A 2 bytes unsigned integer 203 [string] A [short] n, followed by n bytes representing an UTF-8 204 string. 205 [long string] An [int] n, followed by n bytes representing an UTF-8 string. 206 [uuid] A 16 bytes long uuid. 207 [string list] A [short] n, followed by n [string]. 208 [bytes] A [int] n, followed by n bytes if n >= 0. If n < 0, 209 no byte should follow and the value represented is `null`. 210 [value] A [int] n, followed by n bytes if n >= 0. 211 If n == -1 no byte should follow and the value represented is `null`. 212 If n == -2 no byte should follow and the value represented is 213 `not set` not resulting in any change to the existing value. 214 n < -2 is an invalid value and results in an error. 215 [short bytes] A [short] n, followed by n bytes if n >= 0. 216 217 [option] A pair of <id><value> where <id> is a [short] representing 218 the option id and <value> depends on that option (and can be 219 of size 0). The supported id (and the corresponding <value>) 220 will be described when this is used. 221 [option list] A [short] n, followed by n [option]. 222 [inet] An address (ip and port) to a node. It consists of one 223 [byte] n, that represents the address size, followed by n 224 [byte] representing the IP address (in practice n can only be 225 either 4 (IPv4) or 16 (IPv6)), following by one [int] 226 representing the port. 227 [consistency] A consistency level specification. This is a [short] 228 representing a consistency level with the following 229 correspondance: 230 0x0000 ANY 231 0x0001 ONE 232 0x0002 TWO 233 0x0003 THREE 234 0x0004 QUORUM 235 0x0005 ALL 236 0x0006 LOCAL_QUORUM 237 0x0007 EACH_QUORUM 238 0x0008 SERIAL 239 0x0009 LOCAL_SERIAL 240 0x000A LOCAL_ONE 241 242 [string map] A [short] n, followed by n pair <k><v> where <k> and <v> 243 are [string]. 244 [string multimap] A [short] n, followed by n pair <k><v> where <k> is a 245 [string] and <v> is a [string list]. 246 [bytes map] A [short] n, followed by n pair <k><v> where <k> is a 247 [string] and <v> is a [bytes]. 248 249 250 4. Messages 251 252 4.1. Requests 253 254 Note that outside of their normal responses (described below), all requests 255 can get an ERROR message (Section 4.2.1) as response. 256 257 4.1.1. STARTUP 258 259 Initialize the connection. The server will respond by either a READY message 260 (in which case the connection is ready for queries) or an AUTHENTICATE message 261 (in which case credentials will need to be provided using AUTH_RESPONSE). 262 263 This must be the first message of the connection, except for OPTIONS that can 264 be sent before to find out the options supported by the server. Once the 265 connection has been initialized, a client should not send any more STARTUP 266 messages. 267 268 The body is a [string map] of options. Possible options are: 269 - "CQL_VERSION": the version of CQL to use. This option is mandatory and 270 currently the only version supported is "3.0.0". Note that this is 271 different from the protocol version. 272 - "COMPRESSION": the compression algorithm to use for frames (See section 5). 273 This is optional; if not specified no compression will be used. 274 275 276 4.1.2. AUTH_RESPONSE 277 278 Answers a server authentication challenge. 279 280 Authentication in the protocol is SASL based. The server sends authentication 281 challenges (a bytes token) to which the client answers with this message. Those 282 exchanges continue until the server accepts the authentication by sending a 283 AUTH_SUCCESS message after a client AUTH_RESPONSE. Note that the exchange 284 begins with the client sending an initial AUTH_RESPONSE in response to a 285 server AUTHENTICATE request. 286 287 The body of this message is a single [bytes] token. The details of what this 288 token contains (and when it can be null/empty, if ever) depends on the actual 289 authenticator used. 290 291 The response to a AUTH_RESPONSE is either a follow-up AUTH_CHALLENGE message, 292 an AUTH_SUCCESS message or an ERROR message. 293 294 295 4.1.3. OPTIONS 296 297 Asks the server to return which STARTUP options are supported. The body of an 298 OPTIONS message should be empty and the server will respond with a SUPPORTED 299 message. 300 301 302 4.1.4. QUERY 303 304 Performs a CQL query. The body of the message must be: 305 <query><query_parameters> 306 where <query> is a [long string] representing the query and 307 <query_parameters> must be 308 <consistency><flags>[<n>[name_1]<value_1>...[name_n]<value_n>][<result_page_size>][<paging_state>][<serial_consistency>][<timestamp>] 309 where: 310 - <consistency> is the [consistency] level for the operation. 311 - <flags> is a [byte] whose bits define the options for this query and 312 in particular influence what the remainder of the message contains. 313 A flag is set if the bit corresponding to its `mask` is set. Supported 314 flags are, given their mask: 315 0x01: Values. If set, a [short] <n> followed by <n> [value] 316 values are provided. Those values are used for bound variables in 317 the query. Optionally, if the 0x40 flag is present, each value 318 will be preceded by a [string] name, representing the name of 319 the marker the value must be bound to. 320 0x02: Skip_metadata. If set, the Result Set returned as a response 321 to the query (if any) will have the NO_METADATA flag (see 322 Section 4.2.5.2). 323 0x04: Page_size. If set, <result_page_size> is an [int] 324 controlling the desired page size of the result (in CQL3 rows). 325 See the section on paging (Section 8) for more details. 326 0x08: With_paging_state. If set, <paging_state> should be present. 327 <paging_state> is a [bytes] value that should have been returned 328 in a result set (Section 4.2.5.2). The query will be 329 executed but starting from a given paging state. This is also to 330 continue paging on a different node than the one where it 331 started (See Section 8 for more details). 332 0x10: With serial consistency. If set, <serial_consistency> should be 333 present. <serial_consistency> is the [consistency] level for the 334 serial phase of conditional updates. That consitency can only be 335 either SERIAL or LOCAL_SERIAL and if not present, it defaults to 336 SERIAL. This option will be ignored for anything else other than a 337 conditional update/insert. 338 0x20: With default timestamp. If set, <timestamp> should be present. 339 <timestamp> is a [long] representing the default timestamp for the query 340 in microseconds (negative values are forbidden). This will 341 replace the server side assigned timestamp as default timestamp. 342 Note that a timestamp in the query itself will still override 343 this timestamp. This is entirely optional. 344 0x40: With names for values. This only makes sense if the 0x01 flag is set and 345 is ignored otherwise. If present, the values from the 0x01 flag will 346 be preceded by a name (see above). Note that this is only useful for 347 QUERY requests where named bind markers are used; for EXECUTE statements, 348 since the names for the expected values was returned during preparation, 349 a client can always provide values in the right order without any names 350 and using this flag, while supported, is almost surely inefficient. 351 352 Note that the consistency is ignored by some queries (USE, CREATE, ALTER, 353 TRUNCATE, ...). 354 355 The server will respond to a QUERY message with a RESULT message, the content 356 of which depends on the query. 357 358 359 4.1.5. PREPARE 360 361 Prepare a query for later execution (through EXECUTE). The body consists of 362 the CQL query to prepare as a [long string]. 363 364 The server will respond with a RESULT message with a `prepared` kind (0x0004, 365 see Section 4.2.5). 366 367 368 4.1.6. EXECUTE 369 370 Executes a prepared query. The body of the message must be: 371 <id><query_parameters> 372 where <id> is the prepared query ID. It's the [short bytes] returned as a 373 response to a PREPARE message. As for <query_parameters>, it has the exact 374 same definition as in QUERY (see Section 4.1.4). 375 376 The response from the server will be a RESULT message. 377 378 379 4.1.7. BATCH 380 381 Allows executing a list of queries (prepared or not) as a batch (note that 382 only DML statements are accepted in a batch). The body of the message must 383 be: 384 <type><n><query_1>...<query_n><consistency><flags>[<serial_consistency>][<timestamp>] 385 where: 386 - <type> is a [byte] indicating the type of batch to use: 387 - If <type> == 0, the batch will be "logged". This is equivalent to a 388 normal CQL3 batch statement. 389 - If <type> == 1, the batch will be "unlogged". 390 - If <type> == 2, the batch will be a "counter" batch (and non-counter 391 statements will be rejected). 392 - <flags> is a [byte] whose bits define the options for this query and 393 in particular influence what the remainder of the message contains. It is similar 394 to the <flags> from QUERY and EXECUTE methods, except that the 4 rightmost 395 bits must always be 0 as their corresponding options do not make sense for 396 Batch. A flag is set if the bit corresponding to its `mask` is set. Supported 397 flags are, given their mask: 398 0x10: With serial consistency. If set, <serial_consistency> should be 399 present. <serial_consistency> is the [consistency] level for the 400 serial phase of conditional updates. That consistency can only be 401 either SERIAL or LOCAL_SERIAL and if not present, it defaults to 402 SERIAL. This option will be ignored for anything else other than a 403 conditional update/insert. 404 0x20: With default timestamp. If set, <timestamp> should be present. 405 <timestamp> is a [long] representing the default timestamp for the query 406 in microseconds. This will replace the server side assigned 407 timestamp as default timestamp. Note that a timestamp in the query itself 408 will still override this timestamp. This is entirely optional. 409 0x40: With names for values. If set, then all values for all <query_i> must be 410 preceded by a [string] <name_i> that have the same meaning as in QUERY 411 requests [IMPORTANT NOTE: this feature does not work and should not be 412 used. It is specified in a way that makes it impossible for the server 413 to implement. This will be fixed in a future version of the native 414 protocol. See https://issues.apache.org/jira/browse/CASSANDRA-10246 for 415 more details]. 416 - <n> is a [short] indicating the number of following queries. 417 - <query_1>...<query_n> are the queries to execute. A <query_i> must be of the 418 form: 419 <kind><string_or_id><n>[<name_1>]<value_1>...[<name_n>]<value_n> 420 where: 421 - <kind> is a [byte] indicating whether the following query is a prepared 422 one or not. <kind> value must be either 0 or 1. 423 - <string_or_id> depends on the value of <kind>. If <kind> == 0, it should be 424 a [long string] query string (as in QUERY, the query string might contain 425 bind markers). Otherwise (that is, if <kind> == 1), it should be a 426 [short bytes] representing a prepared query ID. 427 - <n> is a [short] indicating the number (possibly 0) of following values. 428 - <name_i> is the optional name of the following <value_i>. It must be present 429 if and only if the 0x40 flag is provided for the batch. 430 - <value_i> is the [value] to use for bound variable i (of bound variable <name_i> 431 if the 0x40 flag is used). 432 - <consistency> is the [consistency] level for the operation. 433 - <serial_consistency> is only present if the 0x10 flag is set. In that case, 434 <serial_consistency> is the [consistency] level for the serial phase of 435 conditional updates. That consitency can only be either SERIAL or 436 LOCAL_SERIAL and if not present will defaults to SERIAL. This option will 437 be ignored for anything else other than a conditional update/insert. 438 439 The server will respond with a RESULT message. 440 441 442 4.1.8. REGISTER 443 444 Register this connection to receive some types of events. The body of the 445 message is a [string list] representing the event types to register for. See 446 section 4.2.6 for the list of valid event types. 447 448 The response to a REGISTER message will be a READY message. 449 450 Please note that if a client driver maintains multiple connections to a 451 Cassandra node and/or connections to multiple nodes, it is advised to 452 dedicate a handful of connections to receive events, but to *not* register 453 for events on all connections, as this would only result in receiving 454 multiple times the same event messages, wasting bandwidth. 455 456 457 4.2. Responses 458 459 This section describes the content of the frame body for the different 460 responses. Please note that to make room for future evolution, clients should 461 support extra informations (that they should simply discard) to the one 462 described in this document at the end of the frame body. 463 464 4.2.1. ERROR 465 466 Indicates an error processing a request. The body of the message will be an 467 error code ([int]) followed by a [string] error message. Then, depending on 468 the exception, more content may follow. The error codes are defined in 469 Section 9, along with their additional content if any. 470 471 472 4.2.2. READY 473 474 Indicates that the server is ready to process queries. This message will be 475 sent by the server either after a STARTUP message if no authentication is 476 required (if authentication is required, the server indicates readiness by 477 sending a AUTH_RESPONSE message). 478 479 The body of a READY message is empty. 480 481 482 4.2.3. AUTHENTICATE 483 484 Indicates that the server requires authentication, and which authentication 485 mechanism to use. 486 487 The authentication is SASL based and thus consists of a number of server 488 challenges (AUTH_CHALLENGE, Section 4.2.7) followed by client responses 489 (AUTH_RESPONSE, Section 4.1.2). The initial exchange is however boostrapped 490 by an initial client response. The details of that exchange (including how 491 many challenge-response pairs are required) are specific to the authenticator 492 in use. The exchange ends when the server sends an AUTH_SUCCESS message or 493 an ERROR message. 494 495 This message will be sent following a STARTUP message if authentication is 496 required and must be answered by a AUTH_RESPONSE message from the client. 497 498 The body consists of a single [string] indicating the full class name of the 499 IAuthenticator in use. 500 501 502 4.2.4. SUPPORTED 503 504 Indicates which startup options are supported by the server. This message 505 comes as a response to an OPTIONS message. 506 507 The body of a SUPPORTED message is a [string multimap]. This multimap gives 508 for each of the supported STARTUP options, the list of supported values. 509 510 511 4.2.5. RESULT 512 513 The result to a query (QUERY, PREPARE, EXECUTE or BATCH messages). 514 515 The first element of the body of a RESULT message is an [int] representing the 516 `kind` of result. The rest of the body depends on the kind. The kind can be 517 one of: 518 0x0001 Void: for results carrying no information. 519 0x0002 Rows: for results to select queries, returning a set of rows. 520 0x0003 Set_keyspace: the result to a `use` query. 521 0x0004 Prepared: result to a PREPARE message. 522 0x0005 Schema_change: the result to a schema altering query. 523 524 The body for each kind (after the [int] kind) is defined below. 525 526 527 4.2.5.1. Void 528 529 The rest of the body for a Void result is empty. It indicates that a query was 530 successful without providing more information. 531 532 533 4.2.5.2. Rows 534 535 Indicates a set of rows. The rest of the body of a Rows result is: 536 <metadata><rows_count><rows_content> 537 where: 538 - <metadata> is composed of: 539 <flags><columns_count>[<paging_state>][<global_table_spec>?<col_spec_1>...<col_spec_n>] 540 where: 541 - <flags> is an [int]. The bits of <flags> provides information on the 542 formatting of the remaining information. A flag is set if the bit 543 corresponding to its `mask` is set. Supported flags are, given their 544 mask: 545 0x0001 Global_tables_spec: if set, only one table spec (keyspace 546 and table name) is provided as <global_table_spec>. If not 547 set, <global_table_spec> is not present. 548 0x0002 Has_more_pages: indicates whether this is not the last 549 page of results and more should be retrieved. If set, the 550 <paging_state> will be present. The <paging_state> is a 551 [bytes] value that should be used in QUERY/EXECUTE to 552 continue paging and retrieve the remainder of the result for 553 this query (See Section 8 for more details). 554 0x0004 No_metadata: if set, the <metadata> is only composed of 555 these <flags>, the <column_count> and optionally the 556 <paging_state> (depending on the Has_more_pages flag) but 557 no other information (so no <global_table_spec> nor <col_spec_i>). 558 This will only ever be the case if this was requested 559 during the query (see QUERY and RESULT messages). 560 - <columns_count> is an [int] representing the number of columns selected 561 by the query that produced this result. It defines the number of <col_spec_i> 562 elements in and the number of elements for each row in <rows_content>. 563 - <global_table_spec> is present if the Global_tables_spec is set in 564 <flags>. It is composed of two [string] representing the 565 (unique) keyspace name and table name the columns belong to. 566 - <col_spec_i> specifies the columns returned in the query. There are 567 <column_count> such column specifications that are composed of: 568 (<ksname><tablename>)?<name><type> 569 The initial <ksname> and <tablename> are two [string] and are only present 570 if the Global_tables_spec flag is not set. The <column_name> is a 571 [string] and <type> is an [option] that corresponds to the description 572 (what this description is depends a bit on the context: in results to 573 selects, this will be either the user chosen alias or the selection used 574 (often a colum name, but it can be a function call too). In results to 575 a PREPARE, this will be either the name of the corresponding bind variable 576 or the column name for the variable if it is "anonymous") and type of 577 the corresponding result. The option for <type> is either a native 578 type (see below), in which case the option has no value, or a 579 'custom' type, in which case the value is a [string] representing 580 the fully qualified class name of the type represented. Valid option 581 ids are: 582 0x0000 Custom: the value is a [string], see above. 583 0x0001 Ascii 584 0x0002 Bigint 585 0x0003 Blob 586 0x0004 Boolean 587 0x0005 Counter 588 0x0006 Decimal 589 0x0007 Double 590 0x0008 Float 591 0x0009 Int 592 0x000B Timestamp 593 0x000C Uuid 594 0x000D Varchar 595 0x000E Varint 596 0x000F Timeuuid 597 0x0010 Inet 598 0x0011 Date 599 0x0012 Time 600 0x0013 Smallint 601 0x0014 Tinyint 602 0x0020 List: the value is an [option], representing the type 603 of the elements of the list. 604 0x0021 Map: the value is two [option], representing the types of the 605 keys and values of the map 606 0x0022 Set: the value is an [option], representing the type 607 of the elements of the set 608 0x0030 UDT: the value is <ks><udt_name><n><name_1><type_1>...<name_n><type_n> 609 where: 610 - <ks> is a [string] representing the keyspace name this 611 UDT is part of. 612 - <udt_name> is a [string] representing the UDT name. 613 - <n> is a [short] representing the number of fields of 614 the UDT, and thus the number of <name_i><type_i> pairs 615 following 616 - <name_i> is a [string] representing the name of the 617 i_th field of the UDT. 618 - <type_i> is an [option] representing the type of the 619 i_th field of the UDT. 620 0x0031 Tuple: the value is <n><type_1>...<type_n> where <n> is a [short] 621 representing the number of values in the type, and <type_i> 622 are [option] representing the type of the i_th component 623 of the tuple 624 625 - <rows_count> is an [int] representing the number of rows present in this 626 result. Those rows are serialized in the <rows_content> part. 627 - <rows_content> is composed of <row_1>...<row_m> where m is <rows_count>. 628 Each <row_i> is composed of <value_1>...<value_n> where n is 629 <columns_count> and where <value_j> is a [bytes] representing the value 630 returned for the jth column of the ith row. In other words, <rows_content> 631 is composed of (<rows_count> * <columns_count>) [bytes]. 632 633 634 4.2.5.3. Set_keyspace 635 636 The result to a `use` query. The body (after the kind [int]) is a single 637 [string] indicating the name of the keyspace that has been set. 638 639 640 4.2.5.4. Prepared 641 642 The result to a PREPARE message. The body of a Prepared result is: 643 <id><metadata><result_metadata> 644 where: 645 - <id> is [short bytes] representing the prepared query ID. 646 - <metadata> is composed of: 647 <flags><columns_count><pk_count>[<pk_index_1>...<pk_index_n>][<global_table_spec>?<col_spec_1>...<col_spec_n>] 648 where: 649 - <flags> is an [int]. The bits of <flags> provides information on the 650 formatting of the remaining information. A flag is set if the bit 651 corresponding to its `mask` is set. Supported masks and their flags 652 are: 653 0x0001 Global_tables_spec: if set, only one table spec (keyspace 654 and table name) is provided as <global_table_spec>. If not 655 set, <global_table_spec> is not present. 656 - <columns_count> is an [int] representing the number of bind markers 657 in the prepared statement. It defines the number of <col_spec_i> 658 elements. 659 - <pk_count> is an [int] representing the number of <pk_index_i> 660 elements to follow. If this value is zero, at least one of the 661 partition key columns in the table that the statement acts on 662 did not have a corresponding bind marker (or the bind marker 663 was wrapped in a function call). 664 - <pk_index_i> is a short that represents the index of the bind marker 665 that corresponds to the partition key column in position i. 666 For example, a <pk_index> sequence of [2, 0, 1] indicates that the 667 table has three partition key columns; the full partition key 668 can be constructed by creating a composite of the values for 669 the bind markers at index 2, at index 0, and at index 1. 670 This allows implementations with token-aware routing to correctly 671 construct the partition key without needing to inspect table 672 metadata. 673 - <global_table_spec> is present if the Global_tables_spec is set in 674 <flags>. If present, it is composed of two [string]s. The first 675 [string] is the name of the keyspace that the statement acts on. 676 The second [string] is the name of the table that the columns 677 represented by the bind markers belong to. 678 - <col_spec_i> specifies the bind markers in the prepared statement. 679 There are <column_count> such column specifications, each with the 680 following format: 681 (<ksname><tablename>)?<name><type> 682 The initial <ksname> and <tablename> are two [string] that are only 683 present if the Global_tables_spec flag is not set. The <name> field 684 is a [string] that holds the name of the bind marker (if named), 685 or the name of the column, field, or expression that the bind marker 686 corresponds to (if the bind marker is "anonymous"). The <type> 687 field is an [option] that represents the expected type of values for 688 the bind marker. See the Rows documentation (section 4.2.5.2) for 689 full details on the <type> field. 690 691 - <result_metadata> is defined exactly the same as <metadata> in the Rows 692 documentation (section 4.2.5.2). This describes the metadata for the 693 result set that will be returned when this prepared statement is executed. 694 Note that <result_metadata> may be empty (have the No_metadata flag and 695 0 columns, See section 4.2.5.2) and will be for any query that is not a 696 Select. In fact, there is never a guarantee that this will be non-empty, so 697 implementations should protect themselves accordingly. This result metadata 698 is an optimization that allows implementations to later execute the 699 prepared statement without requesting the metadata (see the Skip_metadata 700 flag in EXECUTE). Clients can safely discard this metadata if they do not 701 want to take advantage of that optimization. 702 703 Note that the prepared query ID returned is global to the node on which the query 704 has been prepared. It can be used on any connection to that node 705 until the node is restarted (after which the query must be reprepared). 706 707 4.2.5.5. Schema_change 708 709 The result to a schema altering query (creation/update/drop of a 710 keyspace/table/index). The body (after the kind [int]) is the same 711 as the body for a "SCHEMA_CHANGE" event, so 3 strings: 712 <change_type><target><options> 713 Please refer to section 4.2.6 below for the meaning of those fields. 714 715 Note that a query to create or drop an index is considered to be a change 716 to the table the index is on. 717 718 719 4.2.6. EVENT 720 721 An event pushed by the server. A client will only receive events for the 722 types it has REGISTERed to. The body of an EVENT message will start with a 723 [string] representing the event type. The rest of the message depends on the 724 event type. The valid event types are: 725 - "TOPOLOGY_CHANGE": events related to change in the cluster topology. 726 Currently, events are sent when new nodes are added to the cluster, and 727 when nodes are removed. The body of the message (after the event type) 728 consists of a [string] and an [inet], corresponding respectively to the 729 type of change ("NEW_NODE" or "REMOVED_NODE") followed by the address of 730 the new/removed node. 731 - "STATUS_CHANGE": events related to change of node status. Currently, 732 up/down events are sent. The body of the message (after the event type) 733 consists of a [string] and an [inet], corresponding respectively to the 734 type of status change ("UP" or "DOWN") followed by the address of the 735 concerned node. 736 - "SCHEMA_CHANGE": events related to schema change. After the event type, 737 the rest of the message will be <change_type><target><options> where: 738 - <change_type> is a [string] representing the type of changed involved. 739 It will be one of "CREATED", "UPDATED" or "DROPPED". 740 - <target> is a [string] that can be one of "KEYSPACE", "TABLE", "TYPE", 741 "FUNCTION" or "AGGREGATE" and describes what has been modified 742 ("TYPE" stands for modifications related to user types, "FUNCTION" 743 for modifications related to user defined functions, "AGGREGATE" 744 for modifications related to user defined aggregates). 745 - <options> depends on the preceding <target>: 746 - If <target> is "KEYSPACE", then <options> will be a single [string] 747 representing the keyspace changed. 748 - If <target> is "TABLE" or "TYPE", then 749 <options> will be 2 [string]: the first one will be the keyspace 750 containing the affected object, and the second one will be the name 751 of said affected object (either the table, user type, function, or 752 aggregate name). 753 - If <target> is "FUNCTION" or "AGGREGATE", multiple arguments follow: 754 - [string] keyspace containing the user defined function / aggregate 755 - [string] the function/aggregate name 756 - [string list] one string for each argument type (as CQL type) 757 758 All EVENT messages have a streamId of -1 (Section 2.3). 759 760 Please note that "NEW_NODE" and "UP" events are sent based on internal Gossip 761 communication and as such may be sent a short delay before the binary 762 protocol server on the newly up node is fully started. Clients are thus 763 advised to wait a short time before trying to connect to the node (1 second 764 should be enough), otherwise they may experience a connection refusal at 765 first. 766 767 4.2.7. AUTH_CHALLENGE 768 769 A server authentication challenge (see AUTH_RESPONSE (Section 4.1.2) for more 770 details). 771 772 The body of this message is a single [bytes] token. The details of what this 773 token contains (and when it can be null/empty, if ever) depends on the actual 774 authenticator used. 775 776 Clients are expected to answer the server challenge with an AUTH_RESPONSE 777 message. 778 779 4.2.8. AUTH_SUCCESS 780 781 Indicates the success of the authentication phase. See Section 4.2.3 for more 782 details. 783 784 The body of this message is a single [bytes] token holding final information 785 from the server that the client may require to finish the authentication 786 process. What that token contains and whether it can be null depends on the 787 actual authenticator used. 788 789 790 5. Compression 791 792 Frame compression is supported by the protocol, but then only the frame body 793 is compressed (the frame header should never be compressed). 794 795 Before being used, client and server must agree on a compression algorithm to 796 use, which is done in the STARTUP message. As a consequence, a STARTUP message 797 must never be compressed. However, once the STARTUP frame has been received 798 by the server, messages can be compressed (including the response to the STARTUP 799 request). Frames do not have to be compressed, however, even if compression has 800 been agreed upon (a server may only compress frames above a certain size at its 801 discretion). A frame body should be compressed if and only if the compressed 802 flag (see Section 2.2) is set. 803 804 As of version 2 of the protocol, the following compressions are available: 805 - lz4 (https://code.google.com/p/lz4/). In that, note that the first four bytes 806 of the body will be the uncompressed length (followed by the compressed 807 bytes). 808 - snappy (https://code.google.com/p/snappy/). This compression might not be 809 available as it depends on a native lib (server-side) that might not be 810 avaivable on some installations. 811 812 813 6. Data Type Serialization Formats 814 815 This sections describes the serialization formats for all CQL data types 816 supported by Cassandra through the native protocol. These serialization 817 formats should be used by client drivers to encode values for EXECUTE 818 messages. Cassandra will use these formats when returning values in 819 RESULT messages. 820 821 All values are represented as [bytes] in EXECUTE and RESULT messages. 822 The [bytes] format includes an int prefix denoting the length of the value. 823 For that reason, the serialization formats described here will not include 824 a length component. 825 826 For legacy compatibility reasons, note that most non-string types support 827 "empty" values (i.e. a value with zero length). An empty value is distinct 828 from NULL, which is encoded with a negative length. 829 830 As with the rest of the native protocol, all encodings are big-endian. 831 832 6.1. ascii 833 834 A sequence of bytes in the ASCII range [0, 127]. Bytes with values outside of 835 this range will result in a validation error. 836 837 6.2 bigint 838 839 An eight-byte two's complement integer. 840 841 6.3 blob 842 843 Any sequence of bytes. 844 845 6.4 boolean 846 847 A single byte. A value of 0 denotes "false"; any other value denotes "true". 848 (However, it is recommended that a value of 1 be used to represent "true".) 849 850 6.5 date 851 852 An unsigned integer representing days with epoch centered at 2^31. 853 (unix epoch January 1st, 1970). 854 A few examples: 855 0: -5877641-06-23 856 2^31: 1970-1-1 857 2^32: 5881580-07-11 858 859 6.6 decimal 860 861 The decimal format represents an arbitrary-precision number. It contains an 862 [int] "scale" component followed by a varint encoding (see section 6.17) 863 of the unscaled value. The encoded value represents "<unscaled>E<-scale>". 864 In other words, "<unscaled> * 10 ^ (-1 * <scale>)". 865 866 6.7 double 867 868 An 8 byte floating point number in the IEEE 754 binary64 format. 869 870 6.8 float 871 872 A 4 byte floating point number in the IEEE 754 binary32 format. 873 874 6.9 inet 875 876 A 4 byte or 16 byte sequence denoting an IPv4 or IPv6 address, respectively. 877 878 6.10 int 879 880 A 4 byte two's complement integer. 881 882 6.11 list 883 884 A [int] n indicating the number of elements in the list, followed by n 885 elements. Each element is [bytes] representing the serialized value. 886 887 6.12 map 888 889 A [int] n indicating the number of key/value pairs in the map, followed by 890 n entries. Each entry is composed of two [bytes] representing the key 891 and value. 892 893 6.13 set 894 895 A [int] n indicating the number of elements in the set, followed by n 896 elements. Each element is [bytes] representing the serialized value. 897 898 6.14 smallint 899 900 A 2 byte two's complement integer. 901 902 6.15 text 903 904 A sequence of bytes conforming to the UTF-8 specifications. 905 906 6.16 time 907 908 An 8 byte two's complement long representing nanoseconds since midnight. 909 Valid values are in the range 0 to 86399999999999 910 911 6.17 timestamp 912 913 An 8 byte two's complement integer representing a millisecond-precision 914 offset from the unix epoch (00:00:00, January 1st, 1970). Negative values 915 represent a negative offset from the epoch. 916 917 6.18 timeuuid 918 919 A 16 byte sequence representing a version 1 UUID as defined by RFC 4122. 920 921 6.19 tinyint 922 923 A 1 byte two's complement integer. 924 925 6.20 tuple 926 927 A sequence of [bytes] values representing the items in a tuple. The encoding 928 of each element depends on the data type for that position in the tuple. 929 Null values may be represented by using length -1 for the [bytes] 930 representation of an element. 931 932 6.21 uuid 933 934 A 16 byte sequence representing any valid UUID as defined by RFC 4122. 935 936 6.22 varchar 937 938 An alias of the "text" type. 939 940 6.23 varint 941 942 A variable-length two's complement encoding of a signed integer. 943 944 The following examples may help implementors of this spec: 945 946 Value | Encoding 947 ------|--------- 948 0 | 0x00 949 1 | 0x01 950 127 | 0x7F 951 128 | 0x0080 952 129 | 0x0081 953 -1 | 0xFF 954 -128 | 0x80 955 -129 | 0xFF7F 956 957 Note that positive numbers must use a most-significant byte with a value 958 less than 0x80, because a most-significant bit of 1 indicates a negative 959 value. Implementors should pad positive values that have a MSB >= 0x80 960 with a leading 0x00 byte. 961 962 963 7. User Defined Types 964 965 This section describes the serialization format for User defined types (UDT), 966 as described in section 4.2.5.2. 967 968 A UDT value is composed of successive [bytes] values, one for each field of the UDT 969 value (in the order defined by the type). A UDT value will generally have one value 970 for each field of the type it represents, but it is allowed to have less values than 971 the type has fields. 972 973 974 8. Result paging 975 976 The protocol allows for paging the result of queries. For that, the QUERY and 977 EXECUTE messages have a <result_page_size> value that indicate the desired 978 page size in CQL3 rows. 979 980 If a positive value is provided for <result_page_size>, the result set of the 981 RESULT message returned for the query will contain at most the 982 <result_page_size> first rows of the query result. If that first page of results 983 contains the full result set for the query, the RESULT message (of kind `Rows`) 984 will have the Has_more_pages flag *not* set. However, if some results are not 985 part of the first response, the Has_more_pages flag will be set and the result 986 will contain a <paging_state> value. In that case, the <paging_state> value 987 should be used in a QUERY or EXECUTE message (that has the *same* query as 988 the original one or the behavior is undefined) to retrieve the next page of 989 results. 990 991 Only CQL3 queries that return a result set (RESULT message with a Rows `kind`) 992 support paging. For other type of queries, the <result_page_size> value is 993 ignored. 994 995 Note to client implementors: 996 - While <result_page_size> can be as low as 1, it will likely be detrimental 997 to performance to pick a value too low. A value below 100 is probably too 998 low for most use cases. 999 - Clients should not rely on the actual size of the result set returned to 1000 decide if there are more results to fetch or not. Instead, they should always 1001 check the Has_more_pages flag (unless they did not enable paging for the query 1002 obviously). Clients should also not assert that no result will have more than 1003 <result_page_size> results. While the current implementation always respects 1004 the exact value of <result_page_size>, we reserve the right to return 1005 slightly smaller or bigger pages in the future for performance reasons. 1006 - The <paging_state> is specific to a protocol version and drivers should not 1007 send a <paging_state> returned by a node using the protocol v3 to query a node 1008 using the protocol v4 for instance. 1009 1010 1011 9. Error codes 1012 1013 Let us recall that an ERROR message is composed of <code><message>[...] 1014 (see 4.2.1 for details). The supported error codes, as well as any additional 1015 information the message may contain after the <message> are described below: 1016 0x0000 Server error: something unexpected happened. This indicates a 1017 server-side bug. 1018 0x000A Protocol error: some client message triggered a protocol 1019 violation (for instance a QUERY message is sent before a STARTUP 1020 one has been sent) 1021 0x0100 Authentication error: authentication was required and failed. The 1022 possible reason for failing depends on the authenticator in use, 1023 which may or may not include more detail in the accompanying 1024 error message. 1025 0x1000 Unavailable exception. The rest of the ERROR message body will be 1026 <cl><required><alive> 1027 where: 1028 <cl> is the [consistency] level of the query that triggered 1029 the exception. 1030 <required> is an [int] representing the number of nodes that 1031 should be alive to respect <cl> 1032 <alive> is an [int] representing the number of replicas that 1033 were known to be alive when the request had been 1034 processed (since an unavailable exception has been 1035 triggered, there will be <alive> < <required>) 1036 0x1001 Overloaded: the request cannot be processed because the 1037 coordinator node is overloaded 1038 0x1002 Is_bootstrapping: the request was a read request but the 1039 coordinator node is bootstrapping 1040 0x1003 Truncate_error: error during a truncation error. 1041 0x1100 Write_timeout: Timeout exception during a write request. The rest 1042 of the ERROR message body will be 1043 <cl><received><blockfor><writeType> 1044 where: 1045 <cl> is the [consistency] level of the query having triggered 1046 the exception. 1047 <received> is an [int] representing the number of nodes having 1048 acknowledged the request. 1049 <blockfor> is an [int] representing the number of replicas whose 1050 acknowledgement is required to achieve <cl>. 1051 <writeType> is a [string] that describe the type of the write 1052 that timed out. The value of that string can be one 1053 of: 1054 - "SIMPLE": the write was a non-batched 1055 non-counter write. 1056 - "BATCH": the write was a (logged) batch write. 1057 If this type is received, it means the batch log 1058 has been successfully written (otherwise a 1059 "BATCH_LOG" type would have been sent instead). 1060 - "UNLOGGED_BATCH": the write was an unlogged 1061 batch. No batch log write has been attempted. 1062 - "COUNTER": the write was a counter write 1063 (batched or not). 1064 - "BATCH_LOG": the timeout occurred during the 1065 write to the batch log when a (logged) batch 1066 write was requested. 1067 0x1200 Read_timeout: Timeout exception during a read request. The rest 1068 of the ERROR message body will be 1069 <cl><received><blockfor><data_present> 1070 where: 1071 <cl> is the [consistency] level of the query having triggered 1072 the exception. 1073 <received> is an [int] representing the number of nodes having 1074 answered the request. 1075 <blockfor> is an [int] representing the number of replicas whose 1076 response is required to achieve <cl>. Please note that 1077 it is possible to have <received> >= <blockfor> if 1078 <data_present> is false. Also in the (unlikely) 1079 case where <cl> is achieved but the coordinator node 1080 times out while waiting for read-repair acknowledgement. 1081 <data_present> is a single byte. If its value is 0, it means 1082 the replica that was asked for data has not 1083 responded. Otherwise, the value is != 0. 1084 0x1300 Read_failure: A non-timeout exception during a read request. The rest 1085 of the ERROR message body will be 1086 <cl><received><blockfor><numfailures><data_present> 1087 where: 1088 <cl> is the [consistency] level of the query having triggered 1089 the exception. 1090 <received> is an [int] representing the number of nodes having 1091 answered the request. 1092 <blockfor> is an [int] representing the number of replicas whose 1093 acknowledgement is required to achieve <cl>. 1094 <numfailures> is an [int] representing the number of nodes that 1095 experience a failure while executing the request. 1096 <data_present> is a single byte. If its value is 0, it means 1097 the replica that was asked for data had not 1098 responded. Otherwise, the value is != 0. 1099 0x1400 Function_failure: A (user defined) function failed during execution. 1100 The rest of the ERROR message body will be 1101 <keyspace><function><arg_types> 1102 where: 1103 <keyspace> is the keyspace [string] of the failed function 1104 <function> is the name [string] of the failed function 1105 <arg_types> [string list] one string for each argument type (as CQL type) of the failed function 1106 0x1500 Write_failure: A non-timeout exception during a write request. The rest 1107 of the ERROR message body will be 1108 <cl><received><blockfor><numfailures><write_type> 1109 where: 1110 <cl> is the [consistency] level of the query having triggered 1111 the exception. 1112 <received> is an [int] representing the number of nodes having 1113 answered the request. 1114 <blockfor> is an [int] representing the number of replicas whose 1115 acknowledgement is required to achieve <cl>. 1116 <numfailures> is an [int] representing the number of nodes that 1117 experience a failure while executing the request. 1118 <writeType> is a [string] that describes the type of the write 1119 that failed. The value of that string can be one 1120 of: 1121 - "SIMPLE": the write was a non-batched 1122 non-counter write. 1123 - "BATCH": the write was a (logged) batch write. 1124 If this type is received, it means the batch log 1125 has been successfully written (otherwise a 1126 "BATCH_LOG" type would have been sent instead). 1127 - "UNLOGGED_BATCH": the write was an unlogged 1128 batch. No batch log write has been attempted. 1129 - "COUNTER": the write was a counter write 1130 (batched or not). 1131 - "BATCH_LOG": the failure occured during the 1132 write to the batch log when a (logged) batch 1133 write was requested. 1134 1135 0x2000 Syntax_error: The submitted query has a syntax error. 1136 0x2100 Unauthorized: The logged user doesn't have the right to perform 1137 the query. 1138 0x2200 Invalid: The query is syntactically correct but invalid. 1139 0x2300 Config_error: The query is invalid because of some configuration issue 1140 0x2400 Already_exists: The query attempted to create a keyspace or a 1141 table that was already existing. The rest of the ERROR message 1142 body will be <ks><table> where: 1143 <ks> is a [string] representing either the keyspace that 1144 already exists, or the keyspace in which the table that 1145 already exists is. 1146 <table> is a [string] representing the name of the table that 1147 already exists. If the query was attempting to create a 1148 keyspace, <table> will be present but will be the empty 1149 string. 1150 0x2500 Unprepared: Can be thrown while a prepared statement tries to be 1151 executed if the provided prepared statement ID is not known by 1152 this host. The rest of the ERROR message body will be [short 1153 bytes] representing the unknown ID. 1154 1155 10. Changes from v3 1156 1157 * The format of "SCHEMA_CHANGE" events (Section 4.2.6) (and implicitly 1158 "Schema_change" results (Section 4.2.5.5)) has been modified, and now includes 1159 changes related to user defined functions and user defined aggregates. 1160 * Read_failure error code was added. 1161 * Function_failure error code was added. 1162 * Add custom payload to frames for custom QueryHandler implementations (ignored by 1163 Cassandra's standard QueryHandler) 1164 * Add warnings to frames for responses for which the server generated a warning 1165 during processing, which the client needs to address. 1166 * Add the date and time data types 1167 * Add the tinyint and smallint data types 1168 * The <paging_state> returned in the v4 protocol is not compatible with the v3 1169 protocol. In other words, a <paging_state> returned by a node using protocol v4 1170 should not be used to query a node using protocol v3 (and vice-versa).