github.com/pingcap/tiflow@v0.0.0-20240520035814-5bf52d54e205/docs/design/2024-01-05-ticdc-simplify-sink-mysql-timezone.md (about)

     1  # Simplify Sink to MySQL Timezone Handling
     2  
     3  - Author(s): [zhangjinpeng1987](https://github.com/zhangjinpeng1987)
     4  - Tracking Issue(s):
     5  
     6  ## Goals
     7  
     8  - Simplify CDC's sink to MySQL timezone handling logic, eliminate timezone related data correctness risk
     9  - Make CDC a timezone insensitive component, reduce users' timezone maintenance burden for CDC
    10  
    11  ## Problem: How does CDC handle timezone now?
    12  
    13  <img src="../media/cdc-timezone.png?sanitize=true" alt="architecture" width="600"/>
    14  
    15  For a sink-to-mysql changefeed, CDC mounters use CDC server configured timezone to decode timezone sensitive columns like `timestamp`. CDC sink part (write changes to downstream MySQL) will try to use the specified timezone included in the sink-uri. If the timezone is not provided in the sink-uri, it will use the cdc server configured timezone. But essentially, if we want CDC replicate data correctly for timezone sensitive columns like `timestamp`, CDC mounters and CDC sink should use the same timezone to decode and write data changes received from TiKV. Following pseudo code show current CDC's timezone handling logic:
    16  
    17  ```
    18  mouter.time_zone = cdc-server.timezone
    19  
    20  if sink-uri.timezone is not empty (case 1):
    21      // For example: sink-uri="mysql://user:pwd@tcp(127.0.0.1:3306)/db?time-zone=utc".
    22  
    23      sink.time_zone = sink-uri.timezone
    24      if sink-uri.timezone != cdc-server.timezone:
    25          return error. // because mounter and sink may use different timezone which can cause data correctness issue
    26      if downstream MySQL doesn't installed related timezone:
    27          users should follow https://dev.mysql.com/doc/refman/8.0/en/mysql-tzinfo-to-sql.html to install them.
    28  
    29  else if sink-uri.timezone == "" (case 2):
    30      // For example: sink-uri="mysql://root:pwd@tcp(127.0.0.1:3306)/db?time-zone=''".
    31      // In this case, sink.timezone depends on downstream MySQL's time_zone setting whose
    32      // default value is `SYSTEM`. We require users set same timezone for ticdc servers and downstream
    33      // MySQL servers to make sure timezone awared column type like `timestamp` is
    34      // replicated correctly. **If the user dont recognize this case, there is a risk that
    35      // the replicated `timestamp` value is incorrect, at the same time set timezone for
    36      // machine might affect other application in the same machine**.
    37  
    38      sink.timezone = downstream-MySQL.time_zone
    39  
    40  else if sink-uri.time_zone is nil (case 3):
    41      // In this case, sink will use cdc server configured timezone. If the downstream MySQL
    42      // has not installed related timezone, users should follow
    43      // https://dev.mysql.com/doc/refman/8.0/en/mysql-tzinfo-to-sql.html to install them.
    44  
    45      sink.timezone = cdc-server.timezone
    46  ```
    47  
    48  From above description we can tell some potential problems with existing timezone handling in CDC:
    49  
    50  - Case 2 has a correctness risk for `TIMESTAMP` column types when the users don't set downstram MySQL as the same timezone with CDC server. But in the case of cross region replication, databases and machines in different regions may have different default timezone setting.
    51  - Users should carefully treat timezone setting for cdc servers, sink-uri and downstream MySQL/TiDB, this is a maintenance burden for users.
    52  
    53  ## Proposal Changes
    54  
    55  ### CDC internal changes
    56  
    57  - Mounters always use UTC timezone to decode data changes received from TiKV.
    58  - MySQL sink always set session variable `set time_zone = '+00:00'` when replicate data to downstream MySQL.
    59  - Mounters and MySQL sink ignore timezone settings in cdc-server and sink-uri.
    60  
    61  ### Changes from users' perspective
    62  
    63  - Users don't need to adjust cdc-server timezone, sink-uri timezone and donwstream MySQL timezone for CDC.
    64  
    65  ### Upgrade & Compatibility
    66  
    67  - Because MySQL 5.6/5.7/8.0 and TiDB support `set time_zone = '+00:00'` by default, there is no compatibility issue or any required changes for exiting changefeed when upgrade CDC to new version.
    68  
    69  ## Appendix
    70  
    71  ### How MySQL and TiDB handle timezone sensitive column types
    72  
    73  Currently, there is just one timezone sensitive column type `TIMESTAMP`, both TiDB and MySQL will convert it as UTC values and then store them, and convert back from UTC to the current time zone for retrieval. For other time column types like `DATE` and `DATETIME`, both MySQL and TiDB treat them as other types like int, they are not timezone sensitive types, application layer should handle them correctly.
    74  
    75  _MySQL converts TIMESTAMP values from the current time zone to UTC for storage, and back from UTC to the current time zone for retrieval. (This does not occur for other types such as DATETIME.) By default, the current time zone for each connection is the server's time. The time zone can be set on a per-connection basis. As long as the time zone setting remains constant, you get back the same value you store. If you store a TIMESTAMP value, and then change the time zone and retrieve the value, the retrieved value is different from the value you stored. This occurs because the same time zone was not used for conversion in both directions. The current time zone is available as the value of the time_zone system variable._
    76  
    77  ### MySQL binlog cross region replication case
    78  
    79  According to https://dev.mysql.com/doc/refman/8.0/en/replication-features-timezone.html, when using MySQL binlog to replicate data across regions, MySQL suggests users explicitly set related time_zone for both source and replica. The reason is because when the binlog format is `statement` and some time zone sensitive functions like `NOW()` and `FROM_UNIXTIME()` may result in inconsistent value. But TiCDC is more like a row format binlog, it receiving row level changes from TiKV and decode them and repilcate downstream, and CDC has no such issue.
    80  
    81  [NEED TO VERIFY] For row format binlog, `TIMESTAMP` column UTC values are recored and replicated, and the downstream MySQL decode it and write it with the same timezone to make sure it is correct (same as our proposal way above).