github.com/voedger/voedger@v0.0.0-20240520144910-273e84102129/design/archive/20211001/ha.md (about) 1 # Datacenter HA 2 3 ```dot 4 digraph name { 5 node [ fontname="Cambria" shape=rect fontsize=12] 6 7 subgraph cluster_dc1 { 8 label = "dc1"; 9 cas1_1 [shape = "cylinder"] 10 cas1_2 [shape = "cylinder"] 11 app1_1 12 } 13 subgraph cluster_dc2 { 14 label = "dc2"; 15 cas2_1 [shape = "cylinder"] 16 cas2_2 [shape = "cylinder"] 17 app1_2 18 } 19 subgraph cluster_dc3 { 20 label = "dc3"; 21 cas3_1 [shape = "cylinder"] 22 cas3_2 [shape = "cylinder"] 23 app1_3 24 } 25 26 edge [dir=both style=dotted] 27 app1_1 -> cas1_1 28 app1_1 -> cas2_1 29 app1_2 -> cas1_1 30 app1_2 -> cas3_1 31 app1_3 -> cas2_2 32 app1_3 -> cas1_2 33 } 34 ``` 35 36 # App Update 37 38 - Zero Downtime 39 - Clients do not get server errors (e.g. 503) 40 - Latency growth MUST be minimized 41 - Persistent Cache 42 43 ```dot 44 digraph cluster { 45 node [ fontname = "Cambria" fontsize = 12 shape = "rect"] 46 47 subgraph cluster_ac { 48 label = "App Container"; 49 cache [label="cache.prc"] 50 old [label="oldApp.prc"] 51 new [label="newApp.prc" style=dashed] 52 cder [label="cder.prc"] 53 } 54 hbuilder -> hcc [style=dotted] 55 hcc -> cder 56 cder -> new 57 cder -> old 58 cache -> new [dir=none style=dotted] 59 cache -> old [dir=none style=dotted] 60 } 61 ``` 62 63 - cache.prc is a separate process which shares cache memory with apps 64 - Own memory manager 65 - https://github.com/couchbase/go-slab 66 67 ## App Update: Java 68 69 ```dot 70 digraph name { 71 node [ fontname = "Cambria" fontsize = 12 shape = "rect"] 72 73 subgraph cluster_node { 74 label = "node"; 75 cache [label="cache.prc"] 76 old [label="oldApp.fatjar"] 77 new [label="newApp.fatjar" style=dashed] 78 cder [label="cder.prc"] 79 core [label="core.jar"] 80 } 81 hbuilder -> hcc [style=dotted] 82 hcc -> cder 83 cder -> core 84 core -> old 85 core -> new 86 cache -> core [dir=none style=dotted] 87 } 88 ``` 89 90 - Cache can be inside `core.jar`, but will be lost during core.jar update 91 92 # Node/Container Failure 93 94 ```dot 95 digraph graphname { 96 97 graph[rankdir=BT splines=ortho] 98 node [ fontname = "Cambria" shape = "rect" fontsize = 12] 99 edge [dir=both arrowhead=none arrowtail=none] 100 101 Database[shape = "cylinder"] 102 PLog[shape = "cylinder"] 103 WLog[shape = "cylinder"] 104 WLogP[label="WLog.Partition"] 105 State[shape = "cylinder"] 106 StateP[label="State.Partition"] 107 Partition[label="PLog.Partition"] 108 Workspace 109 Container [label="Main App Container" shape=box3d] 110 111 Container -> Database[arrowtail=crow] 112 Partition -> PLog [arrowtail=crow] 113 Partition -> Container [arrowtail=crow] 114 Workspace -> Partition [arrowtail=crow] 115 PLog -> Database 116 WLog -> Database 117 State -> Database 118 WLogP -> Workspace 119 WLogP -> WLog [arrowtail=crow] 120 StateP-> Workspace 121 StateP->State [arrowtail=crow] 122 123 124 } 125 ``` 126 127 ## Distributed Request Handling 128 ```dot 129 digraph name { 130 node [ fontname = "Cambria" fontsize = 12 shape = "rect"] 131 fd[label="Detect container failure"] 132 cu[label="Mark container as `Unavailable`"] 133 fd -> cu 134 } 135 ``` 136 137 ## Partitioned Request Handling 138 139 ```dot 140 digraph name { 141 node [ fontname = "Cambria" fontsize = 12 shape = "rect"] 142 fd[label="Detect container failure"] 143 cu[label="Mark container as `Unavailable`"] 144 en[label="Elect container for PLog.Parition"] 145 iph[label="Initialize PartitionHandler"] 146 fd -> cu 147 cu -> en 148 en -> iph 149 } 150 ``` 151 152 # Links 153 154 - [Дешевле, надежнее, проще / Александр Христофоров (Одноклассники)](https://youtu.be/Hs2txKgnpAk?t=130) 155 - [Maintaining Consistency Across Data Centers(Randy Fradin, BlackRock) | Cassandra Summit 2016](https://www.slideshare.net/DataStax/maintaining-consistency-across-data-centers-randy-fradin-blackrock-cassandra-summit-2016) 156 - Maintaining Consistency Across Data Centers or: How I Learned to Stop Worrying About WAN Latency Randy Fradin BlackRock 157 - Challenge 1: Latency With all that latency on each operation, isn’t performance terrible? 158 - Actually, this wasn’t such a problem: 159 - 10ms+ latency per operation is acceptable for many apps 160 - Minimize use of sequential operations 161 - High throughput still achievable 162