github.com/voedger/voedger@v0.0.0-20240520144910-273e84102129/design/0-inv/20230207-elk-basics/README.md (about) 1 # ELK Basics 2 3 Motivation 4 - dev.heeus.io: [ELK Basics](https://dev.heeus.io/launchpad/#!26205) 5 6 # Content 7 8 - [ELK Concepts](#main-elk-concepts) 9 - [Github](#github-as-example) 10 - [ELK Index](#elk-index) 11 - [Shards](#shards) 12 - [Nodes](#nodes) 13 - [Documents](#documents) 14 - [Shard allocation across Nodes](#shard-allocation-across-nodes) 15 - [Cluster health](#cluster-health) 16 - [Find problematic indices](#find-problematic-indices) 17 - [Waiting for green status](#waiting-for-green-status) 18 - [ELK on one Node](#elk-on-one-node) 19 - [Recovery](#recovery) 20 21 - [Multi-Tenancy with Elasticsearch and OpenSearch](#multi-tenancy-with-elasticsearch-and-opensearch) 22 - [Dashboards](#dashboards) 23 - [Licensing Restrictions](#licensing-restrictions) 24 --- 25 # Main ELK Concepts 26 27 ```mermaid 28 erDiagram 29 Cluster ||--o{ Index: "unlimited???" 30 Index ||..o{ Document : "up to 2,147,483,519 * NumOfPrimaryShards" 31 Index { 32 string IndexID 33 int NumOfPrimaryShards 34 } 35 36 Document }|..|| PrimaryShard : "belongs to" 37 PrimaryShard ||..|| Shard: is 38 Replica ||..|| Shard: is 39 Shard ||..|| LuceneIndex: is 40 Index ||--|{ PrimaryShard : "divided into fixed number of" 41 Node ||..|{Shard : "should have < 600???" 42 PrimaryShard ||--o{ Replica : "has flexible number of" 43 ``` 44 45 **Github as an example:** 46 47 ```mermaid 48 erDiagram 49 GithubIndex { 50 RepoID string 51 } 52 GithubIndex ||--o{ PrimaryShard : "has 128" 53 PrimaryShard ||..|| Size : "is around 120Gb in" 54 ``` 55 56 - https://www.elastic.co/customers/github 57 - In GitHub's main Elasticsearch cluster, they have about 128 shards, with each shard storing about 120 gigabytes each. 58 - To optimize search within a single repository, GitHub uses the Elasticsearch [routing parameter](https://www.elastic.co/guide/en/elasticsearch/reference/current/mapping-routing-field.html) based on the repository ID. "That allows us to put all the source code for a single repository on one shard," says Pease. "If you're on just a single repository page, and you do a search there, that search actually hits just one shard. Those queries are about twice as fast as searches from the main GitHub search page." 59 - Search in partucular repo: 60 - Use repoID as routing field 61 62 ## ELK Index 63 64 - Each Index can have up to 2,147,483,519 Documents (ref. LUCENE-5843) 65 - Each index can be split into multiple shards 66 - Each Elasticsearch shard is a separate Lucene index. 67 - In a single cluster, you can define as many indexes as you want. 68 69 ## Shards 70 71 - Each index can be split into multiple shards 72 - An index can also be replicated zero (meaning no replicas) or more times. 73 - Once replicated, each index will have primary shards (the original shards that were replicated from) and replica shards (the copies of the primary shards). 74 - The number of Primary Shards in an Index is fixed at the time that an index is created 75 - In Elasticsearch > 5.0 You can change the number of shards for an existing index using the _shrink and _split APIs, however this is not a trivial task and pre-planning for the correct number of shards is the optimal approach. 76 - The number of replica shards may be up to the maximum value(total number of nodes-1) 77 - A document [is routed](https://www.elastic.co/guide/en/elasticsearch/reference/current/mapping-routing-field.html) to a particular shard in an index using the following formulas: 78 - routing_factor = num_routing_shards / num_primary_shards 79 - num_routing_shards - virtual shards like virtual nodes in cassandra 80 - num_primary_shards - real shards in indices 81 - shard_num = (hash(_routing) % num_routing_shards) / routing_factor 82 - Recommended Primary Shard size < 40Gb 83 84 ## Nodes 85 86 - Each Node should have not more than 600 Shards (https://discuss.elastic.co/t/how-many-indices-can-be-created/140226) 87 - Replica shard is never allocated on the same Node as the original/primary shard that it was copied from 88 89 ## Documents 90 91 A document [is routed](https://www.elastic.co/guide/en/elasticsearch/reference/current/mapping-routing-field.html) to a particular shard in an index using the following formulas: 92 93 ```javascript 94 routing_factor = num_routing_shards / num_primary_shards 95 shard_num = (hash(_routing) % num_routing_shards) / routing_factor 96 ``` 97 98 - The default _routing value is the document’s _id. Custom routing patterns can be implemented by specifying a custom routing value per document 99 ```javascript 100 PUT my-index-000001/_doc/1?routing=user1&refresh=true 101 { 102 "title": "This is a document" 103 } 104 105 GET my-index-000001/_doc/1?routing=user1 106 ``` 107 108 - With a good hash function, the data will distribute itself roughly equally (short of a pathological distribution of document IDs). 109 - The hash function that used by elastic (murmur3) has good distribution qualities. 110 - Hash function used for routing so that server always know which shard a document is in given the document ID and the number of shards in the index. This is why you can not change the number of shards with the exception of shrinking the number of shards to a divisor of the original number of shards. 111 112 ___ 113 # How to index very big log 114 - Split log into parts (interval between two dates - month, week) 115 - create separate index for every interval 116 - if you need query on all parts - execute parallel and aggregate result 117 118 --- 119 # Shard allocation across Nodes 120 121 - ShardsAllocator figures out where to place shards 122 - The ShardsAllocator is an interface in Elasticsearch whose implementations are responsible for shard placement. When shards are unassigned for any reason, ShardsAllocator decides on which nodes in the cluster to place them. 123 - ShardsAllocator engages to determine shard locations in the following conditions: 124 - Index Creation – when you add an index to your cluster (or restore an index from snapshot), ShardsAllocator decides where to place its shards. When you increase replica count for an index, it decides locations for the new replica copies. 125 - Node failure – if a node drops out of the cluster, ShardsAllocator figures out where to place the shards that were on that node. 126 - Cluster resize – if nodes are added or removed from the cluster, ShardsAllocator decides how to rebalance the cluster. 127 - Disk high water mark – when disk usage on a node hits the high water mark (90% full, by default), Elasticsearch engages ShardsAllocator to move shards off that node. 128 - Manual shard routing – when you manually route shards, ShardsAllocator also moves other shards to ensure that the cluster stays balanced. 129 - Routing related setting updates — when you change cluster or index settings that affect shard routing, such as allocation awareness, exclude or include a node (by ip or node attribute), or filter indexes to include/exclude specific nodes. 130 - Shard placement strategy can be broken into two stage 131 - which shard to act on 132 - which target node to place it at 133 - The default Elasticsearch implementation, BalancedShardsAllocator, divides its responsibilities into three major part 134 - allocate unassigned shards 135 - move shards 136 - rebalance shards 137 138 139 140 141 --- 142 # Cluster health 143 144 An Elasticsearch cluster may consist of a single node with a single index. Or it may have a hundred data nodes, three dedicated masters, a few dozen client nodes —all operating on a thousand indices (and tens of thousands of shards). 145 146 No matter the scale of the cluster, you’ll want a quick way to assess the status of your cluster. The Cluster Health API fills that role. You can think of it as a 10,000-foot view of your cluster. It can reassure you that everything is all right, or alert you to a problem somewhere in your cluster. 147 148 Let’s execute a cluster-health API and see what the response looks like: 149 150 ```http request 151 GET _cluster/health 152 ``` 153 154 ```json lines 155 { 156 "cluster_name": "elasticsearch_heeus", 157 "status": "green", 158 "timed_out": false, 159 "number_of_nodes": 1, 160 "number_of_data_nodes": 1, 161 "active_primary_shards": 10, 162 "active_shards": 10, 163 "relocating_shards": 0, 164 "initializing_shards": 0, 165 "unassigned_shards": 0 166 } 167 ``` 168 The most important piece of information in the response is the status field. The status may be one of three values: 169 - green 170 - All primary and replica shards are allocated. Your cluster is 100% operational. 171 - yellow 172 - All primary shards are allocated, but at least one replica is missing. No data is missing, so search results will still be complete. However, your high availability is compromised to some degree. If more shards disappear, you might lose data. Think of yellow as a warning that should prompt investigation. 173 - red 174 - At least one primary shard (and all of its replicas) is missing. This means that you are missing data: searches will return partial results, and indexing into that shard will return an exception. 175 176 ## Find problematic indices 177 178 One day everything goes wrong: 179 ```json lines 180 { 181 "cluster_name": "elasticsearch_heeus", 182 "status": "red", 183 "timed_out": false, 184 "number_of_nodes": 8, 185 "number_of_data_nodes": 8, 186 "active_primary_shards": 90, 187 "active_shards": 180, 188 "relocating_shards": 0, 189 "initializing_shards": 0, 190 "unassigned_shards": 20 191 } 192 ``` 193 We see that we do not have all the nodes in operation (suppose there were 10 of them) and we have a total of 20 out shards. This information is not enough to make a decision to restore functionality, and we need to ask cluster-health for a little more information by using the level parameter: 194 ```http request 195 GET _cluster/health?level=indices 196 ``` 197 request result 198 ```json lines 199 200 { 201 "cluster_name": "elasticsearch_heeus", 202 "status": "red", 203 "timed_out": false, 204 "number_of_nodes": 8, 205 "number_of_data_nodes": 8, 206 "active_primary_shards": 90, 207 "active_shards": 180, 208 "relocating_shards": 0, 209 "initializing_shards": 0, 210 "unassigned_shards": 20 211 "indices": { 212 "v1": { 213 "status": "green", 214 "number_of_shards": 10, 215 "number_of_replicas": 1, 216 "active_primary_shards": 10, 217 "active_shards": 20, 218 "relocating_shards": 0, 219 "initializing_shards": 0, 220 "unassigned_shards": 0 221 }, 222 "v2": { 223 "status": "red", 224 "number_of_shards": 10, 225 "number_of_replicas": 1, 226 "active_primary_shards": 0, 227 "active_shards": 0, 228 "relocating_shards": 0, 229 "initializing_shards": 0, 230 "unassigned_shards": 20 231 }, 232 "v3": { 233 "status": "green", 234 "number_of_shards": 10, 235 "number_of_replicas": 1, 236 "active_primary_shards": 10, 237 "active_shards": 20, 238 "relocating_shards": 0, 239 "initializing_shards": 0, 240 "unassigned_shards": 0 241 }, 242 And more and more... 243 } 244 } 245 ``` 246 We can now see that the v2 index is the index that has made the cluster red. And all 20 unassigned shards are from this index. 247 248 ## Waiting for green status 249 250 251 You can specify a wait_for_status parameter, which will only return after the status is satisfied. For example: 252 ```http request 253 GET _cluster/health?wait_for_status=green 254 ``` 255 - This call will block (not return control to your program) until the cluster-health has turned green, meaning all primary and replica shards have been allocated. 256 - Useful for automate create indices in cluster (when add documents to indices immediately after creating an index - possible index has not been fully initialized so quickly) 257 258 ## ELK on one Node 259 260 Elasticsearch will never assign a replica to the same node as the primary shard, so if you only have one node it is perfectly normal and expected for your cluster to indicate yellow. If you feel better about it being green, then change the number of replicas on each index to be 0. 261 262 ```json lines 263 PUT /index-name/_settings 264 { 265 "index" : { 266 "number_of_replicas" : 0 267 } 268 } 269 ``` 270 271 ## Recovery 272 273 If a primary shard fails then master promotes one of the active in-sync replicas to become the new primary. 274 - If there are currently no active in-sync replicas then it waits until one appears, and your cluster health reports as RED. 275 - However, if all the in-sync copies of your data permanently lost this not can be done: you have by definition lost data. 276 - And now we have shard in corrupted state. 277 - Ideally we would want to bring it back to life and run a reindex on the missing documents. 278 - To bring it back to life, we had to run this following command: 279 ```json lines 280 PUT/_cluster/reroute 281 { 282 "commands": [ 283 { 284 "allocate_empty_primary": { 285 "index": "index_name_masked", 286 "shard": 0, 287 "node": "*.*.*.*", 288 "accept_data_loss": true 289 } 290 } 291 ] 292 } 293 ``` 294 295 --- 296 # Multi-Tenancy with Elasticsearch and OpenSearch 297 298 [Multi-tenancy](https://blog.bigdataboutique.com/2022/11/multi-tenancy-with-elasticsearch-and-opensearch-c1047b) refers to having multiple users or clients (e.g. tenants) with disparate sets of data stored within the same Elasticsearch cluster. The main reason for wanting to keep multiple tenants on one single cluster is to reduce infrastructure complexity, and keep costs lower by sharing resources. Of course, that is not always the best, or even a possible solution - for example when data is sensitive, or isolation is required for other reasons such as compliance. Below different implementation method proper multi-tenant in EL. 299 - Silos Method - Exclusive indices per tenant 300 - pros 301 - You don’t have to worry about things like field mapping collisions 302 - Data purges can be done by simply deleting all indices with condition in the name. 303 - Simple set different security and management policy. 304 - cons 305 - potentially too many indexes for cluster resources 306 - many small indexes 307 - Pool Method - Shared indices between tenants(Elasticsearch has a feature called filtered aliases that allows you to use an alias on a subset of the data within indices. You can use a filtered alias to, for example, give you all the data within a group of indices where “tenant ID” equals “CompanyA”.) 308 - the pros are obvious - resources economy 309 - cons 310 - mapping problem (same names used by different users) 311 - large numbers of fields within index 312 - same sharding and replication for all 313 - High cardinality and, as a result, low cache efficiency 314 - Hybrid Method - One Pool and then some Silos 315 - Tenants use shared indices with filtered aliases to isolate the respective data 316 - Field names are managed internally by combining similar fields between tenants. 317 - cons 318 - Very complex from implementation and support 319 320 ___ 321 # Dashboards 322 323 - Kibana 324 - pros 325 large community 326 advance visualisation and analytics options 327 provide dashboard and reporting (have elastic query builder) 328 - cons 329 - Very cumbersome and difficult 330 - High resource requirements 331 - Some features may require a paid subscription to access 332 - ElasticVue 333 - pros 334 - lightweight and simple/easy to use 335 - Intuitive data exploration and visualisation 336 - Open source with no licensing restrictions 337 - cons 338 - Limited functionality compared to Kibana 339 - Relative new platform with smaller community and support 340 341 ___ 342 # Logstash is needed? 343 344 - Logstash 345 - Cannot be used in cluster mode 346 - If you need deduplicate - use unique id for elastic doc 347 348 ___ 349 # [Licensing Restrictions](https://dattell.com/data-architecture-blog/opensearch-vs-elasticsearch/) 350 - OpenSearch, is free under the Apache License, Version 2.0. This license is an extremely permissive license, under which users can modify, distribute, and sublicense the original code. No restrictions are set on the original code except that the source code contributors cannot be held liable by end users for any reason. 351 - Elasticsearch, with ELv2 and SSPL, you’ll need to be careful what your product does. For instance, the SSPL states “If you make the functionality of the Program or a modified version available to third parties as a service, you must make the Service Source Code available via network download to everyone at no charge…” That’s a real concern for companies using Elasticsearch as part of their product(s). 352 Elasticsearch [licensing FAQ](https://www.elastic.co/pricing/faq/licensing) 353