github.com/voedger/voedger@v0.0.0-20240520144910-273e84102129/design/0-inv/20230207-elk-basics/README.md (about)

     1  # ELK Basics
     2  
     3  Motivation
     4  - dev.heeus.io: [ELK Basics](https://dev.heeus.io/launchpad/#!26205)
     5  
     6  # Content
     7  
     8  - [ELK Concepts](#main-elk-concepts)
     9    - [Github](#github-as-example)
    10    - [ELK Index](#elk-index)
    11    - [Shards](#shards)
    12    - [Nodes](#nodes)
    13    - [Documents](#documents)
    14  - [Shard allocation across Nodes](#shard-allocation-across-nodes)  
    15  - [Cluster health](#cluster-health)
    16    - [Find problematic indices](#find-problematic-indices)
    17    - [Waiting for green status](#waiting-for-green-status)
    18    - [ELK on one Node](#elk-on-one-node)
    19    - [Recovery](#recovery)
    20  
    21  - [Multi-Tenancy with Elasticsearch and OpenSearch](#multi-tenancy-with-elasticsearch-and-opensearch)
    22  - [Dashboards](#dashboards)
    23  - [Licensing Restrictions](#licensing-restrictions)
    24  ---
    25  # Main ELK Concepts
    26  
    27  ```mermaid
    28  erDiagram
    29     Cluster ||--o{ Index: "unlimited???"
    30     Index ||..o{ Document : "up to 2,147,483,519 * NumOfPrimaryShards"
    31     Index {
    32        string IndexID
    33        int NumOfPrimaryShards
    34     }
    35  
    36     Document }|..|| PrimaryShard : "belongs to"
    37     PrimaryShard ||..|| Shard: is
    38     Replica ||..|| Shard: is
    39     Shard ||..|| LuceneIndex: is
    40     Index ||--|{ PrimaryShard : "divided into fixed number of"
    41     Node ||..|{Shard  : "should have < 600???"
    42     PrimaryShard ||--o{ Replica : "has flexible number of"
    43  ```
    44  
    45  **Github as an example:**
    46  
    47  ```mermaid
    48  erDiagram
    49     GithubIndex {
    50        RepoID string
    51     }
    52     GithubIndex ||--o{ PrimaryShard : "has 128"
    53     PrimaryShard ||..|| Size : "is around 120Gb in"
    54  ```
    55  
    56  - https://www.elastic.co/customers/github
    57  - In GitHub's main Elasticsearch cluster, they have about 128 shards, with each shard storing about 120 gigabytes each.
    58  - To optimize search within a single repository, GitHub uses the Elasticsearch [routing parameter](https://www.elastic.co/guide/en/elasticsearch/reference/current/mapping-routing-field.html) based on the repository ID. "That allows us to put all the source code for a single repository on one shard," says Pease. "If you're on just a single repository page, and you do a search there, that search actually hits just one shard. Those queries are about twice as fast as searches from the main GitHub search page."
    59  - Search in partucular repo:
    60     - Use repoID as routing field
    61  
    62  ## ELK Index
    63  
    64  - Each Index can have up to 2,147,483,519 Documents (ref. LUCENE-5843)
    65  - Each index can be split into multiple shards
    66  - Each Elasticsearch shard is a separate Lucene index.
    67  - In a single cluster, you can define as many indexes as you want.
    68  
    69  ## Shards
    70  
    71  - Each index can be split into multiple shards
    72  - An index can also be replicated zero (meaning no replicas) or more times.
    73  - Once replicated, each index will have primary shards (the original shards that were replicated from) and replica shards (the copies of the primary shards).
    74  - The number of Primary Shards in an Index is fixed at the time that an index is created
    75     - In Elasticsearch > 5.0 You can change the number of shards for an existing index using the _shrink and _split APIs, however this is not a trivial task and pre-planning for the correct number of shards is the optimal approach.
    76  - The number of replica shards may be up to the maximum value(total number of nodes-1)
    77  - A document [is routed](https://www.elastic.co/guide/en/elasticsearch/reference/current/mapping-routing-field.html) to a particular shard in an index using the following formulas:
    78     - routing_factor = num_routing_shards / num_primary_shards
    79        - num_routing_shards - virtual shards like virtual nodes in cassandra
    80        - num_primary_shards - real shards in indices
    81     - shard_num = (hash(_routing) % num_routing_shards) / routing_factor
    82  - Recommended Primary Shard size < 40Gb
    83  
    84  ## Nodes
    85  
    86  - Each Node should have not more than 600 Shards (https://discuss.elastic.co/t/how-many-indices-can-be-created/140226)
    87  - Replica shard is never allocated on the same Node as the original/primary shard that it was copied from
    88  
    89  ## Documents
    90  
    91  A document [is routed](https://www.elastic.co/guide/en/elasticsearch/reference/current/mapping-routing-field.html) to a particular shard in an index using the following formulas:
    92  
    93  ```javascript
    94  routing_factor = num_routing_shards / num_primary_shards
    95  shard_num = (hash(_routing) % num_routing_shards) / routing_factor
    96  ```
    97  
    98  - The default _routing value is the document’s _id. Custom routing patterns can be implemented by specifying a custom routing value per document
    99  ```javascript
   100  PUT my-index-000001/_doc/1?routing=user1&refresh=true 
   101  {
   102    "title": "This is a document"
   103  }
   104  
   105  GET my-index-000001/_doc/1?routing=user1 
   106  ```
   107  
   108  - With a good hash function, the data will distribute itself roughly equally (short of a pathological distribution of document IDs). 
   109     - The hash function that used by elastic (murmur3) has good distribution qualities. 
   110     - Hash function used for routing so that server always know which shard a document is in given the document ID and the number of shards in the index. This is why you can not change the number of shards with the exception of shrinking the number of shards to a divisor of the original number of shards.
   111    
   112  ___
   113  # How to index very big log
   114  - Split log into parts (interval between two dates - month, week)
   115  - create separate index for every interval
   116  - if you need query on all parts - execute parallel and aggregate result
   117  
   118  ---
   119  # Shard allocation across Nodes
   120  
   121  - ShardsAllocator figures out where to place shards
   122     - The ShardsAllocator is an interface in Elasticsearch whose implementations are responsible for shard placement. When shards are unassigned for any reason, ShardsAllocator decides on which nodes in the cluster to place them.
   123     - ShardsAllocator engages to determine shard locations in the following conditions:
   124        - Index Creation – when you add an index to your cluster (or restore an index from snapshot), ShardsAllocator decides where to place its shards. When you increase replica count for an index, it decides locations for the new replica copies.
   125     - Node failure – if a node drops out of the cluster, ShardsAllocator figures out where to place the shards that were on that node.
   126     - Cluster resize – if nodes are added or removed from the cluster, ShardsAllocator decides how to rebalance the cluster.
   127     - Disk high water mark – when disk usage on a node hits the high water mark (90% full, by default), Elasticsearch engages ShardsAllocator to move shards off that node.
   128     - Manual shard routing – when you manually route shards, ShardsAllocator also moves other shards to ensure that the cluster stays balanced.
   129     - Routing related setting updates — when you change cluster or index settings that affect shard routing, such as allocation awareness, exclude or include a node (by ip or node attribute), or filter indexes to include/exclude specific nodes.
   130  - Shard placement strategy can be broken into two stage
   131     - which shard to act on
   132     - which target node to place it at
   133  - The default Elasticsearch implementation, BalancedShardsAllocator, divides its responsibilities into three major part
   134     - allocate unassigned shards
   135     - move shards
   136     - rebalance shards
   137  
   138  
   139  
   140  
   141  ---
   142  # Cluster health
   143  
   144  An Elasticsearch cluster may consist of a single node with a single index. Or it may have a hundred data nodes, three dedicated masters, a few dozen client nodes  —all operating on a thousand indices (and tens of thousands of shards).
   145  
   146  No matter the scale of the cluster, you’ll want a quick way to assess the status of your cluster. The Cluster Health API fills that role. You can think of it as a 10,000-foot view of your cluster. It can reassure you that everything is all right, or alert you to a problem somewhere in your cluster.
   147  
   148  Let’s execute a cluster-health API and see what the response looks like:
   149  
   150  ```http request
   151  GET _cluster/health
   152  ```
   153  
   154  ```json lines
   155  {
   156     "cluster_name": "elasticsearch_heeus",
   157     "status": "green",
   158     "timed_out": false,
   159     "number_of_nodes": 1,
   160     "number_of_data_nodes": 1,
   161     "active_primary_shards": 10,
   162     "active_shards": 10,
   163     "relocating_shards": 0,
   164     "initializing_shards": 0,
   165     "unassigned_shards": 0
   166  }
   167  ```
   168  The most important piece of information in the response is the status field. The status may be one of three values:
   169  - green
   170     - All primary and replica shards are allocated. Your cluster is 100% operational.
   171  - yellow
   172     - All primary shards are allocated, but at least one replica is missing. No data is missing, so search results will still be complete. However, your high availability is compromised to some degree. If more shards disappear, you might lose data. Think of yellow as a warning that should prompt investigation.
   173  - red
   174     - At least one primary shard (and all of its replicas) is missing. This means that you are missing data: searches will return partial results, and indexing into that shard will return an exception.
   175                                              
   176  ## Find problematic indices
   177  
   178  One day everything goes wrong:
   179  ```json lines
   180  {
   181     "cluster_name": "elasticsearch_heeus",
   182     "status": "red",
   183     "timed_out": false,
   184     "number_of_nodes": 8,
   185     "number_of_data_nodes": 8,
   186     "active_primary_shards": 90,
   187     "active_shards": 180,
   188     "relocating_shards": 0,
   189     "initializing_shards": 0,
   190     "unassigned_shards": 20
   191  }
   192  ```
   193  We see that we do not have all the nodes in operation (suppose there were 10 of them) and we have a total of 20 out shards. This information is not enough to make a decision to restore functionality, and we need to ask cluster-health for a little more information by using the level parameter:
   194  ```http request 
   195  GET _cluster/health?level=indices 
   196  ```
   197  request result 
   198  ```json lines
   199  
   200  {
   201     "cluster_name": "elasticsearch_heeus",
   202     "status": "red",
   203     "timed_out": false,
   204     "number_of_nodes": 8,
   205     "number_of_data_nodes": 8,
   206     "active_primary_shards": 90,
   207     "active_shards": 180,
   208     "relocating_shards": 0,
   209     "initializing_shards": 0,
   210     "unassigned_shards": 20
   211     "indices": {
   212        "v1": {
   213           "status": "green",
   214           "number_of_shards": 10,
   215           "number_of_replicas": 1,
   216           "active_primary_shards": 10,
   217           "active_shards": 20,
   218           "relocating_shards": 0,
   219           "initializing_shards": 0,
   220           "unassigned_shards": 0
   221        },
   222        "v2": {
   223           "status": "red", 
   224           "number_of_shards": 10,
   225           "number_of_replicas": 1,
   226           "active_primary_shards": 0,
   227           "active_shards": 0,
   228           "relocating_shards": 0,
   229           "initializing_shards": 0,
   230           "unassigned_shards": 20 
   231        },
   232        "v3": {
   233           "status": "green",
   234           "number_of_shards": 10,
   235           "number_of_replicas": 1,
   236           "active_primary_shards": 10,
   237           "active_shards": 20,
   238           "relocating_shards": 0,
   239           "initializing_shards": 0,
   240           "unassigned_shards": 0
   241        },
   242        And more and more...
   243     }
   244  }
   245  ```
   246  We can now see that the v2 index is the index that has made the cluster red. And all 20 unassigned shards are from this index.
   247                                
   248  ## Waiting for green status
   249  
   250  
   251  You can specify a wait_for_status parameter, which will only return after the status is satisfied. For example:
   252  ```http request
   253  GET _cluster/health?wait_for_status=green
   254  ```
   255  - This call will block (not return control to your program) until the cluster-health has turned green, meaning all primary and replica shards have been allocated.
   256  - Useful for automate create indices in cluster (when add documents to indices immediately after creating an index - possible index has not been fully initialized so quickly)
   257  
   258  ## ELK on one Node
   259  
   260  Elasticsearch will never assign a replica to the same node as the primary shard, so if you only have one node it is perfectly normal and expected for your cluster to indicate yellow.  If you feel better about it being green, then change the number of replicas on each index to be 0.
   261                                                                                               
   262  ```json lines
   263  PUT /index-name/_settings
   264  {
   265      "index" : {
   266          "number_of_replicas" : 0
   267      }
   268  }
   269  ```
   270  
   271  ## Recovery
   272  
   273  If a primary shard fails then  master promotes one of the active in-sync replicas to become the new primary. 
   274  - If there are currently no active in-sync replicas then it waits until one appears, and your cluster health reports as RED. 
   275     - However, if all the in-sync copies of your data permanently lost this not can be done: you have by definition lost data.
   276     - And now we have shard in corrupted state. 
   277        - Ideally we would want to bring it back to life and run a reindex on the missing documents.
   278        - To bring it back to life, we had to run this following command:
   279    ```json lines
   280    PUT/_cluster/reroute
   281    {
   282    "commands": [
   283       {
   284        "allocate_empty_primary": {
   285        "index": "index_name_masked",
   286        "shard": 0,
   287        "node": "*.*.*.*",
   288        "accept_data_loss": true
   289        }
   290      } 
   291    ]
   292    }
   293      ```    
   294   
   295  ---   
   296  # Multi-Tenancy with Elasticsearch and OpenSearch
   297  
   298  [Multi-tenancy](https://blog.bigdataboutique.com/2022/11/multi-tenancy-with-elasticsearch-and-opensearch-c1047b) refers to having multiple users or clients (e.g. tenants) with disparate sets of data stored within the same Elasticsearch cluster. The main reason for wanting to keep multiple tenants on one single cluster is to reduce infrastructure complexity, and keep costs lower by sharing resources. Of course, that is not always the best, or even a possible solution - for example when data is sensitive, or isolation is required for other reasons such as compliance. Below different implementation method proper multi-tenant in EL.
   299  - Silos Method - Exclusive indices per tenant
   300     - pros 
   301     - You don’t have to worry about things like field mapping collisions
   302     - Data purges can be done by simply deleting all indices with condition in the name.
   303     - Simple set different security and management policy.
   304     - cons 
   305     - potentially too many indexes for cluster resources
   306     - many small indexes 
   307  - Pool Method - Shared indices between tenants(Elasticsearch has a feature called filtered aliases that allows you to use an alias on a subset of the data within indices. You can use a filtered alias to, for example, give you all the data within a group of indices where “tenant ID” equals “CompanyA”.) 
   308     - the pros are obvious - resources economy
   309     - cons
   310     - mapping problem (same names used by different users)
   311     - large numbers of fields within index
   312     - same sharding and replication for all
   313     - High cardinality and, as a result, low cache efficiency
   314  - Hybrid Method - One Pool and then some Silos
   315     - Tenants use shared indices with filtered aliases to isolate the respective data
   316     - Field names are managed internally by combining similar fields between tenants.
   317     - cons
   318        - Very complex from implementation and support 
   319                                       
   320  ___
   321  # Dashboards
   322  
   323  - Kibana
   324     - pros
   325  large community
   326  advance visualisation and analytics options
   327  provide dashboard and reporting (have elastic query builder)
   328     - cons
   329        - Very cumbersome and difficult
   330        - High resource requirements
   331        - Some features may require a paid subscription to access
   332  - ElasticVue
   333     - pros
   334        - lightweight and simple/easy to use
   335        - Intuitive data exploration and visualisation
   336        - Open source with no licensing restrictions
   337     - cons
   338        - Limited functionality compared to Kibana
   339        - Relative new platform with smaller community and support
   340          
   341  ___
   342  # Logstash is needed?
   343  
   344  - Logstash
   345    - Cannot be used in cluster mode
   346    - If you need deduplicate - use unique id for elastic doc
   347  
   348  ___            
   349  # [Licensing Restrictions](https://dattell.com/data-architecture-blog/opensearch-vs-elasticsearch/)
   350  - OpenSearch, is free under the Apache License, Version 2.0. This license is an extremely permissive license, under which users can modify, distribute, and sublicense the original code. No restrictions are set on the original code except that the source code contributors cannot be held liable by end users for any reason.
   351  - Elasticsearch, with ELv2 and SSPL, you’ll need to be careful what your product does. For instance, the SSPL states “If you make the functionality of the Program or a modified version available to third parties as a service, you must make the Service Source Code available via network download to everyone at no charge…” That’s a real concern for companies using Elasticsearch as part of their product(s).
   352  Elasticsearch [licensing FAQ](https://www.elastic.co/pricing/faq/licensing)
   353