github.com/anth0d/nomad@v0.0.0-20221214183521-ae3a0a2cad06/website/content/docs/configuration/server.mdx

github.com/anth0d/nomad@v0.0.0-20221214183521-ae3a0a2cad06/website/content/docs/configuration/server.mdx (about)

     1  ---
     2  layout: docs
     3  page_title: server Stanza - Agent Configuration
     4  description: |-
     5    The "server" stanza configures the Nomad agent to operate in server mode to
     6    participate in scheduling decisions, register with service discovery, handle
     7    join failures, and more.
     8  ---
     9  
    10  # `server` Stanza
    11  
    12  <Placement groups={['server']} />
    13  
    14  The `server` stanza configures the Nomad agent to operate in server mode to
    15  participate in scheduling decisions, register with service discovery, handle
    16  join failures, and more.
    17  
    18  ```hcl
    19  server {
    20    enabled          = true
    21    bootstrap_expect = 3
    22    server_join {
    23      retry_join = [ "1.1.1.1", "2.2.2.2" ]
    24      retry_max = 3
    25      retry_interval = "15s"
    26    }
    27  }
    28  ```
    29  
    30  ## `server` Parameters
    31  
    32  - `authoritative_region` `(string: "")` - Specifies the authoritative region, which
    33    provides a single source of truth for global configurations such as ACL Policies and
    34    global ACL tokens. Non-authoritative regions will replicate from the authoritative
    35    to act as a mirror. By default, the local region is assumed to be authoritative.
    36  
    37  - `bootstrap_expect` `(int: required)` - Specifies the number of server nodes to
    38    wait for before bootstrapping. It is most common to use the odd-numbered
    39    integers `3` or `5` for this value, depending on the cluster size. A value of
    40    `1` does not provide any fault tolerance and is not recommended for production
    41    use cases.
    42  
    43  - `data_dir` `(string: "[data_dir]/server")` - Specifies the directory to use
    44    for server-specific data, including the replicated log. By default, this is
    45    the top-level [data_dir](/docs/configuration#data_dir) suffixed with "server",
    46    like `"/opt/nomad/server"`. The top-level option must be set, even when
    47    setting this value. This must be an absolute path.
    48  
    49  - `enabled` `(bool: false)` - Specifies if this agent should run in server mode.
    50    All other server options depend on this value being set.
    51  
    52  - `enabled_schedulers` `(array<string>: [all])` - Specifies which sub-schedulers
    53    this server will handle. This can be used to restrict the evaluations that
    54    worker threads will dequeue for processing.
    55  
    56  - `enable_event_broker` `(bool: true)` - Specifies if this server will generate
    57    events for its event stream.
    58  
    59  - `encrypt` `(string: "")` - Specifies the secret key to use for encryption of
    60    Nomad server's gossip network traffic. This key must be 32 bytes that are
    61    [RFC4648] "URL and filename safe" base64-encoded. You can generate an
    62    appropriately-formatted key with the [`nomad operator keygen`] command. The
    63    provided key is automatically persisted to the data directory and loaded
    64    automatically whenever the agent is restarted. This means that to encrypt
    65    Nomad server's gossip protocol, this option only needs to be provided once
    66    on each agent's initial startup sequence. If it is provided after Nomad has
    67    been initialized with an encryption key, then the provided key is ignored
    68    and a warning will be displayed. See the [encryption
    69    documentation][encryption] for more details on this option and its impact on
    70    the cluster.
    71  
    72  - `event_buffer_size` `(int: 100)` - Specifies the number of events generated
    73    by the server to be held in memory. Increasing this value enables new
    74    subscribers to have a larger look back window when initially subscribing.
    75    Decreasing will lower the amount of memory used for the event buffer.
    76  
    77  - `node_gc_threshold` `(string: "24h")` - Specifies how long a node must be in a
    78    terminal state before it is garbage collected and purged from the system. This
    79    is specified using a label suffix like "30s" or "1h".
    80  
    81  - `job_gc_interval` `(string: "5m")` - Specifies the interval between the job
    82    garbage collections. Only jobs who have been terminal for at least
    83    `job_gc_threshold` will be collected. Lowering the interval will perform more
    84    frequent but smaller collections. Raising the interval will perform collections
    85    less frequently but collect more jobs at a time. Reducing this interval is
    86    useful if there is a large throughput of tasks, leading to a large set of
    87    dead jobs. This is specified using a label suffix like "30s" or "3m". `job_gc_interval`
    88    was introduced in Nomad 0.10.0.
    89  
    90  - `job_gc_threshold` `(string: "4h")` - Specifies the minimum time a job must be
    91    in the terminal state before it is eligible for garbage collection. This is
    92    specified using a label suffix like "30s" or "1h".
    93  
    94  - `eval_gc_threshold` `(string: "1h")` - Specifies the minimum time an
    95    evaluation must be in the terminal state before it is eligible for garbage
    96    collection. This is specified using a label suffix like "30s" or "1h".
    97  
    98  - `deployment_gc_threshold` `(string: "1h")` - Specifies the minimum time a
    99    deployment must be in the terminal state before it is eligible for garbage
   100    collection. This is specified using a label suffix like "30s" or "1h".
   101  
   102  - `csi_volume_claim_gc_threshold` `(string: "1h")` - Specifies the minimum age of
   103    a CSI volume before it is eligible to have its claims garbage collected.
   104    This is specified using a label suffix like "30s" or "1h".
   105  
   106  - `csi_plugin_gc_threshold` `(string: "1h")` - Specifies the minimum age of a
   107    CSI plugin before it is eligible for garbage collection if not in use.
   108    This is specified using a label suffix like "30s" or "1h".
   109  
   110  - `acl_token_gc_threshold` `(string: "1h")` - Specifies the minimum age of an
   111    expired ACL token before it is eligible for garbage collection. This is
   112    specified using a label suffix like "30s" or "1h".
   113  
   114  - `default_scheduler_config` <code>([scheduler_configuration][update-scheduler-config]:
   115    nil)</code> - Specifies the initial default scheduler config when
   116    bootstrapping cluster. The parameter is ignored once the cluster is bootstrapped or
   117    value is updated through the [API endpoint][update-scheduler-config]. See [the
   118    example section](#configuring-scheduler-config) for more details
   119    `default_scheduler_config` was introduced in Nomad 0.10.4.
   120  
   121  - `heartbeat_grace` `(string: "10s")` - Specifies the additional time given
   122    beyond the heartbeat TTL of Clients to account for network and processing
   123    delays and clock skew. This is specified using a label suffix like "30s" or
   124    "1h". See [Client Heartbeats](#client-heartbeats) below for details.
   125  
   126  - `min_heartbeat_ttl` `(string: "10s")` - Specifies the minimum time between
   127    Client heartbeats. This is used as a floor to prevent excessive updates. This
   128    is specified using a label suffix like "30s" or "1h". See [Client
   129    Heartbeats](#client-heartbeats) below for details.
   130  
   131  - `failover_heartbeat_ttl` `(string: "5m")` - The time by which all Clients
   132    must heartbeat after a Server leader election. This is specified using a label
   133    suffix like "30s" or "1h". See [Client Heartbeats](#client-heartbeats) below
   134    for details.
   135  
   136  - `max_heartbeats_per_second` `(float: 50.0)` - Specifies the maximum target
   137    rate of heartbeats being processed per second. This allows the TTL to be
   138    increased to meet the target rate. See [Client
   139    Heartbeats](#client-heartbeats) below for details.
   140  
   141  - `non_voting_server` `(bool: false)` - (Enterprise-only) Specifies whether
   142    this server will act as a non-voting member of the cluster to help provide
   143    read scalability.
   144  
   145  - `num_schedulers` `(int: [num-cores])` - Specifies the number of parallel
   146    scheduler threads to run. This can be as many as one per core, or `0` to
   147    disallow this server from making any scheduling decisions. This defaults to
   148    the number of CPU cores.
   149  
   150  - `license_path` `(string: "")` - Specifies the path to load a Nomad Enterprise
   151    license from. This must be an absolute path (`/opt/nomad/license.hclic`). The
   152    license can also be set by setting `NOMAD_LICENSE_PATH` or by setting
   153    `NOMAD_LICENSE` as the entire license value. `license_path` has the highest
   154    precedence, followed by `NOMAD_LICENSE` and then `NOMAD_LICENSE_PATH`.
   155  
   156  - `plan_rejection_tracker` <code>([PlanRejectionTracker](#plan_rejection_tracker-parameters))</code> -
   157    Configuration for the plan rejection tracker that the Nomad leader uses to
   158    track the history of plan rejections.
   159  
   160  - `raft_boltdb` - This is a nested object that allows configuring options for
   161    Raft's BoltDB based log store.
   162      - `no_freelist_sync` - Setting this to `true` will disable syncing the BoltDB
   163      freelist to disk within the `raft.db` file. Not syncing the freelist to disk
   164      will reduce disk IO required for write operations at the expense of longer
   165      server startup times.
   166  
   167  - `raft_protocol` `(int: 3)` - Specifies the Raft protocol version to use when
   168    communicating with other Nomad servers. This affects available Autopilot
   169    features and is typically not required as the agent internally knows the
   170    latest version, but may be useful in some upgrade scenarios. Must be `3` in
   171    Nomad v1.4 or later.
   172  
   173  - `raft_multiplier` `(int: 1)` - An integer multiplier used by Nomad servers to
   174    scale key Raft timing parameters. Omitting this value or setting it to 0 uses
   175    default timing described below. Lower values are used to tighten timing and
   176    increase sensitivity while higher values relax timings and reduce sensitivity.
   177    Tuning this affects the time it takes Nomad to detect leader failures and to
   178    perform leader elections, at the expense of requiring more network and CPU
   179    resources for better performance. The maximum allowed value is 10.
   180  
   181    By default, Nomad will use the highest-performance timing, currently equivalent
   182    to setting this to a value of 1. Increasing the timings makes leader election
   183    less likely during periods of networking issues or resource starvation. Since
   184    leader elections pause Nomad's normal work, it may be beneficial for slow or
   185    unreliable networks to wait longer before electing a new leader. The tradeoff
   186    when raising this value is that during network partitions or other events
   187    (server crash) where a leader is lost, Nomad will not elect a new leader for
   188    a longer period of time than the default. The [`nomad.nomad.leader.barrier` and
   189    `nomad.raft.leader.lastContact` metrics](/docs/operations/metrics-reference) are a good
   190    indicator of how often leader elections occur and raft latency.
   191  
   192  - `redundancy_zone` `(string: "")` - (Enterprise-only) Specifies the redundancy
   193    zone that this server will be a part of for Autopilot management. For more
   194    information, see the [Autopilot Guide](https://learn.hashicorp.com/tutorials/nomad/autopilot).
   195  
   196  - `rejoin_after_leave` `(bool: false)` - Specifies if Nomad will ignore a
   197    previous leave and attempt to rejoin the cluster when starting. By default,
   198    Nomad treats leave as a permanent intent and does not attempt to join the
   199    cluster again when starting. This flag allows the previous state to be used to
   200    rejoin the cluster.
   201  
   202  - `root_key_gc_interval` `(string: "10m")` - Specifies the interval between
   203    [encryption key][] metadata garbage collections.
   204  
   205  - `root_key_gc_threshold` `(string: "1h")` - Specifies the minimum time that an
   206    [encryption key][] must exist before it can be eligible for garbage
   207    collection.
   208  
   209  - `root_key_rotation_threshold` `(string: "720h")` - Specifies the minimum time
   210    that an [encryption key][] must exist before it is automatically rotated on
   211    the next garbage collection interval.
   212  
   213  - `server_join` <code>([server_join][server-join]: nil)</code> - Specifies
   214    how the Nomad server will connect to other Nomad servers. The `retry_join`
   215    fields may directly specify the server address or use go-discover syntax for
   216    auto-discovery. See the [server_join documentation][server-join] for more detail.
   217  
   218  - `upgrade_version` `(string: "")` - A custom version of the format X.Y.Z to use
   219    in place of the Nomad version when custom upgrades are enabled in Autopilot.
   220    For more information, see the [Autopilot Guide](https://learn.hashicorp.com/tutorials/nomad/autopilot).
   221  
   222  - `search` <code>([search][search]: nil)</code> - Specifies configuration parameters
   223    for the Nomad search API.
   224  
   225  ### Deprecated Parameters
   226  
   227  - `retry_join` `(array<string>: [])` - Specifies a list of server addresses to
   228    retry joining if the first attempt fails. This is similar to
   229    [`start_join`](#start_join), but only invokes if the initial join attempt
   230    fails. The list of addresses will be tried in the order specified, until one
   231    succeeds. After one succeeds, no further addresses will be contacted. This is
   232    useful for cases where we know the address will become available eventually.
   233    Use `retry_join` with an array as a replacement for `start_join`, **do not use
   234    both options**. See the [server_join][server-join]
   235    section for more information on the format of the string. This field is
   236    deprecated in favor of the [server_join stanza][server-join].
   237  
   238  - `retry_interval` `(string: "30s")` - Specifies the time to wait between retry
   239    join attempts. This field is deprecated in favor of the [server_join
   240    stanza][server-join].
   241  
   242  - `retry_max` `(int: 0)` - Specifies the maximum number of join attempts to be
   243    made before exiting with a return code of 1. By default, this is set to 0
   244    which is interpreted as infinite retries. This field is deprecated in favor of
   245    the [server_join stanza][server-join].
   246  
   247  - `start_join` `(array<string>: [])` - Specifies a list of server addresses to
   248    join on startup. If Nomad is unable to join with any of the specified
   249    addresses, agent startup will fail. See the [server address
   250    format](/docs/configuration/server_join#server-address-format)
   251    section for more information on the format of the string. This field is
   252    deprecated in favor of the [server_join stanza][server-join].
   253  
   254  ### `plan_rejection_tracker` Parameters
   255  
   256  The leader plan rejection tracker can be adjusted to prevent evaluations from
   257  getting stuck due to always being scheduled to a client that may have an
   258  unexpected issue. Refer to [Monitoring Nomad][monitoring_nomad_progress] for
   259  more details.
   260  
   261  - `enabled` `(bool: false)` - Specifies if plan rejections should be tracked.
   262  
   263  - `node_threshold` `(int: 100)` - The number of plan rejections for a node
   264    within the `node_window` to trigger a client to be set as ineligible.
   265  
   266  - `node_window` `(string: "5m")` - The time window for when plan rejections for
   267    a node should be considered.
   268  
   269  If you observe too many false positives (clients being marked as ineligible
   270  even if they don't present any problem) you may want to increase
   271  `node_threshold`.
   272  
   273  Or if you are noticing jobs not being scheduled due to plan rejections for the
   274  same `node_id` and the client is not being set as ineligible you can try
   275  increasing the `node_window` so more historical rejections are taken into
   276  account.
   277  
   278  ## `server` Examples
   279  
   280  ### Common Setup
   281  
   282  This example shows a common Nomad agent `server` configuration stanza. The two
   283  IP addresses could also be DNS, and should point to the other Nomad servers in
   284  the cluster
   285  
   286  ```hcl
   287  server {
   288    enabled          = true
   289    bootstrap_expect = 3
   290  
   291    server_join {
   292      retry_join     = [ "1.1.1.1", "2.2.2.2" ]
   293      retry_max      = 3
   294      retry_interval = "15s"
   295    }
   296  }
   297  ```
   298  
   299  ### Configuring Data Directory
   300  
   301  This example shows configuring a custom data directory for the server data.
   302  
   303  ```hcl
   304  server {
   305    data_dir = "/opt/nomad/server"
   306  }
   307  ```
   308  
   309  ### Automatic Bootstrapping
   310  
   311  The Nomad servers can automatically bootstrap if Consul is configured. For a
   312  more detailed explanation, please see the
   313  [automatic Nomad bootstrapping documentation](https://learn.hashicorp.com/tutorials/nomad/clustering).
   314  
   315  ### Restricting Schedulers
   316  
   317  This example shows restricting the schedulers that are enabled as well as the
   318  maximum number of cores to utilize when participating in scheduling decisions:
   319  
   320  ```hcl
   321  server {
   322    enabled            = true
   323    enabled_schedulers = ["batch", "service"]
   324    num_schedulers     = 7
   325  }
   326  ```
   327  
   328  ### Bootstrapping with a Custom Scheduler Config ((#configuring-scheduler-config))
   329  
   330  While [bootstrapping a cluster], you can use the `default_scheduler_config` stanza
   331  to prime the cluster with a [`SchedulerConfig`][update-scheduler-config]. The
   332  scheduler configuration determines which scheduling algorithm is configured—
   333  spread scheduling or binpacking—and which job types are eligible for preemption.
   334  
   335  ~> **Warning:** Once the cluster is bootstrapped, you must configure this using
   336  the [update scheduler configuration][update-scheduler-config] API. This
   337  option is only consulted during bootstrap.
   338  
   339  The structure matches the [Update Scheduler Config][update-scheduler-config] API
   340  endpoint, which you should consult for canonical documentation. However, the
   341  attributes names must be adapted to HCL syntax by using snake case
   342  representations rather than camel case.
   343  
   344  This example shows configuring spread scheduling and enabling preemption for all
   345  job-type schedulers.
   346  
   347  ```hcl
   348  server {
   349    default_scheduler_config {
   350      scheduler_algorithm             = "spread"
   351      memory_oversubscription_enabled = true
   352      reject_job_registration         = false
   353      pause_eval_broker               = false # New in Nomad 1.3.2
   354  
   355      preemption_config {
   356        batch_scheduler_enabled    = true
   357        system_scheduler_enabled   = true
   358        service_scheduler_enabled  = true
   359        sysbatch_scheduler_enabled = true # New in Nomad 1.2
   360      }
   361    }
   362  }
   363  ```
   364  
   365  ## Client Heartbeats ((#client-heartbeats))
   366  
   367  ~> This is an advanced topic. It is most beneficial to clusters over 1,000
   368     nodes or with unreliable networks or nodes (eg some edge deployments).
   369  
   370  Nomad Clients periodically heartbeat to Nomad Servers to confirm they are
   371  operating as expected. Nomad Clients which do not heartbeat in the specified
   372  amount of time are considered `down` and their allocations are marked as `lost`
   373  or `disconnected` (if [`max_client_disconnect`][max_client_disconnect] is set)
   374  and rescheduled.
   375  
   376  The various heartbeat related parameters allow you to tune the following
   377  tradeoffs:
   378  
   379  - The longer the heartbeat period, the longer a `down` Client's workload will
   380    take to be rescheduled.
   381  - The shorter the heartbeat period, the more likely transient network issues,
   382    leader elections, and other temporary issues could cause a perfectly
   383    functional Client and its workloads to be marked as `down` and the work
   384    rescheduled.
   385  
   386  While Nomad Clients can connect to any Server, all heartbeats are forwarded to
   387  the leader for processing. Since this heartbeat processing consumes resources,
   388  Nomad adjusts the rate at which Clients heartbeat based on cluster size. The
   389  goal is to try to keep the resource cost of processing heartbeats constant
   390  regardless of cluster size.
   391  
   392  The base formula for determining how often a Client must heartbeat is:
   393  
   394  ```
   395  <number of Clients> / <max_heartbeats_per_second>
   396  ```
   397  
   398  Other factors modify this base TTL:
   399  
   400  - A random factor up to `2x` is added to the base TTL to prevent the
   401    [thundering herd][herd] problem where a large number of clients attempt to
   402    heartbeat at exactly the same time.
   403  - [`min_heartbeat_ttl`](#min_heartbeat_ttl) is used as the lower bound to
   404    prevent small clusters from sending excessive heartbeats.
   405  - [`heartbeat_grace`](#heartbeat_grace) is the amount of _extra_ time the
   406    leader will wait for a heartbeat beyond the base heartbeat.
   407  - After a leader election all Clients are given up to `failover_heartbeat_ttl`
   408    to successfully heartbeat. This gives Clients time to discover a functioning
   409    Server in case they were directly connected to a leader that crashed.
   410  
   411  For example, given the default values for heartbeat parameters, different sized
   412  clusters will use the following TTLs for the heartbeats. Note that the `Server TTL`
   413  simply adds the `heartbeat_grace` parameter to the TTL Clients are given.
   414  
   415  | Clients | Client TTL  | Server TTL  | Safe after elections |
   416  | ------- | ----------- | ----------- | -------------------- |
   417  | 10      | 10s - 20s   | 20s - 30s   | yes                  |
   418  | 100     | 10s - 20s   | 20s - 30s   | yes                  |
   419  | 1000    | 20s - 40s   | 30s - 50s   | yes                  |
   420  | 5000    | 100s - 200s | 110s - 210s | yes                  |
   421  | 10000   | 200s - 400s | 210s - 410s | NO (see below)       |
   422  
   423  Regardless of size, all clients will have a Server TTL of
   424  `failover_heartbeat_ttl` after a leader election. It should always be larger
   425  than the maximum Client TTL for your cluster size in order to prevent marking
   426  live Clients as `down`.
   427  
   428  For clusters over 5000 Clients you should increase `failover_heartbeat_ttl`
   429  using the following formula:
   430  
   431  ```
   432  (2 * (<number of Clients> / <max_heartbeats_per_second>)) + (10 * <min_heartbeat_ttl>)
   433  
   434  # For example with 6000 Clients:
   435  (2 * (6000 / 50)) + (10 * 10) = 340s (5m40s)
   436  ```
   437  
   438  This ensures Clients have some additional time to failover even if they were
   439  told to heartbeat after the maximum interval.
   440  
   441  The actual value used should take into consideration how much tolerance your
   442  system has for a delay in noticing crashed Clients. For example a
   443  `failover_heartbeat_ttl` of 30 minutes may give even the slowest clients in the
   444  largest clusters ample time to heartbeat after an election.  However if the
   445  election was due to a datacenter-wide failure affecting Clients, it will be 30
   446  minutes before Nomad recognizes that they are `down` and reschedules their
   447  work.
   448  
   449  [encryption]: https://learn.hashicorp.com/tutorials/nomad/security-gossip-encryption 'Nomad Encryption Overview'
   450  [server-join]: /docs/configuration/server_join 'Server Join'
   451  [update-scheduler-config]: /api-docs/operator/scheduler#update-scheduler-configuration 'Scheduler Config'
   452  [bootstrapping a cluster]: /docs/faq#bootstrapping
   453  [rfc4648]: https://tools.ietf.org/html/rfc4648#section-5
   454  [monitoring_nomad_progress]: /docs/operations/monitoring-nomad#progress
   455  [`nomad operator keygen`]: /docs/commands/operator/keygen
   456  [search]: /docs/configuration/search
   457  [encryption key]: /docs/operations/key-management
   458  [max_client_disconnect]: /docs/job-specification/group#max-client-disconnect
   459  [herd]: https://en.wikipedia.org/wiki/Thundering_herd_problem