github.com/kardianos/nomad@v0.1.3-0.20151022182107-b13df73ee850/website/source/docs/agent/index.html.md (about)

     1  ---
     2  layout: "docs"
     3  page_title: "Nomad Agent"
     4  sidebar_current: "docs-agent-basics"
     5  description: |-
     6    The Nomad agent is a long running process which can be used either in a client or server mode.
     7  ---
     8  
     9  # Nomad Agent
    10  
    11  The Nomad agent is a long running process which runs on every machine that
    12  is part of the Nomad cluster. The behavior of the agent depends on if it is
    13  running in client or server mode. Clients are responsible for running tasks,
    14  while servers are responsible for managing the cluster.
    15  
    16  Client mode agents are relatively simple. They make use of fingerprinting
    17  to determine the capabilities and resources of the host machine, as well as 
    18  determining what [drivers](/docs/drivers/index.html) are available. Clients
    19  register with servers to provide the node information, heartbeat to provide
    20  liveness, and run any tasks assigned to them.
    21  
    22  Servers take on the responsibility of being part of the 
    23  [consensus protocol](/docs/internals/consensus.html) and [gossip protocol](/docs/internals/gossip.html).
    24  The consensus protocol, powered by Raft, allows the servers to perform
    25  leader election and state replication. The gossip protocol allows for simple
    26  clustering of servers and multi-region federation. The higher burden on the
    27  server nodes means that usually they should be run on dedicated instances --
    28  they are more resource intensive than a client node.
    29  
    30  Client nodes make up the majority of the cluster, and are very lightweight
    31  as they interface with the server nodes and maintain very little state of their own.
    32  Each cluster has usually 3 or 5 server mode agents and potentially thousands of clients.
    33  
    34  ## Running an Agent
    35  
    36  The agent is started with the [`nomad agent` command](/docs/commands/agent.html). This
    37  command blocks, running forever or until told to quit. The agent command takes a variety
    38  of configuration options, but most have sane defaults.
    39  
    40  When running `nomad agent`, you should see output similar to this:
    41  
    42  ```text
    43  $ nomad agent -dev
    44  ==> Starting Nomad agent...
    45  ==> Nomad agent configuration:
    46  
    47                   Atlas: (Infrastructure: 'armon/test' Join: false)
    48                  Client: true
    49               Log Level: DEBUG
    50                  Region: global (DC: dc1)
    51                  Server: true
    52  
    53  ==> Nomad agent started! Log data will stream in below:
    54  
    55      [INFO] serf: EventMemberJoin: Armons-MacBook-Air.local.global 127.0.0.1
    56      [INFO] nomad: starting 4 scheduling worker(s) for [service batch _core]
    57  ...
    58  ```
    59  
    60  There are several important messages that `nomad agent` outputs:
    61  
    62  * **Atlas**: This shows the [Atlas infrastructure](https://atlas.hashicorp.com)
    63    with which the node is registered, if any. It also indicates if auto-join is enabled.
    64    The Atlas infrastructure is set using [`-atlas`](/docs/agent/config.html#_atlas)
    65    and auto-join is enabled by setting [`-atlas-join`](/docs/agent/config.html#_atlas_join).
    66  
    67  * **Client**: This indicates whether the agent has enabled client mode.
    68    Client nodes fingerprint their host environment, register with servers,
    69    and run tasks.
    70  
    71  * **Log Level**: This indicates the configured log level. Only messages with
    72    an equal or higher severity will be logged. This can be tuned to increase
    73    verbosity for debugging, or reduced to avoid noisy logging.
    74  
    75  * **Region**: This is the region and datacenter in which the agent is configured to run.
    76   Nomad has first-class support for multi-datacenter and multi-region configurations.
    77   The [`-region` and `-dc`](/docs/agent/config.html#_region) flag can be used to set
    78   the region and datacenter. The default is the `global` region in `dc1`.
    79  
    80  * **Server**: This indicates whether the agent has enabled server mode.
    81    Server nodes have the extra burden of participating in the consensus protocol,
    82    storing cluster state, and making scheduling decisions.
    83  
    84  ## Stopping an Agent
    85  
    86  An agent can be stopped in two ways: gracefully or forcefully. By default,
    87  any signal to an agent (interrupt, terminate, kill) will cause the agent
    88  to forcefully stop. Gracefully termination can be configured by either
    89  setting `leave_on_interrupt` or `leave_on_terminate` to respond to the
    90  respective signals.
    91  
    92  When gracefully exiting, clients will update their status to terminal on
    93  the servers so that tasks can be migrated to healthy agents. Servers
    94  will notify their intention to leave the cluster which allows them to
    95  leave the [consensus quorum](/docs/internals/consensus.html).
    96  
    97  It is especially important that a server node be allowed to leave gracefully
    98  so that there will be a minimal impact on availability as the server leaves
    99  the consensus quorum. If a server does not gracefully leave, and will not
   100  return into service, the [`server-force-leave` command](/docs/commands/server-force-leave.html)
   101  should be use to eject it from the consensus quorum.
   102  
   103  ## Lifecycle
   104  
   105  Every agent in the Nomad cluster goes through a lifecycle. Understanding
   106  this lifecycle is useful for building a mental model of an agent's interactions
   107  with a cluster and how the cluster treats a node.
   108  
   109  When a client agent is first started, it fingerprints the host machine to
   110  identify its attributes, capabilities, and [task drivers](/docs/drivers/index.html).
   111  These are reported to the servers during an initial registration. The addresses
   112  of known servers are provided to the agent via configuration, potentially using
   113  DNS for resolution. Using [Consul](https://consul.io) provides a way to avoid hard
   114  coding addresses and resolving them on demand.
   115  
   116  While a client is running, it is performing heartbeating with servers to
   117  maintain liveness. If the hearbeats fail, the servers assume the client node
   118  has failed, and stop assigning new tasks while migrating existing tasks.
   119  It is impossible to distinguish between a network failure and an agent crash,
   120  so both cases are handled the same. Once the network recovers or a crashed agent
   121  restarts the node status will be updated and normal operation resumed.
   122  
   123  To prevent an accumulation of nodes in a terminal state, Nomad does periodic
   124  garbage collection of nodes. By default, if a node is in a failed or 'down'
   125  state for over 24 hours it will be garbage collected from the system.
   126  
   127  Servers are slightly more complex as they perform additional functions. They
   128  participate in a [gossip protocol](/docs/internals/gossip.html) both to cluster
   129  within a region and to support multi-region configurations. When a server is
   130  first started, it does not know the address of other servers in the cluster.
   131  To discover its peers, it must _join_ the cluster. This is done with the
   132  [`server-join` command](/docs/commands/server-join.html) or by providing the
   133  proper configuration on start. Once a node joins, this information is gossiped
   134  to the entire cluster, meaning all nodes will eventually be aware of each other.
   135  
   136  When a server _leaves_, it specifies its intent to do so, and the cluster marks that
   137  node as having _left_. If the server has _left_, replication to it will stop and it
   138  is removed as a member of the consensus quorum. If the server has _failed_, replication
   139  will attempt to make progress to recover from a software or network failure.