github.com/iqoqo/nomad@v0.11.3-0.20200911112621-d7021c74d101/website/pages/docs/install/production/reference-architecture.mdx (about)

     1  ---
     2  layout: docs
     3  page_title: Nomad Reference Architecture
     4  sidebar_title: Reference Architecture
     5  description: |-
     6    This document provides recommended practices and a reference
     7    architecture for HashiCorp Nomad production deployments.
     8  ea_version: 0.9
     9  ---
    10  
    11  # Nomad Reference Architecture
    12  
    13  This document provides recommended practices and a reference architecture for HashiCorp Nomad production deployments. This reference architecture conveys a general architecture that should be adapted to accommodate the specific needs of each implementation.
    14  
    15  The following topics are addressed:
    16  
    17  - [Reference Architecture](#ra)
    18  - [Deployment Topology within a Single Region](#one-region)
    19  - [Deployment Topology across Multiple Regions](#multi-region)
    20  - [Network Connectivity Details](#net)
    21  - [Deployment System Requirements](#system-reqs)
    22  - [High Availability](#high-availability)
    23  - [Failure Scenarios](#failure-scenarios)
    24  
    25  This document describes deploying a Nomad cluster in combination with, or with access to, a [Consul cluster](/docs/integrations/consul-integration). We recommend the use of Consul with Nomad to provide automatic clustering, service discovery, health checking and dynamic configuration.
    26  
    27  ## Reference Architecture ((#ra))
    28  
    29  A Nomad cluster typically comprises three or five servers (but no more than seven) and a number of client agents. Nomad differs slightly from Consul in that it divides infrastructure into regions which are served by one Nomad server cluster, but can manage multiple datacenters or availability zones. For example, a _US Region_ can include datacenters _us-east-1_ and _us-west-2_.
    30  
    31  In a Nomad multi-region architecture, communication happens via [WAN gossip](/docs/internals/gossip). Additionally, Nomad can integrate easily with Consul to provide features such as automatic clustering, service discovery, and dynamic configurations. Thus we recommend you use Consul in your Nomad deployment to simplify the deployment.
    32  
    33  In cloud environments, a single cluster may be deployed across multiple availability zones. For example, in AWS each Nomad server can be deployed to an associated EC2 instance, and those EC2 instances distributed across multiple AZs. Similarly, Nomad server clusters can be deployed to multiple cloud regions to allow for region level HA scenarios.
    34  
    35  For more information on Nomad server cluster design, see the [cluster requirements documentation](/docs/install/production/requirements).
    36  
    37  The design shared in this document is the recommended architecture for production environments, as it provides flexibility and resilience. Nomad utilizes an existing Consul server cluster; however, the deployment design of the Consul server cluster is outside the scope of this document.
    38  
    39  Nomad to Consul connectivity is over HTTP and should be secured with TLS as well as a Consul token to provide encryption of all traffic. This is done using Nomad's [Automatic Clustering with Consul](https://learn.hashicorp.com/nomad/operating-nomad/clustering).
    40  
    41  ### Deployment Topology within a Single Region ((#one-region))
    42  
    43  A single Nomad cluster is recommended for applications deployed in the same region.
    44  
    45  Each cluster is expected to have either three or five servers. This strikes a balance between availability in the case of failure and performance, as [Raft](https://raft.github.io/) consensus gets progressively slower as more servers are added.
    46  
    47  The time taken by a new server to join an existing large cluster may increase as the size of the cluster increases.
    48  
    49  #### Reference Diagram
    50  
    51  ![Reference diagram](/img/nomad_reference_diagram.png)
    52  
    53  ### Deployment Topology across Multiple Regions ((#multi-region))
    54  
    55  By deploying Nomad server clusters in multiple regions, the user is able to interact with the Nomad servers by targeting any region from any Nomad server even if that server resides in a separate region. However, most data is not replicated between regions as they are fully independent clusters. The exceptions are [ACL tokens and policies][acl], as well as [Sentinel policies in Nomad Enterprise][sentinel], which _are_ replicated between regions.
    56  
    57  Nomad server clusters in different datacenters can be federated using WAN links. The server clusters can be joined to communicate over the WAN on port `4648`. This same port is used for single datacenter deployments over LAN as well.
    58  
    59  Additional documentation is available to learn more about [Nomad server federation](https://learn.hashicorp.com/nomad/operating-nomad/federation).
    60  
    61  ## Network Connectivity Details ((#net))
    62  
    63  ![Nomad network diagram](/img/nomad_network_arch.png)
    64  
    65  Nomad servers are expected to be able to communicate in high bandwidth, low latency network environments and have below 10 millisecond latencies between cluster members. Nomad servers can be spread across cloud regions or datacenters if they satisfy these latency requirements.
    66  
    67  Nomad client clusters require the ability to receive traffic as noted above in the Network Connectivity Details; however, clients can be separated into any type of infrastructure (multi-cloud, on-prem, virtual, bare metal, etc.) as long as they are reachable and can receive job requests from the Nomad servers.
    68  
    69  Additional documentation is available to learn more about [Nomad networking](/docs/install/production/requirements#network-topology).
    70  
    71  ## Deployment System Requirements ((#system-reqs))
    72  
    73  Nomad server agents are responsible for maintaining the cluster state, responding to RPC queries (read operations), and for processing all write operations. Given that Nomad server agents do most of the heavy lifting, server sizing is critical for the overall performance efficiency and health of the Nomad cluster.
    74  
    75  ### Nomad Servers
    76  
    77  | Size  | CPU      | Memory       | Disk   | Typical Cloud Instance Types              |
    78  | ----- | -------- | ------------ | ------ | ----------------------------------------- |
    79  | Small | 2 core   | 8-16 GB RAM  | 50 GB  | **AWS:** m5.large, m5.xlarge              |
    80  |       |          |              |        | **Azure:** Standard_D2_v3, Standard_D4_v3 |
    81  |       |          |              |        | **GCE:** n1-standard-8, n1-standard-16    |
    82  | Large | 4-8 core | 32-64 GB RAM | 100 GB | **AWS:** m5.2xlarge, m5.2xlarge           |
    83  |       |          |              |        | **Azure:** Standard_D4_v3, Standard_D8_v3 |
    84  |       |          |              |        | **GCE:** n1-standard-16, n1-standard-32   |
    85  
    86  #### Hardware Sizing Considerations
    87  
    88  - The small size would be appropriate for most initial production
    89    deployments, or for development/testing environments.
    90  
    91  - The large size is for production environments where there is a
    92    consistently high workload.
    93  
    94  ~> **NOTE** For large workloads, ensure that the disks support a high number of IOPS to keep up with the rapid Raft log update rate.
    95  
    96  Nomad clients can be setup with specialized workloads as well. For example, if workloads require GPU processing, a Nomad datacenter can be created to serve those GPU specific jobs and joined to a Nomad server cluster. For more information on specialized workloads, see the documentation on [job constraints](/docs/job-specification/constraint) to target specific client nodes.
    97  
    98  ## High Availability
    99  
   100  A Nomad server cluster is the highly-available unit of deployment within a single datacenter. A recommended approach is to deploy a three or five node Nomad server cluster. With this configuration, during a Nomad server outage, failover is handled immediately without human intervention.
   101  
   102  When setting up high availability across regions, multiple Nomad server clusters are deployed and connected via WAN gossip. Nomad clusters in regions are fully independent from each other and do not share jobs, clients, or state. Data residing in a single region-specific cluster is not replicated to other clusters in other regions.
   103  
   104  ## Failure Scenarios
   105  
   106  Typical distribution in a cloud environment is to spread Nomad server nodes into separate Availability Zones (AZs) within a high bandwidth, low latency network, such as an AWS Region. The diagram below shows Nomad servers deployed in multiple AZs promoting a single voting member per AZ and providing both AZ-level and node-level failure protection.
   107  
   108  ![Nomad fault tolerance](/img/nomad_fault_tolerance.png)
   109  
   110  Additional documentation is available to learn more about [cluster sizing and failure tolerances](/docs/internals/consensus#deployment-table) as well as [outage recovery](https://learn.hashicorp.com/nomad/operating-nomad/outage).
   111  
   112  ### Availability Zone Failure
   113  
   114  In the event of a single AZ failure, only a single Nomad server will be affected which would not impact job scheduling as long as there is still a Raft quorum (i.e. 2 available servers in a 3 server cluster, 3 available servers in a 5 server cluster, etc.). There are two scenarios that could occur should an AZ fail in a multiple AZ setup: leader loss or follower loss.
   115  
   116  #### Leader Server Loss
   117  
   118  If the AZ containing the Nomad leader server fails, the remaining quorum members would elect a new leader. The new leader then begins to accept new log entries and replicates these entries to the remaining followers.
   119  
   120  #### Follower Server Loss
   121  
   122  If the AZ containing a Nomad follower server fails, there is no immediate impact to the Nomad leader server or cluster operations. However, there still must be a Raft quorum in order to properly manage a future failure of the Nomad leader server.
   123  
   124  ### Region Failure
   125  
   126  In the event of a region-level failure (which would contain an entire Nomad server cluster), clients will still be able to submit jobs to another region that is properly federated. However, there will likely be data loss as Nomad server clusters do not replicate their data to other region clusters. See [Multi-region Federation](https://learn.hashicorp.com/nomad/operating-nomad/federation) for more setup information.
   127  
   128  ## Next Steps
   129  
   130  - Read [Deployment Guide](/docs/install/production/deployment-guide) to learn
   131    the steps required to install and configure a single HashiCorp Nomad cluster.
   132  
   133  [acl]: https://learn.hashicorp.com/nomad?track=acls#operations-and-development
   134  [sentinel]: https://learn.hashicorp.com/nomad/governance-and-policy/sentinel