github.com/ghodss/etcd@v0.3.1-0.20140417172404-cc329bfa55cb/Documentation/design/standbys.md

github.com/ghodss/etcd@v0.3.1-0.20140417172404-cc329bfa55cb/Documentation/design/standbys.md (about)

     1  ## Standbys
     2  
     3  Adding peers in an etcd cluster adds network, CPU, and disk overhead to the leader since each one requires replication.
     4  Peers primarily provide resiliency in the event of a leader failure but the benefit of more failover nodes decreases as the cluster size increases.
     5  A lightweight alternative is the standby.
     6  
     7  Standbys are a way for an etcd node to forward requests along to the cluster but the standbys are not part of the Raft cluster themselves.
     8  This provides an easier API for local applications while reducing the overhead required by a regular peer node.
     9  Standbys also act as standby nodes in the event that a peer node in the cluster has not recovered after a long duration.
    10  
    11  
    12  ## Configuration Parameters
    13  
    14  Standbys require two additional configuration parameters: active size & promotion delay.
    15  The active size specifies a target size for the number of peers in the cluster.
    16  If there are not enough peers to meet the active size then standbys are promoted to peers until the peer count is equal to the active size.
    17  If there are more peers than the target active size then peers are demoted to standbys.
    18  
    19  The promotion delay specifies how long the cluster should wait before removing a dead peer and promoting a standby.
    20  By default this is 30 minutes.
    21  If a peer is inactive for 30 minutes then the peer is removed and a live standby is found to take its place.
    22  
    23  
    24  ## Logical Workflow
    25  
    26  Start a etcd machine and join the cluster:
    27  
    28  ```
    29  If peer count less than active size:
    30    If machine already exists as a standby:
    31      Remove machine from standby list
    32    Join as peer
    33  
    34  If peer count greater than or equal to active size:
    35    Join as standby
    36  ```
    37  
    38  Remove an existing etcd machine from the cluster:
    39  
    40  ```
    41  If machine exists in peer list:
    42    Remove from peer list
    43  
    44  If machine exists in standby list:
    45    Remove from standby list
    46  ```
    47  
    48  Leader's active size monitor:
    49  
    50  ```
    51  Loop:
    52    Sleep 5 seconds
    53  
    54    If peer count less than active size:
    55      If standby count greater than zero:
    56        Request a random standby to rejoin
    57      Goto Loop
    58  
    59    If peer count greater than active size:
    60      Demote randomly selected peer
    61      Goto Loop
    62  ```
    63  
    64  Leader's peer activity monitor:
    65  
    66  ```
    67  Loop:
    68    Sleep 5 seconds
    69  
    70    For each peer:
    71      If peer last activity time greater than promote delay:
    72        Demote peer
    73        Goto Loop
    74  ```