github.com/bytedance/gopkg@v0.0.0-20240514070511-01b2cbcf35e1/cloud/circuitbreaker/README.MD (about)

     1  # circuitbreaker
     2  
     3  ## A brief introduction to circuit breaker
     4  ### What circuit breaker does
     5  When making RPC calls, downstream services inevitably fail;
     6  
     7  When a downstream service fails, if the upstream continues to make calls to it, it hinders the recovery of the downstream and wastes the resources of the upstream;
     8  
     9  To solve this problem, you can set some dynamic switches to manually shut down the downstream calls when the downstream fails;
    10  
    11  However, a better solution is to use circuit breakers to automate this problem.
    12  
    13  Here is a more detailed [introduction to circuit breaker](https://msdn.microsoft.com/zh-cn/library/dn589784.aspx).
    14  
    15  One of the better known circuit breakers is hystrix, and here is its [design document](https://github.com/Netflix/Hystrix/wiki).
    16  
    17  ### circuit breaker ideas
    18  The idea of a circuit breaker is simple: restrict access to the downstream based on the success or failure of the RPC;
    19  
    20  Usually there are three states: CLOSED, OPEN, HALFOPEN;
    21  
    22  When the RPC is normal, it is CLOSED;
    23  
    24  When the number of RPC failures increases, the circuit breaker is triggered and goes to OPEN;
    25  
    26  After a certain cooling time after OPEN, the circuit breaker will become HALFOPEN;
    27  
    28  HALFOPEN will do some strategic access to the downstream, and then decide whether to become CLOSED, or OPEN according to the result;
    29  
    30  In summary, the three state transitions are roughly as follows:
    31  
    32  <pre>
    33   [CLOSED] -->- tripped ----> [OPEN]&lt;-------+
    34      ^                          |           ^
    35      |                          v           |
    36      +                          |      detect fail
    37      |                          |           |
    38      |                    cooling timeout   |
    39      ^                          |           ^
    40      |                          v           |
    41      +--- detect succeed --&lt;-[HALFOPEN]-->--+
    42  </pre>
    43  
    44  ## Use of this package
    45  
    46  ### Basic usage
    47  This package divides the results of RPC calls into three categories: Succeed, Fail, Timeout, and maintains a count of all three within a certain time window;
    48  
    49  Before each RPC, you should call IsAllowed() to decide whether to initiate the RPC;
    50  
    51  and call Succeed(), Fail(), Timeout() for feedback after the call is completed, depending on the result;
    52  
    53  The package also controls the number of concurrency, you must also call Done() after each RPC;
    54  
    55  Here is an example:
    56  <pre>
    57  var p *Panel
    58  
    59  func init() {
    60      var err error
    61      p, err = NewPanel(nil, Options{
    62      	CoolingTimeout: time.Minute,
    63      	DetectTimeout:  time.Minute,
    64      	ShouldTrip:     ThresholdTripFunc(100),
    65      })
    66      if err != nil {
    67      	panic(err)
    68      }
    69  }
    70  
    71  func DoRPC() error {
    72      key := "remote::rpc::method"
    73      if p.IsAllowed(key) == false {
    74          return Err("Not allowed by circuitbreaker")
    75      }
    76  
    77      err := doRPC()
    78      if err == nil {
    79          p.Succeed(key)
    80      } else if IsFailErr(err) {
    81          p.Fail(key)
    82      } else if IsTimeout(err) {
    83          p.Timeout(key)
    84      }
    85      return err
    86  }
    87  
    88  func main() {
    89      ...
    90      for ... {
    91          DoRPC()
    92      }
    93      p.Close()
    94  }
    95  </pre>
    96  
    97  ### circuit breaker Trigger strategies
    98  This package provides three basic circuit breaker triggering strategies:
    99  + Number of consecutive failures reaches threshold (ExecutiveTripFunc)
   100  + Failure count reaches threshold (ThresholdTripFunc)
   101  + Failure rate reaches threshold (RateTripFunc)
   102  
   103  Of course, you can write your own circuit breaker triggering strategy by implementing the TripFunc function;
   104  
   105  Circuit breaker will call TripFunc each time Fail or Timeout to decide whether to trigger the circuit breaker;
   106  
   107  ### Circuit breaker cooling strategy
   108  After entering the OPEN state, the circuit breaker will cool down for a period of time, the default is 10 seconds, but this parameter is configurable (CoolingTimeout);
   109  
   110  During this period, all IsAllowed() requests will be returned false;
   111  
   112  After cooling, HALFOPEN is entered;
   113  
   114  ### Half-open strategy
   115  During HALFOPEN, the circuit breaker will let a request go every "while", and after a "number" of consecutive successful requests, the circuit breakerr will become CLOSED; if any of them fail, it will become OPEN;
   116  
   117  This process is a gradual process of testing downstream, and opening up;
   118  
   119  The above "timeout" (DetectTimeout) and "number" (DEFAULT_HALFOPEN_SUCCESSES) are both configurable;
   120  
   121  ### Concurrency control
   122  The circuit breaker also performs concurrency control, with the parameter MaxConcurrency;
   123  
   124  IsAllowed will return false when the maximum number of concurrency is reached;
   125  
   126  ### Statistics
   127  ##### Default parameter
   128  The circuit breaker counts successes, failures and timeouts within a period of time window, the default window size is 10S;
   129  
   130  The time window can be set with two parameters, but usually you don't need to care.
   131  
   132  ##### statistics method
   133  The statistics method is to divide the time window into buckets, each bucket records data for a fixed period of time;
   134  
   135  For example, if you want to count 10 seconds of data, you can divide the 10 second time period into 100 buckets, and each bucket will count 100ms of data;
   136  
   137  The BucketTime and BucketNums in Options correspond to the time period maintained by each bucket and the number of buckets, respectively;
   138  
   139  If BucketTime is set to 100ms and BucketNums is set to 100, it corresponds to a 10 second time window;
   140  
   141  ##### Jitter
   142  As time moves, the oldest bucket in the window will expire, and when the last bucket expires, jitter will occur;
   143  
   144  As an example:
   145  + you divide 10 seconds into 10 buckets, bucket 0 corresponds to the time [0S, 1S), bucket 1 corresponds to the time [1S, 2S), ... , barrel 9 corresponds to [9S, 10S);
   146  + At 10.1S, if Succ is executed once, the following operation occurs in the circuitbreaker;
   147  + (1) Bucket 0 is detected as expired and is discarded; (2) A new bucket 10 is created, corresponding to [10S, 11S); (3) The Succ is placed in bucket 10;
   148  + At 10.2S, you run Successes() to query the number of successes in the window, then you get the actual count of [1S, 10.2S) instead of [0.2S, 10.2S);
   149  
   150  If you use the bucket counting method, such jitter is unavoidable, a compromise is to increase the number of buckets to reduce the impact of jitter;
   151  
   152  If the number of buckets is divided into 2000, the impact of jitter on the overall data will be at most 1/2000;
   153  
   154  In this package, the default number of buckets is also 100, the bucket time is 100ms, and the overall window is 10S;
   155  
   156  There were several technical solutions to avoid this problem, but they all introduced more problems, if you have good ideas, please issue or PR.