github.com/tickoalcantara12/micro/v3@v3.0.0-20221007104245-9d75b9bcbab9/docs/blog/_posts/2016-05-16-resiliency.md

github.com/tickoalcantara12/micro/v3@v3.0.0-20221007104245-9d75b9bcbab9/docs/blog/_posts/2016-05-16-resiliency.md (about)

1 ---
2 layout: post
3 title: Building Resilient and Fault Tolerant Applications with Micro
4 date: 2016-05-15 00:00:00
5 ---
6 
7 It's been a little while since the last blog post but we've been hard at work on Micro and it's definitely starting
8 to pay off. Let's dive into it all now!
9
10 If you want to read up on the [**Micro**](https://github.com/tickoalcantara12/micro) toolkit first, check out the previous blog post
11 [here]({{ site.baseurl }}/2016/03/20/micro.html) or if you would like to learn more about the concept of microservices look [here]({{ site.baseurl }}/2016/03/17/introduction.html).
12
13 It's no secret that building distributed systems can be challenging. While we've solved a lot of problems as an industry along
14 the way, we still go through cycles of rebuilding many of the building blocks. Whether it's because of the move
15 to the next level of abstraction, virtual machines to containers, adopting new languages, leveraging cloud based
16 services or even this coming shift to microservices. There's always something that seems to require us to relearn
17 how to build performant and fault tolerant systems for the next wave of technology.
18
19 It's a never ending battle between iteration and innovation but we need to do something to help alleviate a lot
20 of the pains as the shift to Cloud, Containers and Microservices continues.
21
22 ### The Motivations
23
24 So why are we doing this? Why do we keep rebuilding the building blocks and why do we keep attempting to solve the same
25 scale, fault tolerance and distributed systems problems?
26
27 The term that comes to mind are, "bigger, stronger, faster", or perhaps even, "speed, scale, agility". You'll
28 hear these a lot from C-level executives but the key takeaways are really that there's always a need for us to build more performant
29 and resilient systems.
30
31 In the early days of the internet, there were only thousands or maybe even hundreds of thousands of people coming online. Over time
32 we saw that accelerate and we're now into the order of billions. Billions of people and billions of devices.
33 We've had to learn how to build systems for this.
34
35 For the older generation you may remember the [C10K problem](http://www.kegel.com/c10k.html). I'm not sure where we are with this now
36 but I think we're talking about solving the issue of millions of concurrent connections if not more. The biggest technology players in the world
37 really solved this a decade ago and have patterns for building systems at scale but the rest of us are still learning.
38
39 The likes of Amazon, Google and Microsoft now provide us with Cloud Computing platforms to leverage significant scale but we're
40 still trying to figure out how to write applications that can effectively leverage it. You're hearing the terms container
41 orchestration, microservices and cloud native a lot these days. The work is underway on a multitude of levels and it's going
42 to be a while before we as an industry have really nailed down the patterns and solutions needed moving forward.
43
44 A lot of companies are now helping with the question of, "how do I run my applications in a scalable and fault tolerant manner?", but
45 there's still very few helping with the more important question...
46
47 How do I actually write applications in a scalable and fault tolerant manner?
48
49 Micro looks to address these problems by focusing on the key software development requirements for microservices. We'll run through
50 some of what can help you build resilient and fault tolerant applications now, starting with the client side.
51
52 ### The Client
53
54 The client is a building block for making requests in go-micro. If you've built microservices or SOA architectures before then you'll know that
55 a significant portion of time and execution is spent on calling other services for relevant information.
56
57 Whereas in a monolithic application the focus on mainly on serving content, in a microservices world it's more about retrieving or publishing content.
58
59 Here's a cut down version of the go-micro client interface with the three most important methods; Call, Publish and Stream.
60
61 ```
62 type Client interface {
63 Call(ctx context.Context, req Request, rsp interface{}, opts ...CallOption) error
64 Publish(ctx context.Context, p Publication, opts ...PublishOption) error
65 Stream(ctx context.Context, req Request, opts ...CallOption) (Streamer, error)
66 }
67
68 type Request interface {
69 Service() string
70 Method() string
71 ContentType() string
72 Request() interface{}
73 Stream() bool
74 }
75 ```
76
77 Call and Stream are used to make synchronous requests. Call returns a single result whereas Stream is a bidirectional streaming connection maintained
78 with another service, over which messages can be sent back and forth. Publish is used to publish asynchronous messages via the broker but we're
79 not going to discuss that today.
80
81 How the client works behind the scenes was addressed in a couple of previous blog posts which you can find [here]({{ site.baseurl }}/2016/03/20/micro.html) and
82 [here]({{ site.baseurl }}/2016/04/18/micro-architecture.html). Check those out if you want to learn about the details.
83
84 We'll just briefly mention some important internal details.
85
86 The client deals with the RPC layer while leveraging the broker, codec, registry, selector and transport packages
87 for various pieces of functionality. The layered architecture is important as we separate the concerns of each component, reducing the
88 complexity and providing pluggability.
89
90 ###### Why Does The Client Matter?
91
92 The client is essentially abstracting away the details of providing resilient and fault tolerant communication between services. Making a
93 call to another service seems fairly straight forward but there's all sort of ways in which it could potentially fail.
94
95 Let's start to walk through some of the functionality and how it helps.
96
97 #### Service Discovery
98
99 In a distributed system, instances of a service could be coming and going based on any number of reasons. Network partitions, machine failure,
100 rescheduling, etc. We don't really want to have to care about it this.
101
102 When making a call to another service, we do it by name and allow the client to use service discovery to resolve the name to a list of
103 instances with their address and port. Services register with discovery on startup and deregister on shutdown.
104
105 
106 <img src="{{ site.baseurl }}/blog/images/discovery.png" />
107 
108
109 As we mentioned though, any number of issues can occur in a distributed system and service discovery is no exception. So we rely on battle
110 tested distributed service discovery systems such as consul, etcd and zookeeper to store the information about services.
111
112 Each of these either use the Raft of Paxos network consensus algorithms which gives us consistency and partition tolerance from the CAP theorem.
113 By running a cluster of 3 or 5 nodes, we can tolerate most system failures and get reliable service discovery for the client.
114
115 #### Node Selection
116
117 So now we can reliably resolve service names to a list of addresses. How do we actually select which one to call? This is where the go-micro Selector
118 comes into play. It builds on the registry and provides load balancing strategies such as round robin or random hashing while also providing
119 methods of filtering, caching and blacklisting failed nodes.
120
121 Here's a cut down interface.
122
123 ```
124 type Selector interface {
125 Select(service string, opts ...SelectOption) (Next, error)
126 Mark(service string, node *registry.Node, err error)
127 Reset(service string)
128 }
129
130 type Next func() (*registry.Node, error)
131 type Filter func([]*registry.Service) []*registry.Service
132 type Strategy func([]*registry.Service) Next
133 ```
134
135 ###### Balancing Strategies
136
137 The current strategies are fairly straight forward. When Select is called the Selector will retrieve the service from the Registry
138 and create a Next function that encapsulates the pool of nodes using the default strategy or the one passed in as an option if overridden.
139
140 The client will call the Next function to retrieve the next node in the list based on the load balancing strategy and make the request.
141 If the request fails and retries are set above 1, it will go through the same process, retrieving the next node to call.
142
143 There's a variety of strategies that can be used here such as round robin, random hashing, leastconn, weighted, etc. Load balancing strategies
144 are essential for distributing requests evenly across services.
145
146 ###### Selection Caching
147
148 While its great to have a robust service discovery system it can be inefficient and costly to do a lookup on every request.
149 If you imagine a large scale system in which every service is doing this, it can be quite easy to overload the discovery system. There may
150 be cases in which it becomes completely unavailable.
151
152 To avoid this we can use a caching. Most discovery systems provide a way to listen for updates, normally known as a Watcher. Rather
153 than polling discovery we wait for events to be sent to us. The go-micro Registry provides a Watch abstraction for this.
154
155 We've written a caching selector which maintains an in memory cache of services. On a cache miss it looks up discovery for the info, caches
156 it and then uses this for subsequent requests. If watch events are received for services we know about then the cache will be updated accordingly.
157
158 Firstly, this drastically improves performance by removing the service lookup. It also provides some fault tolerance in the case of
159 service discovery being down. We are a little paranoid though and the cache could go stale because of some failure scenario so nodes are TTLed appropriately.
160
161 ###### Blacklisting Nodes
162
163 Next on the list, blacklisting. Notice the Selector interface has Mark and Reset methods. We can never really guarantee that healthy
164 nodes are registered with discovery so something needs to be done about it.
165
166 Whenever a request is made we'll keep track of the result. If a service instance fails multiple
167 times we can essentially blacklist the node and filter it out the next time a Select request is made.
168
169 A node is blacklisted for a set period of time before being put back in the pool. It's really critical that if a particular node
170 of a service is failing we remove it from the list so that we can continue to serve successful requests without delay.
171
172 #### Timeouts & Retries
173
174 Adrian Cockroft has recently started to talk about the missing components from microservice architectures. One of the very
175 interesting things that came up is classic timeout and retry strategies that lead to cascading failures. I implore you
176 to go look at his slides [here](http://www.slideshare.net/adriancockcroft/microservices-whats-missing-oreilly-software-architecture-new-york#24).
177 I've linked directly to where it starts to cover timeouts and retries. Thanks to Adrian for letting me use the slides.
178
179 This slide really summarises the problem quite well.
180
181 
182 <img src="{{ site.baseurl }}/blog/images/timeouts.png" />
183 
184
185 What Adrian describes above is the common case in which a slow response can lead to a timeout then causing the client to retry.
186 Since a request is actually a chain of requests downstream, this creates a whole new set of requests through the system
187 while old work may still be going on. The misconfiguration can result in overloading services in the call chain and creating a failure
188 scenario that's difficult to recover from.
189
190 In a microservices world, we need to rethink the strategy around handling timeouts and retries. Adrian goes on to discuss potential solutions
191 to this problem. One of which being timeout budgets and retrying against new nodes.
192
193 
194 <img src="{{ site.baseurl }}/blog/images/good-timeouts.png" />
195 
196
197 On the retries side, we've been doing this in Micro for a while. The Number of retries can be configured as an option to the Client.
198 If a call fails the Client will retrieve a new node and attempt to make the request again.
199
200 The timeouts were something being considered more thoughtfully but actually started with the classic static timeout setting. It wasn't until
201 Adrian presented his thoughts that it became clear what the strategy should be.
202
203 Budgeted timouts are now built into Micro. Let's run through how that works.
204
205 The first Caller sets the timeout, this usually happens at the edge. On every request in the chain the timeout is decreased to account for
206 the amount of time that has passed. When zero time is left we stop processing any further requests or retries and return up the call stack.
207
208 As Adrian mentions, this is a great way to provide dynamic timeout budgeting and remove any unnecessary work occurring downstream.
209
210 Further to this, the next steps should really be to remove any kind of static timeouts. How services respond will differ based on environment,
211 request load, etc. This should really be a dynamic SLA that's changing based on its current state but something to be left for another day.
212
213 #### What About Connection Pooling?
214
215 Connection pooling is an important part of building scalable systems. We've very quickly seen the limitations posed
216 without it. Usually leading to hitting file descriptor limits and port exhaustion.
217
218 There's currently a [PR](https://github.com/micro/go-micro/pull/86) in the works to add connection pooling to go-micro. Given the pluggable
219 nature of Micro, it was important to address this a layer above the [Transport](https://godoc.org/github.com/micro/go-micro/transport#Transport)
220 so that any implementation, whether it be HTTP, NATS, RabbitMQ, etc, would benefit.
221
222 You might be thinking, well this is implementation specific, some transports may already support it. While this is true
223 it's not always guaranteed to work the same way across each transport. By addressing this specific problem a layer up,
224 we reduce the complexity and needs of the transport itself.
225
226
227 ### What Else?
228
229 Those are some pretty useful things built in to go-micro, but what else?
230
231 I'm glad you asked... or well, I assume you're asking...anyway.
232
233 #### Service Version Canarying?
234
235 We have it! It was actually discussed in a previous blog post on architecture and design patterns for microservices which
236 you can check out [here]({{ site.baseurl }}/2016/04/18/micro-architecture.html).
237
238 Services contain Name and Version as a pair in service discovery. When a service is retrieved from the registry, it's nodes
239 are grouped by version. The selector can then be leveraged to distribute traffic across the nodes of each version using
240 various load balancing strategies.
241
242 
243 <img src="{{ site.baseurl }}/blog/images/selector.png" />
244 
245
246 ###### Why Is Canarying Important?
247
248 This is really quite useful when releasing new versions of a service and ensuring everything is functioning correctly before
249 rolling out to the entire fleet. The new version can be deployed to a small pool of nodes with the client automatically
250 distributing a percentage of traffic to the new service. In combination with an orchestration system such as Kubernetes
251 you can canary the deployment with confidence and rollback if there's any issues.
252
253 #### What About Filtering?
254
255 We have it! The selector is very powerful and includes the ability to pass in filters at time of selection to filter nodes. These can be
256 passed in as Call Options to the client when making a request. Some existing filters can be found
257 [here](https://github.com/micro/go-micro/blob/master/selector/filter.go) for metadata, endpoint or version filtering.
258
259 ###### Why Is Filtering Important?
260
261 You might have some functionality that only exists across a set of versions of services. Pinning the request flow between
262 the services to those particular versions ensures you always hit the right services. This is great where multiple
263 versions are running in the system at the same time.
264
265 The other useful use case is where you want route to services based on locality. By setting a datacenter label on each
266 service you can apply a filter that will only return local nodes. Filtering based on metadata is pretty powerful and has
267 much broader applications which we hope to hear more about from usage in the wild.
268
269 ### The Pluggable Architecture
270
271 One of the things that you'll keep hearing over and over is the pluggable nature of Micro. This was something
272 addressed in the design from day one. It was very important that Micro provide building blocks as opposed to
273 a complete system. Something that works out of the box but can be enhanced.
274
275 ###### Why Does Being Pluggable Matter?
276
277 Everyone will have different ideas about what it means to build distributed systems and
278 we really want to provide a way for people to design the solutions they want to use. Not only that but
279 there are robust battle tested tools out there which we can leverage rather than writing everything from
280 scratch.
281
282 Technology is always evolving, new and better tools appear everyday. How do we avoid lock in? A pluggable
283 architecture means we can use components today and switch them out tomorrow with minimal effort.
284
285 #### Plugins
286
287 Each of the features of go-micro are created as Go interfaces. By doing so and only referencing the interface,
288 we can actually swap out the underlying implementations with minimal to zero code changes. In most cases
289 a simple import statement and flag specified on the command line.
290
291 There are a number of plugins in the [go-plugins](https://github.com/micro/go-plugins) repo on GitHub.
292
293 While go-micro provides some defaults such as consul for discovery and http for transport, you may want to use
294 something different within your architecture or even implement your own plugins. We've already had community
295 contributions with a [Kubernetes](https://github.com/micro/go-plugins/tree/master/registry/kubernetes) registry
296 plugin and [Zookeeper](https://github.com/micro/go-plugins/pull/24) registry in PR mode right now.
297
298 ###### How do I use plugins?
299
300 Most of the time it's as simple as this.
301
302 ```
303 # Import the plugin
304 import _ "github.com/micro/go-plugins/registry/etcd"
305 ```
306
307 ```
308 go run main.go --registry=etcd --registry_address=10.0.0.1:2379
309 ```
310
311 If you want to see more of it in action, check out the post on [Micro on NATS]({{ site.baseurl }}/2016/04/11/micro-on-nats.html).
312
313 #### Wrappers
314
315 What's more, the Client and Server support the notion of middleware with something called Wrappers. By supporting
316 middleware we can add pre and post hooks with additional functionality around request-response handling.
317
318 Middleware is a well understood concept and something used across thousands of libraries to date. You can
319 immediately see the benefits in use cases such as circuit breaking, rate limiting, authentication, logging, tracing, etc.
320
321 ```
322 # Client Wrappers
323 type Wrapper func(Client) Client
324 type StreamWrapper func(Streamer) Streamer
325
326 # Server Wrappers
327 type HandlerWrapper func(HandlerFunc) HandlerFunc
328 type SubscriberWrapper func(SubscriberFunc) SubscriberFunc
329 type StreamerWrapper func(Streamer) Streamer
330 ```
331
332 ###### How do I use Wrappers?
333
334 This is just as straight forward as plugins.
335
336 ```
337 import (
338 "github.com/micro/go-micro"
339 "github.com/micro/go-plugins/wrapper/breaker/hystrix"
340 )
341
342 func main() {
343 service := micro.NewService(
344 micro.Name("myservice"),
345 micro.WrapClient(hystrix.NewClientWrapper()),
346 )
347 }
348 ```
349
350 Easy right? We find many companies create their own layer on top of Micro to initialise most of the default wrappers
351 they're looking for so if any new wrappers need to be added it can all be done in one place.
352
353 Let's look at a couple wrappers now for resiliency and fault tolerance.
354
355 #### Circuit Breaking
356
357 In an SOA or microservices world, a single request can actually result in a call to multiple services and in many cases,
358 to dozens or more to gather the necessary information to return to the caller. In the successful case, this works quite
359 well but if an issue occurs it can quickly descend into cascading failures which are difficult to recover from without
360 resetting the entire system.
361
362 We partially solve some of these problems in the client with request retries and blacklisting nodes that
363 have failed multiple times but at some point there may be a need to stop the client from even attempting to make the
364 request.
365
366 This is where circuit breakers come into play.
367
368 
369 <img src="{{ site.baseurl }}/blog/images/circuit.png" />
370 
371
372 The concept of circuit breakers are straight forward. The execution of a function is wrapped or associated with a monitor of
373 some kind which tracks failures. When the number of failures exceeds a certain threshold, the breaker is tripped and
374 any further call attempts return an error without executing the wrapped function. After a timeout period the circuit
375 is put into a half open state. If a single call fails in this state the breaker is once again tripped however if it succeeds
376 we reset back to the normal state of a closed circuit.
377
378 While the internals of the Micro client have some fault tolerant features built in, we shouldn't expect to be able to solve
379 every problem. Using Wrappers in conjuction with existing circuit breaker implementations we can benefit greatly.
380
381 #### Rate Limiting
382
383 Wouldn't it be nice if we could just serve all the requests in the world without breaking a sweat. Ah the dream. Well the real
384 world doesn't really work like that. Processing a query takes a certain period of time and given the limitations of resources
385 there's only so many requests we can actually serve.
386
387 At some point we need to think about limiting the number of requests we can either make or serve in parallel. This is where
388 rate limiting comes into play. Without rate limiting it can be very easy to run into resource exhaustion or completely cripple
389 the system and stop it from being able to serve any further requests. This is usually the basis for a great DDOS attack.
390
391 Everyone has heard of, used or maybe even implemented some form of rate limiting. There's quite a few different rate limiting
392 algorithms out there, one of which being the [Leaky Bucket](https://en.wikipedia.org/wiki/Leaky_bucket) algorithm. We're not
393 going to go into the specifics of the algorithm here but it's worth reading about.
394
395 Once again we can make use of Micro Wrappers and existing libraries to perform this function. An existing implementation
396 can be found [here](https://github.com/micro/go-plugins/blob/master/wrapper/ratelimiter/ratelimit/ratelimit.go).
397
398 A system we're actually interested in seeing an implementation for is YouTube's [Doorman](https://github.com/youtube/doorman),
399 a global distributed client side rate limiter. We're looking for a community contribution for this, so please get in touch!
400
401 ### The Server Side
402
403 All of this has covered quite a lot about the client side features or use cases. What about the server side? The first thing to note
404 is that Micro leverages the go-micro client for the API, CLI, Sidecar and so on. These benefits translate across the entire
405 architecture from the edge down to the very last backend service. We still need to address some basics for the server though.
406
407 While on the client side, the registry is used to find services, the server side is where the registration actually occurs. When a
408 an instance of a service comes up, it registers itself with the service discovery mechanism and deregisters when it exits gracefully.
409 The keyword being being "gracefully".
410
411 
412 <img src="{{ site.baseurl }}/blog/images/register.png" />
413 
414
415 ###### Dealing With Failure
416
417 In a distributed system we have to deal with failures, we need to be fault tolerant. The registry supports TTLs to expire or mark
418 nodes as unhealthy based on whatever the underlying service discovery mechanism is e.g consul, etcd. While the service itself also
419 supports re-registration. The combination of the two means the service node will re-register on a set interval while it's healthy
420 and the registry will expire the node if not refreshed. If the node fails for any reason and does not re-register, it will be
421 removed from the registry.
422
423 This fault tolerant behaviour was not initially included as part of go-micro but we quickly saw from real world use that
424 it was very easy to fill the registry with stale nodes because of panics and other failures which causes services to exit ungracefully.
425
426 The knock on effect was that the client would be left to deal with dozens if not hundreds of stale entries. While the client
427 needs to be fault tolerant as well, we think this functionality eliminates a lot of issues upfront.
428
429 ###### Adding Further Functionality
430
431 Another thing to note, as mentioned above, the server also provides the ability to use Wrappers or Middleware as its more commonly known. Which means
432 we can use circuit breaking, rate limiting, and other features at this layer to control request flow, concurrency, etc.
433
434 The functionality of the server is purposely kept simple but pluggable so that features can be layered on top as required.
435
436 ### Clients vs Sidecars
437
438 Most of what's being discussed here exists in the core [go-micro](https://github.com/micro/go-micro) library. While this is great
439 for all the Go programmers everyone else may be wondering, how do I get all these benefits.
440
441 From the very beginning, Micro has included the concept of a [Sidecar](https://github.com/tickoalcantara12/micro/tree/master/car), a HTTP proxy with all
442 the features of go-micro built in. So regardless of which language you're building your applications with, you can benefit from all
443 we've discussed above by using the Micro Sidecar.
444
445 
446 <img src="{{ site.baseurl }}/blog/images/sidecar-rpc.png" style="width: 100%; height: auto;" />
447 
448
449 The sidecar pattern is nothing new. NetflixOSS has one called [Prana](https://github.com/Netflix/Prana) which leverages the JVM based
450 NetflixOSS stack. Buoyant have recently entered the game with an incredibly feature rich system called [Linkerd](https://linkerd.io/),
451 an RPC proxy that layers on top of Twitter's [Finagle](https://finagle.github.io/blog/) library.
452
453 The Micro Sidecar uses the default go-micro Client. So if you want to add other functionality you can augment it very easily and rebuild.
454 We'll look to simplify this process much more in the future and provide a version prebuilt with all the nifty fault tolerant features.
455
456 ### Wait, There's More
457
458 The blog post covers a lot about the core [go-micro](https://github.com/micro/go-micro) library and surrounding toolkit. These tools
459 are a great start but they're not enough. When you want to run at scale, when you want hundreds of microservices that serve millions of
460 requests there's still a lot more to be addressed.
461
462 ###### The Platform
463
464 This is where the [go-platform](https://github.com/micro/go-platform) and [platform](https://github.com/micro/platform) come into play.
465 Where micro addresses the fundamental building blocks, the platform goes a step further by addressing the requirements for running
466 at scale. Authentication, distributed tracing, synchronization, healthcheck monitoring, etc, etc.
467
468 Distributed systems require a different set of tools for observability, consensus and coordinating fault tolerance, the micro platform
469 looks to help with those needs. By providing a layered architecture we can build on the primitives defined by the core tools and
470 enhance their functionality where needed.
471
472 It's still early days but the hope is that the micro platform will solve a lot of the problems organisations have with building
473 distributed systems platforms.
474
475 ### How Do I Use All These Tools?
476
477 As you can gather from the blog post, most of these features are built into the Micro toolkit. You can go check out the project on
478 [GitHub](https://github.com/tickoalcantara12/micro) and get started writing fault tolerant Micro services almost instantly.
479
480 If you need help or have questions, come join the community on [Slack](https://slack.m3o.com). It's very active and
481 growing fast, with a broad range of users, from people hacking on side projects to companies already using Micro in production today.
482
483 ### Summary
484
485 Technology is rapidly evolving, cloud computing now gives us access to almost unlimited scale. Trying to keep up with the pace of
486 change can be difficult and building scalable fault tolerant systems for the new world is still challenging.
487
488 But it doesn't have to be this way. As a community we can help each other to adapt to this new environment and build products
489 that will scale with our growing demands.
490
491 Micro looks to help in this journey by providing the tools to simplify building and managing distributed systems. Hopefully
492 this blog post has helped demonstrate some of the ways we're looking to do just that.
493
494 If you want to learn more about the services we offer or microservices, check out the [blog](/), the website
495 [micro.mu](https://m3o.com) or the github [repo](https://github.com/tickoalcantara12/micro).
496
497 Follow us on Twitter at [@MicroHQ](https://twitter.com/m3ocloud) or join the [Slack](https://slack.m3o.com)
498 community [here](http://slack.m3o.com).
499