github.com/tickoalcantara12/micro/v3@v3.0.0-20221007104245-9d75b9bcbab9/docs/blog/_posts/2019-12-05-building-a-microservices-network.md (about) 1 --- 2 author: Milos, Jake and Asim 3 layout: post 4 title: Building a global services network using Go, QUIC and Micro 5 date: 2019-12-05 09:00:00 6 --- 7 8 Over the past 6 months we at [Micro](https://m3o.com/) have been hard at work developing a global service network to build, share and collaborate on microservices. 9 10 In this post we're going to share some of the technical details, the design decisions we made, challenges we faced and ultimately how we have succeeded in building the microservices network. 11 12 ## Motivations 13 14 The power of collaborative development has largely been restricted to trusted environments within organisations. When done right, these private in-house platforms unlock incredible productivity and compounding value with every new service added.They provide an always-on runtime and known developer workflow for engineers to collaborate on and deliver new features to their customers. 15 16 Historically, this has been quite difficult to achieve outside of organisations. When developers decide to work on new services they often have to deal with a lot of unnecessary work when it comes to making the services available to others to consume and collaborate on. Public cloud providers are too complex and the elaborate setups when hosting things yourself don’t make things easier either. At [Micro](https://m3o.com/) we felt this pain and decided to do something about it. We built a microservices network! 17 18 The micro network looks to solve these problems using a shared global network for micro services. Let’s see how we’ve made this dream a reality! 19 20 ## Design 21 22 The micro network is a globally distributed network based on [go-micro](https://go-micro.dev), a Go microservices framework which enables developers to build services quickly without dealing with the complexity of distributed systems. Go Micro provides strongly opinionated interfaces that are pluggable but also come with sane defaults. This allows Go Micro services to be built once and deployed anywhere, with zero code changes. 23 24 The micro network leverages five of the core primitives: registry, transport, broker, client and server. Our default implementations can be found in each package in the [go-micro](https://github.com/micro/go-micro) framework. Community maintained plugins live in the [go-plugins](https://github.com/micro/go-plugins) repo. 25 26 The micro "network" is an overloaded term, referring both to the global network over which services discover and communicate with each other and the underpinning system consisting of peer nodes whom connect to each and establish the routes over which services communicate. 27 28 The network abstracts away the low level details of distributed system communication at large scale, across any cloud or machine, and allows anyone to build services together without thinking about where they are running. This essentially enables large scale sharing of resources and more importantly microservices. 29 30 There are four fundamental concepts that make the micro network possible. These are entirely new and built into [Go Micro](https://go-micro.dev/) as of the last 6 months: 31 32 - **Tunnel** - point to point tunnelling 33 - **Proxy** - transparent rpc proxying 34 - **Router** - route aggregation and advertising 35 - **Network** - multi-cloud networking built on the above three 36 37 Each of these components is just like any other [Go Micro](https://go-micro.dev/) component - pluggable, with an out of the box default implementation to get started. In our case the micro network it was important that the defaults worked at scale across the world. 38 39 Let’s dig into the details. 40 41 ### Tunnel 42 43 From a high level view the micro network is an overlay network that spans the internet. All micro network nodes maintain secure tunnel connections between each other to enable the secure communication between the services running in the network. Go Micro provides a default tunnel implementation using the QUIC protocol along with custom session management. 44 45 We chose QUIC because it provides some excellent properties especially when it comes to dealing with high latency networks, an important property when dealing with [running services in large distributed networks](https://eng.uber.com/employing-quic-protocol/). QUIC runs over UDP, but by adding some connection based semantics it supports reliable packet delivery. QUIC also supports multiple streams without [head of line blocking](https://en.wikipedia.org/wiki/Head-of-line_blocking) and it’s designed to work with encryption natively. Finally, QUIC runs in userspace, not in the kernel space on conventional systems, so it can provide both a performance and extra security, too. 46 47 Micro tunnel uses [quic-go](https://github.com/lucas-clemente/quic-go) which is the most complete Go implementation of QUIC that we could find at the inception of the micro network. We are aware quic-go is a work in progress and that it can occasionally break, but we are happy to pay the early adopter cost as we believe QUIC will become the defacto standard internet communication protocol in the future, enabling large scale networks such as the micro network. 48 49 Let’s look at the Go Micro tunnel interface: 50 51 ```go 52 // Tunnel creates a gre tunnel on top of the go-micro/transport. 53 // It establishes multiple streams using the Micro-Tunnel-Channel header 54 // and Micro-Tunnel-Session header. The tunnel id is a hash of 55 // the address being requested. 56 type Tunnel interface { 57 // Address the tunnel is listening on 58 Address() string 59 // Connect connects the tunnel 60 Connect() error 61 // Close closes the tunnel 62 Close() error 63 // Links returns all the links the tunnel is connected to 64 Links() []Link 65 // Dial to a tunnel channel 66 Dial(channel string, opts ...DialOption) (Session, error) 67 // Accept connections on a channel 68 Listen(channel string, opts ...ListenOption) (Listener, error) 69 } 70 ``` 71 72 It may look fairly familiar to Go developers. With Go Micro we’ve tried to maintain common interfaces in line with distributed systems development while stepping in at a lower layer to solve some of the nitty gritty details. 73 74 Most of the interface methods should hopefully be self-explanatory, but you might be wondering about channels and sessions. Channels are much like addresses, providing a way to segment different message streams over the tunnel. Listeners listen on a given channel and return a unique session when a client dials into the channel. The session is used to communicate between peers on the same tunnel channel. The Go Micro tunnel provides different communication semantics too. You can choose to use either unicast or multicast. 75 76 <img src="https://m3o.com/docs/images/session.svg" alt="" /> 77 78 In addition tunnels enable bidirectional connections; sessions can be dialled or listened from either side. This enables the reversal of connections so anything behind a [NAT](https://en.wikipedia.org/wiki/Network_address_translation) or without a public IP can become a server. 79 80 ### Router 81 82 Micro router is a critical component of the micro network. It provides the network’s routing plane. Without the router, we wouldn’t know where to send messages. It constructs a routing table based on the local service registry (a component of Go Micro). The routing table maintains the routes to the services available on the local network. With the tunnel its then also able to process messages from any other datacenter or network enabling global routing by default. 83 84 Our default routing table implementation uses a simple Go in memory map, but as with all things in Go Micro, the router and routing table are both pluggable. As we scale we’re thinking about alternative implementations and even the possibility of switching dynamically based on the size of networks. 85 86 The Go Micro router interface is as follows: 87 88 ```go 89 // Router is an interface for a routing control plane 90 type Router interface { 91 // The routing table 92 Table() Table 93 // Advertise advertises routes to the network 94 Advertise() (<-chan *Advert, error) 95 // Process processes incoming adverts 96 Process(*Advert) error 97 // Solicit advertises the whole routing table to the network 98 Solicit() error 99 // Lookup queries routes in the routing table 100 Lookup(...QueryOption) ([]Route, error) 101 // Watch returns a watcher which tracks updates to the routing table 102 Watch(opts ...WatchOption) (Watcher, error) 103 } 104 ``` 105 106 When the router starts it automatically creates a watcher for its local registry. The micro registry emits events any time services are created, updated or deleted. The router processes these events and then applies actions to its routing table accordingly. The router itself advertises the routing table events which you can think of as a cut down version of the registry solely concerned with routing of requests where as the registry provides more feature rich information like api endpoints. 107 108 These routes are propagated as events to other routers on both the local and global network and applied by every router to their own routing table. Thus maintaining the global network routing plane. 109 110 Here’s a look at a typical route: 111 112 ```go 113 // Route is a network route 114 type Route struct { 115 // Service is destination service name 116 Service string 117 // Address is service node address 118 Address string 119 // Gateway is route gateway 120 Gateway string 121 // Network is the network name 122 Network string 123 // Router is router id 124 Router string 125 // Link is networks link 126 Link string 127 // Metric is the route cost 128 Metric int64 129 } 130 ``` 131 132 What we’re primarily concerned with here is routing by service name first, finding its address if its local or a gateway if we have to go through some remote endpoint or different network. We also want to know what type of Link to use e.g whether routing through our tunnel, Cloudflare Argo tunnel or some other network implementation. And then most importantly the metric a.k.a. the cost of routing to that node. We may have many routes and we want to take routes with optimal cost to ensure lowest latency. This doesn’t always mean your request is sent to the local network though! Imagine a situation when the service running on your local network is overloaded. We will always pick the route with the lowest cost no matter where the service is running. 133 134 ### Proxy 135 136 We’ve already discussed the tunnel - how messages get from point to point, and routing - detailing how to find where the services are, but then the question really is how do services actually make use of this? For this we really need a proxy. 137 138 It was important to us when building the micro network that we build something that was native to micro and capable of understanding our routing protocol. Building another VPN or IP based networking solution was not our goal. Instead we wanted to facilitate communication between services. 139 140 When a service needs to communicate with other services in the network it uses micro proxy. 141 142 The proxy is a native RPC proxy implementation built on the Go Micro `Client` and `Server` interfaces. It encapsulates the core means of communication for our services and provides a forwarding mechanism for requests based on service name and endpoints. Additionally it has the ability to also act as a messaging exchange for asynchronous communication since Go Micro supports both request/response and pub/sub communication. This is native to Go Micro and a powerful building block for request routing. 143 144 The interface itself is straightforward and encapsulates the complexity of proxying. 145 146 ```go 147 // Proxy can be used as a proxy server for go-micro services 148 type Proxy interface { 149 // ProcessMessage handles inbound messages 150 ProcessMessage(context.Context, server.Message) error 151 // ServeRequest handles inbound requests 152 ServeRequest(context.Context, server.Request, server.Response) error 153 } 154 ``` 155 156 The proxy receives RPC requests and routes them to an endpoint. It asks the router for the location of the service (caching as needed) and decides based on the `Link` field in the routing table whether to send the request locally or over the tunnel across the global network. The value of the `Link` field is either `“local"` (for local services) or `“network"` if the service is accessible only via the network. 157 158 Like everything else, the proxy is something we built standalone that can work between services in one datacenter but also across many when used in conjunction with the tunnel and router. 159 160 And finally arriving at the pièce de résistance. The network interface. 161 162 ### Network 163 164 Network nodes are the magic that ties all the core components together. Enabling the ability to build a truly global service network. It was really important when creating the network interface that it fit inline with our existing assumptions and understanding about Go Micro and distributed systems development. We really wanted to embrace the existing interfaces of the framework and design something with symmetry in regards to a Service. 165 166 What we arrived at was something very similar to the [micro.Service](https://github.com/micro/go-micro/blob/master/micro.go#L16) interface itself 167 168 169 ```go 170 // Network is a micro network 171 type Network interface { 172 // Node is network node 173 Node 174 // Name of the network 175 Name() string 176 // Connect starts the resolver and tunnel server 177 Connect() error 178 // Close stops the tunnel and resolving 179 Close() error 180 // Client is micro client 181 Client() client.Client 182 // Server is micro server 183 Server() server.Server 184 } 185 186 // Node is a network node 187 type Node interface { 188 // Id is node id 189 Id() string 190 // Address is node bind address 191 Address() string 192 // Peers returns node peers 193 Peers() []Node 194 // Network is the network node is in 195 Network() Network 196 } 197 ``` 198 199 As you can see, a `Network` has a Name, Client and Server, much like a `Service`, so it provides a similar method of communication. This means we can reuse a lot of the existing code base, but it also goes much further. A `Network` includes the concept of a `Node` directly in the interface, one which has peers and whom may belong to the same network or others. This means is networks are peer-to-peer while Services are largely focused on Client/Server. On a day to day basis developers stay focused on building services but these when built to communicate globally need to operate across networks made up of identical peers. 200 201 Our networks have the ability to behave as peers which route for others but also may provide some sort of service themselves. In this case it's mostly routing related information. 202 203 So how does it all work together? 204 205 Networks have a list of peer nodes to talk to. In the case of the default implementation the peer list comes from the registry with other network nodes with the same name (the name of the network itself). When a node starts it “connects" to the network by establishing its tunnel, resolving the nodes and then connecting to them. Once they’ve connected the nodes peer over two multicast sessions, one for peer announcements and the other for route advertisements. As these propagate the network begins to converge on identical routing information building a full mesh that allows for routing of services from any node to the other. 206 207 The nodes maintain keepalives, periodically advertise the full routing table and flush any events as they occur. Our core network nodes make use of multiple resolvers to find each other, including DNS and the local registry. In the case of peers that join our network, we’ve configured them to use a http resolver which gets geo-steered via Cloudflare anycast DNS and global load balanced to the most local region. From there they pull a list of nodes and connect to the ones with the lowest metric. They then repeat the same song and dance as above to continue the growth of the network and participate in service routing. 208 209 Each node maintains its own network graph based on the peer messages it receives. Peer messages contain the graph of each peer up to 3 hops which enables the ability for every node to build a local view of the network. Peers ignore anything with more than a 3 hop radius. This is to avoid potential performance problems. 210 211 We mentioned a little something about peer and route advertisements. So what message do the network nodes actually exchange? First, the network embeds the router interface through which it advertises its local routes to other network nodes. These routes are then propagated across the whole network, much like the internet. The node itself receives route advertisements from its peers and applies the advertised changes to its routing own routing table. The message types are “solicit" to ask for routes and “advert" for updates broadcast. 212 213 Network nodes send “connect" messages on start and “close" on exit. For their lifetime they are periodically broadcasting “peer" messages so that others can discover them and they all can build the network topology. 214 215 When the network is created and converges, services are then capable of sending messages across it. When a service on the network needs to communicate with some other service on the network it sends a request to the network node. The micro network node embeds micro proxy and thus has the ability to forward the request through network or locally if it deems so more fit based on the metrics it retrieves after looking up the routes in the routing table. 216 217 This as a whole forms our micro services network. 218 219 ## Challenges 220 221 Building a global services is not without its challenges. We encountered many from the initial design phase right through to the present day of dealing with broken nodes, bad actors, event storms and more. 222 223 ### Initial Implementation 224 225 The actual task we’d set out to accomplish was pretty monumental and we’d underestimated how much effort it would take even in an MVP phase of the first implementation. 226 227 Every time we attempted to go from design diagram to implementing code we found ourselves stuck. In theory everything made sense but no matter how many times we attempted to write code things just didn’t click. 228 229 We wrote 3-4 implementations that were essentially thrown away before figuring out the best approach was to make local networking work first and then slowly carve out specific problems to solve. So proxying, following by routing and then a network interface. Eventually when these pieces were in place we could get to multi-cloud or global networking by implementing a tunnel to handle the heavy lifting. 230 231 <center> 232 <img src="https://m3o.com/images/it-works.jpg" style="width: 80%; height: auto;" /> 233 </center> 234 235 Once again, the lesson is to keep it simple, but where the task itself is complex, break it down into steps you can actually keep simple independently and then piece back together in practice. 236 237 ### Multipoint Tunneling 238 239 One of the most complex pieces of code we had to write was the tunnel. It's still not where we’d like it to be but it was pretty important to us to write this from the ground up so we’d have a good understanding of how we wanted to operate globally but also have full control over the foundations of the system. 240 241 The complexity in writing network protocols really came to light in this effort, from trying to NOT reimplement tcp, crypto or anything else but also find a path to a working solution. In the end we were able to create a subset of commands which formed enough of a bidirectional and multi-session based tunnel over QUIC. We left most of the heavy lifting to QUIC but we also needed the ability to do multicast. 242 243 For us it didn’t make sense to just rely on unicast, considering the async and pubsub based semantics built into Go Micro we felt pretty adamant it needed to be part of core network routing. So with that sessions needed to be reimplemented on top of QUIC. 244 245 We’ll spare you the gory details but what’s really clear to us is that writing state machines and reliable connection protocol code is not where we want to spend the majority of our time. We have a fairly resilient working solution but our hope is to replace this with something far better in the future. 246 247 ### Event Storms 248 249 When things work they work and when they break they break badly. For us everything came crashing down when we started to encounter broadcast storms caused by services being recycled. When a service is created the service registry fires a create event and when it’s shutting down it automatically deregisters from service registry which fires a delete event. Maybe you can see where this is going. As services cycled in our network they’d generate these events which leads to the routers generating new route events which are then propagated every 5 seconds to every other node on the network. 250 251 This sounds ok if the network converges and they stop propagating events but in our case the sequence of events are observed and applied at random time intervals on every node. This in essence can lead to a broadcast storm which never stops. Understanding and resolving this is an incredibly difficult task. 252 253 For us this really led to research in BGP internet routing in which they’ve defined flap detection algorithms to handle this. We’ve read a few whitepapers to get familiar with the concepts and hacked up a simple flap detection algorithm in the router. 254 255 At its core, the flap detection assigns a numerical cost to every route event. When a route event occurs it’s cost gets incremented. If the same event happens multiple times within a certain period of time and the accumulated cost reaches a predefined threshold the event is immediately suppressed. Suppressed events are not processed by router, but are kept in a memory cache for a redefined period of time. Meanwhile the cost of the event decays with time whilst at the same time it can keep on growing if the event keeps on flapping. If the event drops below another threshold the event is unsuppressed and can be processed by the routers. If the event remains suppressed for longer than a predefined time period it’s discarded. 256 257 The picture below depicts how the decaying actually works. 258 259 <center> 260 <img src="https://m3o.com/assets/images/flap-detection.png" style="width: 80%; height: auto;" /> 261 </center> 262 263 <small>source: http://linuxczar.net/blog/2016/01/31/flap-detection/</small> 264 265 This had a huge effect on the issues we had been experiencing in the network. The nodes were no longer hammered with crazy event storms and the network stabilised and continued to work without any interruptions. Happy days! 266 267 ## Architecture 268 269 Our overall goal is to build a micro services network that manages not only communication but all aspects of running services, governance, and more. To accomplish this we started by addressing networking from the ground up for Go Micro services. Not just to communicate locally within one private network but to have the ability to do so across many networks. 270 271 For this purpose we’ve created a global multi-cloud network that enables communication from anywhere, with anyone. This is fundamental to the architecture of the micro services network. 272 273 Our next goal will be to tackle the runtime aspects so that we offer the ability to host services without the need to manage them. This could be imagined as the basis of a serverless microservices platform which we’re looking to launch soon. 274 275 The platform is designed to be open. Anyone should be able to run services on the platform or join the global network using their own node. What’s more, you can even pick up the open source code and build their own private networks or join theirs to our public one. 276 277 <center> 278 <img src="https://github.com/micro/development/raw/f4c77580acac228c522623c217575fb266d2d4ab/images/arch.jpg" style="width: 80%; height: auto;" /> 279 </center> 280 <br> 281 282 What we think is pretty cool and rather unique about the micro network is the network nodes themselves are just regular micro services like any other. Because we built everything using Go Micro they behave just like any other service. In fact what’s even more exciting is that literally *everything is a service* in the micro network. 283 284 This holds true for all the individual components that make up the network. If you don’t want to run full network nodes, you can also run individual components of the network separately as standalone micro services such as the tunnel, router and proxy. All the components register themselves with local registry via which they can be discovered. 285 286 ## Eventual success 287 288 On 29th August 2019 around 4PM we sent the first successful RPC request between our laptops across the internet using the micro network. 289 290 <center> 291 <img src="https://m3o.com/assets/images/success.jpg" style="width: 80%; height: auto;" /> 292 </center> 293 <br> 294 295 Since then we have squashed a lot of bugs and deployed the network nodes across the globe. 296 At the moment we are running the micro network in 4 cloud providers across 4 geographical regions with 3 nodes in each region. 297 298 <center> 299 <img src="https://m3o.com/assets/images/radar.png" style="width: 80%; height: auto;" /> 300 </center> 301 302 ## Usage 303 304 If you're interested in testing out micro and the network just do the following. 305 306 ```go 307 # enable go modules 308 export GO111MODULE=on 309 310 # download micro 311 go get github.com/tickoalcantara12/micro@master 312 313 # connect to the network 314 micro --peer 315 ``` 316 317 Now you're connected to the network. Start to explore what's there. 318 319 ```go 320 # List the services in the network 321 micro network services 322 323 # See which nodes you're connected to 324 micro network connections 325 326 # List all the nodes in your network graph 327 micro network nodes 328 329 # See what the metrics look like to different service routes 330 micro network routes 331 ``` 332 333 So what does a micro network developer workflow look like? Developers write their Go code using the [Go Micro](https://github.com/micro/go-micro) framework and once they’re ready they can make their services available on the network either directly from their laptop or from anywhere the micro network node runs (more on what micro network node is later). 334 335 Here is an example of a simple service written using `go-micro`: 336 337 ```go 338 package main 339 340 import ( 341 "context" 342 "log" 343 "time" 344 345 hello "github.com/micro/examples/greeter/srv/proto/hello" 346 "github.com/micro/go-micro" 347 ) 348 349 type Say struct{} 350 351 func (s *Say) Hello(ctx context.Context, req *hello.Request, rsp *hello.Response) error { 352 log.Print("Received Say.Hello request") 353 rsp.Msg = "Hello " + req.Name 354 return nil 355 } 356 357 func main() { 358 service := micro.NewService( 359 micro.Name("helloworld"), 360 ) 361 362 // optionally setup command line usage 363 service.Init() 364 365 // Register Handlers 366 hello.RegisterSayHandler(service.Server(), new(Say)) 367 368 // Run server 369 if err := service.Run(); err != nil { 370 log.Fatal(err) 371 } 372 } 373 ``` 374 375 Once you launch the service it automatically registers with service registry and becomes instantly accessible to everyone on the network to consume and collaborate on. All of this is completely transparent to developers. No need to deal with low level distributed systems cruft! 376 377 We’re already running a greeter service in the network so why not try giving it a call. 378 379 ``` 380 # enable proxying through the network 381 export MICRO_PROXY=go.micro.network 382 383 # call a service 384 micro call go.micro.srv.greeter Say.Hello '{"name": "John"}' 385 ``` 386 387 It works! 388 389 ## Conclusion 390 391 Building distributed systems is difficult, but it turns out building the networks they communicate over is an equally, if not more difficult, problem. The classic fallacy, [the network is reliable](https://queue.acm.org/detail.cfm?id=2655736), continues to hold, as we found while building the micro network. However what’s also clear is that our world and most technology thrives through the use of networks. They underpin the very fabric of all that we’ve come to know. Our goal with the micro network is to create a new type of foundation for the open services of the future. Hopefully this post shed some light on the technical accomplishments and challenges of building such a thing. 392 393 <br /> 394 To learn more check out the [website](https://m3o.com), follow us on [twitter](https://twitter.com/m3ocloud) or 395 join the [slack](https://m3o.com/slack) community. We are hiring! 396