github.com/tickoalcantara12/micro/v3@v3.0.0-20221007104245-9d75b9bcbab9/docs/blog/_posts/2019-12-05-building-a-microservices-network.md

github.com/tickoalcantara12/micro/v3@v3.0.0-20221007104245-9d75b9bcbab9/docs/blog/_posts/2019-12-05-building-a-microservices-network.md (about)

1 ---
2 author: Milos, Jake and Asim
3 layout: post
4 title: Building a global services network using Go, QUIC and Micro
5 date: 2019-12-05 09:00:00
6 ---
7
8 Over the past 6 months we at [Micro](https://m3o.com/) have been hard at work developing a global service network to build, share and collaborate on microservices.
9
10 In this post we're going to share some of the technical details, the design decisions we made, challenges we faced and ultimately how we have succeeded in building the microservices network.
11
12 ## Motivations
13
14 The power of collaborative development has largely been restricted to trusted environments within organisations. When done right, these private in-house platforms unlock incredible productivity and compounding value with every new service added.They provide an always-on runtime and known developer workflow for engineers to collaborate on and deliver new features to their customers.
15
16 Historically, this has been quite difficult to achieve outside of organisations. When developers decide to work on new services they often have to deal with a lot of unnecessary work when it comes to making the services available to others to consume and collaborate on. Public cloud providers are too complex and the elaborate setups when hosting things yourself don’t make things easier either. At [Micro](https://m3o.com/) we felt this pain and decided to do something about it. We built a microservices network!
17
18 The micro network looks to solve these problems using a shared global network for micro services. Let’s see how we’ve made this dream a reality!
19
20 ## Design
21
22 The micro network is a globally distributed network based on [go-micro](https://go-micro.dev), a Go microservices framework which enables developers to build services quickly without dealing with the complexity of distributed systems. Go Micro provides strongly opinionated interfaces that are pluggable but also come with sane defaults. This allows Go Micro services to be built once and deployed anywhere, with zero code changes.
23
24 The micro network leverages five of the core primitives: registry, transport, broker, client and server. Our default implementations can be found in each package in the [go-micro](https://github.com/micro/go-micro) framework. Community maintained plugins live in the [go-plugins](https://github.com/micro/go-plugins) repo.
25
26 The micro "network" is an overloaded term, referring both to the global network over which services discover and communicate with each other and the underpinning system consisting of peer nodes whom connect to each and establish the routes over which services communicate.
27
28 The network abstracts away the low level details of distributed system communication at large scale, across any cloud or machine, and allows anyone to build services together without thinking about where they are running. This essentially enables large scale sharing of resources and more importantly microservices.
29
30 There are four fundamental concepts that make the micro network possible. These are entirely new and built into [Go Micro](https://go-micro.dev/) as of the last 6 months:
31
32 - **Tunnel** - point to point tunnelling
33 - **Proxy** - transparent rpc proxying
34 - **Router** - route aggregation and advertising
35 - **Network** - multi-cloud networking built on the above three
36
37 Each of these components is just like any other [Go Micro](https://go-micro.dev/) component - pluggable, with an out of the box default implementation to get started. In our case the micro network it was important that the defaults worked at scale across the world.
38
39 Let’s dig into the details.
40
41 ### Tunnel
42
43 From a high level view the micro network is an overlay network that spans the internet. All micro network nodes maintain secure tunnel connections between each other to enable the secure communication between the services running in the network. Go Micro provides a default tunnel implementation using the QUIC protocol along with custom session management.
44
45 We chose QUIC because it provides some excellent properties especially when it comes to dealing with high latency networks, an important property when dealing with [running services in large distributed networks](https://eng.uber.com/employing-quic-protocol/). QUIC runs over UDP, but by adding some connection based semantics it supports reliable packet delivery. QUIC also supports multiple streams without [head of line blocking](https://en.wikipedia.org/wiki/Head-of-line_blocking) and it’s designed to work with encryption natively. Finally, QUIC runs in userspace, not in the kernel space on conventional systems, so it can provide both a performance and extra security, too.
46
47 Micro tunnel uses [quic-go](https://github.com/lucas-clemente/quic-go) which is the most complete Go implementation of QUIC that we could find at the inception of the micro network. We are aware quic-go is a work in progress and that it can occasionally break, but we are happy to pay the early adopter cost as we believe QUIC will become the defacto standard internet communication protocol in the future, enabling large scale networks such as the micro network.
48
49 Let’s look at the Go Micro tunnel interface:
50
51 ```go
52 // Tunnel creates a gre tunnel on top of the go-micro/transport.
53 // It establishes multiple streams using the Micro-Tunnel-Channel header
54 // and Micro-Tunnel-Session header. The tunnel id is a hash of
55 // the address being requested.
56 type Tunnel interface {
57 // Address the tunnel is listening on
58 Address() string
59 // Connect connects the tunnel
60 Connect() error
61 // Close closes the tunnel
62 Close() error
63 // Links returns all the links the tunnel is connected to
64 Links() []Link
65 // Dial to a tunnel channel
66 Dial(channel string, opts ...DialOption) (Session, error)
67 // Accept connections on a channel
68 Listen(channel string, opts ...ListenOption) (Listener, error)
69 }
70 ```
71
72 It may look fairly familiar to Go developers. With Go Micro we’ve tried to maintain common interfaces in line with distributed systems development while stepping in at a lower layer to solve some of the nitty gritty details.
73
74 Most of the interface methods should hopefully be self-explanatory, but you might be wondering about channels and sessions. Channels are much like addresses, providing a way to segment different message streams over the tunnel. Listeners listen on a given channel and return a unique session when a client dials into the channel. The session is used to communicate between peers on the same tunnel channel. The Go Micro tunnel provides different communication semantics too. You can choose to use either unicast or multicast.
75
76 <img src="https://m3o.com/docs/images/session.svg" alt="" />
77
78 In addition tunnels enable bidirectional connections; sessions can be dialled or listened from either side. This enables the reversal of connections so anything behind a [NAT](https://en.wikipedia.org/wiki/Network_address_translation) or without a public IP can become a server.
79
80 ### Router
81
82 Micro router is a critical component of the micro network. It provides the network’s routing plane. Without the router, we wouldn’t know where to send messages. It constructs a routing table based on the local service registry (a component of Go Micro). The routing table maintains the routes to the services available on the local network. With the tunnel its then also able to process messages from any other datacenter or network enabling global routing by default.
83
84 Our default routing table implementation uses a simple Go in memory map, but as with all things in Go Micro, the router and routing table are both pluggable. As we scale we’re thinking about alternative implementations and even the possibility of switching dynamically based on the size of networks.
85
86 The Go Micro router interface is as follows:
87
88 ```go
89 // Router is an interface for a routing control plane
90 type Router interface {
91 // The routing table
92 Table() Table
93 // Advertise advertises routes to the network
94 Advertise() (<-chan *Advert, error)
95 // Process processes incoming adverts
96 Process(*Advert) error
97 // Solicit advertises the whole routing table to the network
98 Solicit() error
99 // Lookup queries routes in the routing table
100 Lookup(...QueryOption) ([]Route, error)
101 // Watch returns a watcher which tracks updates to the routing table
102 Watch(opts ...WatchOption) (Watcher, error)
103 }
104 ```
105
106 When the router starts it automatically creates a watcher for its local registry. The micro registry emits events any time services are created, updated or deleted. The router processes these events and then applies actions to its routing table accordingly. The router itself advertises the routing table events which you can think of as a cut down version of the registry solely concerned with routing of requests where as the registry provides more feature rich information like api endpoints.
107
108 These routes are propagated as events to other routers on both the local and global network and applied by every router to their own routing table. Thus maintaining the global network routing plane.
109
110 Here’s a look at a typical route:
111
112 ```go
113 // Route is a network route
114 type Route struct {
115 // Service is destination service name
116 Service string
117 // Address is service node address
118 Address string
119 // Gateway is route gateway
120 Gateway string
121 // Network is the network name
122 Network string
123 // Router is router id
124 Router string
125 // Link is networks link
126 Link string
127 // Metric is the route cost
128 Metric int64
129 }
130 ```
131
132 What we’re primarily concerned with here is routing by service name first, finding its address if its local or a gateway if we have to go through some remote endpoint or different network. We also want to know what type of Link to use e.g whether routing through our tunnel, Cloudflare Argo tunnel or some other network implementation. And then most importantly the metric a.k.a. the cost of routing to that node. We may have many routes and we want to take routes with optimal cost to ensure lowest latency. This doesn’t always mean your request is sent to the local network though! Imagine a situation when the service running on your local network is overloaded. We will always pick the route with the lowest cost no matter where the service is running.
133
134 ### Proxy
135
136 We’ve already discussed the tunnel - how messages get from point to point, and routing - detailing how to find where the services are, but then the question really is how do services actually make use of this? For this we really need a proxy.
137
138 It was important to us when building the micro network that we build something that was native to micro and capable of understanding our routing protocol. Building another VPN or IP based networking solution was not our goal. Instead we wanted to facilitate communication between services.
139
140 When a service needs to communicate with other services in the network it uses micro proxy.
141
142 The proxy is a native RPC proxy implementation built on the Go Micro `Client` and `Server` interfaces. It encapsulates the core means of communication for our services and provides a forwarding mechanism for requests based on service name and endpoints. Additionally it has the ability to also act as a messaging exchange for asynchronous communication since Go Micro supports both request/response and pub/sub communication. This is native to Go Micro and a powerful building block for request routing.
143
144 The interface itself is straightforward and encapsulates the complexity of proxying.
145
146 ```go
147 // Proxy can be used as a proxy server for go-micro services
148 type Proxy interface {
149 // ProcessMessage handles inbound messages
150 ProcessMessage(context.Context, server.Message) error
151 // ServeRequest handles inbound requests
152 ServeRequest(context.Context, server.Request, server.Response) error
153 }
154 ```
155
156 The proxy receives RPC requests and routes them to an endpoint. It asks the router for the location of the service (caching as needed) and decides based on the `Link` field in the routing table whether to send the request locally or over the tunnel across the global network. The value of the `Link` field is either `“local"` (for local services) or `“network"` if the service is accessible only via the network.
157
158 Like everything else, the proxy is something we built standalone that can work between services in one datacenter but also across many when used in conjunction with the tunnel and router.
159
160 And finally arriving at the pièce de résistance. The network interface.
161
162 ### Network
163
164 Network nodes are the magic that ties all the core components together. Enabling the ability to build a truly global service network. It was really important when creating the network interface that it fit inline with our existing assumptions and understanding about Go Micro and distributed systems development. We really wanted to embrace the existing interfaces of the framework and design something with symmetry in regards to a Service.
165
166 What we arrived at was something very similar to the [micro.Service](https://github.com/micro/go-micro/blob/master/micro.go#L16) interface itself
167
168
169 ```go
170 // Network is a micro network
171 type Network interface {
172 // Node is network node
173 Node
174 // Name of the network
175 Name() string
176 // Connect starts the resolver and tunnel server
177 Connect() error
178 // Close stops the tunnel and resolving
179 Close() error
180 // Client is micro client
181 Client() client.Client
182 // Server is micro server
183 Server() server.Server
184 }
185
186 // Node is a network node
187 type Node interface {
188 // Id is node id
189 Id() string
190 // Address is node bind address
191 Address() string
192 // Peers returns node peers
193 Peers() []Node
194 // Network is the network node is in
195 Network() Network
196 }
197 ```
198
199 As you can see, a `Network` has a Name, Client and Server, much like a `Service`, so it provides a similar method of communication. This means we can reuse a lot of the existing code base, but it also goes much further. A `Network` includes the concept of a `Node` directly in the interface, one which has peers and whom may belong to the same network or others. This means is networks are peer-to-peer while Services are largely focused on Client/Server. On a day to day basis developers stay focused on building services but these when built to communicate globally need to operate across networks made up of identical peers.
200
201 Our networks have the ability to behave as peers which route for others but also may provide some sort of service themselves. In this case it's mostly routing related information.
202
203 So how does it all work together?
204
205 Networks have a list of peer nodes to talk to. In the case of the default implementation the peer list comes from the registry with other network nodes with the same name (the name of the network itself). When a node starts it “connects" to the network by establishing its tunnel, resolving the nodes and then connecting to them. Once they’ve connected the nodes peer over two multicast sessions, one for peer announcements and the other for route advertisements. As these propagate the network begins to converge on identical routing information building a full mesh that allows for routing of services from any node to the other.
206
207 The nodes maintain keepalives, periodically advertise the full routing table and flush any events as they occur. Our core network nodes make use of multiple resolvers to find each other, including DNS and the local registry. In the case of peers that join our network, we’ve configured them to use a http resolver which gets geo-steered via Cloudflare anycast DNS and global load balanced to the most local region. From there they pull a list of nodes and connect to the ones with the lowest metric. They then repeat the same song and dance as above to continue the growth of the network and participate in service routing.
208
209 Each node maintains its own network graph based on the peer messages it receives. Peer messages contain the graph of each peer up to 3 hops which enables the ability for every node to build a local view of the network. Peers ignore anything with more than a 3 hop radius. This is to avoid potential performance problems.
210
211 We mentioned a little something about peer and route advertisements. So what message do the network nodes actually exchange? First, the network embeds the router interface through which it advertises its local routes to other network nodes. These routes are then propagated across the whole network, much like the internet. The node itself receives route advertisements from its peers and applies the advertised changes to its routing own routing table. The message types are “solicit" to ask for routes and “advert" for updates broadcast.
212
213 Network nodes send “connect" messages on start and “close" on exit. For their lifetime they are periodically broadcasting “peer" messages so that others can discover them and they all can build the network topology.
214
215 When the network is created and converges, services are then capable of sending messages across it. When a service on the network needs to communicate with some other service on the network it sends a request to the network node. The micro network node embeds micro proxy and thus has the ability to forward the request through network or locally if it deems so more fit based on the metrics it retrieves after looking up the routes in the routing table.
216
217 This as a whole forms our micro services network.
218
219 ## Challenges
220
221 Building a global services is not without its challenges. We encountered many from the initial design phase right through to the present day of dealing with broken nodes, bad actors, event storms and more.
222
223 ### Initial Implementation
224
225 The actual task we’d set out to accomplish was pretty monumental and we’d underestimated how much effort it would take even in an MVP phase of the first implementation.
226
227 Every time we attempted to go from design diagram to implementing code we found ourselves stuck. In theory everything made sense but no matter how many times we attempted to write code things just didn’t click.
228
229 We wrote 3-4 implementations that were essentially thrown away before figuring out the best approach was to make local networking work first and then slowly carve out specific problems to solve. So proxying, following by routing and then a network interface. Eventually when these pieces were in place we could get to multi-cloud or global networking by implementing a tunnel to handle the heavy lifting.
230
231 <center>
232 <img src="https://m3o.com/images/it-works.jpg" style="width: 80%; height: auto;" />
233 </center>
234
235 Once again, the lesson is to keep it simple, but where the task itself is complex, break it down into steps you can actually keep simple independently and then piece back together in practice.
236
237 ### Multipoint Tunneling
238
239 One of the most complex pieces of code we had to write was the tunnel. It's still not where we’d like it to be but it was pretty important to us to write this from the ground up so we’d have a good understanding of how we wanted to operate globally but also have full control over the foundations of the system.
240
241 The complexity in writing network protocols really came to light in this effort, from trying to NOT reimplement tcp, crypto or anything else but also find a path to a working solution. In the end we were able to create a subset of commands which formed enough of a bidirectional and multi-session based tunnel over QUIC. We left most of the heavy lifting to QUIC but we also needed the ability to do multicast.
242
243 For us it didn’t make sense to just rely on unicast, considering the async and pubsub based semantics built into Go Micro we felt pretty adamant it needed to be part of core network routing. So with that sessions needed to be reimplemented on top of QUIC.
244
245 We’ll spare you the gory details but what’s really clear to us is that writing state machines and reliable connection protocol code is not where we want to spend the majority of our time. We have a fairly resilient working solution but our hope is to replace this with something far better in the future.
246
247 ### Event Storms
248
249 When things work they work and when they break they break badly. For us everything came crashing down when we started to encounter broadcast storms caused by services being recycled. When a service is created the service registry fires a create event and when it’s shutting down it automatically deregisters from service registry which fires a delete event. Maybe you can see where this is going. As services cycled in our network they’d generate these events which leads to the routers generating new route events which are then propagated every 5 seconds to every other node on the network.
250
251 This sounds ok if the network converges and they stop propagating events but in our case the sequence of events are observed and applied at random time intervals on every node. This in essence can lead to a broadcast storm which never stops. Understanding and resolving this is an incredibly difficult task.
252
253 For us this really led to research in BGP internet routing in which they’ve defined flap detection algorithms to handle this. We’ve read a few whitepapers to get familiar with the concepts and hacked up a simple flap detection algorithm in the router.
254
255 At its core, the flap detection assigns a numerical cost to every route event. When a route event occurs it’s cost gets incremented. If the same event happens multiple times within a certain period of time and the accumulated cost reaches a predefined threshold the event is immediately suppressed. Suppressed events are not processed by router, but are kept in a memory cache for a redefined period of time. Meanwhile the cost of the event decays with time whilst at the same time it can keep on growing if the event keeps on flapping. If the event drops below another threshold the event is unsuppressed and can be processed by the routers. If the event remains suppressed for longer than a predefined time period it’s discarded.
256
257 The picture below depicts how the decaying actually works.
258
259 <center>
260 <img src="https://m3o.com/assets/images/flap-detection.png" style="width: 80%; height: auto;" />
261 </center>
262
263 source: http://linuxczar.net/blog/2016/01/31/flap-detection/
264
265 This had a huge effect on the issues we had been experiencing in the network. The nodes were no longer hammered with crazy event storms and the network stabilised and continued to work without any interruptions. Happy days!
266
267 ## Architecture
268
269 Our overall goal is to build a micro services network that manages not only communication but all aspects of running services, governance, and more. To accomplish this we started by addressing networking from the ground up for Go Micro services. Not just to communicate locally within one private network but to have the ability to do so across many networks.
270
271 For this purpose we’ve created a global multi-cloud network that enables communication from anywhere, with anyone. This is fundamental to the architecture of the micro services network.
272
273 Our next goal will be to tackle the runtime aspects so that we offer the ability to host services without the need to manage them. This could be imagined as the basis of a serverless microservices platform which we’re looking to launch soon.
274
275 The platform is designed to be open. Anyone should be able to run services on the platform or join the global network using their own node. What’s more, you can even pick up the open source code and build their own private networks or join theirs to our public one.
276
277 <center>
278 <img src="https://github.com/micro/development/raw/f4c77580acac228c522623c217575fb266d2d4ab/images/arch.jpg" style="width: 80%; height: auto;" />
279 </center>
280 
281
282 What we think is pretty cool and rather unique about the micro network is the network nodes themselves are just regular micro services like any other. Because we built everything using Go Micro they behave just like any other service. In fact what’s even more exciting is that literally *everything is a service* in the micro network.
283
284 This holds true for all the individual components that make up the network. If you don’t want to run full network nodes, you can also run individual components of the network separately as standalone micro services such as the tunnel, router and proxy. All the components register themselves with local registry via which they can be discovered.
285
286 ## Eventual success
287
288 On 29th August 2019 around 4PM we sent the first successful RPC request between our laptops across the internet using the micro network.
289
290 <center>
291 <img src="https://m3o.com/assets/images/success.jpg" style="width: 80%; height: auto;" />
292 </center>
293 
294
295 Since then we have squashed a lot of bugs and deployed the network nodes across the globe.
296 At the moment we are running the micro network in 4 cloud providers across 4 geographical regions with 3 nodes in each region.
297
298 <center>
299 <img src="https://m3o.com/assets/images/radar.png" style="width: 80%; height: auto;" />
300 </center>
301
302 ## Usage
303
304 If you're interested in testing out micro and the network just do the following.
305
306 ```go
307 # enable go modules
308 export GO111MODULE=on
309
310 # download micro
311 go get github.com/tickoalcantara12/micro@master
312
313 # connect to the network
314 micro --peer
315 ```
316
317 Now you're connected to the network. Start to explore what's there.
318
319 ```go
320 # List the services in the network
321 micro network services
322
323 # See which nodes you're connected to
324 micro network connections
325
326 # List all the nodes in your network graph
327 micro network nodes
328
329 # See what the metrics look like to different service routes
330 micro network routes
331 ```
332
333 So what does a micro network developer workflow look like? Developers write their Go code using the [Go Micro](https://github.com/micro/go-micro) framework and once they’re ready they can make their services available on the network either directly from their laptop or from anywhere the micro network node runs (more on what micro network node is later).
334
335 Here is an example of a simple service written using `go-micro`:
336
337 ```go
338 package main
339
340 import (
341 "context"
342 "log"
343 "time"
344
345 hello "github.com/micro/examples/greeter/srv/proto/hello"
346 "github.com/micro/go-micro"
347 )
348
349 type Say struct{}
350
351 func (s *Say) Hello(ctx context.Context, req *hello.Request, rsp *hello.Response) error {
352 log.Print("Received Say.Hello request")
353 rsp.Msg = "Hello " + req.Name
354 return nil
355 }
356
357 func main() {
358 service := micro.NewService(
359 micro.Name("helloworld"),
360 )
361
362 // optionally setup command line usage
363 service.Init()
364
365 // Register Handlers
366 hello.RegisterSayHandler(service.Server(), new(Say))
367
368 // Run server
369 if err := service.Run(); err != nil {
370 log.Fatal(err)
371 }
372 }
373 ```
374
375 Once you launch the service it automatically registers with service registry and becomes instantly accessible to everyone on the network to consume and collaborate on. All of this is completely transparent to developers. No need to deal with low level distributed systems cruft!
376
377 We’re already running a greeter service in the network so why not try giving it a call.
378
379 ```
380 # enable proxying through the network
381 export MICRO_PROXY=go.micro.network
382
383 # call a service
384 micro call go.micro.srv.greeter Say.Hello '{"name": "John"}'
385 ```
386
387 It works!
388
389 ## Conclusion
390
391 Building distributed systems is difficult, but it turns out building the networks they communicate over is an equally, if not more difficult, problem. The classic fallacy, [the network is reliable](https://queue.acm.org/detail.cfm?id=2655736), continues to hold, as we found while building the micro network. However what’s also clear is that our world and most technology thrives through the use of networks. They underpin the very fabric of all that we’ve come to know. Our goal with the micro network is to create a new type of foundation for the open services of the future. Hopefully this post shed some light on the technical accomplishments and challenges of building such a thing.
392
393 
394 To learn more check out the [website](https://m3o.com), follow us on [twitter](https://twitter.com/m3ocloud) or
395 join the [slack](https://m3o.com/slack) community. We are hiring!
396