github.com/makyo/juju@v0.0.0-20160425123129-2608902037e9/worker/dependency/doc.go (about) 1 // Copyright 2015 Canonical Ltd. 2 // Licensed under the AGPLv3, see LICENCE file for details. 3 4 /* 5 6 The dependency package exists to address a general problem with shared resources 7 and the management of their lifetimes. Many kinds of software handle these issues 8 with more or less felicity, but it's particularly important that juju (which is 9 a distributed system that needs to be very fault-tolerant) handle them clearly 10 and sanely. 11 12 Background 13 ---------- 14 15 A cursory examination of the various workers run in juju agents (as of 2015-04-20) 16 reveals a distressing range of approaches to the shared resource problem. A 17 sampling of techniques (and their various problems) follows: 18 19 * enforce sharing in code structure, either directly via scoping or implicitly 20 via nested runners (state/api conns; agent config) 21 * code structure is inflexible, and it enforces strictly nested resource 22 lifetimes, which are not always adequate. 23 * just create N of them and hope it works out OK (environs) 24 * creating N prevents us from, e.g., using a single connection to an environ 25 and sanely rate-limiting ourselves. 26 * use filesystem locking across processes (machine execution lock) 27 * implementation sometimes flakes out, or is used improperly; and multiple 28 agents *are* a problem anyway, but even if we're all in-process we'll need 29 some shared machine lock... 30 * wrap workers to start up only when some condition is met (post-upgrade 31 stability -- itself also a shared resource) 32 * lifetime-nesting comments apply here again; *and* it makes it harder to 33 follow the code. 34 * implement a singleton (lease manager) 35 * singletons make it *even harder* to figure out what's going on -- they're 36 basically just fancy globals, and have all the associated problems with, 37 e.g. deadlocking due to unexpected shutdown order. 38 39 ...but, of course, they all have their various advantages: 40 41 * Of the approaches, the first is the most reliable by far. Despite the 42 inflexibility, there's a clear and comprehensible model in play that has yet 43 to cause serious confusion: each worker is created with its resource(s) 44 directly available in code scope, and trusts that it will be restarted by an 45 independent watchdog if one of its dependencies fails. This characteristic is 46 extremely beneficial and must be preserved; we just need it to be more 47 generally applicable. 48 49 * The create-N-Environs approach is valuable because it can be simply (if 50 inelegantly) integrated with its dependent worker, and a changed Environ 51 does not cause the whole dependent to fall over (unless the change is itself 52 bad). The former characteristic is a subtle trap (we shouldn't be baking 53 dependency-management complexity into the cores of our workers' select loops, 54 even if it is "simple" to do so), but the latter is important: in particular, 55 firewaller and provisioner are distressingly heavyweight workers and it would 56 be unwise to take an approach that led to them being restarted when not 57 necessary. 58 59 * The filesystem locking just should not happen -- and we need to integrate the 60 unit and machine agents to eliminate it (and for other reasons too) so we 61 should give some thought to the fact that we'll be shuffling these dependencies 62 around pretty hard in the future. If the approach can make that task easier, 63 then great. 64 65 * The singleton is dangerous specifically because its dependency interactions are 66 unclear. Absolute clarity of dependencies, as provided by the nesting approaches, 67 is in fact critical; but the sheer convenience of the singleton is alluring, and 68 reminds us that the approach we take must remain easy to use. 69 70 The various nesting approaches give easy access to directly-available resources, 71 which is great, but will fail as soon as you have a sufficiently sophisticated 72 dependent that can operate usefully without all its dependencies being satisfied 73 (we have a couple of requirements for this in the unit agent right now). Still, 74 direct resource access *is* tremendously convenient, and we need some way to 75 access one service from another. 76 77 However, all of these resources are very different: for a solution that encompasses 78 them all, you kinda have to represent them as interface{} at some point, and that's 79 very risky re: clarity. 80 81 82 Problem 83 ------- 84 85 The package is intended to implement the following developer stories: 86 87 * As a developer trying to understand the codebase, I want to know what workers 88 are running in an agent at any given time. 89 * As a developer, I want to be prevented from introducing dependency cycles 90 into my application. 91 * As a developer, I want to provide a service provided by some worker to one or 92 more client workers. 93 * As a developer, I want to write a service that consumes one or more other 94 workers' services. 95 * As a developer, I want to choose how I respond to missing dependencies. 96 * As a developer, I want to be able to inject test doubles for my dependencies. 97 * As a developer, I want control over how my service is exposed to others. 98 * As a developer, I don't want to have to typecast my dependencies from 99 interface{} myself. 100 * As a developer, I want my service to be restarted if its dependencies change. 101 102 That last one might bear a little bit of explanation: but I contend that it's the 103 only reliable approach to writing resilient services that compose sanely into a 104 comprehensible system. Consider: 105 106 * Juju agents' lifetimes must be assumed to exceed the MTBR of the systems 107 they're deployed on; you might naively think that hard reboots are "rare"... 108 but they're not. They really are just a feature of the terrain we have to 109 traverse. Therefore every worker *always* has to be capable of picking itself 110 back up from scratch and continuing sanely. That is, we're not imposing a new 111 expectation: we're just working within the existing constraints. 112 * While some workers are simple, some are decidedly not; when a worker has any 113 more complexity than "none" it is a Bad Idea to mix dependency-management 114 concerns into their core logic: it creates the sort of morass in which subtle 115 bugs thrive. 116 117 So, we take advantage of the expected bounce-resilience, and excise all dependency 118 management concerns from the existing ones... in favour of a system that bounces 119 workers slightly more often than before, and thus exercises those code paths more; 120 so, when there are bugs, we're more likely to shake them out in automated testing 121 before they hit users. 122 123 We'd maybe also like to implement this story: 124 125 * As a developer, I want to add and remove groups of workers atomically, e.g. 126 when starting the set of controller workers for a hosted environ; or when 127 starting the set of workers used by a single unit. [NOT DONE] 128 129 ...but there's no urgent use case yet, and it's not certain to be superior to an 130 engine-nesting approach. 131 132 133 Solution 134 -------- 135 136 Run a single dependency.Engine at the top level of each agent; express every 137 shared resource, and every worker that uses one, as a dependency.Manifold; and 138 install them all into the top-level engine. 139 140 When installed under some name, a dependency.Manifold represents the features of 141 a node in the engine's dependency graph. It lists: 142 143 * The names of its dependencies (Inputs). 144 * How to create the worker representing the resource (Start). 145 * How (if at all) to expose the resource as a service to other resources that 146 know it by name (Output). 147 148 ...and allows the developers of each independent service a common mechanism for 149 declaring and accessing their dependencies, and the ability to assume that they 150 will be restarted whenever there is a material change to their accessible 151 dependencies. 152 153 When the weight of manifolds in a single engine becomes inconvenient, group them 154 and run them inside nested dependency.Engines; the Report() method on the top- 155 level engine will collect information from (directly-) contained engines, so at 156 least there's still some observability; but there may also be call to pass 157 actual dependencies down from one engine to another, and that'll demand careful 158 thought. 159 160 161 Usage 162 ----- 163 164 In each worker package, write a `manifold.go` containing the following: 165 166 // ManifoldConfig holds the information necessary to configure the worker 167 // controlled by a Manifold. 168 type ManifoldConfig struct { 169 170 // The names of the various dependencies, e.g. 171 APICallerName string 172 MachineLockName string 173 174 // Any other required top-level configuration, e.g. 175 Period time.Duration 176 } 177 178 // Manifold returns a manifold that controls the operation of a worker 179 // responsible for <things>, configured as supplied. 180 func Manifold(config ManifoldConfig) dependency.Manifold { 181 // Your code here... 182 return dependency.Manifold{ 183 184 // * certainly include each of your configured dependency names, 185 // getResource will only expose them if you declare them here. 186 Inputs: []string{config.APICallerName, config.MachineLockName}, 187 188 // * certainly include a start func, it will panic if you don't. 189 Start: func(getResource dependency.GetResourceFunc) (worker.Worker, error) { 190 // You presumably want to get your dependencies, and you almost 191 // certainly want to be closed over `config`... 192 var apicaller base.APICaller 193 if err := getResource(config.APICallerName, &apicaller); err != nil { 194 return nil, err 195 } 196 var machineLock *fslock.Lock 197 if err := getResource(config.MachineLockName, &machineLock); err != nil { 198 return nil, err 199 } 200 return newSomethingWorker(apicaller, machineLock, config.Period) 201 }, 202 203 // * output func is not obligatory, and should be skipped if you 204 // don't know what you'll be exposing or to whom. 205 // * see `worker/machinelock`, `worker/gate`, `worker/util`, and 206 // `worker/dependency/testing` for examples of output funcs. 207 // * if you do supply an output func, be sure to document it on the 208 // Manifold func; for example: 209 // 210 // // Manifold exposes Foo and Bar resources, which can be 211 // // accessed by passing a *Foo or a *Bar in the output 212 // // parameter of its dependencies' getResouce calls. 213 Output: nil, 214 } 215 } 216 217 ...and take care to construct your manifolds *only* via that function; *all* 218 your dependencies *must* be declared in your ManifoldConfig, and *must* be 219 accessed via those names. Don't hardcode anything, please. 220 221 If you find yourself using the same manifold configuration in several places, 222 consider adding helpers to cmd/jujud/agent/util, which includes mechanisms for simple 223 definition of manifolds that depend on an API caller; on an agent; or on both. 224 225 226 Testing 227 ------- 228 229 The `worker/dependency/testing` package, commonly imported as "dt", exposes a 230 `StubResource` that is helpful for testing `Start` funcs in decent isolation, 231 with mocked dependencies. Tests for `Inputs` and `Output` are generally pretty 232 specific to their precise context and don't seem to benefit much from 233 generalisation. 234 235 236 Special considerations 237 ---------------------- 238 239 The nodes in your *dependency* graph must be acyclic; this does not imply that 240 the *information flow* must be acyclic. Indeed, it is common for separate 241 components to need to synchronise their actions; but the implementation of 242 Engine makes it inconvenient for either one to depend on the other (and 243 impossible for both to do so). 244 245 When a set of manifolds need to encode a set of services whose information flow 246 is not acyclic, apparent A->B->A cycles can be broken by introducing a new 247 shared dependency C to mediate the information flow. That is, A and B can then 248 separately depend upon C; and C itself can start a degenerate worker that never 249 errors of its own accord. 250 251 For examples of this technique, search for usage of `cmd/jujud/agent/util.NewValueWorker` 252 (which is generally used inside other manifolds to pass snippets of agent config 253 down to workers that don't have a good reason to see, or write, the full agent 254 config); and `worker/gate.Manifold`, which is for one-way coordination between 255 workers which should not be started until some other worker has completed some 256 task. 257 258 Please be careful when coordinating workers like this; the gate manifold in 259 particular is effectively just another lock, and it'd be trivial to construct 260 a set of gate-users that can deadlock one another. All the usual considerations 261 when working with locks still apply. 262 263 264 Concerns and mitigations thereof 265 -------------------------------- 266 267 The dependency package will *not* provide the following features: 268 269 * Deterministic worker startup. As above, this is a blessing in disguise: if 270 your workers have a problem with this, they're using magical undeclared 271 dependencies and we get to see the inevitable bugs sooner. 272 TODO(fwereade): we should add fuzz to the bounce and restart durations to 273 more vigorously shake out the bugs... 274 * Hand-holding for developers writing Output funcs; the onus is on you to 275 document what you expose; produce useful error messages when they supplied 276 with unexpected types via the interface{} param; and NOT to panic. The onus 277 on your clients is only to read your docs and handle the errors you might 278 emit. 279 280 */ 281 package dependency