github.com/makyo/juju@v0.0.0-20160425123129-2608902037e9/worker/dependency/doc.go

github.com/makyo/juju@v0.0.0-20160425123129-2608902037e9/worker/dependency/doc.go (about)

     1  // Copyright 2015 Canonical Ltd.
     2  // Licensed under the AGPLv3, see LICENCE file for details.
     3  
     4  /*
     5  
     6  The dependency package exists to address a general problem with shared resources
     7  and the management of their lifetimes. Many kinds of software handle these issues
     8  with more or less felicity, but it's particularly important that juju (which is
     9  a distributed system that needs to be very fault-tolerant) handle them clearly
    10  and sanely.
    11  
    12  Background
    13  ----------
    14  
    15  A cursory examination of the various workers run in juju agents (as of 2015-04-20)
    16  reveals a distressing range of approaches to the shared resource problem. A
    17  sampling of techniques (and their various problems) follows:
    18  
    19    * enforce sharing in code structure, either directly via scoping or implicitly
    20      via nested runners (state/api conns; agent config)
    21        * code structure is inflexible, and it enforces strictly nested resource
    22          lifetimes, which are not always adequate.
    23    * just create N of them and hope it works out OK (environs)
    24        * creating N prevents us from, e.g., using a single connection to an environ
    25          and sanely rate-limiting ourselves.
    26    * use filesystem locking across processes (machine execution lock)
    27        * implementation sometimes flakes out, or is used improperly; and multiple
    28          agents *are* a problem anyway, but even if we're all in-process we'll need
    29          some shared machine lock...
    30    * wrap workers to start up only when some condition is met (post-upgrade
    31      stability -- itself also a shared resource)
    32        * lifetime-nesting comments apply here again; *and* it makes it harder to
    33          follow the code.
    34    * implement a singleton (lease manager)
    35        * singletons make it *even harder* to figure out what's going on -- they're
    36          basically just fancy globals, and have all the associated problems with,
    37          e.g. deadlocking due to unexpected shutdown order.
    38  
    39  ...but, of course, they all have their various advantages:
    40  
    41    * Of the approaches, the first is the most reliable by far. Despite the
    42      inflexibility, there's a clear and comprehensible model in play that has yet
    43      to cause serious confusion: each worker is created with its resource(s)
    44      directly available in code scope, and trusts that it will be restarted by an
    45      independent watchdog if one of its dependencies fails. This characteristic is
    46      extremely beneficial and must be preserved; we just need it to be more
    47      generally applicable.
    48  
    49    * The create-N-Environs approach is valuable because it can be simply (if
    50      inelegantly) integrated with its dependent worker, and a changed Environ
    51      does not cause the whole dependent to fall over (unless the change is itself
    52      bad). The former characteristic is a subtle trap (we shouldn't be baking
    53      dependency-management complexity into the cores of our workers' select loops,
    54      even if it is "simple" to do so), but the latter is important: in particular,
    55      firewaller and provisioner are distressingly heavyweight workers and it would
    56      be unwise to take an approach that led to them being restarted when not
    57      necessary.
    58  
    59    * The filesystem locking just should not happen -- and we need to integrate the
    60      unit and machine agents to eliminate it (and for other reasons too) so we
    61      should give some thought to the fact that we'll be shuffling these dependencies
    62      around pretty hard in the future. If the approach can make that task easier,
    63      then great.
    64  
    65    * The singleton is dangerous specifically because its dependency interactions are
    66      unclear. Absolute clarity of dependencies, as provided by the nesting approaches,
    67      is in fact critical; but the sheer convenience of the singleton is alluring, and
    68      reminds us that the approach we take must remain easy to use.
    69  
    70  The various nesting approaches give easy access to directly-available resources,
    71  which is great, but will fail as soon as you have a sufficiently sophisticated
    72  dependent that can operate usefully without all its dependencies being satisfied
    73  (we have a couple of requirements for this in the unit agent right now). Still,
    74  direct resource access *is* tremendously convenient, and we need some way to
    75  access one service from another.
    76  
    77  However, all of these resources are very different: for a solution that encompasses
    78  them all, you kinda have to represent them as interface{} at some point, and that's
    79  very risky re: clarity.
    80  
    81  
    82  Problem
    83  -------
    84  
    85  The package is intended to implement the following developer stories:
    86  
    87    * As a developer trying to understand the codebase, I want to know what workers
    88      are running in an agent at any given time.
    89    * As a developer, I want to be prevented from introducing dependency cycles
    90      into my application.
    91    * As a developer, I want to provide a service provided by some worker to one or
    92      more client workers.
    93    * As a developer, I want to write a service that consumes one or more other
    94      workers' services.
    95    * As a developer, I want to choose how I respond to missing dependencies.
    96    * As a developer, I want to be able to inject test doubles for my dependencies.
    97    * As a developer, I want control over how my service is exposed to others.
    98    * As a developer, I don't want to have to typecast my dependencies from
    99      interface{} myself.
   100    * As a developer, I want my service to be restarted if its dependencies change.
   101  
   102  That last one might bear a little bit of explanation: but I contend that it's the
   103  only reliable approach to writing resilient services that compose sanely into a
   104  comprehensible system. Consider:
   105  
   106    * Juju agents' lifetimes must be assumed to exceed the MTBR of the systems
   107      they're deployed on; you might naively think that hard reboots are "rare"...
   108      but they're not. They really are just a feature of the terrain we have to
   109      traverse. Therefore every worker *always* has to be capable of picking itself
   110      back up from scratch and continuing sanely. That is, we're not imposing a new
   111      expectation: we're just working within the existing constraints.
   112    * While some workers are simple, some are decidedly not; when a worker has any
   113      more complexity than "none" it is a Bad Idea to mix dependency-management
   114      concerns into their core logic: it creates the sort of morass in which subtle
   115      bugs thrive.
   116  
   117  So, we take advantage of the expected bounce-resilience, and excise all dependency
   118  management concerns from the existing ones... in favour of a system that bounces
   119  workers slightly more often than before, and thus exercises those code paths more;
   120  so, when there are bugs, we're more likely to shake them out in automated testing
   121  before they hit users.
   122  
   123  We'd maybe also like to implement this story:
   124  
   125    * As a developer, I want to add and remove groups of workers atomically, e.g.
   126      when starting the set of controller workers for a hosted environ; or when
   127      starting the set of workers used by a single unit. [NOT DONE]
   128  
   129  ...but there's no urgent use case yet, and it's not certain to be superior to an
   130  engine-nesting approach.
   131  
   132  
   133  Solution
   134  --------
   135  
   136  Run a single dependency.Engine at the top level of each agent; express every
   137  shared resource, and every worker that uses one, as a dependency.Manifold; and
   138  install them all into the top-level engine.
   139  
   140  When installed under some name, a dependency.Manifold represents the features of
   141  a node in the engine's dependency graph. It lists:
   142  
   143    * The names of its dependencies (Inputs).
   144    * How to create the worker representing the resource (Start).
   145    * How (if at all) to expose the resource as a service to other resources that
   146      know it by name (Output).
   147  
   148  ...and allows the developers of each independent service a common mechanism for
   149  declaring and accessing their dependencies, and the ability to assume that they
   150  will be restarted whenever there is a material change to their accessible
   151  dependencies.
   152  
   153  When the weight of manifolds in a single engine becomes inconvenient, group them
   154  and run them inside nested dependency.Engines; the Report() method on the top-
   155  level engine will collect information from (directly-) contained engines, so at
   156  least there's still some observability; but there may also be call to pass
   157  actual dependencies down from one engine to another, and that'll demand careful
   158  thought.
   159  
   160  
   161  Usage
   162  -----
   163  
   164  In each worker package, write a `manifold.go` containing the following:
   165  
   166      // ManifoldConfig holds the information necessary to configure the worker
   167      // controlled by a Manifold.
   168      type ManifoldConfig struct {
   169  
   170          // The names of the various dependencies, e.g.
   171          APICallerName   string
   172          MachineLockName string
   173  
   174          // Any other required top-level configuration, e.g.
   175          Period time.Duration
   176      }
   177  
   178      // Manifold returns a manifold that controls the operation of a worker
   179      // responsible for <things>, configured as supplied.
   180      func Manifold(config ManifoldConfig) dependency.Manifold {
   181          // Your code here...
   182          return dependency.Manifold{
   183  
   184              // * certainly include each of your configured dependency names,
   185              //   getResource will only expose them if you declare them here.
   186              Inputs: []string{config.APICallerName, config.MachineLockName},
   187  
   188              // * certainly include a start func, it will panic if you don't.
   189              Start: func(getResource dependency.GetResourceFunc) (worker.Worker, error) {
   190                  // You presumably want to get your dependencies, and you almost
   191                  // certainly want to be closed over `config`...
   192                  var apicaller base.APICaller
   193                  if err := getResource(config.APICallerName, &apicaller); err != nil {
   194                      return nil, err
   195                  }
   196                  var machineLock *fslock.Lock
   197                  if err := getResource(config.MachineLockName, &machineLock); err != nil {
   198                      return nil, err
   199                  }
   200                  return newSomethingWorker(apicaller, machineLock, config.Period)
   201              },
   202  
   203              // * output func is not obligatory, and should be skipped if you
   204              //   don't know what you'll be exposing or to whom.
   205              // * see `worker/machinelock`, `worker/gate`, `worker/util`, and
   206              //   `worker/dependency/testing` for examples of output funcs.
   207              // * if you do supply an output func, be sure to document it on the
   208              //   Manifold func; for example:
   209              //
   210              //       // Manifold exposes Foo and Bar resources, which can be
   211              //       // accessed by passing a *Foo or a *Bar in the output
   212              //       // parameter of its dependencies' getResouce calls.
   213              Output: nil,
   214          }
   215      }
   216  
   217  ...and take care to construct your manifolds *only* via that function; *all*
   218  your dependencies *must* be declared in your ManifoldConfig, and *must* be
   219  accessed via those names. Don't hardcode anything, please.
   220  
   221  If you find yourself using the same manifold configuration in several places,
   222  consider adding helpers to cmd/jujud/agent/util, which includes mechanisms for simple
   223  definition of manifolds that depend on an API caller; on an agent; or on both.
   224  
   225  
   226  Testing
   227  -------
   228  
   229  The `worker/dependency/testing` package, commonly imported as "dt", exposes a
   230  `StubResource` that is helpful for testing `Start` funcs in decent isolation,
   231  with mocked dependencies. Tests for `Inputs` and `Output` are generally pretty
   232  specific to their precise context and don't seem to benefit much from
   233  generalisation.
   234  
   235  
   236  Special considerations
   237  ----------------------
   238  
   239  The nodes in your *dependency* graph must be acyclic; this does not imply that
   240  the *information flow* must be acyclic. Indeed, it is common for separate
   241  components to need to synchronise their actions; but the implementation of
   242  Engine makes it inconvenient for either one to depend on the other (and
   243  impossible for both to do so).
   244  
   245  When a set of manifolds need to encode a set of services whose information flow
   246  is not acyclic, apparent A->B->A cycles can be broken by introducing a new
   247  shared dependency C to mediate the information flow. That is, A and B can then
   248  separately depend upon C; and C itself can start a degenerate worker that never
   249  errors of its own accord.
   250  
   251  For examples of this technique, search for usage of `cmd/jujud/agent/util.NewValueWorker`
   252  (which is generally used inside other manifolds to pass snippets of agent config
   253  down to workers that don't have a good reason to see, or write, the full agent
   254  config); and `worker/gate.Manifold`, which is for one-way coordination between
   255  workers which should not be started until some other worker has completed some
   256  task.
   257  
   258  Please be careful when coordinating workers like this; the gate manifold in
   259  particular is effectively just another lock, and it'd be trivial to construct
   260  a set of gate-users that can deadlock one another. All the usual considerations
   261  when working with locks still apply.
   262  
   263  
   264  Concerns and mitigations thereof
   265  --------------------------------
   266  
   267  The dependency package will *not* provide the following features:
   268  
   269    * Deterministic worker startup. As above, this is a blessing in disguise: if
   270      your workers have a problem with this, they're using magical undeclared
   271      dependencies and we get to see the inevitable bugs sooner.
   272      TODO(fwereade): we should add fuzz to the bounce and restart durations to
   273      more vigorously shake out the bugs...
   274    * Hand-holding for developers writing Output funcs; the onus is on you to
   275      document what you expose; produce useful error messages when they supplied
   276      with unexpected types via the interface{} param; and NOT to panic. The onus
   277      on your clients is only to read your docs and handle the errors you might
   278      emit.
   279  
   280  */
   281  package dependency