
     1  // Copyright 2015 Canonical Ltd.
     2  // Licensed under the AGPLv3, see LICENCE file for details.
     4  /*
     6  The dependency package exists to address a general problem with shared resources
     7  and the management of their lifetimes. Many kinds of software handle these issues
     8  with more or less felicity, but it's particularly important that juju (which is
     9  a distributed system that needs to be very fault-tolerant) handle them clearly
    10  and sanely.
    12  Background
    13  ----------
    15  A cursory examination of the various workers run in juju agents (as of 2015-04-20)
    16  reveals a distressing range of approaches to the shared resource problem. A
    17  sampling of techniques (and their various problems) follows:
    19    * enforce sharing in code structure, either directly via scoping or implicitly
    20      via nested runners (state/api conns; agent config)
    21        * code structure is inflexible, and it enforces strictly nested resource
    22          lifetimes, which are not always adequate.
    23    * just create N of them and hope it works out OK (environs)
    24        * creating N prevents us from, e.g., using a single connection to an environ
    25          and sanely rate-limiting ourselves.
    26    * use filesystem locking across processes (machine execution lock)
    27        * implementation sometimes flakes out, or is used improperly; and multiple
    28          agents *are* a problem anyway, but even if we're all in-process we'll need
    29          some shared machine lock...
    30    * wrap workers to start up only when some condition is met (post-upgrade
    31      stability -- itself also a shared resource)
    32        * lifetime-nesting comments apply here again; *and* it makes it harder to
    33          follow the code.
    34    * implement a singleton (lease manager)
    35        * singletons make it *even harder* to figure out what's going on -- they're
    36          basically just fancy globals, and have all the associated problems with,
    37          e.g. deadlocking due to unexpected shutdown order.
    39  ...but, of course, they all have their various advantages:
    41    * Of the approaches, the first is the most reliable by far. Despite the
    42      inflexibility, there's a clear and comprehensible model in play that has yet
    43      to cause serious confusion: each worker is created with its resource(s)
    44      directly available in code scope, and trusts that it will be restarted by an
    45      independent watchdog if one of its dependencies fails. This characteristic is
    46      extremely beneficial and must be preserved; we just need it to be more
    47      generally applicable.
    49    * The create-N-Environs approach is valuable because it can be simply (if
    50      inelegantly) integrated with its dependent worker, and a changed Environ
    51      does not cause the whole dependent to fall over (unless the change is itself
    52      bad). The former characteristic is a subtle trap (we shouldn't be baking
    53      dependency-management complexity into the cores of our workers' select loops,
    54      even if it is "simple" to do so), but the latter is important: in particular,
    55      firewaller and provisioner are distressingly heavyweight workers and it would
    56      be unwise to take an approach that led to them being restarted when not
    57      necessary.
    59    * The filesystem locking just should not happen -- and we need to integrate the
    60      unit and machine agents to eliminate it (and for other reasons too) so we
    61      should give some thought to the fact that we'll be shuffling these dependencies
    62      around pretty hard in the future. If the approach can make that task easier,
    63      then great.
    65    * The singleton is dangerous specifically because its dependency interactions are
    66      unclear. Absolute clarity of dependencies, as provided by the nesting approaches,
    67      is in fact critical; but the sheer convenience of the singleton is alluring, and
    68      reminds us that the approach we take must remain easy to use.
    70  The various nesting approaches give easy access to directly-available resources,
    71  which is great, but will fail as soon as you have a sufficiently sophisticated
    72  dependent that can operate usefully without all its dependencies being satisfied
    73  (we have a couple of requirements for this in the unit agent right now). Still,
    74  direct resource access *is* tremendously convenient, and we need some way to
    75  access one service from another.
    77  However, all of these resources are very different: for a solution that encompasses
    78  them all, you kinda have to represent them as interface{} at some point, and that's
    79  very risky re: clarity.
    82  Problem
    83  -------
    85  The package is intended to implement the following developer stories:
    87    * As a developer trying to understand the codebase, I want to know what workers
    88      are running in an agent at any given time.
    89    * As a developer, I want to be prevented from introducing dependency cycles
    90      into my application.
    91    * As a developer, I want to provide a service provided by some worker to one or
    92      more client workers.
    93    * As a developer, I want to write a service that consumes one or more other
    94      workers' services.
    95    * As a developer, I want to choose how I respond to missing dependencies.
    96    * As a developer, I want to be able to inject test doubles for my dependencies.
    97    * As a developer, I want control over how my service is exposed to others.
    98    * As a developer, I don't want to have to typecast my dependencies from
    99      interface{} myself.
   100    * As a developer, I want my service to be restarted if its dependencies change.
   102  That last one might bear a little bit of explanation: but I contend that it's the
   103  only reliable approach to writing resilient services that compose sanely into a
   104  comprehensible system. Consider:
   106    * Juju agents' lifetimes must be assumed to exceed the MTBR of the systems
   107      they're deployed on; you might naively think that hard reboots are "rare"...
   108      but they're not. They really are just a feature of the terrain we have to
   109      traverse. Therefore every worker *always* has to be capable of picking itself
   110      back up from scratch and continuing sanely. That is, we're not imposing a new
   111      expectation: we're just working within the existing constraints.
   112    * While some workers are simple, some are decidedly not; when a worker has any
   113      more complexity than "none" it is a Bad Idea to mix dependency-management
   114      concerns into their core logic: it creates the sort of morass in which subtle
   115      bugs thrive.
   117  So, we take advantage of the expected bounce-resilience, and excise all dependency
   118  management concerns from the existing ones... in favour of a system that bounces
   119  workers slightly more often than before, and thus exercises those code paths more;
   120  so, when there are bugs, we're more likely to shake them out in automated testing
   121  before they hit users.
   123  We'd maybe also like to implement this story:
   125    * As a developer, I want to add and remove groups of workers atomically, e.g.
   126      when starting the set of controller workers for a hosted environ; or when
   127      starting the set of workers used by a single unit. [NOT DONE]
   129  ...but there's no urgent use case yet, and it's not certain to be superior to an
   130  engine-nesting approach.
   133  Solution
   134  --------
   136  Run a single dependency.Engine at the top level of each agent; express every
   137  shared resource, and every worker that uses one, as a dependency.Manifold; and
   138  install them all into the top-level engine.
   140  When installed under some name, a dependency.Manifold represents the features of
   141  a node in the engine's dependency graph. It lists:
   143    * The names of its dependencies (Inputs).
   144    * How to create the worker representing the resource (Start).
   145    * How (if at all) to expose the resource as a service to other resources that
   146      know it by name (Output).
   148  ...and allows the developers of each independent service a common mechanism for
   149  declaring and accessing their dependencies, and the ability to assume that they
   150  will be restarted whenever there is a material change to their accessible
   151  dependencies.
   153  When the weight of manifolds in a single engine becomes inconvenient, group them
   154  and run them inside nested dependency.Engines; the Report() method on the top-
   155  level engine will collect information from (directly-) contained engines, so at
   156  least there's still some observability; but there may also be call to pass
   157  actual dependencies down from one engine to another, and that'll demand careful
   158  thought.
   161  Usage
   162  -----
   164  In each worker package, write a `manifold.go` containing the following:
   166      // ManifoldConfig holds the information necessary to configure the worker
   167      // controlled by a Manifold.
   168      type ManifoldConfig struct {
   170          // The names of the various dependencies, e.g.
   171          APICallerName   string
   173          // Any other required top-level configuration, e.g.
   174          Period time.Duration
   175      }
   177      // Manifold returns a manifold that controls the operation of a worker
   178      // responsible for <things>, configured as supplied.
   179      func Manifold(config ManifoldConfig) dependency.Manifold {
   180          // Your code here...
   181          return dependency.Manifold{
   183              // * certainly include each of your configured dependency names,
   184              //   getResource will only expose them if you declare them here.
   185              Inputs: []string{config.APICallerName, config.MachineLockName},
   187              // * certainly include a start func, it will panic if you don't.
   188              Start: func(getResource dependency.GetResourceFunc) (worker.Worker, error) {
   189                  // You presumably want to get your dependencies, and you almost
   190                  // certainly want to be closed over `config`...
   191                  var apicaller base.APICaller
   192                  if err := getResource(config.APICallerName, &apicaller); err != nil {
   193                      return nil, err
   194                  }
   195                  return newSomethingWorker(apicaller, config.Period)
   196              },
   198              // * output func is not obligatory, and should be skipped if you
   199              //   don't know what you'll be exposing or to whom.
   200              // * see `worker/gate`, `worker/util`, and
   201              //   `worker/dependency/testing` for examples of output funcs.
   202              // * if you do supply an output func, be sure to document it on the
   203              //   Manifold func; for example:
   204              //
   205              //       // Manifold exposes Foo and Bar resources, which can be
   206              //       // accessed by passing a *Foo or a *Bar in the output
   207              //       // parameter of its dependencies' getResouce calls.
   208              Output: nil,
   209          }
   210      }
   212  ...and take care to construct your manifolds *only* via that function; *all*
   213  your dependencies *must* be declared in your ManifoldConfig, and *must* be
   214  accessed via those names. Don't hardcode anything, please.
   216  If you find yourself using the same manifold configuration in several places,
   217  consider adding helpers to cmd/jujud/agent/engine, which includes mechanisms
   218  for simple definition of manifolds that depend on an API caller; on an agent;
   219  or on both.
   222  Testing
   223  -------
   225  The `worker/dependency/testing` package, commonly imported as "dt", exposes a
   226  `StubResource` that is helpful for testing `Start` funcs in decent isolation,
   227  with mocked dependencies. Tests for `Inputs` and `Output` are generally pretty
   228  specific to their precise context and don't seem to benefit much from
   229  generalisation.
   232  Special considerations
   233  ----------------------
   235  The nodes in your *dependency* graph must be acyclic; this does not imply that
   236  the *information flow* must be acyclic. Indeed, it is common for separate
   237  components to need to synchronise their actions; but the implementation of
   238  Engine makes it inconvenient for either one to depend on the other (and
   239  impossible for both to do so).
   241  When a set of manifolds need to encode a set of services whose information flow
   242  is not acyclic, apparent A->B->A cycles can be broken by introducing a new
   243  shared dependency C to mediate the information flow. That is, A and B can then
   244  separately depend upon C; and C itself can start a degenerate worker that never
   245  errors of its own accord.
   247  For examples of this technique, search for `cmd/jujud/agent/engine.NewValueWorker`
   248  (which is generally used inside other manifolds to pass snippets of agent config
   249  down to workers that don't have a good reason to see, or write, the full agent
   250  config); and `worker/gate.Manifold`, which is for one-way coordination between
   251  workers which should not be started until some other worker has completed some
   252  task.
   254  Please be careful when coordinating workers like this; the gate manifold in
   255  particular is effectively just another lock, and it'd be trivial to construct
   256  a set of gate-users that can deadlock one another. All the usual considerations
   257  when working with locks still apply.
   260  Concerns and mitigations thereof
   261  --------------------------------
   263  The dependency package will *not* provide the following features:
   265    * Deterministic worker startup. As above, this is a blessing in disguise: if
   266      your workers have a problem with this, they're using magical undeclared
   267      dependencies and we get to see the inevitable bugs sooner.
   268      TODO(fwereade): we should add fuzz to the bounce and restart durations to
   269      more vigorously shake out the bugs...
   270    * Hand-holding for developers writing Output funcs; the onus is on you to
   271      document what you expose; produce useful error messages when they supplied
   272      with unexpected types via the interface{} param; and NOT to panic. The onus
   273      on your clients is only to read your docs and handle the errors you might
   274      emit.
   276  */
   277  package dependency