github.com/mhilton/juju-juju@v0.0.0-20150901100907-a94dd2c73455/worker/dependency/doc.go (about)

     1  // Copyright 2015 Canonical Ltd.
     2  // Licensed under the AGPLv3, see LICENCE file for details.
     3  
     4  /*
     5  
     6  The dependency package exists to address a general problem with shared resources
     7  and the management of their lifetimes. Many kinds of software handle these issues
     8  with more or less felicity, but it's particularly important the juju (a distributed
     9  system that needs to be very fault-tolerant) handle them clearly and sanely.
    10  
    11  Background
    12  ----------
    13  
    14  A cursory examination of the various workers run in juju agents (as of 2015-04-20)
    15  reveals a distressing range of approaches to the shared resource problem. A
    16  sampling of techniques (and their various problems) follows:
    17  
    18    * enforce sharing in code structure, either directly via scoping or implicitly
    19      via nested runners (state/api conns; agent config)
    20        * code structure is inflexible, and it enforces strictly nested resource
    21          lifetimes, which are not always adequate.
    22    * just create N of them and hope it works out OK (environs)
    23        * creating N prevents us from, e.g., using a single connection to an environ
    24          and sanely rate-limiting ourselves.
    25    * use filesystem locking across processes (machine execution lock)
    26        * implementation sometimes flakes out, or is used improperly; and multiple
    27          agents *are* a problem anyway, but even if we're all in-process we'll need
    28          some shared machine lock...
    29    * wrap workers to start up only when some condition is met (post-upgrade
    30      stability -- itself also a shared resource)
    31        * lifetime-nesting comments apply here again; *and* it makes it harder to
    32          follow the code.
    33    * implement a singleton (lease manager)
    34        * singletons make it *even harder* to figure out what's going on -- they're
    35          basically just fancy globals, and have all the associated problems with,
    36          e.g. deadlocking due to unexpected shutdown order.
    37  
    38  ...but, of course, they all have their various advantages:
    39  
    40    * Of the approaches, the first is the most reliable by far. Despite the
    41      inflexibility, there's a clear and comprehensible model in play that has yet
    42      to cause serious confusion: each worker is created with its resource(s)
    43      directly available in code scope, and trusts that it will be restarted by an
    44      independent watchdog if one of its dependencies fails. This characteristic is
    45      extremely beneficial and must be preserved; we just need it to be more
    46      generally applicable.
    47  
    48    * The create-N-Environs approach is valuable because it can be simply (if
    49      inelegantly) integrated with its dependent worker, and a changed Environ
    50      does not cause the whole dependent to fall over (unless the change is itself
    51      bad). The former characteristic is a subtle trap (we shouldn't be baking
    52      dependency-management complexity into the cores of our workers' select loops,
    53      even if it is "simple" to do so), but the latter is important: in particular,
    54      firewaller and provisioner are distressingly heavyweight workers and it would
    55      be unwise to take an approach that led to them being restarted when not
    56      necessary.
    57  
    58    * The filesystem locking just should not happen -- and we need to integrate the
    59      unit and machine agents to eliminate it (and for other reasons too) so we
    60      should give some thought to the fact that we'll be shuffling these dependencies
    61      around pretty hard in the future. If the approach can make that task easier,
    62      then great.
    63  
    64    * The singleton is dangerous specifically because its dependency interactions are
    65      unclear. Absolute clarity of dependencies, as provided by the nesting approaches,
    66      is in fact critical.
    67  
    68  The various nesting approaches give easy access to directly-available resources,
    69  which is great, but will fail as soon as you have a sufficiently sophisticated
    70  dependent that can operate usefully without all its dependencies being satisfied
    71  (we have a couple of requirements for this in the unit agent right now). Still,
    72  direct resource access *is* tremendously convenient, and we need some way to
    73  access one service from another.
    74  
    75  However, all of these resources are very different: for a solution that encompasses
    76  them all, you kinda have to represent them as interface{} at some point, and that's
    77  very risky re: clarity.
    78  
    79  
    80  Problem
    81  -------
    82  
    83  The package is intended to implement the following developer stories:
    84  
    85    * As a developer, I want to provide a service provided by some worker to one or
    86      more client workers.
    87    * As a developer, I want to write a service that consumes one or more other
    88      workers' services.
    89    * As a developer, I want to choose how I respond to missing dependencies.
    90    * As a developer, I want to be able to inject test doubles for my dependencies.
    91    * As a developer, I want control over how my service is exposed to others.
    92    * As a developer, I don't want to have to typecast my dependencies from
    93      interface{} myself.
    94    * As a developer, I want my service to be restarted if its dependencies change.
    95  
    96  That last one might bear a little bit of explanation: but I contend that it's the
    97  only reliable approach to writing resilient services that compose sanely into a
    98  comprehensible system. Consider:
    99  
   100    * Juju agents' lifetimes must be assumed to exceed the MTBR of the systems
   101      they're deployed on; you might naively think that hard reboots are "rare"...
   102      but they're not. They really are just a feature of the terrain we have to
   103      traverse. Therefore every worker *always* has to be capable of picking itself
   104      back up from scratch and continuing sanely. That is, we're not imposing a new
   105      expectation: we're just working within the existing constraints.
   106    * While some workers are simple, some are decidedly not; when a worker has any
   107      more complexity than "none" it is a Bad Idea to mix dependency-management
   108      concerns into their core logic: it creates the sort of morass in which subtle
   109      bugs thrive.
   110  
   111  So, we take advantage of the expected bounce-resilience, and excise all dependency
   112  management concerns from the existing ones... in favour of a system that bounces
   113  workers slightly more often than before, and thus exercises those code paths more;
   114  so, when there are bugs, we're more likely to shake them out in automated testing
   115  before they hit users.
   116  
   117  We'd also like to implement these stories, which go together, and should be
   118  added when their absence becomes inconvenient:
   119  
   120    * As a developer, I want to be prevented from introducing dependency cycles
   121      into my application. [NOT DONE]
   122    * As a developer trying to understand the codebase, I want to know what workers
   123      are running in an agent at any given time. [NOT DONE]
   124    * As a developer, I want to add and remove groups of workers atomically, e.g.
   125      when starting the set of state-server workers for a hosted environ; or when
   126      starting the set of workers used by a single unit. [NOT DONE]
   127  
   128  
   129  Solution
   130  --------
   131  
   132  Run a single dependency.Engine at the top level of each agent; express every
   133  shared resource, and every worker that uses one, as a dependency.Manifold; and
   134  install them all into the top-level engine.
   135  
   136  When installed under some name, a dependency.Manifold represents the features of
   137  a node in the engine's dependency graph. It lists:
   138  
   139    * The names of its dependencies (Inputs).
   140    * How to create the worker representing the resource (Start).
   141    * How (if at all) to expose the resource as a service to other resources that
   142      know it by name (Output).
   143  
   144  ...and allows the developers of each independent service a common mechanism for
   145  declaring and accessing their dependencies, and the ability to assume that they
   146  will be restarted whenever there is a material change to their accessible
   147  dependencies.
   148  
   149  
   150  Usage
   151  -----
   152  
   153  In each worker package, write a `manifold.go` containing the following:
   154  
   155      type ManifoldConfig struct {
   156          // The names of the various dependencies, e.g.
   157          APICallerName   string
   158          MachineLockName string
   159      }
   160  
   161      func Manifold(config ManifoldConfig) dependency.Manifold {
   162          // Your code here...
   163      }
   164  
   165  ...and take care to construct your manifolds *only* via that function; *all*
   166  your dependencies *must* be declared in your ManifoldConfig, and *must* be
   167  accessed via those names. Don't hardcode anything, please.
   168  
   169  If you find yourself using the same manifold configuration in several places,
   170  consider adding helpers to worker/util, which includes mechanisms for simple
   171  definition of manifolds that depend on an API caller; on an agent; or on both.
   172  
   173  
   174  Concerns and mitigations thereof
   175  --------------------------------
   176  
   177  The dependency package will *not* provide the following features:
   178  
   179    * Deterministic worker startup. As above, this is a blessing in disguise: if
   180      your workers have a problem with this, they're using magical undeclared
   181      dependencies and we get to see the inevitable bugs sooner.
   182      TODO(fwereade): we should add fuzz to the bounce and restart durations to
   183      more vigorously shake out the bugs...
   184    * Hand-holding for developers writing Output funcs; the onus is on you to
   185      document what you expose; produce useful error messages when they supplied
   186      with unexpected types via the interface{} param; and NOT to panic. The onus
   187      on your clients is only to read your docs and handle the errors you might
   188      emit.
   189  
   190  */
   191  package dependency