github.com/juju/juju@v0.0.0-20240430160146-1752b71fcf00/doc/architectural-overview.md

github.com/juju/juju@v0.0.0-20240430160146-1752b71fcf00/doc/architectural-overview.md (about)

     1  # Juju Architectural Overview
     2  
     3  
     4  ## Audience
     5  
     6  This document is targeted at new developers of Juju, and may be useful to experienced
     7  developers who need a refresher on some aspect of Juju's operation. It is deliberately
     8  light on detail, because the precise mechanisms of various components' operation are
     9  expected to change much faster than the general interactions between components.
    10  
    11  
    12  ## The View From Space
    13  
    14  A Juju model is a distributed system comprising:
    15  
    16  * A data store (mongodb) which describes the desired state of the world, in terms
    17    of running workloads or *applications*, and the *relations* between them; and of the
    18    *units* that comprise those applications, and the *machines* on which those units run.
    19  * A bunch of *agents*, each of which runs the same `jujud` binary, and which are
    20    variously responsible for causing reality to converge towards the idealised world-
    21    state encoded in the data store.
    22  * Some number of *clients* which talk over an API, implemented by the agents, to
    23    update the desired world-state (and thereby cause the agents to update the world
    24    to match). The `juju` binary is one of many possible clients; the `juju-dashboard` web
    25    application, and the `juju-deployer` python tool, are other examples.
    26  
    27  The whole system depends upon a substrate, or *provider*, which supplies the compute,
    28  storage, and network resources used by the workloads (and by Juju itself; but never
    29  forget that *everything* described in this document is merely supporting infrastructure
    30  geared towards the successful deployment and configuration of the workloads that solve
    31  actual problems for actual users).
    32  
    33  ## Juju Components
    34  
    35  Here's the various high level parts of Juju system and how they interact:
    36  
    37  ```
    38                     +--------------------------+            +------------------------+
    39                     |                          |            |                        |
    40                     |  Machine agent           |            |  Unit agent            |
    41                     |         +-------------+  |            |       +-------------+  |
    42                     |         |             |  |            |       |             |  |
    43                     |         |   workers   |  |            |       |   workers   |  |
    44                     |         |             |  |            |       |             |  |
    45                     |         +-----------+-+  |            |       +-------+-----+  |
    46                     |                     |    |            |               |        |
    47                     +--------------------------+            +------------------------+
    48                                           |                                 |
    49                                           |   Juju API                      |
    50                                           |       +-------------------------+
    51                                           |       |
    52                                           |       |
    53                    +-----------------------------------------------------------------+
    54                    |                      |       |                                  |
    55                    |  Controller agent    |       |                                  |
    56  +------------+    |                     +v-------v----+            +-------------+  |
    57  |            |    |                     |             |  Juju API  |             |  |
    58  |   Client   +-------------------------->  apiserver  +<-----------+   workers   |  |
    59  |            |    |   Juju API          |             |            |             |  |
    60  +------------+    |                     +------+------+            +------+------+  |
    61                    |                            |                          |         |
    62                    |                            |                          |         |
    63                    |                      +-----v-----+             +------v------+  |
    64                    |                      |           |             |             |  |
    65                    |                      |   state   |             |  providers  |  |
    66                    |                      |           |             |             |  |
    67                    |                      +-----+-----+             +------+------+  |
    68                    |                            |                          |         |
    69                    +-----------------------------------------------------------------+
    70                                                 | MongoDB protocol         | cloud API
    71                                                 |                          |
    72                                           +-----v-----+          +---------V---------+
    73                                           |           |          |                   |
    74                                           |  MongoDB  |          |  cloud/substrate  |
    75                                           |           |          |                   |
    76                                           +-----------+          +-------------------+
    77  ```
    78  
    79  At the centre is a *controller agent*. It is responsible for maintaining the
    80  state for one or more Juju models and runs a server which provides the Juju
    81  API. Juju's state is kept in MongoDB. Juju's MongoDB may only be accessed by
    82  the controller agents.
    83  
    84  A controller agent runs a number of *workers*, many of which are specific to
    85  controller tasks. Some workers in the controller agent use the Juju *provider*
    86  implementation to communicate with the underlying cloud substrate using the
    87  substrate's APIs. This is how cloud resources are created, managed and
    88  destroyed.
    89  
    90  Almost all workers will interact with Juju's state using Juju's API, even
    91  workers running within a controller agent.
    92  
    93  If a Juju deployment has high-availability enabled there will be multiple
    94  controller agents. An consumer of the Juju API may connect to any controller
    95  agent. In HA mode, there will be a MongoDB instance on each controller machine,
    96  with a MongoDB replicaset configured to synchronise data between the nodes.
    97  
    98  Each Juju deployed machine runs a *machine agent*. Each machine agent runs a
    99  number of workers.
   100  
   101  A controller agent is a machine agent with extra responsibilities. It runs all
   102  the workers which a normal machine runs as well as controller specific workers.
   103  
   104  A *unit agent* runs for each deployed unit of an application. It is mainly
   105  responsible for installing, running and maintaining charm code. It runs a
   106  different set of workers to a machine agent.
   107  
   108  There are a number of *clients* which interact with Juju using the Juju
   109  API. These include the `juju` command line tool and Juju Dashboard.
   110  
   111  
   112  ## The Data Store (aka "state")
   113  
   114  There's a lot of *detail* to cover, but there's not much to say from an architectural
   115  standpoint. We use a mongodb replicaset to support HA; we use the `mgo` package from
   116  `labix.org` to implement multi-document transactions; we make use of the transaction
   117  log to detect changes to particular documents, and convert them into business-object-
   118  level events that get sent over the API to interested parties.
   119  
   120  The mongodb databases run on machines we refer to as *controllers*, and are only
   121  accessed by agents running on those machines; it's important to keep it locked down
   122  (and, honestly, to lock it down further and better than we currently have).
   123  
   124  There's some documentation on how to work with [the state package](hacking-state.md);
   125  and plenty more on the [state entities](lifecycles.md) and the details of their
   126  [creation](entity-creation.md) and [destruction](death-and-destruction.md) from various
   127  perspectives; but there's not a lot more to say in this context.
   128  
   129  It *is* important to understand that the transaction-log watching is not an ideal
   130  solution, and we'll be retiring it at some point, in favour of an in-memory model
   131  of state and a pub-sub system for watchers; we *know* it's a scalability problem,
   132  but we're not devoting resources to it until it becomes more pressing.
   133  
   134  Code for dealing with mongodb is found primarily in the `state`, `state/watcher`,
   135  `replicaset`, and `worker/peergrouper` packages.
   136  
   137  
   138  ## API
   139  
   140  Juju controllers expose an API endpoint over a websocket connection. The methods
   141  available over the API are broken down by client; there's a `Client` facade that
   142  exposes the methods used by clients, an `Agent` facade that exposes the methods
   143  common to all agents, and a wide range of worker-specific *facades* that individually
   144  deal with particular chunks of functionality implemented by one agent or another
   145  (for example, `Provisioner`, `Upgrader`, and `Uniter`, each used by the eponymous
   146  worker types).
   147  
   148  The API server is implemented in the `apiserver` top level package. Each API
   149  facade has it's own subpackage (e.g. `apiserver/provisioner`). The code under
   150  `apiserver` is the only code that is allowed to import from the `state`
   151  package.
   152  
   153  Various facades share functionality; for example, the Life method is used by many
   154  worker facades. In these cases, the method is implemented on a separate type,
   155  which is embedded in the facade implementation.
   156  
   157  All APIs *should* be implemented such that they can be called in bulk, but not
   158  all of them are. The agent facades are (almost?) all implemented correctly, but
   159  the Client facade is almost exclusively not. As functionality evolves, and new
   160  versions of the client APIs are implemented, we must take care to implement them
   161  consistently -- this means both implementing bulk calls *and* splitting the
   162  monolithic Client facade into smaller application-specific facades, such that we
   163  can evolve interaction with (say) users without bumping global API versions
   164  across the board).
   165  
   166  The Juju API client is implemented under the `api` top level package. Client
   167  side API facade are implemented as subpackages underneath `api`.
   168  
   169  ## The Agents
   170  
   171  Agents all use the same `jujud` binary, and all follow roughly the same  model.
   172  When starting up, they authenticate with an API server; possibly reset their
   173  password, if the one they used has been stored persistently somewhere and is
   174  thus vulnerable; determine their responsibilities; and run a set of tasks in
   175  parallel until one of those tasks returns an error indicating that the agent
   176  should either restart or terminate completely. Tasks that return any other error
   177  will be automatically restarted after a short delay; tasks that return nil are
   178  considered to be complete, and will not be restarted until the whole process is.
   179  
   180  When comparing the unit agent with the machine agent, the truth of the above
   181  may not be immediately apparent, because the responsibilities of the unit
   182  agent are so much less varied than those of the machine agent; but we have
   183  scheduled work to integrate the unit agent into the machine agent, rendering
   184  each unit agent a single worker task within its responsible machine agent. It's
   185  still better to consider a unit agent to be a simplistic and/or degenerate
   186  implementation of a machine agent than to attach too much importance to the
   187  differences.
   188  
   189  ### Jobs, Runners, and Workers
   190  
   191  Machine agents all have at least one of two jobs: JobHostUnits and JobManageModel.
   192  Each of these jobs represents a number of tasks the agent needs to execute to
   193  fulfil its responsibilities; in addition, there are a number of tasks that are
   194  executed by every machine agent. The terms *task* and *worker* are generally used
   195  interchangeably in this document and in the source code; it's possible but not
   196  generally helpful to draw the distinction that a worker executes a task. All
   197  tasks are implemented by code in some subpackage of the `worker` package, and the
   198  `worker.Runner` type implements the retry behaviour described above.
   199  
   200  It's useful to note that the Runner type is itself a worker, so we can and do
   201  nest Runners inside one another; the details of *exactly* how and where a given
   202  worker comes to be executed are generally elided in this document; but it's worth
   203  being aware of the fact that all the workers that use an API connection share a
   204  single one, mediated by a single Runner, such that when the API connection fails
   205  that single Runner can stop all its workers; shut itself down; be restarted by
   206  its parent worker; and set up a new API connection, which it then uses to start
   207  all its child workers.
   208  
   209  Please note that the lists of workers below should *not* be assumed to be
   210  exhaustive. Juju evolves, and the workers evolve with it.
   211  
   212  ### Common workers
   213  
   214  All agents run workers with the following responsibilities:
   215  
   216  * Check for scheduled upgrades for their binaries, and replace themselves
   217    (implemented in `worker/upgrader`)
   218  * Watch logging config, and reconfigure the local logger (`worker/logger`; yes,
   219    we know; it is not the stupidest name in the codebase)
   220  * Watch and store the latest known addresses for the controllers
   221    (`worker/apiaddressupdater`)
   222  
   223  ### Machine Agent Workers
   224  
   225  Machine agents additionally do the following:
   226  
   227  * Run upgrade code in the new binaries once they're replaced themselves
   228    (implemented directly in the machine agent's `upgradeWorker` method)
   229  * Handle SIGABRT and permanently stop the agent (`worker/terminationworker`)
   230  * Handle the machine entity's death and permanently stop the agent (`worker/machiner`)
   231  * Watch proxy config, and reconfigure the local machine (`worker/machineenvironmentworker`)
   232  * Watch for contained LXC or KVM machines and provision/decommission them
   233    (`worker/provisioner`)
   234  
   235  All machine agents have JobHostUnits. These run the `worker/deployer` code which
   236  watches for units assigned to the machine, and deploys/recalls upstart configs
   237  for their respective unit agents as the units are assigned/removed. We expect
   238  the deployer implementation to change to just directly run the unit agents'
   239  workers in its own Runner.
   240  
   241  ### Controller Workers
   242  
   243  Machines with JobManageModel also run a number of other workers, which do
   244  the following.
   245  
   246  * Run the API server used by all other workers (in this, and other, agents:
   247    `state/apiserver`)
   248  * Provision/decommission provider instances in response to the creation/
   249    destruction of machine entities (`worker/provisioner`, just like the
   250    container provisioners run in all machine agents anyway)
   251  * Manipulate provider networks in response to units opening/closing ports,
   252    and users exposing/unexposing applications (`worker/firewaller`)
   253  * Update network addresses and associated information for provider instances
   254    (`worker/instancepoller`)
   255  * Respond to queued DB cleanup events (`worker/cleaner`)
   256  * Maintain the MongoDB replica set (`worker/peergrouper`)
   257  * Resume incomplete MongoDB transactions (`worker/resumer`)
   258  
   259  Many of these workers (more than strictly need to be) are wrapped as "singular"
   260  workers, which only run on the same machine as the current MongoDB replicaset
   261  master. When the master changes, the state connection is dropped, causing all
   262  those workers to also be stopped; when they're restarted, they won't run because
   263  they're no longer running on the master.
   264  
   265  ### Unit Agents
   266  
   267  Unit agents run all the common workers, and the `worker/uniter` task as well;
   268  this task is probably the single most forbiddingly complex part of Juju. (Side
   269  note: It's a unit-er because it deals with units, and we're bad at names; but
   270  it's also a unite-r because it's where all the various components of Juju come
   271  together to run actual workloads.) It's sufficiently large that it deserves its
   272  own top-level heading, below.
   273  
   274  
   275  ## The Uniter
   276  
   277  At the highest level, the Uniter is a state machine. After a "little bit" of setup,
   278  it runs a tight loop in which it calls `Mode` functions one after another, with the
   279  next mode run determined by the result of its predecessor. All mode functions are
   280  implemented in `worker/uniter/modes.go`, which is actually pretty small: just a hair
   281  over 400 lines.
   282  
   283  It's deliberately implemented as conceptually single-threaded (just like almost
   284  everything else in Juju -- rampaging concurrency is the root of much evil, and so
   285  we save ourselves a huge number of headaches by hiding concurrency behind event
   286  channels and handling a single event at a time), but this property has degraded
   287  over time; in particular, the `RunListener` code can inject events at unhelpful
   288  times, and while the `hookLock` *probably* renders this safe it's still deeply
   289  suboptimal, because the fact of the concurrency requires that we be *extremely*
   290  careful with further modifications, lest they subtly break assumptions. We hope
   291  to address this by retiring the current implementation of `juju run`, but it's
   292  not entirely clear how to do this; in the meantime, Here Be Dragons.
   293  
   294  Leaving these woes aside, the mode functions make use of two fundamental components,
   295  which are glommed together until someone refactors it to make more sense. There's
   296  the `Filter`, which is responsible for communicating with the API server (and the
   297  rest of the outside world) such that relevant events can be delivered to the mode
   298  func via channels exposed on the filter; and then there's the `Uniter` itself, which
   299  exposes a number of methods that are expected to be called by the mode funcs.
   300  
   301  
   302  ### Uniter Modes
   303  
   304  XXXX
   305  
   306  
   307  ### Hook contexts
   308  
   309  XXXX
   310  
   311  
   312  ### The Relation Model
   313  
   314  XXXX
   315  
   316  
   317  ## The Providers
   318  
   319  A Juju provider represents a different possible kind of substrate on which a
   320  Juju model can run, and (as far as possible) abstracts away the differences
   321  between them, by making them all conform to the Environ interface. The most
   322  important thing to understand about the various providers is that they're all
   323  implemented without reference to broader Juju concepts; they are squeezed into
   324  a shape that's convenient WRT allowing Juju to make use of them, but if we
   325  allow Juju-level concepts to infect the providers we will suffer greatly,
   326  because we will open a path by which changes to *Juju* end up causing changes
   327  to *all the providers at once*.
   328  
   329  However, we lack the ability to enforce this at present, because the package
   330  dependency flows in the wrong direction, thanks primarily (purely?) to the
   331  StateInfo method on Environ; and we jam all sorts of gymnastics into the state
   332  package to allow us to use Environs without doing so explicitly (see the
   333  state.Policy interface, and its many somewhat-inelegant uses). In other places,
   334  we have (quite reasonably) moved code out of the environs package (see both
   335  environs/config.Config, and instances.Instance).
   336  
   337  Environ implementations are expected to be goroutine-safe; we don't currently
   338  make much use of that property at the moment, but we will be coming to depend
   339  upon it as we move to eliminate the wasteful proliferation of Environ instances
   340  in the API server.
   341  
   342  It's important to note that an environ Config will generally contain sensitive
   343  information -- a user's authentication keys for a cloud provider -- and so we
   344  must always be careful to avoid spreading those around further than we need to.
   345  Basically, if an environ config gets off a controller, we've screwed up.