github.com/juju/juju@v0.0.0-20240430160146-1752b71fcf00/doc/architectural-overview.md (about) 1 # Juju Architectural Overview 2 3 4 ## Audience 5 6 This document is targeted at new developers of Juju, and may be useful to experienced 7 developers who need a refresher on some aspect of Juju's operation. It is deliberately 8 light on detail, because the precise mechanisms of various components' operation are 9 expected to change much faster than the general interactions between components. 10 11 12 ## The View From Space 13 14 A Juju model is a distributed system comprising: 15 16 * A data store (mongodb) which describes the desired state of the world, in terms 17 of running workloads or *applications*, and the *relations* between them; and of the 18 *units* that comprise those applications, and the *machines* on which those units run. 19 * A bunch of *agents*, each of which runs the same `jujud` binary, and which are 20 variously responsible for causing reality to converge towards the idealised world- 21 state encoded in the data store. 22 * Some number of *clients* which talk over an API, implemented by the agents, to 23 update the desired world-state (and thereby cause the agents to update the world 24 to match). The `juju` binary is one of many possible clients; the `juju-dashboard` web 25 application, and the `juju-deployer` python tool, are other examples. 26 27 The whole system depends upon a substrate, or *provider*, which supplies the compute, 28 storage, and network resources used by the workloads (and by Juju itself; but never 29 forget that *everything* described in this document is merely supporting infrastructure 30 geared towards the successful deployment and configuration of the workloads that solve 31 actual problems for actual users). 32 33 ## Juju Components 34 35 Here's the various high level parts of Juju system and how they interact: 36 37 ``` 38 +--------------------------+ +------------------------+ 39 | | | | 40 | Machine agent | | Unit agent | 41 | +-------------+ | | +-------------+ | 42 | | | | | | | | 43 | | workers | | | | workers | | 44 | | | | | | | | 45 | +-----------+-+ | | +-------+-----+ | 46 | | | | | | 47 +--------------------------+ +------------------------+ 48 | | 49 | Juju API | 50 | +-------------------------+ 51 | | 52 | | 53 +-----------------------------------------------------------------+ 54 | | | | 55 | Controller agent | | | 56 +------------+ | +v-------v----+ +-------------+ | 57 | | | | | Juju API | | | 58 | Client +--------------------------> apiserver +<-----------+ workers | | 59 | | | Juju API | | | | | 60 +------------+ | +------+------+ +------+------+ | 61 | | | | 62 | | | | 63 | +-----v-----+ +------v------+ | 64 | | | | | | 65 | | state | | providers | | 66 | | | | | | 67 | +-----+-----+ +------+------+ | 68 | | | | 69 +-----------------------------------------------------------------+ 70 | MongoDB protocol | cloud API 71 | | 72 +-----v-----+ +---------V---------+ 73 | | | | 74 | MongoDB | | cloud/substrate | 75 | | | | 76 +-----------+ +-------------------+ 77 ``` 78 79 At the centre is a *controller agent*. It is responsible for maintaining the 80 state for one or more Juju models and runs a server which provides the Juju 81 API. Juju's state is kept in MongoDB. Juju's MongoDB may only be accessed by 82 the controller agents. 83 84 A controller agent runs a number of *workers*, many of which are specific to 85 controller tasks. Some workers in the controller agent use the Juju *provider* 86 implementation to communicate with the underlying cloud substrate using the 87 substrate's APIs. This is how cloud resources are created, managed and 88 destroyed. 89 90 Almost all workers will interact with Juju's state using Juju's API, even 91 workers running within a controller agent. 92 93 If a Juju deployment has high-availability enabled there will be multiple 94 controller agents. An consumer of the Juju API may connect to any controller 95 agent. In HA mode, there will be a MongoDB instance on each controller machine, 96 with a MongoDB replicaset configured to synchronise data between the nodes. 97 98 Each Juju deployed machine runs a *machine agent*. Each machine agent runs a 99 number of workers. 100 101 A controller agent is a machine agent with extra responsibilities. It runs all 102 the workers which a normal machine runs as well as controller specific workers. 103 104 A *unit agent* runs for each deployed unit of an application. It is mainly 105 responsible for installing, running and maintaining charm code. It runs a 106 different set of workers to a machine agent. 107 108 There are a number of *clients* which interact with Juju using the Juju 109 API. These include the `juju` command line tool and Juju Dashboard. 110 111 112 ## The Data Store (aka "state") 113 114 There's a lot of *detail* to cover, but there's not much to say from an architectural 115 standpoint. We use a mongodb replicaset to support HA; we use the `mgo` package from 116 `labix.org` to implement multi-document transactions; we make use of the transaction 117 log to detect changes to particular documents, and convert them into business-object- 118 level events that get sent over the API to interested parties. 119 120 The mongodb databases run on machines we refer to as *controllers*, and are only 121 accessed by agents running on those machines; it's important to keep it locked down 122 (and, honestly, to lock it down further and better than we currently have). 123 124 There's some documentation on how to work with [the state package](hacking-state.md); 125 and plenty more on the [state entities](lifecycles.md) and the details of their 126 [creation](entity-creation.md) and [destruction](death-and-destruction.md) from various 127 perspectives; but there's not a lot more to say in this context. 128 129 It *is* important to understand that the transaction-log watching is not an ideal 130 solution, and we'll be retiring it at some point, in favour of an in-memory model 131 of state and a pub-sub system for watchers; we *know* it's a scalability problem, 132 but we're not devoting resources to it until it becomes more pressing. 133 134 Code for dealing with mongodb is found primarily in the `state`, `state/watcher`, 135 `replicaset`, and `worker/peergrouper` packages. 136 137 138 ## API 139 140 Juju controllers expose an API endpoint over a websocket connection. The methods 141 available over the API are broken down by client; there's a `Client` facade that 142 exposes the methods used by clients, an `Agent` facade that exposes the methods 143 common to all agents, and a wide range of worker-specific *facades* that individually 144 deal with particular chunks of functionality implemented by one agent or another 145 (for example, `Provisioner`, `Upgrader`, and `Uniter`, each used by the eponymous 146 worker types). 147 148 The API server is implemented in the `apiserver` top level package. Each API 149 facade has it's own subpackage (e.g. `apiserver/provisioner`). The code under 150 `apiserver` is the only code that is allowed to import from the `state` 151 package. 152 153 Various facades share functionality; for example, the Life method is used by many 154 worker facades. In these cases, the method is implemented on a separate type, 155 which is embedded in the facade implementation. 156 157 All APIs *should* be implemented such that they can be called in bulk, but not 158 all of them are. The agent facades are (almost?) all implemented correctly, but 159 the Client facade is almost exclusively not. As functionality evolves, and new 160 versions of the client APIs are implemented, we must take care to implement them 161 consistently -- this means both implementing bulk calls *and* splitting the 162 monolithic Client facade into smaller application-specific facades, such that we 163 can evolve interaction with (say) users without bumping global API versions 164 across the board). 165 166 The Juju API client is implemented under the `api` top level package. Client 167 side API facade are implemented as subpackages underneath `api`. 168 169 ## The Agents 170 171 Agents all use the same `jujud` binary, and all follow roughly the same model. 172 When starting up, they authenticate with an API server; possibly reset their 173 password, if the one they used has been stored persistently somewhere and is 174 thus vulnerable; determine their responsibilities; and run a set of tasks in 175 parallel until one of those tasks returns an error indicating that the agent 176 should either restart or terminate completely. Tasks that return any other error 177 will be automatically restarted after a short delay; tasks that return nil are 178 considered to be complete, and will not be restarted until the whole process is. 179 180 When comparing the unit agent with the machine agent, the truth of the above 181 may not be immediately apparent, because the responsibilities of the unit 182 agent are so much less varied than those of the machine agent; but we have 183 scheduled work to integrate the unit agent into the machine agent, rendering 184 each unit agent a single worker task within its responsible machine agent. It's 185 still better to consider a unit agent to be a simplistic and/or degenerate 186 implementation of a machine agent than to attach too much importance to the 187 differences. 188 189 ### Jobs, Runners, and Workers 190 191 Machine agents all have at least one of two jobs: JobHostUnits and JobManageModel. 192 Each of these jobs represents a number of tasks the agent needs to execute to 193 fulfil its responsibilities; in addition, there are a number of tasks that are 194 executed by every machine agent. The terms *task* and *worker* are generally used 195 interchangeably in this document and in the source code; it's possible but not 196 generally helpful to draw the distinction that a worker executes a task. All 197 tasks are implemented by code in some subpackage of the `worker` package, and the 198 `worker.Runner` type implements the retry behaviour described above. 199 200 It's useful to note that the Runner type is itself a worker, so we can and do 201 nest Runners inside one another; the details of *exactly* how and where a given 202 worker comes to be executed are generally elided in this document; but it's worth 203 being aware of the fact that all the workers that use an API connection share a 204 single one, mediated by a single Runner, such that when the API connection fails 205 that single Runner can stop all its workers; shut itself down; be restarted by 206 its parent worker; and set up a new API connection, which it then uses to start 207 all its child workers. 208 209 Please note that the lists of workers below should *not* be assumed to be 210 exhaustive. Juju evolves, and the workers evolve with it. 211 212 ### Common workers 213 214 All agents run workers with the following responsibilities: 215 216 * Check for scheduled upgrades for their binaries, and replace themselves 217 (implemented in `worker/upgrader`) 218 * Watch logging config, and reconfigure the local logger (`worker/logger`; yes, 219 we know; it is not the stupidest name in the codebase) 220 * Watch and store the latest known addresses for the controllers 221 (`worker/apiaddressupdater`) 222 223 ### Machine Agent Workers 224 225 Machine agents additionally do the following: 226 227 * Run upgrade code in the new binaries once they're replaced themselves 228 (implemented directly in the machine agent's `upgradeWorker` method) 229 * Handle SIGABRT and permanently stop the agent (`worker/terminationworker`) 230 * Handle the machine entity's death and permanently stop the agent (`worker/machiner`) 231 * Watch proxy config, and reconfigure the local machine (`worker/machineenvironmentworker`) 232 * Watch for contained LXC or KVM machines and provision/decommission them 233 (`worker/provisioner`) 234 235 All machine agents have JobHostUnits. These run the `worker/deployer` code which 236 watches for units assigned to the machine, and deploys/recalls upstart configs 237 for their respective unit agents as the units are assigned/removed. We expect 238 the deployer implementation to change to just directly run the unit agents' 239 workers in its own Runner. 240 241 ### Controller Workers 242 243 Machines with JobManageModel also run a number of other workers, which do 244 the following. 245 246 * Run the API server used by all other workers (in this, and other, agents: 247 `state/apiserver`) 248 * Provision/decommission provider instances in response to the creation/ 249 destruction of machine entities (`worker/provisioner`, just like the 250 container provisioners run in all machine agents anyway) 251 * Manipulate provider networks in response to units opening/closing ports, 252 and users exposing/unexposing applications (`worker/firewaller`) 253 * Update network addresses and associated information for provider instances 254 (`worker/instancepoller`) 255 * Respond to queued DB cleanup events (`worker/cleaner`) 256 * Maintain the MongoDB replica set (`worker/peergrouper`) 257 * Resume incomplete MongoDB transactions (`worker/resumer`) 258 259 Many of these workers (more than strictly need to be) are wrapped as "singular" 260 workers, which only run on the same machine as the current MongoDB replicaset 261 master. When the master changes, the state connection is dropped, causing all 262 those workers to also be stopped; when they're restarted, they won't run because 263 they're no longer running on the master. 264 265 ### Unit Agents 266 267 Unit agents run all the common workers, and the `worker/uniter` task as well; 268 this task is probably the single most forbiddingly complex part of Juju. (Side 269 note: It's a unit-er because it deals with units, and we're bad at names; but 270 it's also a unite-r because it's where all the various components of Juju come 271 together to run actual workloads.) It's sufficiently large that it deserves its 272 own top-level heading, below. 273 274 275 ## The Uniter 276 277 At the highest level, the Uniter is a state machine. After a "little bit" of setup, 278 it runs a tight loop in which it calls `Mode` functions one after another, with the 279 next mode run determined by the result of its predecessor. All mode functions are 280 implemented in `worker/uniter/modes.go`, which is actually pretty small: just a hair 281 over 400 lines. 282 283 It's deliberately implemented as conceptually single-threaded (just like almost 284 everything else in Juju -- rampaging concurrency is the root of much evil, and so 285 we save ourselves a huge number of headaches by hiding concurrency behind event 286 channels and handling a single event at a time), but this property has degraded 287 over time; in particular, the `RunListener` code can inject events at unhelpful 288 times, and while the `hookLock` *probably* renders this safe it's still deeply 289 suboptimal, because the fact of the concurrency requires that we be *extremely* 290 careful with further modifications, lest they subtly break assumptions. We hope 291 to address this by retiring the current implementation of `juju run`, but it's 292 not entirely clear how to do this; in the meantime, Here Be Dragons. 293 294 Leaving these woes aside, the mode functions make use of two fundamental components, 295 which are glommed together until someone refactors it to make more sense. There's 296 the `Filter`, which is responsible for communicating with the API server (and the 297 rest of the outside world) such that relevant events can be delivered to the mode 298 func via channels exposed on the filter; and then there's the `Uniter` itself, which 299 exposes a number of methods that are expected to be called by the mode funcs. 300 301 302 ### Uniter Modes 303 304 XXXX 305 306 307 ### Hook contexts 308 309 XXXX 310 311 312 ### The Relation Model 313 314 XXXX 315 316 317 ## The Providers 318 319 A Juju provider represents a different possible kind of substrate on which a 320 Juju model can run, and (as far as possible) abstracts away the differences 321 between them, by making them all conform to the Environ interface. The most 322 important thing to understand about the various providers is that they're all 323 implemented without reference to broader Juju concepts; they are squeezed into 324 a shape that's convenient WRT allowing Juju to make use of them, but if we 325 allow Juju-level concepts to infect the providers we will suffer greatly, 326 because we will open a path by which changes to *Juju* end up causing changes 327 to *all the providers at once*. 328 329 However, we lack the ability to enforce this at present, because the package 330 dependency flows in the wrong direction, thanks primarily (purely?) to the 331 StateInfo method on Environ; and we jam all sorts of gymnastics into the state 332 package to allow us to use Environs without doing so explicitly (see the 333 state.Policy interface, and its many somewhat-inelegant uses). In other places, 334 we have (quite reasonably) moved code out of the environs package (see both 335 environs/config.Config, and instances.Instance). 336 337 Environ implementations are expected to be goroutine-safe; we don't currently 338 make much use of that property at the moment, but we will be coming to depend 339 upon it as we move to eliminate the wasteful proliferation of Environ instances 340 in the API server. 341 342 It's important to note that an environ Config will generally contain sensitive 343 information -- a user's authentication keys for a cloud provider -- and so we 344 must always be careful to avoid spreading those around further than we need to. 345 Basically, if an environ config gets off a controller, we've screwed up.