github.com/stevegt/grokker/v3@v3.0.12/core/testdata/te-full.txt (about)

     1  
     2  $Id: turing.xml,v 1.58 2002/11/09 03:53:26 stevegt Exp $
     3                            ****** Why Order Matters:
     4                                Turing Equivalence
     5                                        in
     6                      Automated Systems Administration ******
     7             Steve Traugott, TerraLuna, LLC -- http://www.stevegt.com
     8        Lance Brown, National Institute of Environmental Health Sciences -
     9                              - lance@bearcircle.net
    10    Originally accepted for publication in the proceedings of the USENIX Large
    11  Installation System Administration conference, Philadelphia, PA Nov 3-8, 2002.
    12            Copyright 2002 Stephen Gordon Traugott, All Rights Reserved
    13  ***** Abstract *****
    14  Hosts in a well-architected enterprise infrastructure are self-administered;
    15  they perform their own maintenance and upgrades. By definition, self-
    16  administered hosts execute self-modifying code. They do not behave according to
    17  simple state machine rules, but can incorporate complex feedback loops and
    18  evolutionary recursion.
    19  The implications of this behavior are of immediate concern to the reliability,
    20  security, and ownership costs of enterprise computing. In retrospect, it
    21  appears that the same concerns also apply to manually-administered machines, in
    22  which administrators use tools that execute in the context of the target disk
    23  to change the contents of the same disk. The self-modifying behavior of both
    24  manual and automatic administration techniques helps explain the difficulty and
    25  expense of maintaining high availability and security in conventionally-
    26  administered infrastructures.
    27  The practice of infrastructure architecture tool design exists to bring order
    28  to this self-referential chaos. Conventional systems administration can be
    29  greatly improved upon through discipline, culture, and adoption of practices
    30  better fitted to enterprise needs. Creating a low-cost maintenance strategy
    31  largely remains an art. What can we do to put this art into the hands of
    32  relatively junior administrators? We think that part of the answer includes
    33  adopting a well-proven strategy for maintenance tools, based in part upon the
    34  theoretical properties of computing.
    35  In this paper, we equate self-administered hosts to Turing machines in order to
    36  help build a theoretical foundation for understanding this behavior. We discuss
    37  some tools that provide mechanisms for reliably managing self-administered
    38  hosts, using deterministic ordering techniques.
    39  Based on our findings, it appears that no tool, written in any language, can
    40  predictably administer an enterprise infrastructure without maintaining a
    41  deterministic, repeatable order of changes on each host. The runtime
    42  environment for any tool always executes in the context of the target operating
    43  system; changes can affect the behavior of the tool itself, creating circular
    44  dependencies. The behavior of these changes may be difficult to predict in
    45  advance, so testing is necessary to validate changed hosts. Once changes have
    46  been validated in testing they must be replicated in production in the same
    47  order in which they were tested, due to these same circular dependencies.
    48  The least-cost method of managing multiple hosts also appears to be
    49  deterministic ordering. All other known management methods seem to include
    50  either more testing or higher risk for each host managed.
    51  This paper is a living document; revisions and discussion can be found at
    52  Infrastructures.Org, a project of TerraLuna, LLC.
    53  ***** 1 Foreword *****
    54   ...by Steve Traugott
    55  In 1998, Joel Huddleston and I suggested that an entire enterprise
    56  infrastructure could be managed as one large "enterprise virtual machine" (EVM)
    57  [bootstrap]. That paper briefly described parts of a management toolset, later
    58  named ISconf [isconf]. This toolset, based on relatively simple makefiles and
    59  shell scripts, did not seem extraordinary at the time. At one point in the
    60  paper, we said that we would likely use cfengine [cfengine] the next time
    61  around -- I had been following Mark Burgess' progress since 1994.
    62  That 1998 paper spawned a web site and community at Infrastructures.Org. This
    63  community in turn helped launch the Infrastructure Architecture (IA) career
    64  field. In the intervening years, we've seen the Infrastructures.Org community
    65  grow from a few dozen to a few hundred people, and the IA field blossom from
    66  obscurity into a major marketing campaign by a leading systems vendor.
    67  Since 1998, Joel and I have both attempted to use other tools, including
    68  cfengine version 1. I've also tried to write tools from scratch again several
    69  times, with mixed success. We have repeatedly hit indications that our 1998
    70  toolset was more optimized than we had originally thought. It appears that in
    71  some ways Joel and I, and the rest of our group at the Bank, were lucky; our
    72  toolset protected us from many of the pitfalls that are laying in wait for IAs.
    73  One of these pitfalls appears to be deterministic ordering; I never realized
    74  how important it was until I tried to use other tools that don't support it.
    75  When left without the ability to concisely describe the order of changes to be
    76  made on a machine, I've seen a marked decrease in my ability to predict the
    77  behavior of those changes, and a large increase in my own time spent
    78  monitoring, troubleshooting, and coding for exceptions. These experiences have
    79  shown me that loss of order seems to result in lower production reliability and
    80  higher labor cost.
    81  The ordered behavior of ISconf was more by accident than design. I needed a
    82  quick way to get a grip on 300 machines. I cobbled a prototype together on my
    83  HP100LX palmtop one March '94 morning, during the 35-minute train ride into
    84  Manhattan. I used 'make' as the state engine because it's available on most
    85  UNIX machines. The deterministic behavior 'make' uses when iterating over
    86  prerequisite lists is something I didn't think of as important at the time -- I
    87  was more concerned with observing known dependencies than creating repeatable
    88  order.
    89  Using that toolset and the EVM mindset, we were able to repeatedly respond to
    90  the chaotic international banking mergers and acquisitions of the mid-90's.
    91  This response included building and rebuilding some of the largest trading
    92  floors in the world, launching on schedule each time, often with as little as a
    93  few months' notice, each launch cleaner than the last. We knew at the time that
    94  these projects were difficult; after trying other tool combinations for more
    95  recent projects I think I have a better appreciation for just how difficult
    96  they were. The phrase "throwing a truck through the eye of a needle" has
    97  crossed my mind more than once. I don't think we even knew the needle was
    98  there.
    99  At the invitation of Mark Burgess, I joined his LISA 2001 [lisa] cfengine
   100  workshop to discuss what we'd found so far, with possible targets for the
   101  cfengine 2.0 feature set. The ordering requirement seemed to need more work; I
   102  found ordering surprisingly difficult to justify to an audience practiced in
   103  the use of convergent tools, where ordering is often considered a constraint to
   104  be specifically avoided [couch] [eika-sandnes]. Later that week, Lance Brown
   105  and I were discussing this over dinner, and he hit on the idea of comparing a
   106  UNIX machine to a Turing machine. The result is this paper.
   107  Based on the symptoms we have seen when comparing ISconf to other tools, I
   108  suspect that ordering is a keystone principle in automated systems
   109  administration. Lance and I, with a lot of help from others, will attempt to
   110  offer a theoretical basis for this suspicion. We encourage others to attempt to
   111  refute or support this work at will; I think systems administration may be
   112  about to find its computer science roots. We have also already accumulated a
   113  large FAQ for this paper -- we'll put that on the website. Discussion on this
   114  paper as well as related topics is encouraged on the infrastructures mailing
   115  list at http://Infrastructures.Org.
   116  ***** 2 Why Order Matters *****
   117   There seem to be (at least) several major reasons why the order of changes
   118  made to machines is important in the administration of an enterprise
   119  infrastructure:
   120  A "circular dependency" or control-loop problem exists when an administrative
   121  tool executes code that modifies the tool or the tool's own foundations (the
   122  underlying host). Automated administration tool designers cannot assume that
   123  the users of their tool will always understand the complex behavior of these
   124  circular dependencies. In most cases we will never know what dependencies end
   125  users might create. See sections (8.40), (8.46).
   126  A test infrastructure is needed to test the behavior of changes before rolling
   127  them to production. No tool or language can remove this need, because no
   128  testing is capable of validating a change in any conditions other than those
   129  tested. This test infrastructure is useless unless there is a way to ensure
   130  that production machines will be built and modified in the same way as the test
   131  machines. See section (6), 'The_Need_for_Testing'.
   132  It appears that a tool that produces deterministic order of changes is cheaper
   133  to use than one that permits more flexible ordering. The unpredictable behavior
   134  resulting from unordered changes to disk is more costly to validate than the
   135  predictable behavior produced by deterministic ordering. See section (8.58).
   136  Because cost is a significant driver in the decision-making process of most IT
   137  organizations, we will discuss this point more in section (3).
   138  Local staff must be able to use administrative tools after a cost-effective
   139  (i.e. cheap and quick) turnover phase. While senior infrastructure architects
   140  may be well-versed in avoiding the pitfalls of unordered change, we cannot be
   141  on the permanent staff of every IT shop on the globe. In order to ensure
   142  continued health of machines after rollout of our tools, the tools themselves
   143  need to have some reasonable default behavior that is safe if the user lacks
   144  this theoretical knowledge. See section (8.54).
   145  This business requirement must be addressed by tool developers. In our own
   146  practice, we have been able to successfully turnover enterprise infrastructures
   147  to permanent staff many times over the last several years. Turnover training in
   148  our case is relatively simple, because our toolsets have always implemented
   149  ordered change by default. Without this default behavior, we would have also
   150  needed to attempt to teach advanced techniques needed for dealing with
   151  unordered behavior, such as inspection of code in vendor-supplied binary
   152  packages. See section (7.2.2), 'Right_Packages,_Wrong_Order'.
   153  ***** 3 A Prediction *****
   154   "Order Matters" when we care about both quality and cost while maintaining an
   155  enterprise infrastructure. If the ideas described in this paper are correct,
   156  then we can make the following prediction:
   157       The least-cost way to ensure that the behavior of any two hosts will
   158       remain completely identical is to always implement the same changes
   159       in the same order on both hosts.
   160  This sounds very simple, almost intuitive, and for many people it is. But to
   161  our knowledge, isconf [isconf] is the only generally-available tool which
   162  specifically supports administering hosts this way. There seems to be no prior
   163  art describing this principle, and in our own experience we have yet to see it
   164  specified in any operational procedure. It is trivially easy to demonstrate in
   165  practice, but has at times been surprisingly hard to support in conversation,
   166  due to the complexity of theory required for a proof.
   167  Note that this prediction does not apply only to those situations when you want
   168  to maintain two or more identical hosts. It applies to any computer-using
   169  organization that needs cost-effective, reliable operation. This includes those
   170  that have many unique production hosts. See section (6), 'The_Need_for
   171  Testing'. Section (4.3) discusses this further, including single-host rebuilds
   172  after a security breach.
   173  This prediction also applies to disaster recovery (DR) or business continuity
   174  planning. Any part of a credible DR procedure includes some method of
   175  rebuilding lost hosts, often with new hardware, in a new location. Restoring
   176  from backups is one way to do this, but making complete backups of multiple
   177  hosts is redundant -- the same operating system components must be backed up
   178  for each host, when all we really need are the user data and host build
   179  procedures (how many copies of /bin/ls do we really need on tape?). It is
   180  usually more efficient to have a means to quickly and correctly rebuild each
   181  host from scratch. A tool that maintains an ordered record of changes made
   182  after install is one way to do this.
   183  This prediction is particularly important for those organizations using what we
   184  call self-administered hosts. These are hosts that run an automated
   185  configuration or administration tool in the context of their own operating
   186  environment. Commercial tools in this category include Tivoli, Opsware, and
   187  CenterRun [tivoli] [opsware] [centerrun]. Open-source tools include cfengine,
   188  lcfg, pikt, and our own isconf [cfengine] [lcfg] [pikt] [isconf]. We will
   189  discuss the fitness of some of these tools later -- not all appear fully suited
   190  to the task.
   191  This prediction applies to those organizations which still use an older
   192  practice called "cloning" to create and manage hosts. In cloning, an
   193  administrator or tool copies a disk image from one machine to another, then
   194  makes the changes needed to make the host unique (at minimum, IP address and
   195  hostname). After these initial changes, the administrator will often make
   196  further changes over the life of the machine. These changes may be required for
   197  additional functionality or security, but are too minor to justify re-cloning.
   198  Unless order is observed, identical changes made to multiple hosts are not
   199  guaranteed to behave in a predictable way (8.47). The procedure needed for
   200  properly maintaining cloned machines is not substantially different from that
   201  described in section (7.1).
   202  This prediction, stated more formally in section (8.58), seems to apply to
   203  UNIX, Windows, and any other general-purpose computer with a rewritable disk
   204  and modern operating system. More generally, it seems to apply to any von
   205  Neumann machine with rewritable nonvolatile storage.
   206  ***** 4 Management Methods *****
   207  All computer systems management methods can be classified into one of three
   208  categories: divergent, convergent, and congruent.
   209  **** 4.1 Divergence ****
   210  Divergence (figure_4.1.1) generally implies bad management. Experience shows us
   211  that virtually all enterprise infrastructures are still divergent today.
   212  Divergence is characterized by the configuration of live hosts drifting away
   213  from any desired or assumed baseline disk content.
   214  [images/divergence.png]
   215       Figure 4.1.1: Divergence
   216  One quick way to tell if a shop is divergent is to ask how changes are made on
   217  production hosts, how those same changes are incorporated into the baseline
   218  build for new or replacement hosts, and how they are made on hosts that were
   219  down at the time the change was first deployed. If you get different answers,
   220  then the shop is divergent.
   221  The symptoms of divergence include unpredictable host behavior, unscheduled
   222  downtime, unexpected package and patch installation failure, unclosed security
   223  vulnerabilities, significant time spent "firefighting", and high
   224  troubleshooting and maintenance costs.
   225  The causes of divergence are generally that class of operations that create
   226  non-reproducible change. Divergence can be caused by ad-hoc manual changes,
   227  changes implemented by two independent automatic agents on the same host, and
   228  other unordered changes. Scripts which drive rdist, rsync, ssh, scp, [rdist]
   229  [rsync] [ssh] or other change agents as a push operation [bootstrap] are also a
   230  common source of divergence.
   231  **** 4.2 Convergence ****
   232  Convergence (figure_4.2.1) is the process most senior systems administrators
   233  first begin when presented with a divergent infrastructure. They tend to start
   234  by manually synchronizing some critical files across the diverged machines,
   235  then they figure out a way to do that automatically. Convergence is
   236  characterized by the configuration of live hosts moving towards an ideal
   237  baseline. By definition, all converging infrastructures are still diverged to
   238  some degree. (If an infrastructure maintains full compliance with a fully
   239  descriptive baseline, then it is congruent according to our definition, not
   240  convergent. See section (4.3), 'Congruence'.)
   241  [images/convergence.png]
   242       Figure 4.2.1: Convergence
   243  The baseline description in a converging infrastructure is characteristically
   244  an incomplete description of machine state. You can quickly detect convergence
   245  in a shop by asking how many files are currently under management control. If
   246  an approximate answer is readily available and is on the order of a few hundred
   247  files or less, then the shop is likely converging legacy machines on a file-by-
   248  file basis.
   249  A convergence tool is an excellent means of bringing some semblance of order to
   250  a chaotic infrastructure. Convergent tools typically work by sampling a small
   251  subset of the disk -- via a checksum of one or more files, for example -- and
   252  taking some action in response to what they find. The samples and actions are
   253  often defined in a declarative or descriptive language that is optimized for
   254  this use. This emulates and preempts the firefighting behavior of a reactive
   255  human systems administrator -- "see a problem, fix it". Automating this process
   256  provides great economies of scale and speed over doing the same thing manually.
   257  Convergence is a feature of Mark Burgess' Computer Immunology principles
   258  [immunology]. His cfengine is in our opinion the best tool for this job
   259  [cfengine]. Simple file replication tools [sup] [cvsup] [rsync] provide a
   260  rudimentary convergence function, but without the other action semantics and
   261  fine-grained control that cfengine provides.
   262  Because convergence typically includes an intentional process of managing a
   263  specific subset of files, there will always be unmanaged files on each host.
   264  Whether current differences between unmanaged files will have an impact on
   265  future changes is undecidable, because at any point in time we do not know the
   266  entire set of future changes, or what files they will depend on.
   267  It appears that a central problem with convergent administration of an
   268  initially divergent infrastructure is that there is no documentation or
   269  knowledge as to when convergence is complete. One must treat the whole
   270  infrastructure as if the convergence is incomplete, whether it is or not. So
   271  without more information, an attempt to converge formerly divergent hosts to an
   272  ideal configuration is a never-ending process. By contrast, an infrastructure
   273  based upon first loading a known baseline configuration on all hosts, and
   274  limited to purely orthogonal and non-interacting sets of changes, implements
   275  congruence (4.3). Unfortunately, this is not the way most shops use convergent
   276  tools such as cfengine.
   277  The symptoms of a convergent infrastructure include a need to test all changes
   278  on all production hosts, in order to detect failures caused by remaining
   279  unforeseen differences between hosts. These failures can impact production
   280  availability. The deployment process includes iterative adjustment of the
   281  configuration tools in response to newly discovered differences, which can
   282  cause unexpected delays when rolling out new packages or changes. There may be
   283  a higher incidence of failures when deploying changes to older hosts. There may
   284  be difficulty eliminating some of the last vestiges of the ad-hoc methods
   285  mentioned in section (4.1). Continued use of ad-hoc and manual methods
   286  virtually ensures that convergence cannot complete.
   287  With all of these faults, convergence still provides much lower overall
   288  maintenance costs and better reliability than what is available in a divergent
   289  infrastructure. Convergence features also provide more adaptive self-healing
   290  ability than pure congruence, due to a convergence tool's ability to detect
   291  when deviations from baseline have occurred. Congruent infrastructures rely on
   292  monitoring to detect deviations, and generally call for a rebuild when they
   293  have occurred. We discuss the security reasons for this in section (4.3).
   294  We have found apparent limits to how far convergence alone can go. We know of
   295  no previously divergent infrastructure that, through convergence alone, has
   296  reached congruence (4.3). This makes sense; convergence is a process of
   297  eliminating differences on an as-needed basis; the managed disk content will
   298  generally be a smaller set than the unmanaged content. In order to prove
   299  congruence, we would need to sample all bits on each disk, ignore those that
   300  are user data, determine which of the remaining bits are relevant to the
   301  operation of the machine, and compare those with the baseline.
   302  In our experience, it is not enough to prove via testing that two hosts
   303  currently exhibit the same behavior while ignoring bit differences on disk; we
   304  care not only about current behavior, but future behavior as well. Bit
   305  differences that are currently deemed not functional, or even those that truly
   306  have not been exercised in the operation of the machine, may still affect the
   307  viability of future change directives. If we cannot predict the viability of
   308  future change actions, we cannot predict the future viability of the machine.
   309  Deciding what bit differences are "functional" is often open to individual
   310  interpretation. For instance, do we care about the order of lines and comments
   311  in /etc/inetd.conf? We might strip out comments and reorder lines without
   312  affecting the current operation of the machine; this might seem like a non-
   313  functional change, until two years from now. After time passes, the lack of
   314  comments will affect our future ability to correctly understand the
   315  infrastructure when designing a new change. This example would seem to indicate
   316  that even non-machine-readable bit differences can be meaningful when
   317  attempting to prove congruence.
   318  Unless we can prove congruence, we cannot validate the fitness of a machine
   319  without thorough testing, due to the uncertainties described in section (8.25).
   320  In order to be valid, this testing must be performed on each production host,
   321  due to the factors described in section (8.47). This testing itself requires
   322  either removing the host from production use or exposing untested code to
   323  users. Without this validation, we cannot trust the machine in mission-critical
   324  operation.
   325  **** 4.3 Congruence ****
   326  Congruence (figure_4.3.1) is the practice of maintaining production hosts in
   327  complete compliance with a fully descriptive baseline (7.1). Congruence is
   328  defined in terms of disk state rather than behavior, because disk state can be
   329  fully described, while behavior cannot (8.59).
   330  [images/congruence.png]
   331       Figure 4.3.1: Congruence
   332  By definition, divergence from baseline disk state in a congruent environment
   333  is symptomatic of a failure of code, administrative procedures, or security. In
   334  any of these three cases, we may not be able to assume that we know exactly
   335  which disk content was damaged. It is usually safe to handle all three cases as
   336  a security breach: correct the root cause, then rebuild.
   337  You can detect congruence in a shop by asking how the oldest, most complex
   338  machine in the infrastructure would be rebuilt if destroyed. If years of
   339  sysadmin work can be replayed in an hour, unattended, without resorting to
   340  backups, and only user data need be restored from tape, then host management is
   341  likely congruent.
   342  Rebuilds in a congruent infrastructure are completely unattended and generally
   343  faster than in any other; anywhere from 10 minutes for a simple workstation to
   344  2 hours for a node in a complex high-availability server cluster (most of that
   345  two hours is spent in blocking sleeps while meeting barrier conditions with
   346  other nodes).
   347  Symptoms of a congruent infrastructure include rapid, predictable, "fire-and-
   348  forget" deployments and changes. Disaster recovery and production sites can be
   349  easily maintained or rebuilt on demand in a bit-for-bit identical state.
   350  Changes are not tested for the first time in production, and there are no
   351  unforeseen differences between hosts. Unscheduled production downtime is
   352  reduced to that caused by hardware and application problems; firefighting
   353  activities drop considerably. Old and new hosts are equally predictable and
   354  maintainable, and there are fewer host classes to maintain. There are no ad-hoc
   355  or manual changes. We have found that congruence makes cost of ownership much
   356  lower, and reliability much higher, than any other method.
   357  Our own experience and calculations show that the return-on-investment (ROI) of
   358  converting from divergence to congruence is less than 8 months for most
   359  organizations. See (figure_4.3.2). This graph assumes an existing divergent
   360  infrastructure of 300 hosts, 2%/month growth rate, followed by adoption of
   361  congruent automation techniques. Typical observed values were used for other
   362  input parameters. Automation tool rollout began at the 6-month mark in this
   363  graph, causing temporarily higher costs; return on this investment is in 5
   364  months, where the manual and automatic lines cross over at the 11 month mark.
   365  Following crossover, we see a rapidly increasing cost savings, continuing over
   366  the life of the infrastructure. While this graph is calculated, the results
   367  agree with actual enterprise environments that we have converted. There is a
   368  CGI generator for this graph at Infrastructures.Org, where you can experiment
   369  with your own parameters.
   370  [images/t7a_automation_curve.png]
   371       Figure 4.3.2: Cumulative costs for fully automated (congruent) versus
   372       manual administration.
   373  Congruence allows us to validate a change on one host in a class, in an
   374  expendable test environment, then deploy that change to production without risk
   375  of failure. Note that this is useful even when (or especially when) there may
   376  be only one production host in that class.
   377  A congruence tool typically works by maintaining a journal of all changes to be
   378  made to each machine, including the initial image installation. The journal
   379  entries for a class of machine drive all changes on all machines in that class.
   380  The tool keeps a lifetime record, on the machine's local disk, of all changes
   381  that have been made on a given machine. In the case of loss of a machine, all
   382  changes made can be recreated on a new machine by "replaying" the same journal;
   383  likewise for creating multiple, identical hosts. The journal is usually
   384  specified in a declarative language that is optimized for expressing ordered
   385  sets and subsets. This allows subclassing and easy reuse of code to create new
   386  host types. See section (7.1), 'Describing_Disk_State'.
   387  There are few tools that are capable of the ordered lifetime journaling
   388  required for congruent behavior. Our own isconf (7.3.1) is the only
   389  specifically congruent tool we know of in production use, though cfengine, with
   390  some care and extra coding, appears to be usable for administration of
   391  congruent environments. We discuss this in more detail in section (7.3.2).
   392  We recognize that congruence may be the only acceptable technique for managing
   393  life-critical systems infrastructures, including those that:
   394      * Influence the results of human-subject health and medicine experiments
   395      * Provide command, control, communications, and intelligence (C3I) for
   396        battlefield and weapons systems environments
   397      * Support command and telemetry systems for manned aerospace vehicles,
   398        including spacecraft and national airspace air traffic control
   399  Our personal experience shows that awareness of the risks of conventional host
   400  management techniques has not yet penetrated many of these organizations. This
   401  is cause for concern.
   402  ***** 5 Ordered Thinking *****
   403  We have found that designers of automated systems administration tools can
   404  benefit from a certain mindset:
   405       Think like a kernel developer, not an application programmer.
   406  A good multitasking operating system is designed to isolate applications (and
   407  their bugs) from each other and from the kernel, and produce the illusion of
   408  independent execution. Systems administration is all about making sure that
   409  users continue to see that illusion.
   410  Modern languages, compilers, and operating systems are designed to isolate
   411  applications programmers from "the bare hardware" and the low-level machine
   412  code, and enable object-oriented, declarative, and other high-level
   413  abstractions. But it is important to remember that the central processing unit
   414  (s) on a general-purpose computer only accepts machine-code instructions, and
   415  these instructions are coded in a procedural language. High-level languages are
   416  convenient abstractions, but are dependent on several layers of code to deliver
   417  machine language instructions to the CPU.
   418  In reality, on any computer there is only one program; it starts running when
   419  the machine finishes power-on self test (POST), and stops when you kill the
   420  power. This program is machine language code, dynamically linked at runtime,
   421  calling in fragments of code from all over the disk. These "fragments" of code
   422  are what we conventionally think of as applications, shared libraries, device
   423  drivers, scripts, commands, administrative tools, and the kernel itself -- all
   424  of the components that make up the machine's operating environment.
   425  None of these fragments can run standalone on the bare hardware -- they all
   426  depend on others. We cannot analyze the behavior of any application-layer tool
   427  as if it were a standalone program. Even kernel startup depends on the
   428  bootloader, and in some operating systems the kernel runtime characteristics
   429  can be influenced by one or more configuration files found elsewhere on disk.
   430  This perspective is opposite from that of an application programmer. An
   431  application programmer "sees" the system as an axiomatic underlying support
   432  infrastructure, with the application in control, and the kernel and shared
   433  libraries providing resources. A kernel developer, though, is on the other side
   434  of the syscall interface; from this perspective, an application is something
   435  you load, schedule, confine, and kill if necessary.
   436  On a UNIX machine, systems administration tools are generally ordinary
   437  applications that run as root. This means that they, too, are at the mercy of
   438  the kernel. The kernel controls them, not the other way around. And yet, we
   439  depend on automated systems administration tools to control, modify, and
   440  occasionally replace not only that kernel, but any and all other disk content.
   441  This presents us with the potential for a circular dependency chain.
   442  A common misconception is that "there is some high-level tool language that
   443  will avoid the need to maintain strict ordering of changes on a UNIX machine".
   444  This belief requires that the underlying runtime layers obey axiomatic and
   445  immutable behavioral laws. When using automated administration tools we cannot
   446  consider the underlying layers to be axiomatic; the administration tool itself
   447  perturbs those underlying layers. See section (7.2.3), 'Circular_Dependencies'.
   448  Inspection of high-level code alone is not enough. Without considering the
   449  entire system and its resulting machine language code, we cannot prove
   450  correctness. For example:
   451   print "hello\n";
   452  This looks like a trivial-enough Perl program; it "obviously" should work. But
   453  what if the Perl interpreter is broken? In other words, a conclusion of "simple
   454  enough to easily prove" can only be made by analyzing low-level machine
   455  language code, and the means by which it is produced.
   456  "Order Matters" because we need to ensure that the machine-language
   457  instructions resulting from a set of change actions will execute in the correct
   458  order, with the correct operands. Unless we can prove program correctness at
   459  this low level, we cannot prove the correctness of any program. It does no good
   460  to prove correctness of a higher-level program when we do not know the
   461  correctness of the lower runtime layers. If the high-level program can modify
   462  those underlying layers, then the behavior of the program can change with each
   463  modification. Ordering of those modifications appears to be important to our
   464  ability to predict the behavior of the high-level program. (Put simply, it is
   465  important to ensure that you can step off of the tree limb before you cut
   466  through it.)
   467  ***** 6 The Need for Testing *****
   468  Just as we urge tool designers to think like kernel developers (5), we urge
   469  systems administrators to think like operating systems vendors -- because they
   470  are. Systems administration is actually systems modification; the administrator
   471  replaces binaries and alters configuration files, creating a combination which
   472  the operating system vendor has never tested. Since many of these modifications
   473  are specific to a single site or even a single machine, it is unreasonable to
   474  assume that the vendor has done the requisite testing. The systems
   475  administrator must perform the role of systems vendor, testing each unique
   476  combination -- before the users do.
   477  Due to modern society's reliance on computers, it is unethical (and just plain
   478  bad business practice) for an operating system vendor to release untested
   479  operating systems without at least noting them as such. Better system vendors
   480  undertake a rigorous and exhaustive series of unit, system, regression,
   481  application, stress, and performance testing on each build before release,
   482  knowing full well that no amount of testing is ever enough (8.9). They do this
   483  in their own labs; it would make little sense to plan to do this testing on
   484  customers' production machines.
   485  And yet, IT shops today habitually have no dedicated testing environment for
   486  validating changed operating systems. They deploy changes directly to
   487  production without prior testing. Our own experience and informal surveys show
   488  that greater than 95% of shops still do business this way. It is no wonder that
   489  reliability, security, and high availability are still major issues in IT.
   490  We urge systems administrators to create and use dedicated testing
   491  environments, not inflict changes on users without prior testing, and consider
   492  themselves the operating systems vendors that they really are. We urge IT
   493  management organizations to understand and support administrators in these
   494  efforts; the return on investment is in the form of lower labor costs and much
   495  higher user satisfaction. See section (8.42). Availability of a test
   496  environment enables the deployment of automated systems administration tools,
   497  bringing major cost savings. See (figure_4.3.2).
   498  A test environment is useless until we have a means to replicate the changes we
   499  made in testing onto production machines. "Order matters" when we do this
   500  replication; an earlier change will often affect the outcome of a later change.
   501  This means that changes made to a test machine must later be "replayed" in the
   502  same order on the machine's production counterpart. See section (8.45).
   503  Testing costs can be greatly reduced by limiting the number of unique builds
   504  produced; this holds true for both vendors and administrators. This calls for
   505  careful management of changes and host classes in an IT environment, with an
   506  intent of limiting proliferation of classes. See section (8.41).
   507  Note that use of open-source operating systems does not remove the need for
   508  local testing of local modifications. In any reasonably complex infrastructure,
   509  there will always be local configuration and non-packaged binary modifications
   510  which the community cannot have previously exercised. We prefer open source; we
   511  do not expect it to relieve us from our responsibilities though.
   512  ***** 7 Ordering HOWTO *****
   513  Automated systems administration is very straightforward. There is only one way
   514  for a user-side administrative tool to change the contents of disk in a running
   515  UNIX machine -- the syscall interface. The task of automated administration is
   516  simply to make sure that each machine's kernel gets the right system calls, in
   517  the right order, to make it be the machine you want it to be.
   518  **** 7.1 Describing Disk State ****
   519  If there are N bits on a disk, then there are 2^N possible disk states. In
   520  order to maintain the baseline host description needed for congruent
   521  management, we need to have a way to describe any arbitrary disk state in a
   522  highly compressed way, preferably in a human-readable configuration file or
   523  script. For the purposes of this description, we neglect user data and log
   524  files -- we want to be able to describe the root-owned and administered
   525  portions of disk.
   526  "Order Matters" whether creating or modifying a disk:
   527       A concise and reliable way to describe any arbitrary state of a disk
   528       is to describe the procedure for creating that state.
   529  This procedure will include the initial state (bare-metal build) of the disk,
   530  followed by the steps used to change it over time, culminating in the desired
   531  state. This procedure must be in writing, preferably in machine-readable form.
   532  This entire set of information, for all hosts, constitutes the baseline
   533  description of a congruent infrastructure. Each change added to the procedure
   534  updates the baseline. See section (4.3), 'Congruence'.
   535  There are tools which can help you maintain and execute this procedure. See
   536  section (7.3), 'Example_Tools_and_Techniques'.
   537  While it is conceivable that this procedure could be a documented manual
   538  process, executing these steps manually is tedious and costly at best. (Though
   539  we know of many large mission-critical shops which try.) It is generally error-
   540  prone. Manual execution of complex procedures is one of the best methods we
   541  know of for generating divergence (4.1).
   542  The starting state (bare-metal install) description of the disk may take the
   543  form of a network install tool's configuration file, such as that used for
   544  Solaris Jumpstart or RedHat Kickstart. The starting state might instead be a
   545  bitstream representing the entire initial content of the disk (usually a
   546  snapshot taken right after install from vendor CD). The choice of which of
   547  these methods to use is usually dependent on the vendor-supplied install tool -
   548  - some will support either method, some require one or the other.
   549  **** 7.2 How to Break an Enterprise ****
   550  A systems administrator, whether a human or a piece of software (8.36), can
   551  easily break an enterprise infrastructure by executing the right actions in the
   552  wrong order. In this section, we will explore some of the ways this can happen.
   553  *** 7.2.1 Right Commands, Wrong Order ***
   554  First we will cover a trivial but devastating example that is easily avoided.
   555  This once happened to a colleague while doing manual operations on a machine.
   556  He wanted to clean out the contents of a directory which ordinarily had the
   557  development group's source code NFS mounted over top of it. Here is what he
   558  wanted to do:
   559  	umount /apps/src
   560  	cd /apps/src
   561  	rm -rf .
   562  	mount /apps/src
   563  				
   564  Here's what he actually did:
   565  	umount /apps/src
   566  		...umount fails, directory in use; while resolving
   567  		this, his pager goes off, he handles the interrupt,
   568  		then...
   569  	cd /apps/src
   570  	rm -rf .
   571  				
   572  Needless to say, there had also been no backup of the development source tree
   573  for quite some time...
   574  In this example, "correct order" includes some concept of sufficient error
   575  handling. We show this example because it highlights the importance of a
   576  default behavior of "halt on error" for automatic systems administration tools.
   577  Not all tools halt on error by default; isconf does (7.3.1).
   578  *** 7.2.2 Right Packages, Wrong Order ***
   579  We in the UNIX community have long accused Windows developers of poor library
   580  management, due to the fact that various Windows applications often come
   581  bundled with differing version of the same DLLs. It turns out that at least
   582  some UNIX and Linux distributions appear to suffer from the same problem.
   583  Jeffrey D'Amelia and John Hart [hart] demonstrated this in the case of RedHat
   584  RPMs, both official and contributed. They showed that the order in which you
   585  install RPMs can matter, even when there are no applicable dependencies
   586  specified in the package. We don't assume that this situation is restricted to
   587  RPMs only -- any package management system should be susceptible to this
   588  problem. An interesting study would be to investigate similar overlaps in
   589  vendor-supplied packages for commercial UNIX distributions.
   590  Detecting this problem for any set of packages involves extensive analysis by
   591  talented persons. In the case of [hart], the authors developed a suite of
   592  global analysis tools, and repeatedly downloaded and unpacked thousands of
   593  RPMs. They still only saw "the tip of the iceberg" (their words). They
   594  intentionally ignored the actions of postinstall scripts, and they had not yet
   595  executed any packaged code to look for behavioral interactions.
   596  Avoiding the problem is easier; install the packages, record the order of
   597  installation, test as usual, and when satisfied with testing, install the same
   598  packages in the same order on production machines.
   599  While we've used packages in this example, we'd like to remind the reader that
   600  these considerations apply not only to package installation but any other
   601  change that affects the root-owned portions of disk.
   602  *** 7.2.3 Circular Dependencies ***
   603  There is a "chicken and egg" or bootstrapping problem when updating either an
   604  automated systems administration tool (ASAT) or its underlying foundations
   605  (8.40). Order is important when changes the tool makes can change the ability
   606  of the tool to make changes.
   607  For example, cfengine version 2 includes new directives available for use in
   608  configuration files. Before using a new configuration file, the new version of
   609  cfengine needs to be installed. The new client is named 'cfagent' rather than
   610  'cfengine', so wrapper scripts and crontab entries will also need to be
   611  updated, and so on.
   612  For fully automated operation on hundreds or thousands of machines, we would
   613  like to be able to upgrade cfengine under the control of cfengine (8.46). We
   614  want to ensure that the following actions will take place on all machines,
   615  including those currently down:
   616     1. fetch new configuration file containing the following instructions
   617     2. install new cfagent binary
   618     3. run cfkey to generate key pair
   619     4. fetch new configuration file containing version 2 directives
   620     5. update calling scripts and crontab entries
   621  There are several ordering considerations here. We won't know that we need the
   622  new cfagent binary until we do step 1. We shouldn't proceed with step 4 until
   623  we know that 2 and 3 were successful. If we do 5 too early, we may break the
   624  ability for cfengine to operate at all. If we do step 4 too early and try to
   625  run the resulting configuration file using the old version of cfengine, it will
   626  fail.
   627  While this example may seem straightforward, implementing it in a language
   628  which does not by default support deterministic ordering requires much use of
   629  conditionals, state chaining, or equivalent. If this is the case, then code
   630  flow will not be readily apparent, making inspection and edits error-prone.
   631  Infrastructure automation code runs as root and has the ability to stop work
   632  across the entire enterprise; it needs to be simple, short, and easy for humans
   633  to read, like security-related code paths in tools such as PGP or ssh.
   634  If the tool's language does not support "halt on error" by default, then it is
   635  easy to inadvertently allow later actions to take place when we would have
   636  preferred to abort. Going back to our cfengine example, if we can easily abort
   637  and leave the cfengine version 1 infrastructure in place, then we can still use
   638  version 1 to repair the damage.
   639  *** 7.2.4 Other Sources of Breakage ***
   640  There are many other examples we could show, some including multi-host
   641  "barrier" problems. These include:
   642      * Updating ssh to openssh on hundreds of hosts and getting the
   643        authorized_keys and/or protocol version configuration out of order. This
   644        can greatly hinder further contact with the target hosts. Daniel Hagerty
   645        [hagerty] ran into this one; many of us have been bitten by this at some
   646        point.
   647      * Reconfiguring network routes or interfaces while communicating with the
   648        target device via those same routes or interfaces. Ordering errors can
   649        prevent further contact with the target, and often require a physical
   650        visit to resolve. This is especially true if the target is a workstation
   651        with no remote serial console access. Again, most readers have had this
   652        happen to them.
   653  **** 7.3 Example Tools and Techniques ****
   654  While there are many automatic systems administration tools (ASAT) available,
   655  the two we are most familiar with are cfengine and our own isconf [cfengine]
   656  [isconf]. In this section, we will look at these two tools from the perspective
   657  of Turing equivalence (8), with a focus on how each can be used
   658  deterministically.
   659  In general, some of the techniques that seem to work well for the design and
   660  use of most ASATs include:
   661      * Keep the "Turing tape" a finite size by holding the network content
   662        constant (8.23), or versioning it using CVS or another version control
   663        tool [cvs] [bootstrap]. This helps prevent some of the more insidious
   664        behaviors that are a potential in self-modifying machines (8.40).
   665      * Continuing in that vein, when using distributed package repositories such
   666        as the public Debian [debian] package server infrastructure, always
   667        specify version numbers when automating the installation of packages,
   668        rather than let the package installation tool (in Debian's case apt-get)
   669        select the latest version. If you do not specify the package version,
   670        then you may introduce divergence (4.1). This risk varies, of course,
   671        depending on your choice of 'stable' or 'unstable' distribution, though
   672        we suspect it still applies in 'stable', especially when using the
   673        'security' packages. It certainly applies in all cases when you need to
   674        maintain your own kernel or kernel modules rather than using the
   675        distributed packages.
   676        We have experienced this repeatedly -- machines which built correctly the
   677        first time with a given package list will not rebuild with the same
   678        package list a few weeks later, due to package version changes on the
   679        public servers, and resulting unresolved incompatibilities with local
   680        conditions and configuration file contents. Remember, your hosts are
   681        unique in the world -- there are likely no others like them. Package
   682        maintainers cannot be expected to test every configuration, especially
   683        yours. You must retain this responsibility. See section (6), 'The_Need
   684        for_Testing'.
   685        We use Debian in this example because it is a distribution we like a lot;
   686        note that other package distribution and installation infrastructures,
   687        such as the RedHat up2date system, also have this problem.
   688      * Expect long dependency or sequence chains when building enterprise
   689        infrastructures. If an ASAT can easily support encapsulation and ordering
   690        of 10, 50, or even 100 complex atomic actions in a single chain, then it
   691        is likely capable of fully automated administration of machines,
   692        including package, kernel, build, and even rebuild management. If the
   693        ASAT is cumbersome to use when chains become only two or three actions
   694        deep, then it is likely most suited for configuration file management,
   695        not package, binary, or kernel manipulation.
   696  *** 7.3.1 ISconf ***
   697  As we mentioned in section (1), isconf originally began life as a quick hack.
   698  Its basic utility has proven itself repeatedly over the last 8 years, and as
   699  adoption has grown it is currently managing more production infrastructures
   700  than we are personally aware of.
   701  While we show some ISconf makefile examples here, we do not show any example of
   702  the top-level configuration file which drives the environment and targets for
   703  'make'. It is this top-level configuration file, and the scripts which
   704  interpret it, which are the core of ISconf and enable the typing or classing of
   705  hosts. These top-level facilities also are what governs the actions ISconf is
   706  to take during boot versus cron or other execution contexts. More information
   707  and code is available at ISconf.org and Infrastructures.Org.
   708  We also do not show here the network fetch and update portions of ISconf, and
   709  the way that it updates its own code and configuration files at the beginning
   710  of each run. This default behavior is something that we feel is important in
   711  the design of any automated systems administration tool. If the tool does not
   712  support it, end-users will have to figure out how to do it themselves, reducing
   713  the usability of the tool.
   714  ** 7.3.1.1 ISconf Version 2 **
   715  Version 2 of ISconf was a late-90's rewrite to clean up and make portable the
   716  lessons learned from version 1. As in version 1, the code used was Bourne
   717  shell, and the state engine used was 'make'.
   718  In (listing 1), we show a simplified example of Version 2 usage. While examples
   719  related to this can be found in [hart] and in our own makefiles, real-world
   720  usage is usually much more complex than the example shown here. We've contrived
   721  this one for clarity of explanation.
   722  In this contrived example, we install two packages which we have not proven
   723  orthogonal. We in fact do not wish to take the time to detect whether or not
   724  they are orthogonal, due to the considerations expressed in section (8.58). We
   725  may be tool users, rather than tool designers, and may not have the skillset to
   726  determine orthogonality, as in section (8.54).
   727  These packages might both affect the same shared library, for instance. Again
   728  according to [hart] and our own experience, it is not unusual for two packages
   729  such as these to list neither as prerequisites, so we might gain no ordering
   730  guidance from the package headers either.
   731  In other words, all we know is that we installed package 'foo', tested and
   732  deployed it to production, and then later installed package 'bar', tested it
   733  and deployed. These installs may have been weeks or months apart. All went well
   734  throughout, users were happy, and we have no interest in unpacking and
   735  analyzing the contents of these packages for possible reordering for any
   736  reason; we've gone on to other problems.
   737  Because we know this order works, we wish for these two packages, 'foo' and
   738  'bar', to be installed in the same order on every future machine in this class.
   739  This makefile will ensure that; the touch $@ command at the end of each stanza
   740  will prevent this stanza from being run again. The ISconf code always changes
   741  to the timestamps directory before starting 'make' (and takes other measures to
   742  constrain the normal behavior of 'make', so that we never try to "rebuild" this
   743  target either).
   744  The class name in this case (listing 1) is 'Block12'. You can see that
   745  'Block12' is also made up of many other packages; we don't show the makefile
   746  stanzas for these here. These packages are listed as prerequisites to
   747  'Block12', in chronological order. Note that we only want to add items to the
   748  end of this list, not the middle, due to the considerations expressed in
   749  section (8.49).
   750  In this example, even though we take advantage of the Debian package server
   751  infrastructure, we specify the version of package that we want, as in the
   752  introduction to section (7.3). We also use a caching proxy when fetching Debian
   753  packages, in order to speed up our own builds and reduce the load on the Debian
   754  servers to a minimum.
   755  Note that we get "halt-on-error" behavior from 'make', as we wished for in
   756  section (7.2.1). If any of the commands in the 'foo' or 'bar' sections exit
   757  with a non-zero return code, then 'make' aborts processing immediately. The
   758  'touch' will not happen, and we normally configure the infrastructure such that
   759  the ISconf failure will be noticed by a monitoring tool and escalated for
   760  resolution. In practice, these failures very rarely occur in production; we see
   761  and fix them in test. Production failures, by the definition of congruence
   762  (4.3), usually indicate a systemic, security, or organizational problem; we
   763  don't want them fixed without human investigation.
   764  Listing 1: ISconf makefile package ordering example.
   765  Block12: cvs ntp foo lynx wget serial_console bar sudo mirror_rootvg
   766  
   767  foo:
   768  	apt-get -y install foo=0.17-9
   769  	touch $@
   770  
   771  bar:
   772  	apt-get -y install bar=1.0.2-1
   773  	echo apple pear > /etc/bar.conf
   774  	touch $@
   775  
   776  ...
   777  				
   778  ** 7.3.1.2 ISconf Version 3 **
   779  ISconf version 3 was a rewrite in Perl, by Luke Kanies. This version adds more
   780  "lessons learned", including more fine-grained control of actions as applied to
   781  target classes and hosts. There are more layers of abstraction between the
   782  administrator and the target machines; the tool uses various input files to
   783  generate intermediate and final file formats which eventually are fed to
   784  'make'.
   785  One feature in particular is of special interest for this paper. In ISconf
   786  version 2, the administrator still had the potential to inadvertently create
   787  unordered change by an innocent makefile edit. While it is possible to avoid
   788  this with foreknowledge of the problem, version 3 uses timestamps in an
   789  intermediate file to prevent it from being an issue.
   790  The problem which version 3 fixes can be reproduced in version 2 as follows:
   791  Refer to (listing 1). If both 'foo' and 'bar' have been executed (installed) on
   792  production machines, then the administrator adds 'baz' as a prerequisite to
   793  'bar', then this would qualify as "editing prior actions" and create the
   794  divergence described in (8.49).
   795  ISconf version 3, rather than using a human-edited makefile, reads other input
   796  files which the administrator maintains, and generates intermediate and final
   797  files which include timestamps to detect the problem and correct the ordering.
   798  ** 7.3.1.3 ISconf version 4 **
   799  ISconf version 4, currently in prototype, represents a significant
   800  architectural change from versions 1 through 3. If the current feature plan is
   801  fully implemented, version 4 will enable cross-organizational collaboration for
   802  development and use of ordered change actions. A core requirement is
   803  decentralized development, storage, and distribution of changes. It will enable
   804  authentication and signing, encryption, and other security measures. We are
   805  likely to replace 'make' with our own state engine, continuing the migration
   806  begun in version 3. See ISconf.Org for the latest information.
   807  ** 7.3.1.4 Baseline Management **
   808  In section (4.3), we discussed the concept of maintaining a fully descriptive
   809  baseline for congruent management. In (7.1), we discussed in general terms how
   810  this might be done. In this section, we will show how we do it in isconf.
   811  First, we install the base disk image as in section (7.1), usually using
   812  vendor-supplied network installation tools. We discuss this process more in
   813  [bootstrap]. We might name this initial image 'Block00'. Then we use the
   814  process we mentioned in (7.3.1.1) to apply changes to the machine over the
   815  course of its life. Each change we add updates our concept of what is the
   816  'baseline' for that class of host.
   817  As we add changes, any new machine we build will need to run isconf longer on
   818  first boot, to add all of the accumulated changes to the Block00 image. After
   819  about forty minutes' worth of changes have built up on top of the initial
   820  image, it helps to be able to build one more host that way, set the hostname/IP
   821  to 'baseline', cut a disk image of it, and declare that new image to be the new
   822  baseline. This infrequent snapshot or checkpoint not only reduces the build
   823  time of future hosts, but reduces the rebuild time and chance of error in
   824  rebuilding existing hosts -- we always start new builds from the latest
   825  baseline image.
   826  In an isconf makefile, this whole process is reflected as in (listing 2). Note
   827  that whether we cut a new image and start the next install from that, or if we
   828  just pull an old machine off the shelf with a Block00 image and plug it in,
   829  we'll still end up with a Block20 image with apache and a 2.2.12 kernel, due to
   830  the way the makefile prerequisites are chained.
   831  This example shows a simple, linear build of successive identical hosts with no
   832  "branching" for different host classes. Classes add slightly more complexity to
   833  the makefile. They require a top-level configuration file to define the classes
   834  and target them to the right hosts, and they require wrapper script code to
   835  read the config file.
   836  There is a little more complexity to deal with things that should only happen
   837  at boot, and that can happen when cron runs the code every hour or so. There
   838  are examples of all of this in the isconf-2i package available from ISconf.Org.
   839  Listing 2: Baseline Management in an ISconf Makefile
   840  
   841    # 01 Feb 97 - Block00 is initial disk install from vendor cd,
   842    # with ntp etc. added later
   843    Block00: ntp cvs lynx ...
   844  
   845    # 15 Jul 98 - got tired of waiting for additions to Block00 to build,
   846  	# cut new baseline image, later add ssh etc.
   847    Block10: Block00 ssh ...
   848  
   849    # 17 Jan 99 - new baseline again, later add apache, rebuild kernel, etc.
   850    Block20: Block10 apache kernel-2.2.12 ...
   851  *** 7.3.2 Cfengine ***
   852  Cfengine is likely the most popular purpose-built tool for automated systems
   853  administration today. The cfengine language was optimized for dynamic
   854  prerequisite analysis rather than long, deterministic ordered sets.
   855  While the cfengine language wasn't specifically optimized for ordered behavior,
   856  it is possible to achieve this with extra work. It should be possible to
   857  greatly reduce the amount of effort involved, by using some tool to generate
   858  cfengine configuration files from makefile-like (or equivalent) input files.
   859  One good starting point might be Tobias Oetiker's TemplateTree II [oetiker].
   860  Automatic generation of cfengine configuration files appears to be a near-
   861  requirement if the tool is to be used to maintain congruent infrastructures;
   862  the class and action-type structures tend to get relatively complex rather fast
   863  if congruent ordering, rather than convergence, is the goal.
   864  Other gains might be made from other features of cfengine; we have made
   865  progress experimenting with various helper modules, for instance. Another
   866  technique that we have put to good use is to implement atomic changes using
   867  very small cfengine scripts, each equivalent to an ISconf makefile stanza.
   868  These scripts we then drive within a deterministically ordered framework.
   869  In the cfengine version 2 language there are new features, such as the
   870  FileExists() evaluated class function, which may reduce the amount of code. So
   871  far, based on our experience over the last few years in trial attempts, it
   872  appears that a cfengine configuration file that does the same job as an ISconf
   873  makefile would still need anywhere from 2-3 times the number of lines of code.
   874  We consider this an open and evolving effort though -- check the cfengine.org
   875  and Infrastructures.Org websites for the latest information.
   876  ***** 8 Brown/Traugott Turing Equivalence *****
   877       If it should turn out that the basic logics of a machine designed for
   878       the numerical solution of differential equations coincide with the
   879       logics of a machine intended to make bills for a department store, I
   880       would regard this as the most amazing coincidence that I have ever
   881       encountered. -- Howard Aiken, founder of Harvard's Computer Science
   882       department and architect of the IBM/Harvard Mark I.
   883  Turing equivalence in host management appears to be a new factor relative to
   884  the age of the computing industry. The downsizing of mainframe installations
   885  and distribution of their tasks to midrange and desktop machines by the early
   886  1990's exposed administrative challenges which have taken the better part of a
   887  decade for the systems administration community to understand, let alone deal
   888  with effectively.
   889  Older computing machinery relied more on dedicated hardware rather than
   890  software to perform many administrative tasks. Operating systems were limited
   891  in their ability to accept changes on the fly, often requiring recompilation
   892  for tasks as simple as adding terminals or changing the time zone. Until
   893  recently, the most popular consumer desktop operating system still required a
   894  reboot when changing IP address.
   895  In the interests of higher uptime, modern versions of UNIX and Linux have
   896  eliminated most of these issues; there is very little software or configuration
   897  management that cannot be done with the machine "live". We have evolved to a
   898  model that is nearly equivalent to that of a Universal Turing Machine, with all
   899  of its benefits and pitfalls. To avoid this equivalence, we would need to go
   900  back to shutting operating systems down in order to administer them. Rather
   901  than go back, we should seek ways to go further forward; understanding Turing
   902  equivalence appears to be a good next step.
   903  This situation may soon become more critical, with the emergence of "soft
   904  hardware". These systems use Field-Programmable Gate Arrays to emulate
   905  dedicated processor and peripheral hardware. Newer versions of these devices
   906  can be reprogrammed, while running, under control of the software hosted on the
   907  device itself [xilinx]. This will bring us the ability to modify, for instance,
   908  our own CPU, using high-level automated administration tools. Imagine not only
   909  accidentally unconfiguring your Ethernet interface, but deleting the circuitry
   910  itself...
   911  We have synthesized a thought experiment to demonstrate some of the
   912  implications of Turing equivalence in host management, based on our
   913  observations over the course of several years. The description we provide here
   914  is not as rigorous as the underlying theories, and much of it should be
   915  considered as still subject to proof. We do not consider ourselves theorists;
   916  it was surprising to find ourselves in this territory. The theories cited here
   917  provided inspiration for the thought experiment, but the goal is practical
   918  management of UNIX and other machines. We welcome any and all future
   919  exploration, pro or con. See section (9), 'Conclusion_and_Critique'.
   920  In the following description of this thought experiment, we will develop a
   921  model of system administration starting at the level of the Turing machine. We
   922  will show how a modern self-administered machine is equivalent to a Turing
   923  machine with several tapes, which is in turn equivalent to a single-tape Turing
   924  machine. We will construct a Turing machine which is able to update its own
   925  program by retrieving new instructions from a network-accessible tape. We will
   926  develop the idea of configuration management for this simpler machine model,
   927  and show how problems such as circular dependencies and uncertainty about
   928  behavior arise naturally from the nature of computation.
   929  We will discuss how this Turing machine relates to a modern general-purpose
   930  computer running an automatic administration tool. We will introduce the
   931  implications of the self-modifying code which this arrangement allows, and the
   932  limitations of inspection and testing in understanding the behavior of this
   933  machine. We will discuss how ordering of changes affects this behavior, and how
   934  deterministically ordered changes can make its behavior more deterministic.
   935  We will expand beyond single machines into the realm of distributed computing
   936  and management of multiple machines, and their associated inspection and
   937  testing costs. We will discuss how ordering of changes affects these costs, and
   938  how ordered change apparently provides the lowest cost for managing an
   939  enterprise infrastructure.
   940  Readers who are interested in applied rather than mathematical or theoretical
   941  arguments may want to review (7) or skip to section (9).
   942  8.1 - A Turing machine (figure_8.1.1) reads bits from an infinite tape,
   943  interprets them as data according to a hardwired program and rewrites portions
   944  of the tape based on what it finds. It continues this cycle until it reaches a
   945  completion state, at which time it halts [turing].
   946  [images/turing.png]
   947       Figure 8.1.1: Turing machine block diagram; the machine reads and
   948       writes an infinite tape and updates an internal state variable based
   949       on a hardwired or stored ruleset.
   950  8.2 - Because a Turing machine's program is hardwired, it is common practice to
   951  say that the program describes or is the machine. A Turing machine's program is
   952  stated in a descriptive language which we will call the machine language. Using
   953  this language, we describe the actions the machine should take when certain
   954  conditions are discovered. We will call each atom of description an
   955  instruction. An example instruction might say:
   956       If the current machine state is 's3', and the tape cell at the
   957       machine's current head position contains the letter 'W', then change
   958       to state 's7', overwrite the 'W' with a 'P', and move the tape one
   959       cell to the right.
   960  Each instruction is commonly represented as a quintuple; it contains the letter
   961  and current state to be matched, as well as the letter to be written, the tape
   962  movement command, and the new state. The instruction we described above would
   963  look like:
   964       s3,W ⇒ s7,P,r
   965  Note that a Turing machine's language is in no way algorithmic; the order of
   966  quintuples in a program listing is unimportant; there are no branching,
   967  conditional, or loop statements in a Turing machine program.
   968  8.3 - The content of a Turing tape is expressed in a language that we will call
   969  the input language. A Turing machine's program is said to either accept or
   970  reject a given input language, if it halts at all. If our Turing machine halts
   971  in an accept state, (which might actually be a state named 'accept') then we
   972  know that our program is able to process the data and produce a valid result -
   973  - we have validated our input against our machine. If our Turing machine halts
   974  because there is no instruction that matches the current combination of state
   975  and cell content (8.2), then we know that our program is unable to process this
   976  input, so we reject. If we never halt, then we cannot state a result, so we
   977  cannot validate the input or the machine.
   978  8.4 - A Universal Turing Machine (UTM) is able to emulate any arbitrary Turing
   979  machine. Think of this as running a Turing "virtual machine" (TVM) on top of a
   980  host UTM. A UTM's machine language program (8.2) is made up of instructions
   981  which are able to read and execute the TVM's machine language instructions. The
   982  TVM's machine language instructions are the UTM's input data, written on the
   983  input tape of the UTM alongside the TVM's own input data (figure_8.4.1).
   984  Any multiple-tape Turing machine can be represented by a single-tape Turing
   985  machine, so it is equally valid to think of our Universal Turing Machine as
   986  having two tapes; one for TVM program, and the other for TVM data.
   987  A Universal Turing Machine appears to be a useful model for analyzing the
   988  theoretical behavior of a "real" general-purpose computer; basic computability
   989  theory seems to indicate that a UTM can solve any problem that a general-
   990  purpose computer can solve [church].
   991  [images/utmtape.png]
   992       Figure 8.4.1: The tape of a Universal Turing Machine (UTM) stores the
   993       program and data of a hosted Turing Virtual Machine (TVM).
   994  8.5 - Further work by John von Neumann and others demonstrated one way that
   995  machines could be built which were equivalent in ability to Universal Turing
   996  Machines, with the exception of the infinite tape size [vonneumann]. The von
   997  Neumann architecture is considered to be a foundation of modern general purpose
   998  computers [godfrey].
   999  8.6 - As in von Neumann's "stored program" architecture, the TVM program and
  1000  data are both stored as rewritable bits on the UTM tape (8.4) (figure_8.4.1).
  1001  This arrangement allows the TVM to change the machine language instructions
  1002  which describe the TVM itself. If it does so, our TVM enjoys the advantages
  1003  (and the pitfalls) of self-modifying code [nordin].
  1004  8.7 - There is no algorithm that a Turing machine can use to determine whether
  1005  another specific Turing machine will halt for a given tape; this is known as
  1006  the "halting problem". In other words, Turing machines can contain
  1007  constructions which are difficult to validate. This is not to say that every
  1008  machine contains such constructions, but that that an arbitrary machine and
  1009  tape chosen at random has some chance of containing one.
  1010  8.8 - Note that, since a Turing machine is an imaginary construct [turing], our
  1011  own brain, a pencil, and a piece of paper are (theoretically) sufficient to
  1012  work through the tape, producing a result if there is one. In other words, we
  1013  can inspect the code and determine what it would do. There may be tools and
  1014  algorithms we can use to assist us in this [laitenberger]. We are not
  1015  guaranteed to reach a result though -- in order for us to know that we have a
  1016  valid machine and valid input, we must halt and reach an accept state.
  1017  Inspection is generally considered to be a form of testing.
  1018  Inspection has a cost (which we will use later):
  1019       Cinspect
  1020  This cost includes the manual labor required to inspect the code, any machine
  1021  time required for execution of inspection tools, and the manual labor to
  1022  examine the tool results.
  1023  8.9 - There is no software testing algorithm that is guaranteed to ensure fully
  1024  reliable program operation across all inputs -- there appears to be no
  1025  theoretical foundation for one [hamlet]. We suspect that some of the reasons
  1026  for this may be related to the halting problem (8.7), Gödel's incompleteness
  1027  theorem [godel], and some classes of computational intractability problems,
  1028  such as the Traveling Salesman and NP completeness [greenlaw] [garey]
  1029  [brookshear] [dewdney].
  1030  In practice, we can use multiple test runs to explore the input domain via a
  1031  parameter study, equivalence partitioning [richardson], cyclomatic complexity
  1032  analysis [mccabe], pseudo-random input, or other means. Using any or all of
  1033  these methods, we may be able to build a confidence level for predictability of
  1034  a given program. Note that we can never know when testing is complete, and that
  1035  testing only proves incorrectness of a program, not correctness.
  1036  Testing cost includes the manual labor required to design the test, any machine
  1037  time required for execution, and the manual labor needed to examine the test
  1038  results:
  1039       Ctest
  1040  8.10 - For software testing to be meaningful, we must also ensure code
  1041  coverage. Code coverage requirements are generally determined through some form
  1042  of inspection (8.8), with or without the aid of tools. Coverage information is
  1043  only valid for a fixed program -- even relatively minor code changes can affect
  1044  code coverage information in unpredictable ways [elbaum]. We must repeat
  1045  testing (8.9) for every variation of program code.
  1046  To ensure code coverage, testing includes the manual labor required to inspect
  1047  the code, any machine time required for execution of the coverage tools and
  1048  tests, and the manual labor needed to examine the test results. Because testing
  1049  for coverage includes code inspection, we know that testing is more expensive
  1050  than inspection alone:
  1051       Ctest > Cinspect
  1052  8.11 - Once we have found a UTM tape that produces the result we desire, we can
  1053  make many copies of that tape, and run them through many identical Universal
  1054  Turing Machines simultaneously. This will produce many simultaneous, identical
  1055  results. This is not very interesting -- what we really want to be able to do
  1056  is hold the TVM program portion of the tape constant while changing the TVM
  1057  data portion, then feed those differing tapes through identical machines. The
  1058  latter arrangement can give us a form of distributed or parallel computing.
  1059  8.12 - Altering the tapes (8.11) presents a problem though. We cannot in
  1060  advance know whether these altered tapes will provide valid results, or even
  1061  reach completion. We can exhaustively test the same program with a wide variety
  1062  of sample inputs, validating each of these. This is fundamentally a time-
  1063  consuming, pseudo-statistical process, due to the iterative validations
  1064  normally required. And it is not a complete solution (8.9).
  1065  8.13 - If we for some reason needed to solve slightly different problems with
  1066  the distributed machines in (8.11), we may decide to use slightly different
  1067  programs in each machine, rather than add functionality to our original
  1068  program. But using these unique programs would greatly worsen our testing
  1069  problem. We would not only need to validate across our range of input data
  1070  (8.9), but we would also need to repeat the process for each program variant
  1071  (8.10). We know that testing many unique programs will be more expensive than
  1072  testing one:
  1073       Cmany > Ctest
  1074  8.14 - It is easy to imagine a Turing Machine that is connected to a network,
  1075  and which is able to use the net to fetch data from tapes stored remotely,
  1076  under program control. This is simply a case of an multiple-tape Turing
  1077  machine, with one or more of the tapes at the other end of a network
  1078  connection.
  1079  8.15 - Building on (8.14), imagine a Turing Virtual Machine (TVM) running on
  1080  top of a networked Universal Turing Machine (UTM) (8.4). In this case, we might
  1081  have 3 tapes; one for the TVM program, one for the TVM data, and a third for
  1082  the remote network tape. It is easy to imagine a sequence of TVM operations
  1083  which involve fetching a small amount of data from the remote tape, and storing
  1084  it on the local program tape as additional and/or replacement TVM instructions
  1085  (8.6). We will name the old TVM instruction set A. The set of fetched
  1086  instructions we will name B, and the resulting merger of the two we will name
  1087  AB. Note that some of the instructions in B may have replaced some of those in
  1088  A (figure_8.15.1). Before the fetch, our TVM could be described (8.2) as an A
  1089  machine, after the fetch we have an AB machine -- the TVM's basic functionality
  1090  has changed. It is no longer the same machine.
  1091  [images/ab.png]
  1092       Figure 8.15.1: Instruction set B partially overlays instruction set
  1093       A, creating set AB.
  1094  8.16 - Note that, if any of the instructions in set B replace any of those in
  1095  set A, (8.15), then the order of loading these sets is important. A TVM with
  1096  the instruction set AB will be a different machine than one with set BA (figure
  1097  8.16.1).
  1098  [images/ba.png]
  1099       Figure 8.16.1: Instruction set BA is created by loading B before A; A
  1100       partially overlays B this time.
  1101  8.17 - It is easy to imagine that the TVM in (8.15) could later execute an
  1102  instruction from set B, which could in turn cause the machine to fetch another
  1103  set of one or more instructions in a set we will call C, resulting in an ABC
  1104  machine:
  1105  [images/abc.png]
  1106       Figure 8.17.1: If instructions from set AB load C, then ABC results.
  1107  8.18 - After each fetch described in section (8.17), the local program and data
  1108  tapes will contain bits from (at least) three sources: the new instruction set
  1109  just copied over the net, any old instructions still on tape, and the data
  1110  still on tape from ongoing execution of all previous instructions.
  1111  8.19 - The choice of next instruction to be fetched from the remote tape in
  1112  section (8.17) can be calculated by the currently available instructions on the
  1113  local program tape, based on current tape content (8.18).
  1114  8.20 - The behavior of one or more new instructions fetched in (8.17) can (and
  1115  usually will) be influenced by other content on the local tapes (8.18). With
  1116  careful inspection and testing we can detect some of the ways content will
  1117  affect instruction fetches, but due to the indeterminate results of software
  1118  testing (8.9), we may never know if we found all of them.
  1119  8.21 - Let us go back to our three TVM instruction sets, A, B, and C (8.17).
  1120  These were loaded over the net and executed using the procedure described in
  1121  (8.19). Assume we start with blank local program and data tapes. Assume our UTM
  1122  is hardwired to fetch set A if the local program tape is found to be blank. If
  1123  we then run the TVM, A can collect data over the net and begin processing it.
  1124  At some point later, A can cause set B to be loaded. Our local tapes will now
  1125  contain the TVM data resulting from execution of A, and the new TVM machine
  1126  instructions AB. If the TVM later loads C, our program tape will contain ABC.
  1127  8.22 - If the networked UTM machine constructed in (8.21) always starts with
  1128  the same (blank) local tape content, and the remote tape content does not
  1129  change, then we can demonstrate that an A TVM will always evolve to an AB, then
  1130  an ABC machine, before halting and producing a result.
  1131  8.23 - Assuming the network-resident data never changes, we can rebuild our
  1132  networked UTM at any time and restore it to any prior state by clearing the
  1133  local tapes, resetting the machine state, and restarting execution with the
  1134  load of A (8.21). The machine will execute and produce the same intermediate
  1135  and final results as it did before, as in section (8.22).
  1136  8.24 - If the network-resident data does change, though, we may not be able to
  1137  rebuild to an identical state. For example, if someone were to alter the
  1138  network-resident master copy of the B instruction set after we last fetched it,
  1139  then it may no longer produce the same intermediate results and may no longer
  1140  fetch C (8.19). We might instead halt at AB.
  1141  8.25 - Without careful (and possibly intractable) inspection (8.8), we cannot
  1142  prove in advance whether an BCA or CAB machine can produce the same result as
  1143  an ABC machine. It is possible that these, or other, variations might yield the
  1144  same result. We can validate the result for a given input (8.3). We would also
  1145  need to do iterative testing (8.12) to demonstrate that multiple inputs would
  1146  produce the same result. Our cost of testing multiple or partially ordered
  1147  sequences is greater than that required to test a single sequence:
  1148       Cpartial > Ctest
  1149  8.26 - If the behavior of any instruction from B in (8.22) is in any way
  1150  dependent on other content found on tape (8.18) (8.19) (8.20), then we can
  1151  expect our TVM to behave differently if we load B before loading A (8.16). We
  1152  cannot be certain that a UTM loaded with only a B instruction set will accept
  1153  the input language, or even halt, until after we validate it (8.3).
  1154  8.27 - We might want to rollback from the load or execution of a new
  1155  instruction set. In order to do this, we would need to return the local program
  1156  and data tape to a previous content. For example, if machine A executes and
  1157  loads B, our instruction set will now be AB. We might rollback by replacing our
  1158  tape with the A copy.
  1159  8.28 - Due to (8.26), it is not safe to try to rollback the instruction set of
  1160  machine AB to recreate machine A by simply removing the B instructions. Some of
  1161  B may have replaced A. The AB machine, while executing, may have even loaded C
  1162  already (8.21), in which case you won't end up with A, but with AC. If the AB
  1163  machine executed for any period of time, it is likely that the input data
  1164  language now on the data tape is only acceptable to an AB machine -- an A
  1165  machine might reject it or fail to halt (8.3). The only safe rollback method
  1166  seems to be something similar to (8.27).
  1167  8.29 - It is easy to imagine an automatic process which conducts a rollback.
  1168  For example, in (8.27), machine AB itself might have the ability to clear its
  1169  own tapes, reset the machine state, and restart execution at the beginning of
  1170  A, as in section (8.23).
  1171  8.30 - But the system described in (8.29) will loop infinitely. Each time A
  1172  executes, it will load B, then AB will execute and reset the local tapes again.
  1173  In practice, a human might detect and break this loop; to represent this
  1174  interaction, we would need to add a fourth tape, representing the user
  1175  detection and input data.
  1176  8.31 - It is easy to imagine an automatic process which emulates a rollback
  1177  while avoiding loops, without requiring the user input tape in (8.30). For
  1178  example, instruction set C might contain the instructions from A that B
  1179  overlaid. In other words, installing C will "rollback" B. Note that this is not
  1180  a true rollback; we never return to a tape state that is completely identical
  1181  to any previous state. Although this is an imperfect solution, it is the best
  1182  we seem to be able to do without human intervention.
  1183  8.32 - The loop in section (8.30) will cause our UTM to never reach completion
  1184  -- we will not halt, and cannot validate a result (8.3). A method such as
  1185  (8.31) can prevent a rollback-induced loop, but is not a true rollback -- we
  1186  never return to an earlier tape content. If these, or similar, methods are the
  1187  only ones available to us, it appears that program-controlled tape changes must
  1188  be monotonic -- we cannot go back to a previous tape content under program
  1189  control, otherwise we loop.
  1190       You are in a maze of twisty little passages, all alike. -- Will
  1191       Crowther's "Adventure"
  1192  8.33 - Let us now look at a conventional application program, running as an
  1193  ordinary user on a correctly configured UNIX host. This program can be loaded
  1194  from disk into memory and executed. At no time is the program able to modify
  1195  the "master" copy of itself on disk. An application program typically executes
  1196  until it has output its results, at which time it either sleeps or halts. This
  1197  application is equivalent to a fixed-program Turing machine (8.1) in the
  1198  following ways: Both can be validated for a given input (8.3) to prove that
  1199  they will produce results in a finite time and that those results are correct.
  1200  Both can be tested over a range of inputs (8.9) to build confidence in their
  1201  reliability. Neither can modify their own executable instructions; in the UNIX
  1202  machine they are protected by filesystem permissions; in the Turing machine
  1203  they are hardwired. (We stipulate that there are some ways in which (8.33) and
  1204  (8.1) are not equivalent -- a Turing machine has a theoretically infinite tape,
  1205  for instance.)
  1206  8.34 - We can say that the application program in (8.33) is running on top of
  1207  an application virtual machine (AVM). If the application is written in Java,
  1208  for example, the AVM consists of the Java Virtual Machine. In Perl, the AVM is
  1209  the Perl bytecode VM. For C programs, the AVM is the kernel system call
  1210  interface. Low-level code in shared libraries used by a C program uses the same
  1211  syscall interface to interact with the hardware -- shared libraries are part of
  1212  the C AVM. A Perl program can load modules -- these become part of the
  1213  program's AVM. A C or Perl program that uses the system() or exec() system
  1214  calls relies on any executables called -- these other executables, then, are
  1215  part of the C or Perl program's AVM. Any executables called via exec() or
  1216  system() in turn may require other executables, shared libraries, or other
  1217  facilities. Many, if not most, of these components are dependent on one or more
  1218  configuration files. These components all form an AVM chain of dependency for
  1219  any given application. Regardless of the size or shape of this chain, all
  1220  application programs on a UNIX machine ultimately interact with the hardware
  1221  and the outside world via the kernel syscall interface.
  1222  8.35 - When we perform system administration actions as root on a running UNIX
  1223  machine, we can use tools found on the local disk to cause the machine to
  1224  change portions of that same disk. Those changes can include executables,
  1225  configuration files, and the kernel itself. Changes can include the system
  1226  administration tools themselves, and changed components and configuration files
  1227  can influence the fundamental behavior and viability of those same executables
  1228  in unforeseen ways, as in section (8.10), as applied to changes in the AVM
  1229  chain (8.34).
  1230  8.36 - A self-administered UNIX host runs an automatic systems administration
  1231  tool (ASAT) periodically and/or at boot. The ASAT is an application program
  1232  (8.33), but it runs as root rather than an ordinary user. While executing, the
  1233  ASAT is able to modify the "master" copy of itself on disk, as well as the
  1234  kernel, shared libraries, filesystem layout, or any other portion of disk, as
  1235  in section (8.35).
  1236  8.37 - The ASAT described in section (8.36) is equivalent to a Turing Virtual
  1237  Machine (8.4) in the ways described in section (8.33). In addition, a self-
  1238  administered host running an ASAT is similar to a Universal Turing Machine in
  1239  that the ASAT can modify its own program code (8.6).
  1240  8.38 - A self-administered UNIX host connected to a network is equivalent to a
  1241  network-connected Universal Turing Machine (8.14) in the following ways: The
  1242  host's ASAT (8.36) can fetch and execute an arbitrary new program as in section
  1243  (8.15). The fetched program can fetch and execute another as in (8.17).
  1244  Intermediate results can control which program is fetched next, as in (8.19).
  1245  The behavior of each fetched program can be influenced by the results of
  1246  previous programs.
  1247  8.39 - When we do administration via automated means (8.36), we rely on the
  1248  executable portions of disk, controlled by their configuration files, to
  1249  rewrite those same executables and configuration files (8.35). Like the
  1250  Universal Turing Machine in (8.32), changes made under program control must be
  1251  assumed to be monotonic; non-reversible short of "resetting the tape state" by
  1252  reformatting the disk.
  1253  8.40 - An ASAT (8.36) runs in the context of the host kernel and configuration
  1254  files, and depends either directly or indirectly on other executables and
  1255  shared libraries on the host's disk (8.26).
  1256  The circular dependency of the ASAT AVM dependency tree (8.34) forces us to
  1257  assume that, even though we may not ever change the ASAT code itself, we can
  1258  unintentionally change its behavior if we change other components of the
  1259  operating system. This is similar to the indeterminacy described in (8.20).
  1260  It is not enough for an ASAT designer to statically link the ASAT binary and
  1261  carefully design it for minimum dependencies. Other executables, their shared
  1262  libraries, scripts, and configuration files might be required by ASAT
  1263  configuration files written by a system administrator -- the tool's end user.
  1264  When designing tools we cannot know whether the system administrator is aware
  1265  of the AVM dependency tree (we certainly can't expect them to have read this
  1266  paper). We must assume that there will be circular dependencies, and we must
  1267  assume that the tool designer will never know what these dependencies are. The
  1268  tool must support some means of dealing with them by default. We've found over
  1269  the last several years that a default paradigm of deterministic ordering will
  1270  do this.
  1271  8.41 - We cannot always keep all hosts identical; a more practical method, for
  1272  instance, is to set up classes of machines, such as "workstation" and "mail
  1273  server", and keep the code within a class identical. This reduces the amount of
  1274  coverage testing required (8.10). This testing is similar to that described in
  1275  section (8.13).
  1276  8.42 - The question of whether a particular piece of software is of sufficient
  1277  quality for the job remains intractable (8.9).
  1278  But in practice, in a mission-critical environment, we still want to try to
  1279  find most defects before our users do. The only accurate way to do this is to
  1280  duplicate both program and input data, and validate the combination (8.3). In
  1281  order for this validation to be useful, the input data would need to be an
  1282  exact copy of real-world, production data, as would the program code. Since we
  1283  want to be able to not only validate known real-world inputs but also test some
  1284  possible future inputs (8.9), we expect to modify and disrupt the data itself.
  1285  We cannot do this in production. Application developers and QA engineers tend
  1286  to use test environments to do this work. It appears to us that systems
  1287  administrators should have the same sort of test facilities available for
  1288  testing infrastructure changes, and should make good use of them.
  1289  8.43 - Because the ASAT (8.36) is itself a complex, critical application
  1290  program, it needs to be tested using the procedure in (8.42). Because the ASAT
  1291  can affect the operation of the UNIX kernel and all subsidiary processes, this
  1292  testing usually will conflict with ordinary application testing. Because the
  1293  ASAT needs to be tested against every class of host (8.41) to be used in
  1294  production, this usually requires a different mix of hosts than that required
  1295  for testing an ordinary application.
  1296  8.44 - The considerations in section (8.43) dictate a need for an
  1297  infrastructure test environment for testing automated systems administration
  1298  tools and techniques. This environment needs to be separate from production,
  1299  and needs to be as identical as possible in terms of user data and host class
  1300  mix.
  1301  8.45 - Changes made to hosts in the test environment (8.44), once tested
  1302  (8.12), need to be transferred to their production counterpart hosts. When
  1303  doing so, the ordering precautions in section (8.26) need to be observed. Over
  1304  the last several years, we have found that if you observe these precautions,
  1305  then you will see the benefits of repeatable results as shown in (8.22). In
  1306  other words, if you always make the same changes first in test, then
  1307  production, and you always make those changes in the same order on each host,
  1308  then changes that worked in test will work in production.
  1309  8.46 - Because an ASAT (8.36) installed on many machines must be able to be
  1310  updated without manual intervention, it is our standard practice to always have
  1311  the tool update itself as well as its own configuration files and scripts. This
  1312  allows the entire system state to progress through deterministic and repeatable
  1313  phases, with the tool, its configuration files, and other possibly dependent
  1314  components kept in sync with each other.
  1315  By having the ASAT update itself, we know that we are purposely adding another
  1316  circular dependency beyond that mentioned in section (8.40). This adds to the
  1317  urgency of the need for ordering constraints such as (8.45).
  1318  We suspect control loop theory applies here; this circular dependency creates a
  1319  potential feedback loop. We need to "break the loop" and prevent runaway
  1320  behavior such as oscillation (replacing the same file over and over) or loop
  1321  lockup (breaking the tool so that it cannot do anything anymore).
  1322  Deterministically ordered changes seem to do the trick, acting as an effective
  1323  damper.
  1324  We stipulate that this is not standard practice for all ASAT users. But all
  1325  tools must be updated at some point; there are always new features or bug fixes
  1326  which need to be addressed. If the tool cannot support a clean and predictable
  1327  update of its own code, then these very critical updates must be done "out of
  1328  band". This defeats the purpose of using an ASAT, and ruins any chance of
  1329  reproducible change in an enterprise infrastructure.
  1330  8.47 - Due to (8.45), if we allow the order of changes to be A, B, C on some
  1331  hosts, and A, C, B on others, then we must test both versions of the resulting
  1332  hosts (8.13). We have inadvertently created two host classes (8.41); due to the
  1333  risk of unforeseen interactions we must also test both versions of hosts for
  1334  all future changes as well, regardless of ordering of those future changes. The
  1335  hosts have diverged (4.1).
  1336  8.48 - It is tempting to ask "Why don't we just test changes in production, and
  1337  rollback if they don't work?" This does not work unless you are able to take
  1338  the time to restore from tape, as in section (8.27). There's also the user data
  1339  to consider -- if a change has been applied to a production machine, and the
  1340  machine has run for any length of time, then the data may no longer be
  1341  compatible with the earlier version of code (8.28). When using an ASAT in
  1342  particular, it appears that changes should be assumed to be monotonic (8.39).
  1343  8.49 - It appears that editing, removing, or otherwise altering the master
  1344  description of prior changes (8.24) is harmful if those changes have already
  1345  been deployed to production machines. Editing previously-deployed changes is
  1346  one cause of divergence (4.1). A better method is to always "roll forward" by
  1347  adding new corrective changes, as in section (8.31).
  1348  8.50 - It is extremely tempting to try to create a declarative or descriptive
  1349  language L that is able to overcome the ordering restrictions in (8.45) and
  1350  (8.49). The appeal of this is obvious: "Here are the results I want, go make it
  1351  so."
  1352  A tool that supports this language would work by sampling subsets of disk
  1353  content, similar to the way our Turing machine samples individual tape cells
  1354  (8.1). The tool would read some instruction set P, which was written in L by
  1355  the sysadmin. While sampling disk content, the tool would keep track of some
  1356  internal state S, similar to our Turing machine's state (8.2). Upon discovering
  1357  a state and disk sample that matched one of the instructions in P, the tool
  1358  could then change state, rewrite some part of the disk, and look at some other
  1359  part of the disk for something else to do. Assuming a constant instruction set
  1360  P, and a fixed virtual machine in which to interpret P, this would provide
  1361  repeatable, validatable results (8.3).
  1362  8.51 - Since the tool in section (8.50) is an ASAT (8.36), influenced by the
  1363  AVM dependency tree (8.34), it is equivalent to a Turing Virtual Machine as in
  1364  (8.37). This means that it is subject to the ordering constraints of (8.45). If
  1365  the host is networked, then the behavior shown in (8.15) through (8.20) will be
  1366  evident.
  1367  8.52 - Due to (8.51), there appears to be no language, declarative or
  1368  imperative, that is able to fully describe the desired content of the root-
  1369  owned, managed portions of a disk while neglecting ordering and history. This
  1370  is not a language problem: The behavior of the language interpreter or AVM
  1371  (8.34) itself is subject to current disk content in unforeseen ways (8.35).
  1372  We stipulate that disk content can be completely described in any language by
  1373  simply stating the complete contents of the disk. This is still a case of
  1374  ordering, a case in which there is only one change to be made. Cloning,
  1375  discussed in section (3), is an applied example of this case. This class of
  1376  change seems to be free of the circular dependencies of an AVM; the new disk
  1377  image is usually applied when running from an NFS or ramdisk root partition,
  1378  not while modifying a live machine.
  1379  8.53 - A tool constructed as in section (8.50) is useful for a very well-
  1380  defined purpose; when hosts have diverged (8.47) beyond any ability to keep
  1381  track of what changes have already been made. At this point, you have two
  1382  choices; rebuild the hosts from scratch, using a tool that tracks lifetime
  1383  ordering; or use a convergence tool to gain some control over them.
  1384  8.54 - It is tempting to ask "Does every change really need to be strictly
  1385  sequenced? Aren't some changes orthogonal?" By orthogonal we mean that the
  1386  subsystems affected by the changes are fully independent, non-overlapping,
  1387  cause no conflict, and have no interaction each other, and therefore are not
  1388  subject to ordering concerns.
  1389  While it is true that some changes will always be orthogonal, we cannot easily
  1390  prove orthogonality in advance. It might appear that some changes are
  1391  "obviously unrelated" and therefore not subject to sequencing issues. The
  1392  problem is, who decides? We stipulate that talent and experience are useful
  1393  here, for good reason: it turns out that orthogonality decisions are subject to
  1394  the same pitfalls as software testing.
  1395  For example, inspection (8.8) and testing (8.9) can help detect changes which
  1396  are not orthogonal. Code coverage information (8.10) can be used to ensure the
  1397  validity of the testing itself. But in the end, none of these provide assurance
  1398  that any two changes are orthogonal, and like other testing, we cannot know
  1399  when we have tested or inspected for orthogonality enough.
  1400  Due to this lack of assurance, the cost of predicting orthogonality needs to
  1401  accrue the potential cost of any errors that result from a faulty prediction.
  1402  This error cost includes lost revenue, labor required for recovery, and loss of
  1403  goodwill. We may be able to reduce this error cost, but it cannot be zero -- a
  1404  zero cost implies that we never make mistakes when analyzing orthogonality.
  1405  Because the cost of prediction includes this error cost as well as the cost of
  1406  testing, we know that prediction of orthogonality is more expensive than either
  1407  the testing or error cost alone:
  1408       Cpredict > Cerror
  1409       Cpredict > Ctest
  1410  8.55 - As a crude negative proof, let us take a look at what would happen if we
  1411  were to allow the order of changes to be totally unsequenced on a production
  1412  host. First, if we were to do this, it is apparent that some sequences would
  1413  not work at all, and probably damage the host (8.26). We would need to have a
  1414  way of preventing them from executing, probably by using some sort of exclusion
  1415  list. In order to discover the full list of bad sequences, we would need to
  1416  test and/or inspect each possible sequence.
  1417  This is an intractable problem: the number of possible orderings of M changes
  1418  is M!. If each build/test cycle takes an hour, then any number of changes
  1419  beyond 7 or 8 becomes impractical -- testing all combinations of 8 changes
  1420  would require 4.6 years. In practice, we see change sets much larger than this;
  1421  the ISconf version 2i makefile for building HACMP clusters, for instance, has
  1422  sequences as long as 121 operations -- that's 121!/24/365, or 9.24*10^196
  1423  years. It is easier to avoid unsequenced changes.
  1424  The cost of testing and inspection required to enable randomized sequencing
  1425  appears to be greater than the cost of testing a subset of all sequences
  1426  (8.25), and greater than the testing, inspection, and accrued error of
  1427  predicting orthogonality (8.54):
  1428       Crandom > Cpredict > Cpartial
  1429  8.56 - As a self-administering machine changes its disk contents, it may change
  1430  its ability to change its disk contents. A change directive that works now may
  1431  not work in the same way on the same machine in the future and vice versa
  1432  (8.26). There appears to be a need to constrain the order of change directives
  1433  in order to obtain predictable behavior.
  1434  8.57 - In contrast to (8.52), a language that supports execution of an ordered
  1435  set of changes appears to satisfy (8.56), and appears to have the ability to
  1436  fully describe any arbitrary disk content, as in (7.1).
  1437  8.58 - In practice, sysadmins tend to make changes to UNIX hosts as they
  1438  discover the need for them; in response to user request, security concern, or
  1439  bug fix. If the goal is minimum work for maximum reliability, then it would
  1440  appear that the "ideal" sequence is the one which is first known to work -- the
  1441  sequence in which the changes were created and tested. This sequence carries
  1442  the least testing cost. It carries a lower risk than a sequence which has been
  1443  partially tested or not tested at all.
  1444  The costs in sections (8.8), (8.9), (8.25), (8.54), and (8.55) are related to
  1445  each other as shown in (figure_8.58.1). This leads us to these conclusions:
  1446      * Validating, inspecting, testing, and deploying a single sequence (Ctest)
  1447        appears to be the least-cost host change management technique.
  1448      * Adequate testing of partially-ordered sequences (Cpartial) is more
  1449        expensive.
  1450      * Predicting orthogonality between partial sequences (Cpredict) is yet more
  1451        expensive.
  1452      * The testing required to enable random change sequences (Crandom) is more
  1453        expensive than any other testing, due to the N! combinatorial explosions
  1454        involved.
  1455  [images/costs.png]
  1456       Figure 8.58.1: Relationship between costs of various ordering
  1457       techniques; larger set size means higher cost.
  1458  8.59 - The behavioral attributes of a complex host seem to be effectively
  1459  infinite over all possible inputs, and therefore difficult to fully quantify
  1460  (8.9). The disk size is finite, so we can completely describe hosts in terms of
  1461  disk content (7.1), but we cannot completely describe hosts in terms of
  1462  behavior. We can easily test all disk content, but we do not seem to be able to
  1463  test all possible behavior.
  1464  This point has important implications for the design of management tools -
  1465  - behavior seems to be a peripheral issue, while disk content seems to play a
  1466  more central role. It would seem that tools which test only for behavior will
  1467  always be convergent at best. Tools which test for disk content have the
  1468  potential to be congruent, but only if they are able to describe the entire
  1469  disk state. One way to describe the entire disk is to support an initial disk
  1470  state description followed by ordered changes, as in (7.1).
  1471  8.60 - There appears to be a general statement we can make about software
  1472  systems that run "on top of" others in a "virtual machine" or other software-
  1473  constructed execution environment (8.34):
  1474       If any virtual machine instruction has the ability to alter the
  1475       virtual machine instruction set, then different instruction execution
  1476       orders can produce different instruction sets. Order of execution of
  1477       these instructions is critical in determining the future instruction
  1478       set of the machine. Faulty order has the potential to remove the
  1479       ability for the machine to update the instruction set or to function
  1480       at all.
  1481  This applies to any application, automatic administration tool (8.37), or
  1482  shared library code executed as root on a UNIX machine (it also applies to
  1483  other cases on other operating systems). These all interact with hardware and
  1484  the outside world via the operation system kernel, and have the ability to
  1485  change that same kernel as well as higher-level elements of their "virtual
  1486  machine". This statement appears to be independent of the language of the
  1487  virtual machine instruction set (8.52).
  1488  ***** 9 Conclusion and Critique *****
  1489  One interesting result of automated systems administration efforts might be
  1490  that, like the term 'computer', the term 'system administrator' may someday
  1491  evolve to mean a piece of technology rather than a chained human.
  1492  Sometime in the last few years, we began to suspect that deterministic ordering
  1493  of host changes may be the airfoil of automated systems administration. Many
  1494  other tool designers make use of algorithms that specifically avoid any
  1495  ordering constraint; we accepted ordering as an axiom.
  1496  With this constraint in place, we built and maintained thousands of hosts, in
  1497  many mission-critical production infrastructures worldwide, with excellent
  1498  results. These results included high reliability and security, low cost of
  1499  ownership, rapid deployments and changes, easy turnover, and excellent
  1500  longevity -- after several years, some of our first infrastructures are still
  1501  running and are actively maintained by people we've never met, still using the
  1502  same toolset. Our attempts to duplicate these results while neglecting ordering
  1503  have not met these same standards as well as we would like.
  1504  In this paper, our first attempt at explaining a theoretical reason why these
  1505  results might be expected, we have not "proven" the connection between ordering
  1506  and theory in any mathematical sense. We have, however, been able to provide a
  1507  thought experiment which we hope will help guide future research. Based on this
  1508  thought experiment, it seems that more in-depth theoretical models may be able
  1509  to support our practical results.
  1510  This work seems to imply that, if hosts are Turing equivalent (with the
  1511  possible exception of tape size) and if an automated administration tool is
  1512  Turing equivalent in its use of language, then there may be certain self-
  1513  referential behaviors which we might want to either avoid or plan for. This in
  1514  turn would imply that either order of changes is important, or the host or
  1515  method of administration needs to be constrained to less than Turing
  1516  equivalence in order to make order unimportant. The validity of this claim is
  1517  still an open question. In our deployments we have decided to err on the side
  1518  of ordering.
  1519  On tape size: one addition to our "thought experiment" might be a stipulation
  1520  that a network-connected host may in fact be fully equivalent to a Universal
  1521  Turing Machine, including infinite tape size, if the network is the Internet.
  1522  This is possibly true, due to the fact that the host's own network interface
  1523  card will always have a lower bandwidth than the growth rate of the Internet
  1524  itself -- the host cannot ever reach "the end of the tape". We have not
  1525  explored the implications or validity of this claim. If true, this claim may be
  1526  especially interesting in light of the recent trend of package management tools
  1527  which are able to self-select, download, and install packages from arbitrary
  1528  servers elsewhere on the Internet.
  1529  Synthesizing a theoretical basis for why "order matters" has turned out to be
  1530  surprisingly difficult. The concepts involve the circular dependency chain
  1531  mentioned in section (5), the dependency trees which conventional package
  1532  management schemes support, as well as the interactions between these and more
  1533  granular changes, such as patches and configuration file edits. Space and
  1534  accessibility concerns precluded us from accurately providing rigorous proofs
  1535  for the points made in section (8). Rather than do so, we have tried to express
  1536  these points as hypotheses, and have provided some pointers to some of the
  1537  foundation theories that we believe to be relevant. We encourage others to
  1538  attempt to refute or support these points.
  1539  One issue we have not adequately covered is the fact that changing the order of
  1540  actions can not only break machines, but the actions themselves may not
  1541  complete. Altering order often calls for altering the content of the actions
  1542  themselves if success is to be assured.
  1543  There may be useful vulnerabilities or benefits hidden in the structure of
  1544  section (8). Even after the many months we have spent poring over it, it is
  1545  still certainly more complex than it needs to be, with many intertwined threads
  1546  and long chains of assumptions (figure_9.1). One reason for this complexity was
  1547  our desire to avoid forward references within that section; we didn't want to
  1548  inadvertently base any point on circular logic. A much more readable text could
  1549  likely be produced by reworking these threads into a single linear order,
  1550  though that would likely require adding the forward references back in.
  1551  For further theoretical study, we recommend:
  1552      * Gödel Numbers
  1553      * Gödel's Incompleteness Theorem
  1554      * Chomsky's Hierarchy
  1555      * Diagonalization
  1556      * The halting problem
  1557      * NP completeness and the Traveling Salesman Problem
  1558      * Theory of ordered sets
  1559      * Closed-loop control theory
  1560  Starting points for most of these can be found in [greenlaw] [garey]
  1561  [brookshear] [dewdney].
  1562  [images/sref-small.gif]
  1563       Figure 9.1: Thread structure of section (8)
  1564  ***** 10 Acknowledgments *****
  1565  We'd like to thank all souls who strive to better your organizations' computing
  1566  infrastructures, often against active opposition by your own management. You
  1567  know that your efforts are not likely to be understood by your own CIO. You do
  1568  this for the good of the organization and the global economy; you do this in
  1569  order to improve the quality of life of your constituents, often at the cost of
  1570  your own health; you do this because you know it is the right thing to do. In
  1571  this year of security-related tragedies and corporate accounting scandals, you
  1572  know that if the popular media recognized what's going on in our IT departments
  1573  there'd be hell to pay. But you know they won't, not for many years, if ever.
  1574  Still you try to clean up the mess, alone. You are all heroes.
  1575  The debate that was the genesis of this paper began in Mark Burgess' cfengine
  1576  workshop, LISA 2001.
  1577  Alva Couch provided an invaluable sounding board for the theoretical
  1578  foundations of this paper. Paul Anderson endured the intermediate drafts,
  1579  providing valuable constructive criticism. Paul's wife, Jessie, confirmed
  1580  portability of these principles to other operating systems and provided early
  1581  encouragement. Jon Stearley provided excellent last-minute review guidance.
  1582  Joel Huddleston responded to our recall with his usual deep interest in any
  1583  brain-exploding problem, the messier the better.
  1584  The members of the infrastructures list have earned our respect as a group of
  1585  very smart, very capable individuals. Their reactions to drafts were as good as
  1586  rocket fuel. In addition to those mentioned elsewhere, notable mention goes to
  1587  Ryan Nowakowski and Kevin Counts, for their last-minute readthrough of final
  1588  drafts.
  1589  Steve's wife, Joyce Cao Traugott, made this paper possible. Her sense of
  1590  wonder, analytical interest in solving the problem, and unconditional love let
  1591  Steve stay immersed far longer than any of us suspected would be necessary.
  1592  Thank You, Joyce.
  1593  ***** 11 About the Authors *****
  1594  Steve Traugott is a consulting Infrastructure Architect, and publishes tools
  1595  and techniques for automated systems administration. His firm, TerraLuna LLC,
  1596  is a specialty consulting organization that focuses on enterprise
  1597  infrastructure architecture. His deployments have ranged from New York trading
  1598  floors, IBM mainframe UNIX labs, and NASA supercomputers to web farms and
  1599  growing startups. He can be reached via the Infrastructures.Org, TerraLuna.Com,
  1600  or stevegt.com web sites.
  1601  Lance Brown taught himself Applesoft BASIC in 9th grade by pestering the 11th
  1602  graders taking Computer Science so much their teacher gave him a complete copy
  1603  of all the handouts she used for the entire semester. Three weeks later he
  1604  asked for more. He graduated college with a BA in Computer Science, attended
  1605  graduate school, and began a career as a software developer and then systems
  1606  administrator. He has been the lead Unix sysadmin for central servers at the
  1607  National Institute of Environmental Health Sciences in Research Triangle Park,
  1608  North Carolina for the last six years.
  1609  ***** 12 References *****
  1610  [bootstrap] Bootstrapping an infrastructure, Steve Traugott and Joel
  1611  Huddleston, Proceedings of the 12th Systems Administration Conference (LISA
  1612  XII) (USENIX Association: Berkeley, CA), pp. 181, 1998
  1613  [brookshear] Computer Science, An Overview, (very accessible text), J. Glenn
  1614  Brookshear, Addison Wesley, 2000, ISBN 0-201-35747-X
  1615  [centerrun] CenterRun Application Management System, http://www.centerrun.com
  1616  [cfengine] Cfengine, A configuration engine, http://www.cfengine.org/
  1617  [church] Review of Turing 1936, Church, A., 1937a Journal of Symbolic Logic, 2,
  1618  42-43.
  1619  [couch] The Maelstrom: Network Service Debugging via "Ineffective Procedures",
  1620  Alva Couch and N. Daniels, Proceedings of the Fifteenth Systems Administration
  1621  Conference (LISA XV) (USENIX Association: Berkeley, CA), pp. 63, 2001
  1622  [cvs] Concurrent Version System, http://www.cvshome.org
  1623  [cvsup] CVSup Versioned Software Distribution package, http://www.openbsd.org/
  1624  cvsup.html
  1625  [debian] Debian Linux, http://www.debian.org
  1626  [dewdney] The (New) Turing Omnibus -- 66 Excursions in Computer Science, A. K.
  1627  Dewdney, W. H. Freeman and Company, 1993
  1628  [eika-sandnes] Scheduling Partially Ordered Events In A Randomized Framework -
  1629  Empirical Results And Implications For Automatic Configuration Management,
  1630  Frode Eika Sandnes, Proceedings of the Fifteenth Systems Administration
  1631  Conference (LISA XV) (USENIX Association: Berkeley, CA), 2001
  1632  [elbaum] The Impact of Software Evolution on Code Coverage Information
  1633  Sebastian G. Elbaum, David Gable, Gregg Rothermel, International Conference on
  1634  Software Engineering p. 170-179, 2001
  1635  [garey] Computers and Intractability, A guide to the theory of NP-Completeness,
  1636  Michael R. Garey, David S. Johnson, W.H. Freeman and and Company, 2002, ISBN 0-
  1637  7167-1045-5
  1638  [godel] Uber formal unentscheidbare Satze der Principia Mathematica und
  1639  verwandter Systeme, Kurt Godel, Monatshefte fur Mathematik und Physik, 38:173--
  1640  198, 1931.
  1641  [godfrey] The Computer as Von Newmann Planned it, M.D. Godfrey, D.F Hendry,
  1642  IEEE Annals of the History of Computing, Vol 15, No 1, 1993
  1643  [greenlaw] Fundamentals of the Theory of Computation, (includes examples in C
  1644  and UNIX shell, detailed references to seminal works, Raymond Greenlaw, H James
  1645  Hoover, Morgan Kaufmann, 1998, ISBN 1-55860-474-X
  1646  [hagerty] Daniel Hagerty, hag@ai.mit.edu, 2002, personal correspondence
  1647  [hamlet] Foundations of Software Testing: Dependability Theory, Dick Hamlet,
  1648  Software Engineering Notes v 19, No.5, Proceedings of the Second ACM SIGSOFT
  1649  Symposium on Foundations of Software Engineering, pp. 128-139, 1994
  1650  [hart] An Analysis of RPM Validation Drift, John Hart and Jeffrey D'Amelia,
  1651  Proceedings of the 16th Systems Administration Conference (USENIX Association:
  1652  Berkeley, CA), 2002
  1653  [immunology] Computer immunology, M. Burgess, Proceedings of the Twelth Systems
  1654  Administration Conference (LISA XII) (USENIX Association: Berkeley, CA), pp.
  1655  283, 1998
  1656  [isconf] ISconf, Infrastructure configuration manager, http://www.isconf.org
  1657  and http://www.infrastructures.org
  1658  [jiang] Basic Notions in Computational Complexity, Tao Jiang, Ming Li, Bala
  1659  Ravikumar, Algorithms and Theory of Computation Handbook p. 24-1, CRC Press,
  1660  1999, ISBN 0-8493-2649-4
  1661  [laitenberger] An encompassing life cycle centric survey of software
  1662  inspection, Oliver Laitenberger and Jean-Marc DeBaud, The Journal of Systems
  1663  and Software, vol 50, num 1, pp. 5--31, 2000
  1664  [lcfg] LCFG: A large scale UNIX configuration system, http://www.lcfg.org
  1665  [lisa] Large Installation Systems Administration Conference, USENIX
  1666  Association, Berkeley, CA, http://www.usenix.org
  1667  [mccabe] Software Complexity, McCabe, Thomas J. & Watson, Arthur H, Crosstalk,
  1668  Journal of Defense Software Engineering 7, 12 (December 1994): 5-9.
  1669  [nordin] Evolving Turing-Complete Programs for a Register Machine with Self-
  1670  modifying Code, Peter Nordin and Wolfgang Banzhaf, Genetic Algorithms:
  1671  Proceedings of the Sixth International Conference (ICGA95), Morgan Kaufmann, L.
  1672  Eshelman, pp. 318--325, 1995, 15-19, ISBN 1-55860-370-0
  1673  [oetiker] Template Tree II: The Post-Installation Setup Tool, T. Oetiker,
  1674  Proceedings of the Fifteenth Systems Administration Conference (LISA XV)
  1675  (USENIX Association: Berkeley, CA), pp. 179, 2001
  1676  [opsware] Opsware Management System, http://www.opsware.com
  1677  [pikt] PIKT: "Problem Informant/Killer Tool", http://www.pikt.org
  1678  [rdist] Overhauling Rdist for the '90s, M.A. Cooper, Proceedings of the Sixth
  1679  Systems Administration Conference (LISA VI) (USENIX Association: Berkeley, CA),
  1680  pp. 175, 1992
  1681  [richardson] Partition analysis: a method combining testing and verification,
  1682  D. J. Richardson and L. A. Clarke, IEEE Trans. Soft. Eng., 11(12):1477--1490,
  1683  1985
  1684  [rsync] rsync incremental file transfer utility, http://samba.anu.edu.au/rsync
  1685  [ssh] SSH protocol suite of network connectivity tools, http://www.openssh.org
  1686  [sup] The SUP Software Upgrade Protocol, Steven Shafer and Mary Thompson, 1989
  1687  [tivoli] Tivoli Management Framework, http://www.tivoli.com
  1688  [turing] On Computable Numbers, with an Application to the
  1689  Entscheidungsproblem, Alan M. Turing, Proceedings of the London Mathematical
  1690  Society, Series 2, 42 (1936-37), pp.230-265.
  1691  [vonneumann] First Draft of a Report on the EDVAC, John Von Neumann, IEEE
  1692  Annals of the History of Computing, Vol 15, No 4, 1993
  1693  [xilinx] Xilink Virtex-II Platform FPGA, http://www.xilinx.com