github.com/craigkelly/dmk@v1.0.1/README.md (about)

     1  ---
     2  title: DMK ReadMe
     3  ---
     4  
     5  This is a simplified, automated build tool for data projects.
     6  
     7  The idea behind dmk is to support build files that are easy to read *and*
     8  write, and to support automating a build system for data artifacts. `make`,
     9  `scons`, and `fac` all provided inspiration for `dmk`.
    10  
    11  # Who should use this?
    12  
    13  This is a tool for data flows and simple projects. Often these projects
    14  involve one-time steps for getting the data used for analysis. Often the user
    15  manually places that data in the directory after downloading it from Amazon S3
    16  or a research server or whatever. Scripts or programs run in a pipeline
    17  fashion: handling cleaning, transformation, analysis, model building,
    18  figure production, presentation building, etc.
    19  
    20  Pipelines like this are often written Python or R, partially automated with
    21  shell scripts, and then tied together with a Makefile (or a SConstruct file if
    22  you're an `scons` fan).
    23  
    24  *Protip*: if you're looking for a command to handle building reports from `.tex`
    25  files (including handling metapost and biblatex), look into `rubber`.
    26  
    27  # What is this NOT for?
    28  
    29  This is *not* mean to replace a real automated build tool for a software
    30  project. As a general rule:
    31  
    32  * If you're building Go software, use the Go tools (and optionally make)
    33  * If you're building Java/Scala use `sbt`, `gradle`, `mvn`, `ant`, etc
    34  * If you're building .NET, erm, I'm not sure
    35  * There are great tools like `scons` that understand how to build lots of artifacts (including LaTeX docs)
    36  * If you're not sure, at least understand why you wouldn't use `make`
    37  
    38  For instance, this project (written in Go) is actually built with `make` + the
    39  standard Go tools.
    40  
    41  # Using
    42  
    43  When running `dmk`, you may specify `-h` on the command line for information
    44  on command line parameters. Specify `-v` for "verbose" mode to get output you
    45  may want for debugging or understanding what is going on.
    46  
    47  For each command in a pipeline, you need to supply:
    48  
    49  * A name
    50  * The actual command to run (in the shell)
    51  * The inputs required
    52  * The outputs generated
    53  
    54  This list is not exhaustive; see below for everything you can specify for a
    55  build step.
    56  
    57  The file is generally named `Pipeline` or `pipeline.yaml`. If you do not
    58  specify a pipeline file with the `-f` command line parameter, `dmk` looks for
    59  the following names in the current directory (in order):
    60  
    61  * Pipeline
    62  * Pipeline.yaml
    63  * pipeline
    64  * pipeline.yaml
    65  * .Pipeline.yaml
    66  * .pipeline.yaml
    67  
    68  You may also supply a custom name with the `-f` command line flag. If the
    69  pipeline file is in a different directory, dmk will change to that directory
    70  before parsing the config file. Note that the bash tab completion logic will
    71  show you step names, but it isn't smart enough to know that you've specified
    72  `-f` on the command line.
    73  
    74  All build steps run in parallel, but each step waits until other steps build
    75  its dependencies. A single build step executes the following in order:
    76  
    77  1. The step is "Started"
    78  2. If any of the required inputs are another step's outputs, then wait for a built message.
    79  3. Check to see if *any* outputs are older than *any* of the inputs. If not, then the step is "Completed"!
    80  4. If not done, set status to "Executing" and run the command.
    81  5. If the command returns an error code or if *any* outputs are missing or older than *any* inputs, the step is "Failed".
    82  6. Send notification messages for each output for any waiting steps.
    83  7. The step is now "Completed"
    84  
    85  The outputs for a step must be unique to that step: you can't have two steps
    86  both list `foo.data` as an output. (Note that this applies to *expanded* output
    87  names, and abstract/baseSteps aren't checked.)
    88  
    89  `dmk` provides an automatic "clean" mode that deletes all outputs. To use it,
    90  specify `-c` on the command line. `dmk` will delete all the outputs for all
    91  steps. If you have files to clean not specified as outputs, you can specified
    92  them in the _clean_ list for a build step (see the Pipeline file format
    93  below). Good candidates for the _clean_ section are intermediate files (such
    94  as logs) generated as part of a build process that are not dependencies and
    95  should not determine if a build step is up to date.
    96  
    97  You may also run `dmk` with `-listSteps` to see a list of all steps in the current
    98  pipeline file. Currently, this is used for bash completion.
    99  
   100  # Pipeline file format
   101  
   102  The file is in YAML format where each build step is a named hash. Each build
   103  step should specify:
   104  
   105  * _command_ - The command to run as part of the build. `dmk` uses bash to run
   106    the command, so it can rely on bash shell niceties (like using `~` for the
   107    home directory)
   108  * _inputs_ - a list of inputs needed for the build. These are also the
   109    dependencies that must exist before the step can run. An entry can be a
   110    glob pattern (like `*.txt`)
   111  * _outputs_ - a list of outputs generated by the step. Outputs decide if the
   112    step must run, and the clean phase deletes them. Glob patterns are
   113    **ignored** for outputs.
   114  * _clean_ - A list of files to clean. These and outputs are the files deleted
   115    during a clean. You may use glob patterns for these.
   116  * _explicit_ - Optional, defaults to false. If set to true, the step will
   117    run if you specify it on the command line. It will not run by default. Any steps
   118    required by steps specified on the command line will also run, regardless of their
   119    _explicit_ setting.
   120  * _delOnFail_ - Optional, defaults to false. If set to true and the step fails,
   121    then `dmk` will delete all the step's output files.
   122  * _direct_ - Optional, default to false. If set to true, both stdout and stderr
   123    from the step are written to the `dmk` process standard streams. If set to false
   124    (the default), stdout and stderr are written in single blocks after the step
   125    completes (stdout is only written if `dmk` is running in *verbose* mode).
   126    Note in *direct* mode (direct=True), step output may be interleaved with
   127    "normal" output when steps are running in parallel!
   128  * _abstract_ - Optional, defaults to false. If specified, the step is an "base step"
   129    and will never be executed (it's only to be used as a baseStep). See below.
   130  * _baseStep_ - Optional, defaults to empty. If specified, it must be the name of a
   131    step with `abstract: true`. In that case the step's properties will be based
   132    on the step given. See below.
   133  * _vars_ - Optional, defaults to empty dictionary. If specified, this must be a
   134    hash/dictionary with strings as both keys and values. The keys are treated as
   135    variables names with are replaced with their corresponding values. See below
   136    for variable details.
   137  
   138  The `res` subdirectory contains sample Pipeline files (used for testing), but
   139  a quick example would look like:
   140  
   141  ````yaml
   142  # You can have comments in a file
   143  step1:                                # first step
   144      command: "xformxyz i{1,2,3}.txt"  # command with some shell magic
   145      inputs:                           # 3 inputs (read by our imaginary command)
   146          - i1.txt                  
   147          - i2.txt
   148          - i3.txt
   149      outputs:                          # 3 outputs
   150          - o1.txt
   151          - o2.txt
   152          - o3.txt
   153      clean: [a.aux, b.log]             # two extra clean targets, specified in
   154                                        # an alternate syntax for YAML lists
   155  
   156  step2:                                # second step
   157      command: cmd1xyz                  # note the lack of inputs - this means
   158      outputs:                          # the step will run without waiting for
   159          - output.bin                  # other steps to complete.
   160  
   161  depstep:                              # third/final step: it won't run until the
   162      command: cmd2xyz                  # previous steps finish because their
   163      inputs:                           # outputs are in the this step's inputs.
   164          - o3.txt                      
   165          - output.bin
   166      outputs:
   167          - combination.output
   168      clean:
   169          - need-cleaning.*             # An example of using a glob pattern
   170      delOnFail: true
   171  
   172  extrastep:
   173      command: special-command
   174      inputs:
   175          - some-script-file.txt
   176      outputs:
   177          - my-special-file.extra
   178      explicit: true                    # Run if specified on command line (and not by default)
   179  ````
   180  
   181  If you were to run `dmk -c` then it would deleted the following files:
   182  
   183  * o1.txt, o2.txt, o3.txt because they are outputs of `step1`
   184  * a.aux and b.log because they are in the `clean` list in `step1`
   185  * output.bin because of `step2`
   186  * combination.output and any files matching the pattern `need-cleaning.*` because of `depstep`
   187  
   188  Note that my-special-file.extra from `extrastep` is not deleted unless you specify
   189  `extrastep` on the command line.
   190  
   191  After cleaning, if you run `dmk` the following steps would occur:
   192  
   193  * The commands from `step1` (`xformxyz i{1,2,3}.txt`) and `step2` (`cmd1xyz`)
   194    would run
   195  * When they were both finished, `depstep` would start and `cmd2xyz` would run.
   196  * As before, `extrastep` would NOT run.
   197  * If the `depstep` command (`cmd2xyz`) fails, then `dmk` will delete
   198    `combination.output` (if it exists).
   199  * If all the steps succeed, running `dmk` again would not cause
   200    any command to run (because all outputs are newer than their steps' inputs).
   201  
   202  If you were to run `dmk extrastep` then the command `special-command` would run.
   203  Nothing else would run.
   204  
   205  If you were to run `dmk extrastep depstep` then all steps would run (because
   206  `step1` and `step2` are `depstep` dependencies).
   207  
   208  # Using Variables
   209  
   210  `dmk` steps support variable expansion.
   211  
   212  Before variable expansion begins, both `inputs` and `clean` are expanded via
   213  globbing (e.g. `*.csv` expands to all files ending in `.csv` in the current
   214  directory.)
   215  
   216  After globbing expansion, `dmk` will expand variables for all the strings in:
   217  
   218  * command
   219  * inputs
   220  * outputs
   221  * clean
   222  
   223  Variables are expanded in the following order:
   224  
   225  1. Any keys from the current step's `vars` section (if specified)
   226  2. Any from the `vars` section of the current step's `baseStep` *if* that
   227     variable wasn't specified by the current step.
   228  3. Any environment variables - note that you can set environment variables
   229     from an env file: see "Build Step Environment" below.
   230  
   231  *IMPORTANT*: `DMK_STEPNAME` is defined at this point, but the other `DMK_`
   232  variables described below in "Build Step Environment" are *not*. However,
   233  the command will be executed in bash and they can be evaluated/used by a
   234  script at run time.
   235  
   236  See "Abstract/Base Steps" below for an example.
   237  
   238  (See below for more explanation of base steps)
   239  
   240  # Abstract/Base Steps
   241  
   242  `dmk` provides a way to create small template steps so that you can simplify
   243  pipeline files.  Often you'll have a few steps that have similar structures. In
   244  that case you can specify a step with `abstract: true`. These steps will never
   245  be executed, but provide a "template" for "concrete" steps.
   246  
   247  If a step specified another step with `baseStep` then:
   248  
   249  * If the step has command specified, it takes the command of its base step
   250  * The base step's values for `explicit`, `delonFail`, and `direct` are all
   251    used, regardless of the child step's settings
   252  * The base step's `inputs`, `outputs`, and `clean` entries are all added to
   253    the child step's lists.
   254  * The base step's `vars` section provides the "defaults" for the child step
   255    (The child's `vars` section always wins)
   256  
   257  Some rules:
   258  
   259  * A step named in `baseStep` must have `abstract: true`
   260  * An abstract step may *not* specify a `baseStep`
   261  
   262  
   263  Example (note that `inputs` and `outputs` are missing):
   264  
   265  ```yaml
   266  base:
   267      command: "echo $A $B $C"
   268      abstract: true
   269      vars:
   270          - A: Hello
   271          - B: World
   272  stepa:
   273      baseStep: base
   274      vars:
   275          - B: There
   276          - C: Everyone
   277  stepb:
   278      command: "echo $A $B $C"
   279      vars:
   280          - B: Anything
   281          - C: Missing
   282  ```
   283  
   284  When `stepa` is executed, it will echo `Hello There Everyone` because it inherits
   285  `A` from it's `baseStep`. Note that it does *not* use `B` from its `baseStep`.
   286  
   287  When `stepb` is executed it will echo the string `Anything Missing` because the
   288  command `echo $A Anything Missing` will be executed by bash, which will expand
   289  `$A` to an empty string.
   290  
   291  # Build Step Environment
   292  
   293  Before reading the pipeline file, `dmk` will load the env file specified by the
   294  (optional) `-e` command line parameter. This functionality comes from the
   295  excellent [GoDotEnv](https://github.com/joho/godotenv) library, which is based
   296  on the Ruby dotenv project.
   297  
   298  When a build step runs, `dmk` sets environment variables in the step command's
   299  process:
   300  
   301  * DMK_VERSION - version string for dmk
   302  * DMK_PIPELINE - absolute path to the pipeline file running
   303  * DMK_STEPNAME - the name of the current step
   304  * DMK_INPUTS - a colon (":") delimited list of inputs for this step
   305  * DMK_OUTPUTS - a colon (":") delimited list of outputs for this step
   306  * DMK_CLEAN - a colon (":") delimited list of extra clean files for this step
   307  
   308  **IMPORTANT!** These `DMK_` variables are setup *after* config file processing
   309  and *will* override any variables set in the environment before startup or via
   310  an env file.
   311  
   312  Also note that although `bash` evaluates the command, `dmk` does it's own
   313  variable expansion before executing the command. However, only `DMK_STEPNAME`
   314  will be defined for `dmk` variable expansion. See "Using Variables" above for
   315  details.
   316  
   317  When the command for a step is activated, it will inherit the original
   318  environment that `dmk` is running in, modified in this order:
   319  
   320  1. Start with the original environment
   321  2. Set DMK_VERSION
   322  3. Optionally load an .env file, which will update the environment
   323  4. Set DMK_PIPELINE
   324  5. For each step, add the build step environment variables
   325  6. For each step, add the step variables (as defined above)
   326  
   327  
   328  # Some helpful hints to remember
   329  
   330  A pipeline file is a YAML document, and a **JSON** document is valid YAML. For
   331  instance, `res/slowbuild.yaml` and `res/slowbuild.json` are semantically
   332  identical pipeline files. If you need a customized build, you can generate the
   333  pipeline file in the language of your choice in JSON or YAML and then call
   334  `dmk`. As example, if you have a script named custom.py that outputs a JSON
   335  pipeline on stdout, you can run the JIT pipeline with: `python3 custom.py | dmk -f -`.
   336  
   337  Commands run in a new bash shell (which also means you need bash).
   338  
   339  `dmk` changes to the directory of the Pipeline file, so you can specify file
   340  names relative to the Pipeline file's directory. Of course, the current directory
   341  is not changed if the Pipeline file is *stdin*.
   342  
   343  You may use globbing patterns for the inputs and clean.
   344  
   345  # Building
   346  
   347  `dep` manages dependencies in the vendor directory. Although the project began
   348  with `godep` (which is/was an excellent tool), we're switching to `dep` in
   349  anticipation of it becoming the de facto dependency managment tool for Gophers.
   350  In addition, switching to `dep` before it's the standard seem like a good way
   351  to give back to the Go community.
   352  
   353  You shouldn't need to worry about dependencies if you are building with the
   354  `Makefile`. Also note the fact that we use `make` to build `dmk`. We are serious
   355  about using the correct build tool for the job.
   356  
   357  Yes, we currently regenerate `version.go` too frequently. If we ever get a single
   358  contributor or pull request, we'll make it better :)
   359  
   360  You should also have Python 3 installed (for `script/versiongen` and for the test
   361  script `res/slow`).
   362  
   363  `make dist` will build cross-platform binaries in `./dist`. Yes, we commit them
   364  to the repo. Deal with it, they're small.
   365  
   366  `make release` handles tagging and pushing to GitHub.
   367  
   368  `make install` will perform the standard `go install` but will *also* install
   369  the bash completions we make available for `dmk`. Note that this will use
   370  `sudo` and it currently the only way to get the bash completions.
   371  
   372  Before submiting a pull-request or merging into a mainline branch, you should
   373  be sure that `make lint` passes with no errors. We use the standard `go vet`
   374  plus a few extras. Even though we don't use the entire `gometalinter` suite, we
   375  do use it to install the linters we use. All this can be handled with
   376  `make lint-install`.
   377