github.com/craigkelly/dmk@v1.0.1/README.md (about) 1 --- 2 title: DMK ReadMe 3 --- 4 5 This is a simplified, automated build tool for data projects. 6 7 The idea behind dmk is to support build files that are easy to read *and* 8 write, and to support automating a build system for data artifacts. `make`, 9 `scons`, and `fac` all provided inspiration for `dmk`. 10 11 # Who should use this? 12 13 This is a tool for data flows and simple projects. Often these projects 14 involve one-time steps for getting the data used for analysis. Often the user 15 manually places that data in the directory after downloading it from Amazon S3 16 or a research server or whatever. Scripts or programs run in a pipeline 17 fashion: handling cleaning, transformation, analysis, model building, 18 figure production, presentation building, etc. 19 20 Pipelines like this are often written Python or R, partially automated with 21 shell scripts, and then tied together with a Makefile (or a SConstruct file if 22 you're an `scons` fan). 23 24 *Protip*: if you're looking for a command to handle building reports from `.tex` 25 files (including handling metapost and biblatex), look into `rubber`. 26 27 # What is this NOT for? 28 29 This is *not* mean to replace a real automated build tool for a software 30 project. As a general rule: 31 32 * If you're building Go software, use the Go tools (and optionally make) 33 * If you're building Java/Scala use `sbt`, `gradle`, `mvn`, `ant`, etc 34 * If you're building .NET, erm, I'm not sure 35 * There are great tools like `scons` that understand how to build lots of artifacts (including LaTeX docs) 36 * If you're not sure, at least understand why you wouldn't use `make` 37 38 For instance, this project (written in Go) is actually built with `make` + the 39 standard Go tools. 40 41 # Using 42 43 When running `dmk`, you may specify `-h` on the command line for information 44 on command line parameters. Specify `-v` for "verbose" mode to get output you 45 may want for debugging or understanding what is going on. 46 47 For each command in a pipeline, you need to supply: 48 49 * A name 50 * The actual command to run (in the shell) 51 * The inputs required 52 * The outputs generated 53 54 This list is not exhaustive; see below for everything you can specify for a 55 build step. 56 57 The file is generally named `Pipeline` or `pipeline.yaml`. If you do not 58 specify a pipeline file with the `-f` command line parameter, `dmk` looks for 59 the following names in the current directory (in order): 60 61 * Pipeline 62 * Pipeline.yaml 63 * pipeline 64 * pipeline.yaml 65 * .Pipeline.yaml 66 * .pipeline.yaml 67 68 You may also supply a custom name with the `-f` command line flag. If the 69 pipeline file is in a different directory, dmk will change to that directory 70 before parsing the config file. Note that the bash tab completion logic will 71 show you step names, but it isn't smart enough to know that you've specified 72 `-f` on the command line. 73 74 All build steps run in parallel, but each step waits until other steps build 75 its dependencies. A single build step executes the following in order: 76 77 1. The step is "Started" 78 2. If any of the required inputs are another step's outputs, then wait for a built message. 79 3. Check to see if *any* outputs are older than *any* of the inputs. If not, then the step is "Completed"! 80 4. If not done, set status to "Executing" and run the command. 81 5. If the command returns an error code or if *any* outputs are missing or older than *any* inputs, the step is "Failed". 82 6. Send notification messages for each output for any waiting steps. 83 7. The step is now "Completed" 84 85 The outputs for a step must be unique to that step: you can't have two steps 86 both list `foo.data` as an output. (Note that this applies to *expanded* output 87 names, and abstract/baseSteps aren't checked.) 88 89 `dmk` provides an automatic "clean" mode that deletes all outputs. To use it, 90 specify `-c` on the command line. `dmk` will delete all the outputs for all 91 steps. If you have files to clean not specified as outputs, you can specified 92 them in the _clean_ list for a build step (see the Pipeline file format 93 below). Good candidates for the _clean_ section are intermediate files (such 94 as logs) generated as part of a build process that are not dependencies and 95 should not determine if a build step is up to date. 96 97 You may also run `dmk` with `-listSteps` to see a list of all steps in the current 98 pipeline file. Currently, this is used for bash completion. 99 100 # Pipeline file format 101 102 The file is in YAML format where each build step is a named hash. Each build 103 step should specify: 104 105 * _command_ - The command to run as part of the build. `dmk` uses bash to run 106 the command, so it can rely on bash shell niceties (like using `~` for the 107 home directory) 108 * _inputs_ - a list of inputs needed for the build. These are also the 109 dependencies that must exist before the step can run. An entry can be a 110 glob pattern (like `*.txt`) 111 * _outputs_ - a list of outputs generated by the step. Outputs decide if the 112 step must run, and the clean phase deletes them. Glob patterns are 113 **ignored** for outputs. 114 * _clean_ - A list of files to clean. These and outputs are the files deleted 115 during a clean. You may use glob patterns for these. 116 * _explicit_ - Optional, defaults to false. If set to true, the step will 117 run if you specify it on the command line. It will not run by default. Any steps 118 required by steps specified on the command line will also run, regardless of their 119 _explicit_ setting. 120 * _delOnFail_ - Optional, defaults to false. If set to true and the step fails, 121 then `dmk` will delete all the step's output files. 122 * _direct_ - Optional, default to false. If set to true, both stdout and stderr 123 from the step are written to the `dmk` process standard streams. If set to false 124 (the default), stdout and stderr are written in single blocks after the step 125 completes (stdout is only written if `dmk` is running in *verbose* mode). 126 Note in *direct* mode (direct=True), step output may be interleaved with 127 "normal" output when steps are running in parallel! 128 * _abstract_ - Optional, defaults to false. If specified, the step is an "base step" 129 and will never be executed (it's only to be used as a baseStep). See below. 130 * _baseStep_ - Optional, defaults to empty. If specified, it must be the name of a 131 step with `abstract: true`. In that case the step's properties will be based 132 on the step given. See below. 133 * _vars_ - Optional, defaults to empty dictionary. If specified, this must be a 134 hash/dictionary with strings as both keys and values. The keys are treated as 135 variables names with are replaced with their corresponding values. See below 136 for variable details. 137 138 The `res` subdirectory contains sample Pipeline files (used for testing), but 139 a quick example would look like: 140 141 ````yaml 142 # You can have comments in a file 143 step1: # first step 144 command: "xformxyz i{1,2,3}.txt" # command with some shell magic 145 inputs: # 3 inputs (read by our imaginary command) 146 - i1.txt 147 - i2.txt 148 - i3.txt 149 outputs: # 3 outputs 150 - o1.txt 151 - o2.txt 152 - o3.txt 153 clean: [a.aux, b.log] # two extra clean targets, specified in 154 # an alternate syntax for YAML lists 155 156 step2: # second step 157 command: cmd1xyz # note the lack of inputs - this means 158 outputs: # the step will run without waiting for 159 - output.bin # other steps to complete. 160 161 depstep: # third/final step: it won't run until the 162 command: cmd2xyz # previous steps finish because their 163 inputs: # outputs are in the this step's inputs. 164 - o3.txt 165 - output.bin 166 outputs: 167 - combination.output 168 clean: 169 - need-cleaning.* # An example of using a glob pattern 170 delOnFail: true 171 172 extrastep: 173 command: special-command 174 inputs: 175 - some-script-file.txt 176 outputs: 177 - my-special-file.extra 178 explicit: true # Run if specified on command line (and not by default) 179 ```` 180 181 If you were to run `dmk -c` then it would deleted the following files: 182 183 * o1.txt, o2.txt, o3.txt because they are outputs of `step1` 184 * a.aux and b.log because they are in the `clean` list in `step1` 185 * output.bin because of `step2` 186 * combination.output and any files matching the pattern `need-cleaning.*` because of `depstep` 187 188 Note that my-special-file.extra from `extrastep` is not deleted unless you specify 189 `extrastep` on the command line. 190 191 After cleaning, if you run `dmk` the following steps would occur: 192 193 * The commands from `step1` (`xformxyz i{1,2,3}.txt`) and `step2` (`cmd1xyz`) 194 would run 195 * When they were both finished, `depstep` would start and `cmd2xyz` would run. 196 * As before, `extrastep` would NOT run. 197 * If the `depstep` command (`cmd2xyz`) fails, then `dmk` will delete 198 `combination.output` (if it exists). 199 * If all the steps succeed, running `dmk` again would not cause 200 any command to run (because all outputs are newer than their steps' inputs). 201 202 If you were to run `dmk extrastep` then the command `special-command` would run. 203 Nothing else would run. 204 205 If you were to run `dmk extrastep depstep` then all steps would run (because 206 `step1` and `step2` are `depstep` dependencies). 207 208 # Using Variables 209 210 `dmk` steps support variable expansion. 211 212 Before variable expansion begins, both `inputs` and `clean` are expanded via 213 globbing (e.g. `*.csv` expands to all files ending in `.csv` in the current 214 directory.) 215 216 After globbing expansion, `dmk` will expand variables for all the strings in: 217 218 * command 219 * inputs 220 * outputs 221 * clean 222 223 Variables are expanded in the following order: 224 225 1. Any keys from the current step's `vars` section (if specified) 226 2. Any from the `vars` section of the current step's `baseStep` *if* that 227 variable wasn't specified by the current step. 228 3. Any environment variables - note that you can set environment variables 229 from an env file: see "Build Step Environment" below. 230 231 *IMPORTANT*: `DMK_STEPNAME` is defined at this point, but the other `DMK_` 232 variables described below in "Build Step Environment" are *not*. However, 233 the command will be executed in bash and they can be evaluated/used by a 234 script at run time. 235 236 See "Abstract/Base Steps" below for an example. 237 238 (See below for more explanation of base steps) 239 240 # Abstract/Base Steps 241 242 `dmk` provides a way to create small template steps so that you can simplify 243 pipeline files. Often you'll have a few steps that have similar structures. In 244 that case you can specify a step with `abstract: true`. These steps will never 245 be executed, but provide a "template" for "concrete" steps. 246 247 If a step specified another step with `baseStep` then: 248 249 * If the step has command specified, it takes the command of its base step 250 * The base step's values for `explicit`, `delonFail`, and `direct` are all 251 used, regardless of the child step's settings 252 * The base step's `inputs`, `outputs`, and `clean` entries are all added to 253 the child step's lists. 254 * The base step's `vars` section provides the "defaults" for the child step 255 (The child's `vars` section always wins) 256 257 Some rules: 258 259 * A step named in `baseStep` must have `abstract: true` 260 * An abstract step may *not* specify a `baseStep` 261 262 263 Example (note that `inputs` and `outputs` are missing): 264 265 ```yaml 266 base: 267 command: "echo $A $B $C" 268 abstract: true 269 vars: 270 - A: Hello 271 - B: World 272 stepa: 273 baseStep: base 274 vars: 275 - B: There 276 - C: Everyone 277 stepb: 278 command: "echo $A $B $C" 279 vars: 280 - B: Anything 281 - C: Missing 282 ``` 283 284 When `stepa` is executed, it will echo `Hello There Everyone` because it inherits 285 `A` from it's `baseStep`. Note that it does *not* use `B` from its `baseStep`. 286 287 When `stepb` is executed it will echo the string `Anything Missing` because the 288 command `echo $A Anything Missing` will be executed by bash, which will expand 289 `$A` to an empty string. 290 291 # Build Step Environment 292 293 Before reading the pipeline file, `dmk` will load the env file specified by the 294 (optional) `-e` command line parameter. This functionality comes from the 295 excellent [GoDotEnv](https://github.com/joho/godotenv) library, which is based 296 on the Ruby dotenv project. 297 298 When a build step runs, `dmk` sets environment variables in the step command's 299 process: 300 301 * DMK_VERSION - version string for dmk 302 * DMK_PIPELINE - absolute path to the pipeline file running 303 * DMK_STEPNAME - the name of the current step 304 * DMK_INPUTS - a colon (":") delimited list of inputs for this step 305 * DMK_OUTPUTS - a colon (":") delimited list of outputs for this step 306 * DMK_CLEAN - a colon (":") delimited list of extra clean files for this step 307 308 **IMPORTANT!** These `DMK_` variables are setup *after* config file processing 309 and *will* override any variables set in the environment before startup or via 310 an env file. 311 312 Also note that although `bash` evaluates the command, `dmk` does it's own 313 variable expansion before executing the command. However, only `DMK_STEPNAME` 314 will be defined for `dmk` variable expansion. See "Using Variables" above for 315 details. 316 317 When the command for a step is activated, it will inherit the original 318 environment that `dmk` is running in, modified in this order: 319 320 1. Start with the original environment 321 2. Set DMK_VERSION 322 3. Optionally load an .env file, which will update the environment 323 4. Set DMK_PIPELINE 324 5. For each step, add the build step environment variables 325 6. For each step, add the step variables (as defined above) 326 327 328 # Some helpful hints to remember 329 330 A pipeline file is a YAML document, and a **JSON** document is valid YAML. For 331 instance, `res/slowbuild.yaml` and `res/slowbuild.json` are semantically 332 identical pipeline files. If you need a customized build, you can generate the 333 pipeline file in the language of your choice in JSON or YAML and then call 334 `dmk`. As example, if you have a script named custom.py that outputs a JSON 335 pipeline on stdout, you can run the JIT pipeline with: `python3 custom.py | dmk -f -`. 336 337 Commands run in a new bash shell (which also means you need bash). 338 339 `dmk` changes to the directory of the Pipeline file, so you can specify file 340 names relative to the Pipeline file's directory. Of course, the current directory 341 is not changed if the Pipeline file is *stdin*. 342 343 You may use globbing patterns for the inputs and clean. 344 345 # Building 346 347 `dep` manages dependencies in the vendor directory. Although the project began 348 with `godep` (which is/was an excellent tool), we're switching to `dep` in 349 anticipation of it becoming the de facto dependency managment tool for Gophers. 350 In addition, switching to `dep` before it's the standard seem like a good way 351 to give back to the Go community. 352 353 You shouldn't need to worry about dependencies if you are building with the 354 `Makefile`. Also note the fact that we use `make` to build `dmk`. We are serious 355 about using the correct build tool for the job. 356 357 Yes, we currently regenerate `version.go` too frequently. If we ever get a single 358 contributor or pull request, we'll make it better :) 359 360 You should also have Python 3 installed (for `script/versiongen` and for the test 361 script `res/slow`). 362 363 `make dist` will build cross-platform binaries in `./dist`. Yes, we commit them 364 to the repo. Deal with it, they're small. 365 366 `make release` handles tagging and pushing to GitHub. 367 368 `make install` will perform the standard `go install` but will *also* install 369 the bash completions we make available for `dmk`. Note that this will use 370 `sudo` and it currently the only way to get the bash completions. 371 372 Before submiting a pull-request or merging into a mainline branch, you should 373 be sure that `make lint` passes with no errors. We use the standard `go vet` 374 plus a few extras. Even though we don't use the entire `gometalinter` suite, we 375 do use it to install the linters we use. All this can be handled with 376 `make lint-install`. 377