github.com/pachyderm/pachyderm@v1.13.4/examples/err_cmd/README.md (about) 1 > INFO - Pachyderm 2.0 introduces profound architectural changes to the product. As a result, our examples pre and post 2.0 are kept in two separate branches: 2 > - Branch Master: Examples using Pachyderm 2.0 and later versions - https://github.com/pachyderm/pachyderm/tree/master/examples 3 > - Branch 1.13.x: Examples using Pachyderm 1.13 and older versions - https://github.com/pachyderm/pachyderm/tree/1.13.x/examples 4 5 # Skip Failed Datums in Your Pipeline 6 7 This example describes how you can use the `err_cmd` and `err_stdin` fields 8 in your pipeline to fail a datum without failing the job and the whole 9 pipeline. This feature is useful when you have large datasets and multiple 10 datums, and you do not need to have them all processed successfully to 11 move to the next step in your DAG. 12 13 For more information about the `err_cmd` command, see [](../../docs/err_cmd.md) 14 15 ## Prerequisites 16 17 Before you begin, verify that you have the following configured in your 18 environment: 19 20 * Pachyderm version 1.9.x or later 21 * A clone of the Pachyderm repository 22 23 To clone the Pachyderm repository, run the following command: 24 25 ```shell 26 $ git clone git@github.com:pachyderm/pachyderm.git 27 ``` 28 29 ## Create a Repository 30 31 The first step is to create a repository called `input` by running the 32 following command: 33 34 ```shell 35 $ pachctl create repo input 36 ``` 37 38 ## Create a Pipeline 39 40 Next, you need to create a pipeline that uses the `input` repository 41 as input and has the `err_cmd` and `err_stdin` fields specified. 42 43 In this example, we use the [error_test.json](error_test.json) 44 pipeline: 45 46 ```json 47 { 48 "pipeline": { 49 "name": "error_test" 50 "description": "A pipeline that checks if the `file1` is present in the datum.", 51 }, 52 "input": { 53 "pfs": { 54 "glob": "/*", 55 "repo": "input" 56 } 57 }, 58 "transform": { 59 "cmd": [ "bash" ] , 60 "stdin": [ "if", "[ -a /pfs/input/file1 ]", "then cp /pfs/input/* /pfs/out/", "exit 0", "fi", "exit 1" ] , 61 "err_cmd": [ "bash" ] , 62 "err_stdin": [ "if", "[ -a /pfs/input/file2 ]", "then", "exit 0", "fi", " exit 1" ] 63 } 64 } 65 ``` 66 67 In the pipeline above, the code checks if the datum contains `file1`. If it 68 does, then the code copies everything in `/pfs/input/` to the `/pfs/out` 69 directory. If the datum does not include `file1`, the datum is checked 70 against the code in `err_stdin`. That code checks if the datum has 71 `file2`. If it does, the code marks the datum as recovered, and the 72 job succeeds. If it does not, the job fails. 73 74 Create a pipeline by running the following command from the `examples/err_cmd/` 75 directory: 76 77 ```shell 78 $ pachctl create pipeline -f error_test.json 79 ``` 80 81 Verify that the pipeline was successfully created: 82 83 ```shell 84 $ pachctl list pipeline 85 NAME VERSION INPUT CREATED STATE / LAST JOB 86 error_test 1 input:/* 5 seconds ago running / starting 87 ``` 88 89 ## Add Files to the Input Repository 90 91 Now, let's add some files to the input repository to watch how your pipeline 92 code and error code work. 93 94 You will add three files, `file1`, `file2`, and `file3`, that each contains one 95 line in them. 96 97 1. Add `file1`: 98 99 ```shell 100 $ echo "foo" | pachctl put file input@master:file1 101 ``` 102 103 When you add `file1`, your pipeline should succeed: 104 105 ```shell 106 $ pachctl list job --no-pager 107 ID PIPELINE STARTED DURATION RESTART PROGRESS DL UL STATE 108 c8860dae5a054ec38a33068f75fe9690 error_test 13 seconds ago Less than a second 0 1 + 0 / 1 4B 4B success 109 ``` 110 111 As you can see in the `PROGRESS` column – `1 + 0 / 1`, you have one 112 successfully processed datum. 113 114 1. Add `file2`: 115 116 ```shell 117 $ echo "bar" | pachctl put file input@master:file2 118 ``` 119 120 Processing of this datum fails, but because the `err_cmd` code ran successfully, 121 the datum is marked as *recovered*, and the job finishes without errors. 122 Only `file1` is available in the output commit. 123 124 ```shell 125 $ pachctl list job --no-pager 126 ID PIPELINE STARTED DURATION RESTART PROGRESS DL UL STATE 127 bc3da288ff884d5a9bcb312dd6cf07cb error_test 3 seconds ago Less than a second 0 0 + 1 + 1 / 2 0B 0B success 128 c8860dae5a054ec38a33068f75fe9690 error_test 3 minutes ago Less than a second 0 1 + 0 / 1 4B 4B success 129 ``` 130 131 In the `PROGRESS` column, you can see that the last job did not have 132 any successfully processed datums, but it had a skipped datum and a 133 recovered datum – `0 + 1 + 1 / 2`. 134 135 1. Add `file3`: 136 137 138 ```shell 139 $ echo "baz" | pachctl put file input@master:file3 140 ``` 141 142 Because the processed datum does not have neither `file1`, nor 143 `file2`, this job results in failure. Therefore, both `cmd` 144 and `err_cmd` codes result in non-zero status: 145 146 ``` 147 ID PIPELINE STARTED DURATION RESTART PROGRESS DL UL STATE 148 272370ec03c24cc1be660bf97403712f error_test 26 seconds ago Less than a second 0 0 + 2 / 3 0B 0B failure: failed to process datum:... 149 bc3da288ff884d5a9bcb312dd6cf07cb error_test 6 minutes ago Less than a second 0 0 + 1 + 1 / 2 0B 0B success 150 c8860dae5a054ec38a33068f75fe9690 error_test 10 minutes ago Less than a second 0 1 + 0 / 1 4B 4B success 151 ```