github.com/Jeffail/benthos/v3@v3.65.0/website/docs/components/processors/awk.md (about) 1 --- 2 title: awk 3 type: processor 4 status: stable 5 categories: ["Mapping"] 6 --- 7 8 <!-- 9 THIS FILE IS AUTOGENERATED! 10 11 To make changes please edit the contents of: 12 lib/processor/awk.go 13 --> 14 15 import Tabs from '@theme/Tabs'; 16 import TabItem from '@theme/TabItem'; 17 18 19 Executes an AWK program on messages. This processor is very powerful as it 20 offers a range of [custom functions](#awk-functions) for querying and mutating 21 message contents and metadata. 22 23 24 <Tabs defaultValue="common" values={[ 25 { label: 'Common', value: 'common', }, 26 { label: 'Advanced', value: 'advanced', }, 27 ]}> 28 29 <TabItem value="common"> 30 31 ```yaml 32 # Common config fields, showing default values 33 label: "" 34 awk: 35 codec: text 36 program: BEGIN { x = 0 } { print $0, x; x++ } 37 ``` 38 39 </TabItem> 40 <TabItem value="advanced"> 41 42 ```yaml 43 # All config fields, showing default values 44 label: "" 45 awk: 46 codec: text 47 program: BEGIN { x = 0 } { print $0, x; x++ } 48 parts: [] 49 ``` 50 51 </TabItem> 52 </Tabs> 53 54 Works by feeding message contents as the program input based on a chosen 55 [codec](#codecs) and replaces the contents of each message with the result. If 56 the result is empty (nothing is printed by the program) then the original 57 message contents remain unchanged. 58 59 Comes with a wide range of [custom functions](#awk-functions) for accessing 60 message metadata, json fields, printing logs, etc. These functions can be 61 overridden by functions within the program. 62 63 Check out the [examples section](#examples) in order to see how this processor 64 can be used. 65 66 This processor uses [GoAWK][goawk], in order to understand the differences 67 in how the program works you can [read more about it here][goawk.differences]. 68 69 ## Fields 70 71 ### `codec` 72 73 A [codec](#codecs) defines how messages should be inserted into the AWK program as variables. The codec does not change which [custom Benthos functions](#awk-functions) are available. The `text` codec is the closest to a typical AWK use case. 74 75 76 Type: `string` 77 Default: `"text"` 78 Options: `none`, `text`, `json`. 79 80 ### `program` 81 82 An AWK program to execute 83 84 85 Type: `string` 86 Default: `"BEGIN { x = 0 } { print $0, x; x++ }"` 87 88 ### `parts` 89 90 An optional array of message indexes of a batch that the processor should apply to. 91 If left empty all messages are processed. This field is only applicable when 92 batching messages [at the input level](/docs/configuration/batching). 93 94 Indexes can be negative, and if so the part will be selected from the end 95 counting backwards starting from -1. 96 97 98 Type: `array` 99 Default: `[]` 100 101 ## Examples 102 103 <Tabs defaultValue="JSON Mapping and Arithmetic" values={[ 104 { label: 'JSON Mapping and Arithmetic', value: 'JSON Mapping and Arithmetic', }, 105 { label: 'Stuff With Arrays', value: 'Stuff With Arrays', }, 106 ]}> 107 108 <TabItem value="JSON Mapping and Arithmetic"> 109 110 111 Because AWK is a full programming language it's much easier to map documents and 112 perform arithmetic with it than with other Benthos processors. For example, if 113 we were expecting documents of the form: 114 115 ```json 116 {"doc":{"val1":5,"val2":10},"id":"1","type":"add"} 117 {"doc":{"val1":5,"val2":10},"id":"2","type":"multiply"} 118 ``` 119 120 And we wished to perform the arithmetic specified in the `type` field, 121 on the values `val1` and `val2` and, finally, map the result into the 122 document, giving us the following resulting documents: 123 124 ```json 125 {"doc":{"result":15,"val1":5,"val2":10},"id":"1","type":"add"} 126 {"doc":{"result":50,"val1":5,"val2":10},"id":"2","type":"multiply"} 127 ``` 128 129 We can do that with the following: 130 131 ```yaml 132 pipeline: 133 processors: 134 - awk: 135 program: | 136 function map_add_vals() { 137 json_set_int("doc.result", json_get("doc.val1") + json_get("doc.val2")); 138 } 139 function map_multiply_vals() { 140 json_set_int("doc.result", json_get("doc.val1") * json_get("doc.val2")); 141 } 142 function map_unknown(type) { 143 json_set("error","unknown document type"); 144 print_log("Document type not recognised: " type, "ERROR"); 145 } 146 { 147 type = json_get("type"); 148 if (type == "add") 149 map_add_vals(); 150 else if (type == "multiply") 151 map_multiply_vals(); 152 else 153 map_unknown(type); 154 } 155 ``` 156 157 </TabItem> 158 <TabItem value="Stuff With Arrays"> 159 160 161 It's possible to iterate JSON arrays by appending an index value to the path, 162 this can be used to do things like removing duplicates from arrays. For example, 163 given the following input document: 164 165 ```json 166 {"path":{"to":{"foos":["one","two","three","two","four"]}}} 167 ``` 168 169 We could create a new array `foos_unique` from `foos` giving us the result: 170 171 ```json 172 {"path":{"to":{"foos":["one","two","three","two","four"],"foos_unique":["one","two","three","four"]}}} 173 ``` 174 175 With the following config: 176 177 ```yaml 178 pipeline: 179 processors: 180 - awk: 181 program: | 182 { 183 array_path = "path.to.foos" 184 array_len = json_length(array_path) 185 186 for (i = 0; i < array_len; i++) { 187 ele = json_get(array_path "." i) 188 if ( ! ( ele in seen ) ) { 189 json_append(array_path "_unique", ele) 190 seen[ele] = 1 191 } 192 } 193 } 194 ``` 195 196 </TabItem> 197 </Tabs> 198 199 ## Codecs 200 201 The chosen codec determines how the contents of the message are fed into the 202 program. Codecs only impact the input string and variables initialised for your 203 program, they do not change the range of custom functions available. 204 205 ### `none` 206 207 An empty string is fed into the program. Functions can still be used in order to 208 extract and mutate metadata and message contents. 209 210 This is useful for when your program only uses functions and doesn't need the 211 full text of the message to be parsed by the program, as it is significantly 212 faster. 213 214 ### `text` 215 216 The full contents of the message are fed into the program as a string, allowing 217 you to reference tokenised segments of the message with variables ($0, $1, etc). 218 Custom functions can still be used with this codec. 219 220 This is the default codec as it behaves most similar to typical usage of the awk 221 command line tool. 222 223 ### `json` 224 225 An empty string is fed into the program, and variables are automatically 226 initialised before execution of your program by walking the flattened JSON 227 structure. Each value is converted into a variable by taking its full path, 228 e.g. the object: 229 230 ``` json 231 { 232 "foo": { 233 "bar": { 234 "value": 10 235 }, 236 "created_at": "2018-12-18T11:57:32" 237 } 238 } 239 ``` 240 241 Would result in the following variable declarations: 242 243 ``` 244 foo_bar_value = 10 245 foo_created_at = "2018-12-18T11:57:32" 246 ``` 247 248 Custom functions can also still be used with this codec. 249 250 ## AWK Functions 251 252 ### `json_get` 253 254 Signature: `json_get(path)` 255 256 Attempts to find a JSON value in the input message payload by a 257 [dot separated path](/docs/configuration/field_paths) and returns it as a string. 258 259 ### `json_set` 260 261 Signature: `json_set(path, value)` 262 263 Attempts to set a JSON value in the input message payload identified by a 264 [dot separated path](/docs/configuration/field_paths), the value argument will be interpreted 265 as a string. 266 267 In order to set non-string values use one of the following typed varieties: 268 269 - `json_set_int(path, value)` 270 - `json_set_float(path, value)` 271 - `json_set_bool(path, value)` 272 273 ### `json_append` 274 275 Signature: `json_append(path, value)` 276 277 Attempts to append a value to an array identified by a 278 [dot separated path](/docs/configuration/field_paths). If the target does not 279 exist it will be created. If the target exists but is not already an array then 280 it will be converted into one, with its original contents set to the first 281 element of the array. 282 283 The value argument will be interpreted as a string. In order to append 284 non-string values use one of the following typed varieties: 285 286 - `json_append_int(path, value)` 287 - `json_append_float(path, value)` 288 - `json_append_bool(path, value)` 289 290 ### `json_delete` 291 292 Signature: `json_delete(path)` 293 294 Attempts to delete a JSON field from the input message payload identified by a 295 [dot separated path](/docs/configuration/field_paths). 296 297 ### `json_length` 298 299 Signature: `json_length(path)` 300 301 Returns the size of the string or array value of JSON field from the input 302 message payload identified by a [dot separated path](/docs/configuration/field_paths). 303 304 If the target field does not exist, or is not a string or array type, then zero 305 is returned. In order to explicitly check the type of a field use `json_type`. 306 307 ### `json_type` 308 309 Signature: `json_type(path)` 310 311 Returns the type of a JSON field from the input message payload identified by a 312 [dot separated path](/docs/configuration/field_paths). 313 314 Possible values are: "string", "int", "float", "bool", "undefined", "null", 315 "array", "object". 316 317 ### `create_json_object` 318 319 Signature: `create_json_object(key1, val1, key2, val2, ...)` 320 321 Generates a valid JSON object of key value pair arguments. The arguments are 322 variadic, meaning any number of pairs can be listed. The value will always 323 resolve to a string regardless of the value type. E.g. the following call: 324 325 `create_json_object("a", "1", "b", 2, "c", "3")` 326 327 Would result in this string: 328 329 `{"a":"1","b":"2","c":"3"}` 330 331 ### `create_json_array` 332 333 Signature: `create_json_array(val1, val2, ...)` 334 335 Generates a valid JSON array of value arguments. The arguments are variadic, 336 meaning any number of values can be listed. The value will always resolve to a 337 string regardless of the value type. E.g. the following call: 338 339 `create_json_array("1", 2, "3")` 340 341 Would result in this string: 342 343 `["1","2","3"]` 344 345 ### `metadata_set` 346 347 Signature: `metadata_set(key, value)` 348 349 Set a metadata key for the message to a value. The value will always resolve to 350 a string regardless of the value type. 351 352 ### `metadata_get` 353 354 Signature: `metadata_get(key) string` 355 356 Get the value of a metadata key from the message. 357 358 ### `timestamp_unix` 359 360 Signature: `timestamp_unix() int` 361 362 Returns the current unix timestamp (the number of seconds since 01-01-1970). 363 364 ### `timestamp_unix` 365 366 Signature: `timestamp_unix(date) int` 367 368 Attempts to parse a date string by detecting its format and returns the 369 equivalent unix timestamp (the number of seconds since 01-01-1970). 370 371 ### `timestamp_unix` 372 373 Signature: `timestamp_unix(date, format) int` 374 375 Attempts to parse a date string according to a format and returns the equivalent 376 unix timestamp (the number of seconds since 01-01-1970). 377 378 The format is defined by showing how the reference time, defined to be 379 `Mon Jan 2 15:04:05 -0700 MST 2006` would be displayed if it were the value. 380 381 ### `timestamp_unix_nano` 382 383 Signature: `timestamp_unix_nano() int` 384 385 Returns the current unix timestamp in nanoseconds (the number of nanoseconds 386 since 01-01-1970). 387 388 ### `timestamp_unix_nano` 389 390 Signature: `timestamp_unix_nano(date) int` 391 392 Attempts to parse a date string by detecting its format and returns the 393 equivalent unix timestamp in nanoseconds (the number of nanoseconds since 394 01-01-1970). 395 396 ### `timestamp_unix_nano` 397 398 Signature: `timestamp_unix_nano(date, format) int` 399 400 Attempts to parse a date string according to a format and returns the equivalent 401 unix timestamp in nanoseconds (the number of nanoseconds since 01-01-1970). 402 403 The format is defined by showing how the reference time, defined to be 404 `Mon Jan 2 15:04:05 -0700 MST 2006` would be displayed if it were the value. 405 406 ### `timestamp_format` 407 408 Signature: `timestamp_format(unix, format) string` 409 410 Formats a unix timestamp. The format is defined by showing how the reference 411 time, defined to be `Mon Jan 2 15:04:05 -0700 MST 2006` would be displayed if it 412 were the value. 413 414 The format is optional, and if omitted RFC3339 (`2006-01-02T15:04:05Z07:00`) 415 will be used. 416 417 ### `timestamp_format_nano` 418 419 Signature: `timestamp_format_nano(unixNano, format) string` 420 421 Formats a unix timestamp in nanoseconds. The format is defined by showing how 422 the reference time, defined to be `Mon Jan 2 15:04:05 -0700 MST 2006` would be 423 displayed if it were the value. 424 425 The format is optional, and if omitted RFC3339 (`2006-01-02T15:04:05Z07:00`) 426 will be used. 427 428 ### `print_log` 429 430 Signature: `print_log(message, level)` 431 432 Prints a Benthos log message at a particular log level. The log level is 433 optional, and if omitted the level `INFO` will be used. 434 435 [goawk]: https://github.com/benhoyt/goawk 436 [goawk.differences]: https://github.com/benhoyt/goawk#differences-from-awk 437