github.com/Jeffail/benthos/v3@v3.65.0/website/docs/components/processors/awk.md (about)

     1  ---
     2  title: awk
     3  type: processor
     4  status: stable
     5  categories: ["Mapping"]
     6  ---
     7  
     8  <!--
     9       THIS FILE IS AUTOGENERATED!
    10  
    11       To make changes please edit the contents of:
    12       lib/processor/awk.go
    13  -->
    14  
    15  import Tabs from '@theme/Tabs';
    16  import TabItem from '@theme/TabItem';
    17  
    18  
    19  Executes an AWK program on messages. This processor is very powerful as it
    20  offers a range of [custom functions](#awk-functions) for querying and mutating
    21  message contents and metadata.
    22  
    23  
    24  <Tabs defaultValue="common" values={[
    25    { label: 'Common', value: 'common', },
    26    { label: 'Advanced', value: 'advanced', },
    27  ]}>
    28  
    29  <TabItem value="common">
    30  
    31  ```yaml
    32  # Common config fields, showing default values
    33  label: ""
    34  awk:
    35    codec: text
    36    program: BEGIN { x = 0 } { print $0, x; x++ }
    37  ```
    38  
    39  </TabItem>
    40  <TabItem value="advanced">
    41  
    42  ```yaml
    43  # All config fields, showing default values
    44  label: ""
    45  awk:
    46    codec: text
    47    program: BEGIN { x = 0 } { print $0, x; x++ }
    48    parts: []
    49  ```
    50  
    51  </TabItem>
    52  </Tabs>
    53  
    54  Works by feeding message contents as the program input based on a chosen
    55  [codec](#codecs) and replaces the contents of each message with the result. If
    56  the result is empty (nothing is printed by the program) then the original
    57  message contents remain unchanged.
    58  
    59  Comes with a wide range of [custom functions](#awk-functions) for accessing
    60  message metadata, json fields, printing logs, etc. These functions can be
    61  overridden by functions within the program.
    62  
    63  Check out the [examples section](#examples) in order to see how this processor
    64  can be used.
    65  
    66  This processor uses [GoAWK][goawk], in order to understand the differences
    67  in how the program works you can [read more about it here][goawk.differences].
    68  
    69  ## Fields
    70  
    71  ### `codec`
    72  
    73  A [codec](#codecs) defines how messages should be inserted into the AWK program as variables. The codec does not change which [custom Benthos functions](#awk-functions) are available. The `text` codec is the closest to a typical AWK use case.
    74  
    75  
    76  Type: `string`  
    77  Default: `"text"`  
    78  Options: `none`, `text`, `json`.
    79  
    80  ### `program`
    81  
    82  An AWK program to execute
    83  
    84  
    85  Type: `string`  
    86  Default: `"BEGIN { x = 0 } { print $0, x; x++ }"`  
    87  
    88  ### `parts`
    89  
    90  An optional array of message indexes of a batch that the processor should apply to.
    91  If left empty all messages are processed. This field is only applicable when
    92  batching messages [at the input level](/docs/configuration/batching).
    93  
    94  Indexes can be negative, and if so the part will be selected from the end
    95  counting backwards starting from -1.
    96  
    97  
    98  Type: `array`  
    99  Default: `[]`  
   100  
   101  ## Examples
   102  
   103  <Tabs defaultValue="JSON Mapping and Arithmetic" values={[
   104  { label: 'JSON Mapping and Arithmetic', value: 'JSON Mapping and Arithmetic', },
   105  { label: 'Stuff With Arrays', value: 'Stuff With Arrays', },
   106  ]}>
   107  
   108  <TabItem value="JSON Mapping and Arithmetic">
   109  
   110  
   111  Because AWK is a full programming language it's much easier to map documents and
   112  perform arithmetic with it than with other Benthos processors. For example, if
   113  we were expecting documents of the form:
   114  
   115  ```json
   116  {"doc":{"val1":5,"val2":10},"id":"1","type":"add"}
   117  {"doc":{"val1":5,"val2":10},"id":"2","type":"multiply"}
   118  ```
   119  
   120  And we wished to perform the arithmetic specified in the `type` field,
   121  on the values `val1` and `val2` and, finally, map the result into the
   122  document, giving us the following resulting documents:
   123  
   124  ```json
   125  {"doc":{"result":15,"val1":5,"val2":10},"id":"1","type":"add"}
   126  {"doc":{"result":50,"val1":5,"val2":10},"id":"2","type":"multiply"}
   127  ```
   128  
   129  We can do that with the following:
   130  
   131  ```yaml
   132  pipeline:
   133    processors:
   134    - awk:
   135        program: |
   136          function map_add_vals() {
   137            json_set_int("doc.result", json_get("doc.val1") + json_get("doc.val2"));
   138          }
   139          function map_multiply_vals() {
   140            json_set_int("doc.result", json_get("doc.val1") * json_get("doc.val2"));
   141          }
   142          function map_unknown(type) {
   143            json_set("error","unknown document type");
   144            print_log("Document type not recognised: " type, "ERROR");
   145          }
   146          {
   147            type = json_get("type");
   148            if (type == "add")
   149              map_add_vals();
   150            else if (type == "multiply")
   151              map_multiply_vals();
   152            else
   153              map_unknown(type);
   154          }
   155  ```
   156  
   157  </TabItem>
   158  <TabItem value="Stuff With Arrays">
   159  
   160  
   161  It's possible to iterate JSON arrays by appending an index value to the path,
   162  this can be used to do things like removing duplicates from arrays. For example,
   163  given the following input document:
   164  
   165  ```json
   166  {"path":{"to":{"foos":["one","two","three","two","four"]}}}
   167  ```
   168  
   169  We could create a new array `foos_unique` from `foos` giving us the result:
   170  
   171  ```json
   172  {"path":{"to":{"foos":["one","two","three","two","four"],"foos_unique":["one","two","three","four"]}}}
   173  ```
   174  
   175  With the following config:
   176  
   177  ```yaml
   178  pipeline:
   179    processors:
   180    - awk:
   181        program: |
   182          {
   183            array_path = "path.to.foos"
   184            array_len = json_length(array_path)
   185  
   186            for (i = 0; i < array_len; i++) {
   187              ele = json_get(array_path "." i)
   188              if ( ! ( ele in seen ) ) {
   189                json_append(array_path "_unique", ele)
   190                seen[ele] = 1
   191              }
   192            }
   193          }
   194  ```
   195  
   196  </TabItem>
   197  </Tabs>
   198  
   199  ## Codecs
   200  
   201  The chosen codec determines how the contents of the message are fed into the
   202  program. Codecs only impact the input string and variables initialised for your
   203  program, they do not change the range of custom functions available.
   204  
   205  ### `none`
   206  
   207  An empty string is fed into the program. Functions can still be used in order to
   208  extract and mutate metadata and message contents.
   209  
   210  This is useful for when your program only uses functions and doesn't need the
   211  full text of the message to be parsed by the program, as it is significantly
   212  faster.
   213  
   214  ### `text`
   215  
   216  The full contents of the message are fed into the program as a string, allowing
   217  you to reference tokenised segments of the message with variables ($0, $1, etc).
   218  Custom functions can still be used with this codec.
   219  
   220  This is the default codec as it behaves most similar to typical usage of the awk
   221  command line tool.
   222  
   223  ### `json`
   224  
   225  An empty string is fed into the program, and variables are automatically
   226  initialised before execution of your program by walking the flattened JSON
   227  structure. Each value is converted into a variable by taking its full path,
   228  e.g. the object:
   229  
   230  ``` json
   231  {
   232  	"foo": {
   233  		"bar": {
   234  			"value": 10
   235  		},
   236  		"created_at": "2018-12-18T11:57:32"
   237  	}
   238  }
   239  ```
   240  
   241  Would result in the following variable declarations:
   242  
   243  ```
   244  foo_bar_value = 10
   245  foo_created_at = "2018-12-18T11:57:32"
   246  ```
   247  
   248  Custom functions can also still be used with this codec.
   249  
   250  ## AWK Functions
   251  
   252  ### `json_get`
   253  
   254  Signature: `json_get(path)`
   255  
   256  Attempts to find a JSON value in the input message payload by a
   257  [dot separated path](/docs/configuration/field_paths) and returns it as a string.
   258  
   259  ### `json_set`
   260  
   261  Signature: `json_set(path, value)`
   262  
   263  Attempts to set a JSON value in the input message payload identified by a
   264  [dot separated path](/docs/configuration/field_paths), the value argument will be interpreted
   265  as a string.
   266  
   267  In order to set non-string values use one of the following typed varieties:
   268  
   269  - `json_set_int(path, value)`
   270  - `json_set_float(path, value)`
   271  - `json_set_bool(path, value)`
   272  
   273  ### `json_append`
   274  
   275  Signature: `json_append(path, value)`
   276  
   277  Attempts to append a value to an array identified by a
   278  [dot separated path](/docs/configuration/field_paths). If the target does not
   279  exist it will be created. If the target exists but is not already an array then
   280  it will be converted into one, with its original contents set to the first
   281  element of the array.
   282  
   283  The value argument will be interpreted as a string. In order to append
   284  non-string values use one of the following typed varieties:
   285  
   286  - `json_append_int(path, value)`
   287  - `json_append_float(path, value)`
   288  - `json_append_bool(path, value)`
   289  
   290  ### `json_delete`
   291  
   292  Signature: `json_delete(path)`
   293  
   294  Attempts to delete a JSON field from the input message payload identified by a
   295  [dot separated path](/docs/configuration/field_paths).
   296  
   297  ### `json_length`
   298  
   299  Signature: `json_length(path)`
   300  
   301  Returns the size of the string or array value of JSON field from the input
   302  message payload identified by a [dot separated path](/docs/configuration/field_paths).
   303  
   304  If the target field does not exist, or is not a string or array type, then zero
   305  is returned. In order to explicitly check the type of a field use `json_type`.
   306  
   307  ### `json_type`
   308  
   309  Signature: `json_type(path)`
   310  
   311  Returns the type of a JSON field from the input message payload identified by a
   312  [dot separated path](/docs/configuration/field_paths).
   313  
   314  Possible values are: "string", "int", "float", "bool", "undefined", "null",
   315  "array", "object".
   316  
   317  ### `create_json_object`
   318  
   319  Signature: `create_json_object(key1, val1, key2, val2, ...)`
   320  
   321  Generates a valid JSON object of key value pair arguments. The arguments are
   322  variadic, meaning any number of pairs can be listed. The value will always
   323  resolve to a string regardless of the value type. E.g. the following call:
   324  
   325  `create_json_object("a", "1", "b", 2, "c", "3")`
   326  
   327  Would result in this string:
   328  
   329  `{"a":"1","b":"2","c":"3"}`
   330  
   331  ### `create_json_array`
   332  
   333  Signature: `create_json_array(val1, val2, ...)`
   334  
   335  Generates a valid JSON array of value arguments. The arguments are variadic,
   336  meaning any number of values can be listed. The value will always resolve to a
   337  string regardless of the value type. E.g. the following call:
   338  
   339  `create_json_array("1", 2, "3")`
   340  
   341  Would result in this string:
   342  
   343  `["1","2","3"]`
   344  
   345  ### `metadata_set`
   346  
   347  Signature: `metadata_set(key, value)`
   348  
   349  Set a metadata key for the message to a value. The value will always resolve to
   350  a string regardless of the value type.
   351  
   352  ### `metadata_get`
   353  
   354  Signature: `metadata_get(key) string`
   355  
   356  Get the value of a metadata key from the message.
   357  
   358  ### `timestamp_unix`
   359  
   360  Signature: `timestamp_unix() int`
   361  
   362  Returns the current unix timestamp (the number of seconds since 01-01-1970).
   363  
   364  ### `timestamp_unix`
   365  
   366  Signature: `timestamp_unix(date) int`
   367  
   368  Attempts to parse a date string by detecting its format and returns the
   369  equivalent unix timestamp (the number of seconds since 01-01-1970).
   370  
   371  ### `timestamp_unix`
   372  
   373  Signature: `timestamp_unix(date, format) int`
   374  
   375  Attempts to parse a date string according to a format and returns the equivalent
   376  unix timestamp (the number of seconds since 01-01-1970).
   377  
   378  The format is defined by showing how the reference time, defined to be
   379  `Mon Jan 2 15:04:05 -0700 MST 2006` would be displayed if it were the value.
   380  
   381  ### `timestamp_unix_nano`
   382  
   383  Signature: `timestamp_unix_nano() int`
   384  
   385  Returns the current unix timestamp in nanoseconds (the number of nanoseconds
   386  since 01-01-1970).
   387  
   388  ### `timestamp_unix_nano`
   389  
   390  Signature: `timestamp_unix_nano(date) int`
   391  
   392  Attempts to parse a date string by detecting its format and returns the
   393  equivalent unix timestamp in nanoseconds (the number of nanoseconds since
   394  01-01-1970).
   395  
   396  ### `timestamp_unix_nano`
   397  
   398  Signature: `timestamp_unix_nano(date, format) int`
   399  
   400  Attempts to parse a date string according to a format and returns the equivalent
   401  unix timestamp in nanoseconds (the number of nanoseconds since 01-01-1970).
   402  
   403  The format is defined by showing how the reference time, defined to be
   404  `Mon Jan 2 15:04:05 -0700 MST 2006` would be displayed if it were the value.
   405  
   406  ### `timestamp_format`
   407  
   408  Signature: `timestamp_format(unix, format) string`
   409  
   410  Formats a unix timestamp. The format is defined by showing how the reference
   411  time, defined to be `Mon Jan 2 15:04:05 -0700 MST 2006` would be displayed if it
   412  were the value.
   413  
   414  The format is optional, and if omitted RFC3339 (`2006-01-02T15:04:05Z07:00`)
   415  will be used.
   416  
   417  ### `timestamp_format_nano`
   418  
   419  Signature: `timestamp_format_nano(unixNano, format) string`
   420  
   421  Formats a unix timestamp in nanoseconds. The format is defined by showing how
   422  the reference time, defined to be `Mon Jan 2 15:04:05 -0700 MST 2006` would be
   423  displayed if it were the value.
   424  
   425  The format is optional, and if omitted RFC3339 (`2006-01-02T15:04:05Z07:00`)
   426  will be used.
   427  
   428  ### `print_log`
   429  
   430  Signature: `print_log(message, level)`
   431  
   432  Prints a Benthos log message at a particular log level. The log level is
   433  optional, and if omitted the level `INFO` will be used.
   434  
   435  [goawk]: https://github.com/benhoyt/goawk
   436  [goawk.differences]: https://github.com/benhoyt/goawk#differences-from-awk
   437