bosun.org@v0.0.0-20210513094433-e25bc3e69a1f/docs/expressions.md (about) 1 --- 2 layout: default 3 title: Expression Documentation 4 --- 5 6 <div class="row"> 7 <div class="col-sm-3" > 8 <div class="sidebar" data-spy="affix" data-offset-top="0" data-offset-bottom="0" markdown="1"> 9 10 * Some TOC 11 {:toc} 12 13 </div> 14 </div> 15 16 <div class="doc-body col-sm-9" markdown="1"> 17 18 <p class="title h1">{{page.title}}</p> 19 20 This section documents Bosun's expression language, which is used to define the trigger condition for an alert. At the highest level the expression language takes various time *series* and reduces them them a *single number*. True or false indicates whether the alert should trigger or not; 0 represents false (don't trigger an alert) and any other number represents true (trigger an alert). An alert can also produce one or more *groups* which define the alert's scope or dimensionality. For example could you have one alert per host, service, or cluster or a single alert for your entire environment. 21 22 # Fundamentals 23 24 ## Data Types 25 There are three data types in Bosun's expression language: 26 27 1. **Scalar**: This is the simplest type, it is a single numeric value with no group associated with it. Keep in mind that an empty group, `{}` is still a group. 28 2. **NumberSet**: A number set is a group of tagged numeric values with one value per unique grouping. As a special case, a **scalar** may be used in place of a **numberSet** with a single member with an empty group. 29 3. **SeriesSet**: A series is an array of timestamp-value pairs and an associated group. 30 4. **VariantSet**: This is for generic functions. It can be a NumberSet, a SeriesSet, or Scalar. In the case of a NumberSet of a SeriesSet that same type will be returned, in the case of a Scalar a NumberSet is returned. Therefore the VariantSet type is never returned. 31 32 In the vast majority of your alerts you will getting ***seriesSets*** back from your time series database and ***reducing*** them into ***numberSets***. 33 34 ## Group keys 35 Groups are generally provided by your time series database. We also sometimes refer to groups as "Tags". When you query your time series database and get multiple time series back, each time series needs an identifier. So for example if I make a query with some thing like `host=*` then I will get one time series per host. Host is the tag key, and the various various values returned, i.e. `host1`, `host2`, `host3`.... are the tag values. Therefore the group for a single time series is something like `{host=host1}`. A group can have multiple tag keys, and will have one tag value for each key. 36 37 Each group can become its own alert instance. This is what we mean by ***scope*** or dimensionality. Thus, you can do things like `avg(q("sum:sys.cpu{host=ny-*}", "5m", "")) > 0.8` to check the CPU usage for many New York hosts at once. The dimensions can be manipulated with our expression language. 38 39 ### Group Subsets 40 Various metrics can be combined by operators as long as one group is a subset of the other. A ***subset*** is when one of the groups contains all of the tag key-value pairs in the other. An empty group `{}` is a subset of all groups. `{host=foo}` is a subset of `{host=foo,interface=eth0}`, and neither `{host=foo,interface=eth0}` nor `{host=foo,partition=/}` are a subset of the other. Equal groups are considered subsets of each other. 41 42 ## Operators 43 44 The standard arithmetic (`+`, binary and unary `-`, `*`, `/`, `%`), relational (`<`, `>`, `==`, `!=`, `>=`, `<=`), and logical (`&&`, `||`, and unary `!`) operators are supported. Examples: 45 46 * `q("q") + 1`, which adds one to every element of the result of the query `"q"` 47 * `-q("q")`, the negation of the results of the query 48 * `5 > q("q")`, a series of numbers indicating whether each data point is more than five 49 * `6 / 8`, the scalar value three-quarters 50 51 ### Series Operations 52 53 If you combine two seriesSets with an operator (i.e. `q(..)` + `q(..)`), then operations are applied for each point in the series if there is a corresponding datapoint on the right hand side (RH). A corresponding datapoint is one which has the same timestamp (and normal group subset rules apply). If there is no corresponding datapoint on the left side, then the datapoint is dropped. This is a new feature as of 0.5.0. 54 55 ### Precedence 56 57 From highest to lowest: 58 59 1. `()` and the unary operators `!` and `-` 60 1. `*`, `/`, `%` 61 1. `+`, `-` 62 1. `==`, `!=`, `>`, `>=`, `<`, `<=` 63 1. `&&` 64 1. `||` 65 66 ## Numeric constants 67 68 Numbers may be specified in decimal (e.g., `123.45`), octal (with a leading zero like `072`), or hex (with a leading 0x like `0x2A`). Exponentials and signs are supported (e.g., `-0.8e-2`). 69 70 # The Anatomy of a Basic Alert 71 <pre> 72 alert haproxy_session_limit { 73 template = generic 74 $notes = This alert monitors the percentage of sessions against the session limit in haproxy (maxconn) and alerts when we are getting close to that limit and will need to raise that limit. This alert was created due to a socket outage we experienced for that reason 75 $current_sessions = max(q("sum:haproxy.frontend.scur{host=*,pxname=*,tier=*}", "5m", "")) 76 $session_limit = max(q("sum:haproxy.frontend.slim{host=*,pxname=*,tier=*}", "5m", "")) 77 $query = ($current_sessions / $session_limit) * 100 78 warn = $query > 80 79 crit = $query > 95 80 warnNotification = default 81 critNotification = default 82 } 83 </pre> 84 85 We don't need to understand everything in this alert, but it is worth highlighting a few things to get oriented: 86 87 * `haproxy_session_limit` This is the name of the alert, an alert instance is uniquely identified by its alertname and group, i.e `haproxy_session_limit{host=lb,pxname=http-in,tier=2}` 88 * `$notes` This is a variable. Variables are not smart, they are just text replacement. If you are familiar with macros in C, this is a similar concept. These variables can be referenced in notification templates which is why we have a generic one for notes 89 * `q("sum:haproxy.frontend.scur{host=*,pxname=*,tier=*}", "5m", "")` is an OpenTSDB query function, it returns *N* series, we know each series will have the host, pxname, and tier tag keys in their group based on the query. 90 * `max(...)` is a reduction function. It takes each **series** and **reduces** it to a **number** (See the Data types section above). 91 * `$current_sessions / $session_limit` these variables represent **numbers** and will have subset group matches so there for you can use the / **operator** between them. 92 * `warn = $query > 80` if this is true (non-zero) then the `warnNotification` will be triggered. 93 94 # Query Functions 95 96 ## Azure Monitor Query Functions 97 98 These functions are considered *preview* as of August 2018. The names, signatures, and behavior of these functions might change as they are tested in real word usage. 99 100 The Azure Monitor datasource queries Azure for metric and resource information. These functions are available when [AzureMonitorConf](#system-configuration#azuremonitorconf) is defined in the system configuration. 101 102 These requests are subject to the [Azure Resource Manager Request Limits](https://docs.microsoft.com/en-us/azure/azure-resource-manager/resource-manager-request-limits) so when using the `az` and `azmulti` functions you should be mindful of how many API calls your alerts are making given your configured check interval. Also using the historical testing feature to query multiple intervals of time could quickly eat through your request limit. 103 104 Currently there is no special treatment or instrumentation of the rate limit by Bosun, other then errors are expected once the rate limit is hit and warning will be logged when a request responses with less than 100 reads remaining. 105 106 ### PrefixKey 107 108 PrefixKey is a quoted string used to query Azure with different clients from a single instance of Bosun. It can be passed as a prefix to Azure query functions as in the example below. If there is no prefix used then the query will be made on default Azure client. 109 110 ``` 111 $resources = ["foo"]azrt("Microsoft.Compute/virtualMachines") 112 $filteresRes = azrf($resources, "client:.*") 113 ["foo"]azmulti("Percentage CPU", "", $resources, "max", "5m", "1h", "") 114 ``` 115 116 ### az(namespace string, metric string, tagKeysCSV string, rsg string, resName string, agType string, interval string, startDuration string, endDuration string) seriesSet 117 {: .exprFunc} 118 119 az queries the [Azure Monitor REST API](https://docs.microsoft.com/en-us/rest/api/monitor/) for time series data for a specific metric and resource. Responses will include at least to tags: `name=<resourceName>,rsg=<resourceGroupName>`. If the metric support multiple dimensions and tagKeysCSV is non-empty additional tag keys are added to the response. 120 121 * `namespace` is the Azure namespace that the metric lives under. [Supported metric with Azure montior](https://docs.microsoft.com/en-us/azure/monitoring-and-diagnostics/monitoring-supported-metrics) contains a list of those namespaces, for example `Microsoft.Cache/redis` and `Microsoft.Compute/virtualMachines`. 122 * `metric` is the name of the metric under the corresponding `namespace` that you want to query, for example `Percentage CPU`. 123 * `tagKeysCSV` is comma-separated list of dimension keys that you want the response to group by. For example, the `Per Disk Read Bytes/sec` metric under `Microsoft.Compute/virtualMachines` has a SlotId metric, so if you pass `"SlotId"` for this argument `SlotId` will become a tag key in the response with the values corresponding to each slot (i.e `0`) 124 * `rsg` is the name of the Azure resource group that the resource is in 125 * `resName` is the name of the resource 126 * `agType` is the type of aggregation to use can be `avg`, `min`, `max`, `total`, or `count`. If an empty string then the default is `avg`. 127 * `interval` is the Azure timegrain to use without "PT" and in lower case (ISO 8601 duration format). Common supported timegrains are `1m`, `5m`, `15m`, `30m`, `1h`, `6h`, `12h`, and `1d`. 128 * `startDuration` and `endDuration` set the time window from now - see the OpenTSDB q() function for more details 129 130 Examples: 131 132 `az("Microsoft.Compute/virtualMachines", "Percentage CPU", "", "myResourceGroup", "myFavoriteVM", "avg", "5m", "1h", "")` 133 134 `az("Microsoft.Compute/virtualMachines", "Per Disk Read Bytes/sec", "SlotId", "myResourceGroup", "myFavoriteVM", "max", "5m", "1h", "")` 135 136 ### azrt(type string) azureResources 137 {: .exprFunc} 138 139 azrt (Azure Resources By Type) gets a list of Azure Resources that exist for a certain type. For example, `azrt("Microsoft.Compute/virtualMachines")` would return all virtualMachine resources. This list of resources can then be passed to `azrf()` (Azure Resource Filter) for additional filtering or to a query function that takes AzureResources as an argument like `azmulti()`. 140 141 An error will be returned if you attempt to pass resources fetched for an Azure client with a different client. In other words, if the resources call (e.g. `azrt()`) uses a different prefix from the time series query (e.g. `azmulti()`)). 142 143 The underlying implementation of this fetches *all* resources and caches that information. So additional azrt calls within scheduled check cycle will not result in additional calls to Azure's API. 144 145 ### azrf(resources azureResources, filter string) azureResources 146 {: .exprFunc} 147 148 azrf (Azure Resource Filter) takes a resource list and filters it to less resources based on the filter. The resources argument would usually be an `azrt()` call or another `azrf` call. 149 150 The filter argument supports filter supports joining terms in `()` as well as the `AND`, `OR`, and `!` operators. The following query terms are supported and are always in the format of something:something. The first part of each term (the key) is case insensitive. 151 152 * `name:<regex>` where the resource name matches the regular expression. 153 * `rsg:<regex>` where the resource group of the resource matches the resource. 154 * `otherText:<regex>` will match resources based on Azure tags. `otherText` would be the tag key and the regex will match against the tag's value. If the tag key does not exist on the resource then there will be no match. 155 156 Regular expressions use Go's regular expressions which use the [RE2 syntax](https://github.com/google/re2/wiki/Syntax). If you want an exact match and not a substring be sure to anchor the term with something like `rsg:^myRSG$`. 157 158 Example: 159 160 ``` 161 $resources = azrt("Microsoft.Compute/virtualMachines") 162 # Filter resources to those with client azure tag that has any value 163 $filteresRes = azrf($resources, "client:.*") 164 azmulti("Percentage CPU", "", $filteredRes, "max", "5m", "1h", "") 165 ``` 166 167 Note that `azrf()` does not take a prefix key since it is filtering resources that have already been retrieved. The resulting azureResources will still be associated with the correct client/prefix. 168 169 ### azmulti(metric string, tagKeysCSV string, resources AzureResources, agType string, interval string, startDuration string, endDuration string) seriesSet 170 {: .exprFunc} 171 172 azmulti (Azure Multiple Query) queries a metric for multiple resources and returns them as a single series set. The arguments metric, tagKeysCSV, agType, interval, startDuration, and endDuration all behave the same as in the `az` function. Also like the `az` functions the result will be tagged with `rsg`, `name`, and any dimensions from tagKeysCSV. 173 174 The resources argument is a list of resources (an azureResourcesType) as returned by `azrt` and `azrf`. 175 176 Each resource queried requires an Azure Monitor API call. So if there are 20 items in the set from return of the call, 20 calls are made that count toward the rate limit. This function exists because most metrics do not have dimensions on primary attributes like the machine name. 177 178 Example: 179 180 ``` 181 $resources = azrt("Microsoft.Compute/virtualMachines") 182 azmulti("Percentage CPU", "", $resources, "max", "PT5M", "1h", "") 183 ``` 184 185 ## Azure Application Insights Query Functions 186 187 Queries for [Azure Application Insights](https://docs.microsoft.com/en-us/azure/application-insights/app-insights-overview) use the same system configuration as the [Azure Monitor Query Functions](/expressions#azure-monitor-query-functions). Therefore these functions are available when [AzureMonitorConf](#system-configuration#azuremonitorconf) is defined in the system configuration. However, a [different API](https://dev.applicationinsights.io/documentation/overview) is used to query these metrics. In order for these to work you will have to have [AAD Auth setup](https://dev.applicationinsights.io/documentation/Authorization/AAD-Application-Setup) for the client user. 188 189 Currently only Application Insights [*metrics*](https://dev.applicationinsights.io/documentation/Using-the-API/Metrics) are supported and [events](https://dev.applicationinsights.io/documentation/Using-the-API/Events) are *not* supported. 190 191 These queries share the same [Prefix Key as Azure Montitor queries](/expressions#prefixkey). 192 193 ### aiapp() azureAIApps 194 {: .exprFunc} 195 196 aiapp (Application Insights Apps) gets a list of Azure [application insights applications/resources](https://docs.microsoft.com/en-us/azure/application-insights/app-insights-create-new-resource) to query. This can be passed to the `ai()` function, or filtered to a subset of applications using the `aippf()` function, which can then also be passed to the `ai()` function. 197 198 The implementation for getting the list of applications uses the [Azure components/list REST API](https://docs.microsoft.com/en-us/rest/api/application-insights/components/components_list). 199 200 ### aippf(apps azureAIApps, filter string) azureAIApps 201 {: .exprFunc} 202 203 aiappf (Application Insights Apps Filter) filters a list of applications from `aiapp()` to a subset of applications based on the `filter` string. The result can then be passed to the `ai()` function. The filter behaves in a similar way to the way [`azrf()`](expressions#azrfresources-azureresources-filter-string-azureresources) filters resources. 204 205 The filter argument supports filter supports joining terms in `()` as well as the `AND`, `OR`, and `!` operators. The following query terms are supported and are always in the format of something:something. The first part of each term (the key) is case insensitive. 206 207 * `name:<regex>` where the resource name of the insights application matches the regular expression. 208 * `otherText:<regex>` will match insights applications based on the Azure tags on the insights application resource. `otherText` would be the tag key and the regex will match against the tag's value. If the tag key does not exist on the resource then there will be no match. 209 210 Regular expressions use Go's regular expressions which use the [RE2 syntax](https://github.com/google/re2/wiki/Syntax). If you want an exact match and not a substring be sure to anchor the term with something like `name:^myApp$`. 211 212 ### ai(metric, segmentsCSV, filter string, apps azureAIApps, agType, interval, startDuration, endDuration string) seriesSet 213 {: .exprFunc} 214 215 ai (Application Insights) queries application insights metrics from multiple application insights applications, tagging the values with the `app=AppName` key-value pair where AppName is the name of the Application Insights resource. The response will also be tagged by segments if any are requested. 216 217 * `metric` is the name of the metric you wish to query. A list of ["Default Metrics" is listed in the API Documentation](https://dev.applicationinsights.io/documentation/Using-the-API/Metrics). You can also use the `aimd()` function to see what metrics are available. 218 * `segmentsCSV` is a comma-separated is comma-separated list of "segments" that you want the response to group by. For example with the default metric `requests/count` you might have `client/countryOrRegion,cloud/roleInstance`. You can also use the `aimd()` function to see what segments/dimensions are available. 219 * `filter` is an odata filter than can be used to refine results. See more information below. 220 * `apps` is a list of azure applications to query returned by `aiapp()` or `aiappf()`. 221 * `agType` is the aggregation type to use. Common values are `avg`, `min`, `max`, `sum`, or `count`. If the aggregation type is not available the error will indicate what types are. You can use the `aimd()` function to see what aggregations are available. 222 * `interval` is the Azure timegrain to use without "PT" and in lower case (ISO 8601 duration format). Common supported timegrains are `1m`, `5m`, `15m`, `30m`, `1h`, `6h`, `12h`, and `1d`. If empty the value will be `1m`. 223 * `startDuration` and `endDuration` set the time window from now - see the OpenTSDB q() function for more details. 224 225 Regarding the `filter` argument it seems [Azure's documentation](https://dev.applicationinsights.io/reference) is not clear on supported OData operations. That being said here are some observations: 226 227 * `startswith` and `contains` are valid string operations in the filter. 228 * You can *not do* negated matches. The API will accept them but they seem to have no impact. See this [Azure Feedback Issue](https://feedback.azure.com/forums/357324-application-insights/suggestions/7924191--not-filters-in-application-insights). 229 * You can filter on dimension/segements that are relevant to the metric, but were not requested as part of `segmentsCSV`. 230 231 These requests are subject to a different [rate limit](https://dev.applicationinsights.io/documentation/Authorization/Rate-limits). 232 233 > Using Azure Active Directory for authentication, throttling rules are applied per AAD client user. Each AAD user is able to make up to 200 requests per 30 seconds, with no cap on the total calls per day. 234 235 A HTTP request is made per application. Unlike `azmulti()` these requests are *serial* and not parallelized since the ratelimit is of a relatively short duration (30 seconds). That means you can expect this query to be slow relative to the number of applications you are querying. 236 237 Example: 238 239 ``` 240 $selectedApps = aiappf(aiapp(), "environment:prd") 241 $filter = "startswith(operation/name, 'POST')" 242 ai("requests/duration", "cloud/roleInstance", $filter, $selectedApps, "avg", "1h", "3d", "") 243 ``` 244 245 ### aimd(apps azureAIApps) Info 246 {: .exprFunc} 247 248 aimd (Application Insights Metadata) return metrics and their related aggregations and dimensions/segments per application. The list of applications should be provided with `aiapp()` or `aiappf()`. For most use cases filtering to a single app is ideal since the metadata object for each application is generally fairly large. 249 250 This is not meant to be used in normal expression workflow (e.g. *not* for alerting or templates), but rather exists so in the Bosun's expression editor UI, you can get a list of what can be queried with the `ai()` function. 251 252 ## Graphite Query Functions 253 254 ### graphite(query string, startDuration string, endDuration string, format string) seriesSet 255 {: .exprFunc} 256 257 Performs a graphite query. the duration format is the internal bosun format (which happens to be the same as OpenTSDB's format). 258 Functions pretty much the same as q() (see that for more info) but for graphite. 259 The format string lets you annotate how to parse series as returned by graphite, as to yield tags in the format that bosun expects. 260 The tags are dot-separated and the amount of "nodes" (dot-separated words) should match what graphite returns. 261 Irrelevant nodes can be left empty. 262 263 For example: 264 265 `groupByNode(collectd.*.cpu.*.cpu.idle,1,'avg')` 266 267 returns seriesSet named like `host1`, `host2` etc, in which case the format string can simply be `host`. 268 269 `collectd.web15.cpu.*.cpu.*` 270 271 returns seriesSet named like `collectd.web15.cpu.3.idle`, requiring a format like `.host..core..cpu_type`. 272 273 For advanced cases, you can use graphite's alias(), aliasSub(), etc to compose the exact parseable output format you need. 274 This happens when the outer graphite function is something like "avg()" or "sum()" in which case graphite's output series will be identified as "avg(some.string.here)". 275 276 ### graphiteBand(query string, duration string, period string, format string, num string) seriesSet 277 {: .exprFunc} 278 279 Like band() but for graphite queries. 280 281 ## InfluxDB Query Functions 282 283 ### influx(db string, query string, startDuration string, endDuration, groupByInterval string) seriesSet 284 {: .exprFunc} 285 286 Queries InfluxDB. 287 288 All tags returned by InfluxDB will be included in the results. 289 290 * `db` is the database name in InfluxDB 291 * `query` is an InfluxDB select statement 292 NB: WHERE clauses for `time` are inserted automatically, and it is thus an error to specify `time` conditions in query. 293 * `startDuration` and `endDuration` set the time window from now - see the OpenTSDB q() function for more details 294 They will be merged into the existing WHERE clause in the `query`. 295 * `groupByInterval` is the `time.Duration` window which will be passed as an argument to a GROUP BY time() clause if given. This groups values in the given time buckets. This groups (or in OpenTSDB lingo "downsamples") the results to this timeframe. [Full documentation on Group by](https://influxdb.com/docs/v0.9/query_language/data_exploration.html#group-by). 296 297 ### Notes: 298 299 * By default, queries will be given a suffix of `fill(none)` to filter out any nil rows. 300 * Influx queries themselves often use both double and single (quoting issues are often encountered [as per the documentation](https://docs.influxdata.com/influxdb/v0.13/troubleshooting/frequently_encountered_issues/#single-quoting-and-double-quoting-in-queries)). So you will likely need to use triple single quotes (`'''`) for many queries. When using single quotes in triple single quotes, you may need a space. So for example `'''select max(value) from "my.measurement" where key = 'val''''` is not valid but `'''select max(value) from "my.measurement" where key = 'val' '''` is. 301 302 ### examples: 303 304 These influx and opentsdb queries should give roughly the same results: 305 306 ``` 307 influx("db", '''SELECT non_negative_derivative(mean(value)) FROM "os.cpu" GROUP BY host''', "30m", "", "2m") 308 309 q("sum:2m-avg:rate{counter,,1}:os.cpu{host=*}", "30m", "") 310 ``` 311 312 Querying graphite sent to influx (note the quoting): 313 314 ``` 315 influx("graphite", '''select sum(value) from "df-root_df_complex-free" where env='prod' and node='web' ''', "2h", "1m", "1m") 316 ``` 317 318 ## Elastic Query Functions 319 320 Elasitc replaces the deprecated logstash (ls) functions. It only works with Elastic v2+. It is meant to be able to work with any elastic documents that have a time field and not just logstash. It introduces two new types to allow for greater flexibility in querying. The ESIndexer type generates index names to query (based on the date range). There are now different functions to generate indexers for people with different configurations. The ESQuery type is generates elastic queries so you can filter your results. By making these new types, new Indexers and Elastic queries can be added over time. 321 322 You can view the generated JSON for queries on the expr page by bring up miniprofiler with Alt-P. 323 324 ### PrefixKey 325 PrefixKey is a quoted string used to query different elastic cluster and can be passed as a prefix to elastic query functions mentioned below. If not used the query will be made on [default](system_configuration#elasticconfdefault) cluster. 326 327 Querying [foo](system_configuration#example-2) cluster: 328 329 ``` 330 $index = esindices("timestamp", "errors") 331 $filter = esquery("nginx", "POST") 332 crit = max(["foo"]escount($index, "host", $filter, "1h", "30m", "")) > 2 333 ``` 334 335 ### escount(indexRoot ESIndexer, keyString string, filter ESQuery, bucketDuration string, startDuration string, endDuration string) seriesSet 336 {: .exprFunc} 337 338 escount returns a time bucked count of matching documents. It uses the keystring, indexRoot, interval, and durations to create an [elastic Date Histogram Aggregation](https://www.elastic.co/guide/en/elasticsearch/reference/current/search-aggregations-bucket-datehistogram-aggregation.html). 339 340 * `indexIndexer` will always be a function that returns an ESIndexer, such as `esdaily`. 341 * `keyString` is a csv separated list of fields. The fields will become tag keys, and the values returned for fields become the correspond tag values. For example `host,errorCode`. If an empty string is given, then the result set will have a single series and will have an empty tagset `{}`. These keys become terms filters for the date histogram. 342 * `filter` will be a funtion that returns an ESQuery. The queries further refine the results. The fields you filter on can match the fields in the keyString, but don't have too. If you don't want to filter you results, use `esall()` here. 343 * `bucketDuration` is an opentsdb duration string. It sets the the span of time to bucket the count of documents. For example, "1m" will give you the count of documents per minute. 344 * `startDuration` and `endDuration` set the time window from now - see the OpenTSDB q() function for more details. 345 346 ### esstat(indexRoot ESIndexer, keyString string, filter ESQuery, field string, rStat string, bucketDuration string, startDuration string, endDuration string) seriesSet 347 {: .exprFunc} 348 349 estat returns various summary stats per bucket for the specified `field`. The field must be numeric in elastic. rStat can be one of `avg`, `min`, `max`, `sum`, `sum_of_squares`, `variance`, `std_deviation`. The rest of the fields behave the same as escount. 350 351 ## Elastic Index Functions 352 353 ### esdaily (timeField string, indexRoot string, layout string) ESIndexer 354 {: .exprFunc} 355 356 esdaily is for elastic indexes that have a date name for each day. Based on the timeframe of the enclosing es function (i.e. esstat and escount) to generate which indexes should be included in the query. It gets all indexes and won't include indices that don't exist. The layout specifier uses's [Go's time specification format](https://golang.org/pkg/time/#Parse). The timeField is the name of the field in elastic that contains timestamps for the documents. 357 358 ### esmonthly (timeField string, indexRoot string, layout string) ESIndexer 359 {: .exprFunc} 360 361 esmonthly is like esdaily except that it is for monthly indices. It is expect the index name is the first of every month. 362 363 ### esindices(timeField string, index string...) ESIndexer 364 {: .exprFunc} 365 366 esindices takes one or more literal indices for the enclosing query to use. It does not check for existance of the index, and passes back the elastic error if the index does not exist. The timeField is the name of the field in elastic that contains timestamps for the documents. 367 368 ### esls(indexRoot string) ESIndexer 369 {: .exprFunc} 370 371 esls is a shortcut for esdaily("@timestamp", indexRoot+"-", "2006.01.02") and is for the default daily format that logstash creates. 372 373 ## Elastic Query Generating Functions (for filtering) 374 375 ### esall() ESQuery 376 {: .exprFunc} 377 378 esall returns an elastic matchall query, use this when you don't want to filter any documents. 379 380 ### esregexp(field string, regexp string) 381 {: .exprFunc} 382 383 esregexp creates an [elastic regexp query](https://www.elastic.co/guide/en/elasticsearch/reference/2.4/query-dsl-regexp-query.html) for the specified field. 384 385 ### esquery(field string, querystring string) 386 {: .exprFunc} 387 388 esquery creates a [full-text elastic query string query](https://www.elastic.co/guide/en/elasticsearch/reference/2.4/query-dsl-query-string-query.html). 389 390 ### esand(queries.. ESQuery) ESQuery 391 {: .exprFunc} 392 393 esand takes one or more ESQueries and combines them into an [elastic bool query](https://www.elastic.co/guide/en/elasticsearch/reference/2.4/query-dsl-bool-query.html) where all the queries "must" be true. 394 395 ### esor(queries.. ESQuery) ESQuery 396 {: .exprFunc} 397 398 esor takes one or more ESQueries and combines them into an [elastic bool query](https://www.elastic.co/guide/en/elasticsearch/reference/2.4/query-dsl-bool-query.html) so that at least one must be true. 399 400 ### esnot(query ESQuery) ESQuery 401 {: .exprFunc} 402 403 esnot takes a query and inverses the logic using must_not from an [elastic bool query](https://www.elastic.co/guide/en/elasticsearch/reference/2.4/query-dsl-bool-query.html). 404 405 ### esexists(field string) ESQuery 406 {: .exprFunc} 407 408 esexists is true when the specified field exists. 409 410 ###esgt(field string, value Scalar) ESQuery 411 {: .exprFunc} 412 413 esgt takes a field (expected to be numeric field in elastic) and returns results where the value of that field is greater than the specified value. It creates an [elastic range query](https://www.elastic.co/guide/en/elasticsearch/reference/2.4/query-dsl-range-query.html). 414 415 ### esgte(field string, value Scalar) ESQuery 416 {: .exprFunc} 417 418 esgt takes a field (expected to be numeric field in elastic) and returns results where the value of that field is greater than or equal to the specified value. It creates an [elastic range query](https://www.elastic.co/guide/en/elasticsearch/reference/2.4/query-dsl-range-query.html). 419 420 ### eslt(field string, value Scalar) ESQuery 421 {: .exprFunc} 422 423 esgt takes a field (expected to be numeric field in elastic) and returns results where the value of that field is less than the specified value. It creates an [elastic range query](https://www.elastic.co/guide/en/elasticsearch/reference/2.4/query-dsl-range-query.html). 424 425 ### eslte(field string, value Scalar) ESQuery 426 {: .exprFunc} 427 428 esgt takes a field (expected to be numeric field in elastic) and returns results where the value of that field is less than or equal to the specified value. It creates an [elastic range query](https://www.elastic.co/guide/en/elasticsearch/reference/2.4/query-dsl-range-query.html). 429 430 431 ## OpenTSDB Query Functions 432 433 Query functions take a query string (like `sum:os.cpu{host=*}`) and return a seriesSet. 434 435 ### q(query string, startDuration string, endDuration string) seriesSet 436 {: .exprFunc} 437 438 Generic query from endDuration to startDuration ago. If endDuration is the empty string (`""`), now is used. Support d( units are listed in [the docs](http://opentsdb.net/docs/build/html/user_guide/query/dates.html). Refer to [the docs](http://opentsdb.net/docs/build/html/user_guide/query/index.html) for query syntax. The query argument is the value part of the `m=...` expressions. `*` and `|` are fully supported. In addition, queries like `sys.cpu.user{host=ny-*}` are supported. These are performed by an additional step which determines valid matches, and replaces `ny-*` with `ny-web01|ny-web02|...|ny-web10` to achieve the same result. This lookup is kept in memory by the system and does not incur any additional OpenTSDB API requests, but does require scollector instances pointed to the bosun server. 439 440 ### band(query string, duration string, period string, num scalar) seriesSet 441 {: .exprFunc} 442 443 Band performs `num` queries of `duration` each, `period` apart and concatenates them together, starting `period` ago. So `band("avg:os.cpu", "1h", "1d", 7)` will return a series comprising of the given metric from 1d to 1d-1h-ago, 2d to 2d-1h-ago, etc, until 8d. This is a good way to get a time block from a certain hour of a day or certain day of a week over a long time period. 444 445 Note: this function wraps a more general version `bandQuery(query string, duration string, period string, eduration string, num scalar) seriesSet`, where `eduration` specifies the end duration for the query to stop at, as with `q()`. 446 447 ### over(query string, duration string, period string, num scalar) seriesSet 448 {: .exprFunc} 449 450 Over's arguments behave the same way as band. However over shifts the time of previous periods to be now, tags them with duration that each period was shifted, and merges those shifted periods into a single seriesSet, which includes the most recent period. This is useful for displaying time over time graphs. For example, the same day week over week would be `over("avg:1h-avg:rate:os.cpu{host=ny-bosun01}", "1d", "1w", 4)`. 451 452 Note: this function wraps a more general version `overQuery(query string, duration string, period string, eduration string, num scalar) seriesSet`, where `eduration` specifies the end duration for the query to stop at, as with `q`. Results are still shifted to end at current time. 453 454 ### shiftBand(query string, duration string, period string, num scalar) seriesSet 455 {: .exprFunc} 456 457 shiftBand's behaviour is very similar to `over`, however the most recent period is not included in the seriesSet. This function could be useful for anomaly detection when used with `aggr`, to calculate historical distributions to compare against. 458 459 ### change(query string, startDuration string, endDuration string) numberSet 460 {: .exprFunc} 461 462 Change is a way to determine the change of a query from startDuration to endDuration. If endDuration is the empty string (`""`), now is used. The query must either be a rate or a counter converted to a rate with the `agg:rate:metric` flag. 463 464 For example, assume you have a metric `net.bytes` that records the number of bytes that have been sent on some interface since boot. We could just subtract the end number from the start number, but if a reboot or counter rollover occurred during that time our result will be incorrect. Instead, we ask OpenTSDB to convert our metric to a rate and handle all of that for us. So, to get the number of bytes in the last hour, we could use: 465 466 `change("avg:rate:net.bytes", "60m", "")` 467 468 Note that this is implemented using the bosun's `avg` function. The following is exactly the same as the above example: 469 470 `avg(q("avg:rate:net.bytes", "60m", "")) * 60 * 60` 471 472 ### count(query string, startDuration string, endDuration string) scalar 473 {: .exprFunc} 474 475 Count returns the number of groups in the query as an ungrouped scalar. 476 477 ### window(query string, duration string, period string, num scalar, funcName string) seriesSet 478 {: .exprFunc} 479 480 Window performs `num` queries of `duration` each, `period` apart, starting 481 `period` ago. The results of the queries are run through `funcName` which 482 must be a reduction function taking only one argument (that is, a function 483 that takes a series and returns a number), then a series made from those. So 484 `window("avg:os.cpu{host=*}", "1h", "1d", 7, "dev")` will return a series 485 comprising of the average of given metric from 1d to 1d-1h-ago, 2d to 486 2d-1h-ago, etc, until 8d. It is similar to the band function, except that 487 instead of concatenating series together, each series is reduced to a number, 488 and those numbers created into a series. 489 490 In addition to supporting Bosun's reduction functions that take on argument, percentile operations may be be done by setting `funcName` to p followed by number that is between 0 and 1 (inclusively). For example, `"p.25"` will be the 25th percentile, `"p.999"` will be the 99.9th percentile. `"p0"` and `"p1"` are min and max respectively (However, in these cases it is recommended to use `"min"` and `"max"` for the sake of clarity. 491 492 ## Prometheus Query Functions 493 494 Prometheus query functions query Prometheus TSDB(s) using the using the [Prometheus HTTP v1 API](https://prometheus.io/docs/prometheus/latest/querying/api/). When [`PromConf` in the system configuration](/system_configuration#promconf) is added these functions become available. 495 496 There are currently two types of functions: functions that return time series sets (seriesSet) and information functions that are meant to be used interactively in the expression editor for information about metrics and tags. 497 498 ### PrefixKey 499 The PrefixKey is a quoted string used to query different promthesus backends in [`PromConf` in the system configuration](/system_configuration#promconf). If the PrefixKey is missing (there are no brackets before the function), then "default" is used. For example the prefix in the following is `["it"]`: 500 501 ``` 502 ["it"]prom("up", "namespace", "", "sum", "5m", "1h", "") 503 ``` 504 505 In the case of `promm` and `promratem`, the prefix may have multiple keys separated by commas to allow for querying multiple prom datasources at once, for example: 506 507 ``` 508 ["it,default"]promm("up", "namespace", "", "sum", "5m", "1h", "") 509 ``` 510 511 ### Series Removed from Responses 512 513 When a Prometheus query is made all time series in the response do not have to have the same set of tag keys. For example, when making a PromQL request that has group `by (host,interface)` results may be included in the response that contain only `host`, only `interface`, or no tag keys at all. Bosun requires that the tag keys be consistent for each series within a seriesSet. Therefore, these results are removed from the responses when using functions like `prom`, `promrate`, `promm`, and `promratem`. 514 515 <div class="admonition"> 516 <p class="admonition-title">Note</p> 517 <p>This behavior may change in the future to an alternative design. Instead of dropping these series, the series could be retained but the missing tag keys would be added to the response with some sort of value to represent that the tag is missing.</p> 518 </div> 519 520 ### prom(metric, groupByTags, filter, agType, stepDuration, startDuration, endDuration string) seriesSet 521 {: .exprFunc} 522 523 prom queries a Promethesus TSDB for time series data. It accomplishes this by generating a PromQL query from the given arguments. 524 525 * `metric` is the name of the to query. To get a list of available metrics use the `prommetrics()` function. 526 * `groupByTags` is a comma separated list of tag keys to aggregate the response by. 527 * `filter` filters to results using [Prometheus Time Series Selectors](https://prometheus.io/docs/prometheus/latest/querying/basics/#time-series-selectors). This functions analogous to a `WHERE` clause in SQL. For example: `job=~".*",method="get"`. Operators are `=`, `!=`, `=~`, and `!~` for equals, not equals, [RE2](https://github.com/google/re2/wiki/Syntax) match, and not RE2 match respectively. This string is inserted into the generated promQL query directly. 528 * `agType` is the the aggregation function to perform such as `"sum"` or `"avg"`. It can be any [Prometheus Aggregation operator](https://prometheus.io/docs/prometheus/latest/querying/operators/#aggregation-operators). 529 * `stepDuration` is Prometheus's evaluation step duration. This is like downsampling, except that takes the datapoint that is most recently before (or matching) the step based on the start time. If there are no samples in that duration, the sample will be repeated. See [Prometheus Docs Issue #699].(https://github.com/prometheus/docs/issues/699). 530 * `startDuration` and `endDuration` determain the start and end time based on the current time (or currently selected time in the expression/rule editor). They are then used to send an absolute time range for the Prometheus request. 531 532 Example: 533 534 ``` 535 $metric = "up" 536 $groupByTags = "namespace" 537 $filter = ''' service !~ "kubl.*" ''' 538 $agg = "sum" 539 $step = "1m" 540 541 prom($metric, $groupByTags, $filter, $agg, $step, "1h", "") 542 ``` 543 544 The above example would generate a PromQL query of `sum( up { service !~ "kubl.*" } ) by ( namespace )`. The time range and step are sent via HTTP query parameters. 545 546 ### promrate(metric, groupByTags, filter, agType, rateStepDuration, stepDuration, startDuration, endDuration string) seriesSet 547 {: .exprFunc} 548 549 promrate is like `prom` function, except that is for rate per-second calculations on metrics that are counters. It therefore includes the extra `rateStepDuration` argument which is for calculating the step of the rate calculation. The `stepDuration` is then for the step of the aggregation operation that is on top of the calculated rate. This is performed using [the `rate()` function in PromQL](https://prometheus.io/docs/prometheus/latest/querying/functions/#rate). 550 551 Example: 552 553 ``` 554 $metric = "container_memory_working_set_bytes" 555 $groupByTags = "container_name,namespace" 556 $filter = ''' container_name !~ "pvc-.*$" ''' 557 $agg = "sum" 558 $rateStep = "1m" 559 $step = "5m" 560 561 promrate($metric, $groupByTags, $filter, $agg, $rateStep, $step, "1h", "") 562 ``` 563 564 The above example would generate a PromQL query of `sum(rate( container_memory_working_set_bytes { container_name !~ "pvc-.*$" } [1m] )) by ( container_name,namespace )`. The time range and step are sent via HTTP query parameters. 565 566 ### promm(metric, groupByTags, filter, agType, stepDuration, startDuration, endDuration string) seriesSet 567 {: .exprFunc} 568 569 promm (Prometheus Multiple) is like the `prom` function, except that it queries multiple Prometheus TSDBs and combines the result into a single seriesSet. A tag key of `bosun_prefix` with the tag value set to the prefix is added to the results to ensure that series are unique in the result. 570 571 Example: 572 573 ``` 574 $metric = "container_memory_working_set_bytes" 575 $groupByTags = "container_name,namespace" 576 $filter = ''' container_name !~ "pvc-.*$" ''' 577 $agg = "sum" 578 $step = "5m" 579 580 $q = ["it,default"]promm($metric, $groupByTags, $filter, $agg, $step, "1h", "") 581 max($q) 582 583 # You could use the aggr function to aggregate across clusters if you like 584 # aggr($q, $groupByTags, $agg) 585 ``` 586 587 In the above example `$q` will be a seriesSet with the tag keys of `container_name`, `namespace`, and `bosun_prefix`. The values for the `bosun_prefix` key will be either `it` or `default` for each series in the set. 588 589 ### promratem(metric, groupByTags, filter, agType, rateStepDuration, stepDuration, startDuration, endDuration string) seriesSet 590 {: .exprFunc} 591 592 promratem (Prometheus Rate Multiple) is like the `promm` function is to the `prom` function. It allows you to do a per-second rate query against multiple Prometheus TSDBs and combines the result into a single seriesSet -- adding the `bosun_prefix` tag key to the result. It behaves the same as the `promm` function, but like `promrate`, it has the extra `rateStepDuration` argument. 593 594 ### promras(promql, stepDuration, startDuration, endDuration string) seriesSet 595 {: .exprFunc} 596 597 Instead of building a promql query like the `prom` and `promrate` functions, promras (Prometheus Raw Aggregate Series) allows you to query Prometheus using promql with some restrictions: 598 599 1. The query must return a time series (a Prometheus matrix) 600 2. The top level function in promql must be an [Prometheus Aggregation Operator](https://prometheus.io/docs/prometheus/latest/querying/operators/#aggregation-operators) with a `by` clause. 601 602 Example: 603 604 ``` 605 promras(''' sum(rate(container_fs_reads_total[1m]) + rate(container_fs_writes_total[1m])) by (namespace) ''', "2m", "2h", "") 606 ``` 607 608 ### prommras(promql, stepDuration, startDuration, endDuration string) seriestSet 609 {: .exprFunc} 610 611 prommras (Prometheus Multiple Raw Aggregate Series) is like the `promras` function excepts that it queries multiple prometheus instances and adds the "bosun_prefix" tag to the results like the `promm` and `prommrate` functions. 612 613 Example: 614 615 ``` 616 # You can still use string interpolation of $variables in promras and prommras 617 $step = 1m 618 $reads = container_fs_reads_total[$step] 619 $writes = container_fs_writes_total[$step] 620 ["default,it"]prommras(''' sum(rate($reads) + rate($writes)) by (namespace) ''', "2m", "2h", "") 621 ``` 622 623 ### prommetrics() Info 624 {: .exprFunc} 625 626 prommetrics returns a list of metrics that are available in the Prometheus TSDB. This is not meant to be used in alerting, it is for use in the expression editor for getting information to build queries. For you example, you might open up another expression tab in bosun and use the output as a reference. This function supports a prefix so examples would be `prommetrics()` and `["it"]prommetrics()`. 627 628 It gets the list of metrics by using the [Prometheus Label Values HTTP API](https://prometheus.io/docs/prometheus/latest/querying/api/#querying-label-values) to get the values for the `__name__` value. 629 630 ### promtags(metric string, endDuration string, startDuration string) Info 631 {: .exprFunc} 632 633 promtags returns various tag information for the metric ("tag" ~= "Label" in Prometheus terminology). It does a raw query (querying the metric only) for the provided duration and returns the tag information for the metric in that given time period. This is not meant to be used in alerting, it is for use in the expression editor for getting information to build queries. 634 635 The result has the following Properties: 636 637 * Metric: The name of the metric 638 * Keys: A list of the tag keys available for the metric 639 * KeysToValues: A map/dictionary of tag keys to a list of their unique values 640 * UniqueSets: A list of unique tag key/value combination pairs that represent complete series 641 642 Examples: `promtags("up", "10", "")`, `["it"]promtags("container_memory_working_set_bytes")`. 643 644 645 ## CloudWatch Query Functions (Beta) 646 These functions are available when cloudwatch is enabled via Bosun's configuration. 647 Query syntax is potentially subject to change in later releases 648 649 ### cw(region, namespace, metric, period, statistic, dimensions, startDuration, endDuration string) seriesSet 650 {: .exprFunc} 651 652 The parameters are as follows: 653 654 * `region` The amazon region(s) for the service metrics you are interested in. e.g. `eu-west-1,eu-central-1` 655 * `namespace` The [CloudWatch namespace](https://docs.aws.amazon.com/AmazonCloudWatch/latest/monitoring/aws-namespaces.html) which the metric you want to query exists under e.g `AWS/S3` 656 * `metric` The [CloudWatch metric](https://docs.aws.amazon.com/AmazonCloudWatch/latest/monitoring/CW_Support_For_AWS.html) you wish to query, e.g. `NumberOfObjects` 657 * `dimension` A string containing dimension key value pairs separated by : 658 * `period` size of bucket to use for grouping data-points expressed as a time string e.g. `1m` 659 * `statistic` Which [aggregator](https://docs.aws.amazon.com/AmazonCloudWatch/latest/monitoring/cloudwatch_concepts.html#Statistic) to use to combine the datapoints in each bucket. e.g. `Sum` 660 * `startDuration` and `endDuration` set the time window from now - see the OpenTSDB q() function for more details 661 662 A complete example returning the counts of infrequent access objects in our s3 bucket over the last hour. 663 ``` 664 $region = "eu-west-1" 665 $namespace = "AWS/S3" 666 $metric = "NumberOfObjects" 667 $period = "1m" 668 $statistics = "Average" 669 $dimensions = "BucketName:my-s3-bucket,StorageType:STANDARD_IA" 670 $objectCount = cw($region, $namespace, $metric, $period, $statistics, $dimensions, "1h" ,"") 671 ``` 672 673 You can use * as a wildcard character in dimensions to match multiple series 674 ``` 675 $region = "eu-west-1,eu-central-1" 676 $namespace = "AWS/ELB" 677 $metric = "HealthyHostCount" 678 $period = "5m" 679 $statistics = "Minimum" 680 $dimensions = "LoadBalancerName:web-*,AvailabilityZone:*" 681 $cpuUsage = cw($region, $namespace, $metric, $period, $statistics, $dimensions, "7d" ,"") 682 ``` 683 684 685 ### PrefixKey 686 PrefixKey is a quoted string used to query different aws accounts by passing the name of the profile from the amazon credentials file. If omitted the query will be made using the default credentials chain. 687 688 Credentials file example: 689 ``` 690 [prod] 691 aws_access_key_id=AKIAIOSFODNN7EXAMPLE 692 aws_secret_access_key=wJalrXUtnFEMI/K7MDENG/bPxRfiCYEXAMPLEKEY 693 694 [test] 695 aws_access_key_id=BKyfyfIAIDNN7EXAMPLE 696 aws_secret_access_key=Ays6tnFEMI/ASD7D6/bPxRfiCYEXAMPLEKEY 697 ``` 698 699 Example of querying using multiple accounts 700 ``` 701 $region = "eu-west-1" 702 $namespace = "AWS/EC2" 703 $metric = "CPUUtilization" 704 $period = "1m" 705 $statistics = "Average" 706 707 $prodDim = "InstanceId":"i-1234567890abcdef0" 708 $testDim = "InstanceId":"i-0598c7d356eba48d7" 709 710 $p = ["prod"]cw($region, $namespace, $metric, $period, $statistics, $prodDim, "1h" ,"") 711 $t = ["test"]cw($region, $namespace, $metric, $period, $statistics, $testDim, "1h" ,"") 712 ``` 713 714 715 716 # Annotation Query Functions 717 These function are available when annotate is enabled via Bosun's configuration. 718 719 ## Annotation Filters 720 For the following annotation functions, `filter` is a string with the following specification. 721 722 Items in a filter are in the format `keyword:value`. The value is either a glob pattern or literal string to match, or the reserved word `empty` which means that the value of the field is an empty string. 723 724 Possible keywords are: `owner`, `user`, `host`, `category`, `url`, and `message`. 725 726 All items can be combined in boolean logic by using paranthesis groupging, `!` as not, `AND` as logical and, and `OR` as logical or. 727 728 For example, `"owner:sre AND ! user:empty"` would show things that belong to sre, and have a username specified. When annotations are created by a process, we don't specify a user. 729 730 ## antable(filter string, fieldsCSV string, startDuration, endDuration) Table 731 Antable is meant for shoowing annotations in a Grafana table, where Grafana's "To Table Transform" under options is set to type "Table". 732 733 See Annotation Filters above to understand filters. FieldsCSV is a list of columns to display in the table. They can be in any order. The possible columns you can include are: `start`, `end`, `owner`, `user`, `host`, `category`, `url`, `link` `message`, `duration`. At least one column must be specified. 734 735 `link` is unlike the others in that it actually returns the HTML to construct a link, whereas `url` is the the text of the link. This is so when using a Grafana table and Grafana v3.1.1 or later, you can have a link in a table as long as you enable sanitize HTML within the Grafana Column Styles. 736 737 For example: `antable("owner:sre AND category:outage", "start,end,user,owner,category,message", "8w", "")` will return a table of annotations with the selected columns in FieldCSV going back 8 weeks from the time of the query. 738 739 ## ancounts(filter string, startDuration string, endDuration string) seriesSet 740 {: .exprFunc} 741 ancounts returns a series representing the number of annotations that matched the filter for the specified period. One might expect a number instead of a series, but by having a series it has a useful property. We can count outages that span'd across the requested time frame and count them as fractional outages. 742 743 If an annotation's timespan is contained entirely within the request timespan, or the timespan of the request is within the the timespan of the annotation, a 1 is added to the series. 744 745 If an annotation either starts before the requested start time, or ends after the requested start time then it is counted as a fractional outage (Assuming the annotation ended or started respectively with the requested time frame). 746 747 If there are no annotations within the requested time period, then the value `NaN` will be returned. 748 749 For example: 750 751 The following request is made at `2016-09-21 14:49:00`. 752 753 ``` 754 $filter = "owner:sre AND category:outage" 755 $back = "1n" 756 $count = ancounts($filter, $back, "") 757 # TimeFrame of the Fractional annotation: "2016-09-21T14:47:56Z", "2016-09-21T14:50:53Z" (Duration: 2m56 sec) 758 $count 759 ``` 760 761 Returns: 762 ``` 763 { 764 "0": 1, 765 "1": 1, 766 "2": 0.3615819209039548 767 } 768 ``` 769 770 The float values means that 36% of the annotation fell with the requested time frame. Once can get the sum of these by doing `sum($count)`, result of `2.36...` to get the fractional sum, or `len($count)`, result `3` to get the count. 771 772 Note: The index values above, 0, 1, and 2 are disregarded and are just there so we can use the same underlying type as a time series. 773 774 775 ## andurations(filter string, startDuration, endDuration string) seriesSet 776 {: .exprFunc} 777 778 andurations behaves in a similiar way to ancounts. The difference is that the values you returned will be the duration of annotation in seconds. 779 780 If the duration spans part of the requested time frame, only the duration of the annotation that falls within the timerange will be returns as a value for that annotation. If the annotation starts before the request and ends after the request, the duration of the request timeframe will be returned. 781 782 If there are no annotations within the requested time period, then the value `NaN` will be returned. 783 784 For example, a identical query to the example in ancounts but using andurations instead: 785 786 ``` 787 $filter = "owner:sre AND category:outage" 788 $back = "1n" 789 $durations = andurations($filter, $back, "") 790 # TimeFrame of the Fractional Outage: "2016-09-21T14:47:56Z", "2016-09-21T14:50:53Z", 791 $durations 792 ``` 793 794 Returns: 795 796 ``` 797 { 798 "0": 402, 799 "1": 758, 800 "2": 64 801 } 802 ``` 803 804 805 # Reduction Functions 806 807 All reduction functions take a seriesSet and return a numberSet with one element per unique group. 808 809 ## avg(seriesSet) numberSet 810 {: .exprFunc} 811 812 Average (arithmetic mean). 813 814 ## cCount(seriesSet) numberSet 815 {: .exprFunc} 816 817 Returns the change count which is the number of times in the series a value was not equal to the immediate previous value. Useful for checking if things that should be at a steady value are "flapping". For example, a series with values [0, 1, 0, 1] would return 3. 818 819 ## dev(seriesSet) numberSet 820 {: .exprFunc} 821 822 Standard deviation. 823 824 ## diff(seriesSet) numberSet 825 {: .exprFunc} 826 827 Diff returns the last point of each series minus the first point. 828 829 ## first(seriesSet) numberSet 830 {: .exprFunc} 831 832 Returns the first (least recent) data point in each series. 833 834 ## forecastlr(seriesSet, y_val numberSet|scalar) numberSet 835 {: .exprFunc} 836 837 Returns the number of seconds until a linear regression of each series will reach y_val. 838 839 ## linelr(seriesSet, d Duration) seriesSet 840 {: .exprFunc} 841 842 843 Linelr return the linear regression line from the end of each series to end+duration (an [OpenTSDB duration string](http://opentsdb.net/docs/build/html/user_guide/query/dates.html)). It adds `regression=line` to the group / tagset. It is meant for graphing with expressions, for example: 844 845 ``` 846 $d = "1w" 847 $q = q("avg:1h-avg:os.disk.fs.percent_free{}{host=ny-tsdb*,disk=/mnt*}", "2w", "") 848 $line = linelr($q, "3n") 849 $m = merge($q, $line) 850 $m 851 ``` 852 853 ## last(seriesSet) numberSet 854 {: .exprFunc} 855 856 Returns the last (most recent) data point in each series. 857 858 ## len(seriesSet) numberSet 859 {: .exprFunc} 860 861 Returns the length of each series. 862 863 ## max(seriesSet) numberSet 864 {: .exprFunc} 865 866 Returns the maximum value of each series, same as calling percentile(series, 1). 867 868 ## median(seriesSet) numberSet 869 {: .exprFunc} 870 871 Returns the median value of each series, same as calling percentile(series, .5). 872 873 ## min(seriesSet) numberSet 874 {: .exprFunc} 875 876 Returns the minimum value of each series, same as calling percentile(series, 0). 877 878 ## percentile(seriesSet, p numberSet|scalar) numberSet 879 {: .exprFunc} 880 881 Returns the value from each series at the percentile p. Min and Max can be simulated using `p <= 0` and `p >= 1`, respectively. 882 883 ## since(seriesSet) numberSet 884 {: .exprFunc} 885 886 Returns the number of seconds since the most recent data point in each series. 887 888 ## streak(seriesSet) numberSet 889 {: .exprFunc} 890 891 Returns the length of the longest streak of values that evaluate to true for each series in the set (i.e. max amount of contiguous non-zero values found). A single true value in the series returns 1. 892 893 This is useful to create an expression that is true if a certain number of consecutive observations exceeded a threshold - as in the following example: 894 895 ``` 896 $seriesA = series("host=server01", 0,0, 60,35, 120,35, 180,35, 240,5) 897 $seriesB = series("host=server02", 0,0, 60,35, 120, 5, 180, 5, 240,5) 898 $sSet = merge($seriesA, $seriesB) 899 $isAbove = $sSet > 30 900 $consecutiveCount = streak($isAbove) 901 # $consecutiveCount: a numberSet where server01 has a value of 3, server02 has a value of 1 902 # Are there 3 or more adjacent/consecutive/contiguous observations greater than 30? 903 $consecutiveCount >= 3 904 ``` 905 906 ## sum(seriesSet) numberSet 907 {: .exprFunc} 908 909 Sum returns the sum (a.k.a. "total") for each series in the set. 910 911 # Aggregation Functions 912 913 Aggregation functions take a seriesSet, and return a new seriesSet. 914 915 ## aggr(series seriesSet, groups string, aggregator string) seriesSet 916 {: .exprFunc} 917 918 Takes a seriesSet and combines it into a new seriesSet with the groups specified, using an aggregator to merge any series that share the matching group values. If the groups argument is an empty string, all series are combined into a single series, regardless of existing groups. 919 920 The available aggregator functions are: `"avg"` (average), `"min"` (minimum), `"max"` (maximum), `"sum"` and `"pN"` (percentile) where N is a floating point number between 0 and 1 inclusive. For example, `"p.25"` will be the 25th percentile, `"p.999"` will be the 99.9th percentile. `"p0"` and `"p1"` are min and max respectively (However, in these cases it is recommended to use `"min"` and `"max"` for the sake of clarity. 921 922 The aggr function can be particularly useful for removing anomalies when comparing timeseries over periods using the over function. 923 924 Example: 925 926 ``` 927 $weeks = over("sum:1m-avg:os.cpu{region=*,color=*}", "24h", "1w", 3) 928 $agg = aggr($weeks, "region,color", "p.50") 929 ``` 930 931 The above example uses `over` to load a 24 hour period over the past 3 weeks. We then use the aggr function to combine the three weeks into one, selecting the median (`p.50`) value of the 3 weeks at each timestamp. This results in a new seriesSet, grouped by region and color, that represents a "normal" 24 hour period with anomalies removed. 932 933 An error will be returned if a group is specified to aggregate on that does not exist in the original seriesSet. 934 935 The aggr function expects points in the original series to be aligned by timestamp. If points are not aligned, they are aggregated separately. For example, if we had a seriesSet, 936 937 Group | Timestamp | Value | 938 ----------- | --------- | ----- | 939 {host=web01} | 1 | 1 | 940 {host=web01} | 2 | 7 | 941 {host=web01} | 1 | 4 | 942 943 and applied the following aggregation: 944 945 ``` 946 aggr($series, "host", "max") 947 ``` 948 949 we would receive the following aggregated result: 950 951 Group | Timestamp | Value | Timestamp | Value | 952 ----------- | --------- | ----- | --------- | ----- | 953 {host=web01} | 1 | 4 | 2 | 7 | 954 955 aggr also does not attempt to deal with NaN values in a consistent manner. If all values for a specific timestamp are NaN, the result for that timestamp will be NaN. If a particular timestamp has a mix of NaN and non-NaN values, the result may or may not be NaN, depending on the aggregation function specified. 956 957 # Group Functions 958 959 Group functions modify the OpenTSDB groups. 960 961 ## addtags(set variantSet, group string) (seriesSet|numberSet) 962 {: .exprFunc} 963 964 Accepts a series and a set of tags to add to the set in `Key1=NewK1,Key2=NewK2` format. This is useful when you want to add series to set with merge and have tag collisions. 965 966 ## rename(variantSet, string) (seriesSet|numberSet) 967 {: .exprFunc} 968 969 Accepts a series and a set of tags to rename in `Key1=NewK1,Key2=NewK2` format. All data points will have the tag keys renamed according to the spec provided, in order. This can be useful for combining results from seperate queries that have similar tagsets with different tag keys. 970 971 ## remove(variantSet, string) (seriesSet|numberSet) 972 {: .exprFunc} 973 974 Accepts a tag key to remove from the set. The function will error if removing the tag key from the set would cause the resulting set to have a duplicate item in it. 975 976 ## t(numberSet, group string) seriesSet 977 {: .exprFunc} 978 979 Transposes N series of length 1 to 1 series of length N. If the group parameter is not the empty string, the number of series returned is equal to the number of tagks passed. This is useful for performing scalar aggregation across multiple results from a query. For example, to get the total memory used on the web tier: `sum(t(avg(q("avg:os.mem.used{host=*-web*}", "5m", "")), ""))`. See [Understanding the Transpose Function](/t) for more explanation. 980 981 How transpose works conceptually 982 983 Transpose Grouped results into a Single Result: 984 985 Before Transpose (Value Type is NumberSet): 986 987 Group | Value | 988 ----------- | ----- | 989 {host=web01} | 1 | 990 {host=web02} | 7 | 991 {host=web03} | 4 | 992 993 After Transpose (Value Type is SeriesSet): 994 995 Group | Value | 996 ----------- | ----- | 997 {} | 1,7,4 | 998 999 Transpose Groups results into Multiple Results: 1000 1001 Before Transpose by host (Value Type is NumberSet) 1002 1003 Group | Value | 1004 ----------- | ----- | 1005 {host=web01,disk=c} | 1 | 1006 {host=web01,disc=d} | 3 | 1007 {host=web02,disc=c} | 4 | 1008 1009 After Transpose by "host" (Value type is SeriesSet) 1010 1011 Group | Value | 1012 ------------ | ------ | 1013 {host=web01} | 1,3 | 1014 {host=web02} | 4 | 1015 1016 Useful Example of Transpose 1017 Alert if more than 50% of servers in a group have ping timeouts 1018 1019 ``` 1020 alert or_down { 1021 $group = host=or-* 1022 # bosun.ping.timeout is 0 for no timeout, 1 for timeout 1023 $timeout = q("sum:bosun.ping.timeout{$group}", "5m", "") 1024 # timeout will have multiple groups, such as or-web01,or-web02,or-web03. 1025 # each group has a series type (the observations in the past 10 mintutes) 1026 # so we need to *reduce* each series values of each group into a single number: 1027 $max_timeout = max($timeout) 1028 # Max timeout is now a group of results where the value of each group is a number. Since each 1029 # group is an alert instance, we need to regroup this into a sigle alert. We can do that by 1030 # transposing with t() 1031 $max_timeout_series = t("$max_timeout", "") 1032 # $max_timeout_series is now a single group with a value of type series. We need to reduce 1033 # that series into a single number in order to trigger an alert. 1034 $number_down_series = sum($max_timeout_series) 1035 $total_servers = len($max_timeout_series) 1036 $percent_down = $number_down_servers / $total_servers) * 100 1037 warnNotification = $percent_down > 25 1038 } 1039 ``` 1040 1041 Since our templates can reference any variable in this alert, we can show which servers are down in the notification, even though the alert just triggers on 25% of or-\* servers being down. 1042 1043 ## ungroup(numberSet) scalar 1044 {: .exprFunc} 1045 1046 Returns the input with its group removed. Used to combine queries from two differing groups. 1047 1048 # Other Functions 1049 1050 ## alert(name string, key string) numberSet 1051 {: .exprFunc} 1052 1053 Executes and returns the `key` expression from alert `name` (which must be 1054 `warn` or `crit`). Any alert of the same name that is unknown or unevaluated 1055 is also returned with a value of `1`. Primarily for use with the [`depends` alert keyword](/definitions#depends). 1056 1057 Example: `alert("host.down", "crit")` returns the crit 1058 expression from the host.down alert. 1059 1060 ## abs(variantSet) (seriesSet|numberSet) 1061 {: .exprFunc} 1062 1063 Returns the absolute value of each value in the set. 1064 1065 ## crop(series seriesSet, start numberSet, end numberSet) seriesSet 1066 {: .exprFunc} 1067 1068 Returns a seriesSet where each series is has datapoints removed if the datapoint is before start (from now, in seconds) or after end (also from now, in seconds). This is useful if you want to alert on different timespans for different items in a set, for example: 1069 1070 ``` 1071 lookup test { 1072 entry host=ny-bosun01 { 1073 start = 30 1074 } 1075 entry host=* { 1076 start = 60 1077 } 1078 } 1079 1080 alert test { 1081 template = test 1082 $q = q("avg:rate:os.cpu{host=ny-bosun*}", "5m", "") 1083 $c = crop($q, lookup("test", "start") , 0) 1084 crit = avg($c) 1085 } 1086 ``` 1087 1088 ## d(string) scalar 1089 {: .exprFunc} 1090 1091 Returns the number of seconds of the [OpenTSDB duration string](http://opentsdb.net/docs/build/html/user_guide/query/dates.html). 1092 1093 ## tod(scalar) string 1094 {: .exprFunc} 1095 1096 Returns an [OpenTSDB duration string](http://opentsdb.net/docs/build/html/user_guide/query/dates.html) that represents the given number of seconds. This lets you do math on durations and then pass it to the duration arguments in functions like `q()` 1097 1098 ## des(series, alpha scalar, beta scalar) series 1099 {: .exprFunc} 1100 1101 Returns series smoothed using Holt-Winters double exponential smoothing. Alpha 1102 (scalar) is the data smoothing factor. Beta (scalar) is the trend smoothing 1103 factor. 1104 1105 ## dropg(seriesSet, threshold numberSet|scalar) seriesSet 1106 {: .exprFunc} 1107 1108 Remove any values greater than number from a series. Will error if this operation results in an empty series. 1109 1110 ## dropge(seriesSet, threshold numberSet|scalar) seriesSet 1111 {: .exprFunc} 1112 1113 Remove any values greater than or equal to number from a series. Will error if this operation results in an empty series. 1114 1115 ## dropl(seriesSet, threshold numberSet|scalar) seriesSet 1116 {: .exprFunc} 1117 1118 Remove any values lower than number from a series. Will error if this operation results in an empty series. 1119 1120 ## drople(seriesSet, threshold numberSet|scalar) seriesSet 1121 {: .exprFunc} 1122 1123 Remove any values lower than or equal to number from a series. Will error if this operation results in an empty series. 1124 1125 ## dropna(seriesSet) seriesSet 1126 {: .exprFunc} 1127 1128 Remove any NaN or Inf values from a series. Will error if this operation results in an empty series. 1129 1130 ## dropbool(seriesSet, seriesSet) seriesSet 1131 {: .exprFunc} 1132 1133 Drop datapoints where the corresponding value in the second series set is zero. (See Series Operations for what corresponding means). The following example drops tr_avg (avg response time per bucket) datapoints if the count in that bucket was + or - 100 from the average count over the time period. 1134 1135 Example: 1136 1137 ``` 1138 $count = q("sum:traffic.haproxy.route_tr_count{host=literal_or(ny-logsql01),route=Questions/Show}", "30m", "") 1139 $avg = q("sum:traffic.haproxy.route_tr_avg{host=literal_or(ny-logsql01),route=Questions/Show}", "30m", "") 1140 $avgCount = avg($count) 1141 dropbool($avg, !($count < $avgCount-100 || $count > $avgCount+100)) 1142 ``` 1143 1144 ## epoch() scalar 1145 {: .exprFunc} 1146 1147 Returns the Unix epoch in seconds of the expression start time (scalar). 1148 1149 ## filter(variantSet, numberSet) (seriesSet|numberSet) 1150 {: .exprFunc} 1151 1152 Returns all results in variantSet that are a subset of numberSet and have a non-zero value. Useful with the limit and sort functions to return the top X results of a query. 1153 1154 ## limit(set variantSet, count scalar) (seriesSet|numberSet) 1155 {: .exprFunc} 1156 1157 Returns the first count (scalar) items of the set. 1158 1159 ## lookup(table string, key string) numberSet 1160 {: .exprFunc} 1161 1162 Returns the first key from the given lookup table with matching tags, this searches the built-in index and so only makes sense when using OpenTSDB and sending data to /index or relaying through bosun. 1163 1164 Using the lookup function will set [unJoinedOk](/definitions#unjoinedok) to true for the alert. 1165 1166 ## lookupSeries(series seriesSet, table string, key string) numberSet 1167 {: .exprFunc} 1168 1169 Returns the first key from the given lookup table with matching tags. 1170 The first argument is a series to use from which to derive the tag information. This is good for alternative storage backends such as graphite and influxdb. 1171 1172 Using the lookupSeries function will set [unJoinedOk](/definitions#unjoinedok) to true for the alert. 1173 1174 ## map(series seriesSet, subExpr numberSetExpr) seriesSet 1175 {: .exprFunc} 1176 1177 map applies the subExpr to each value in each series in the set. A special function `v()` which is only available in a numberSetExpr and it gives you the value for each item in the series. 1178 1179 For example you can do something like the following to get the absolute value for each item in the series (since the normal `abs()` function works on normal numbers, not series: 1180 1181 ``` 1182 $q = q("avg:rate:os.cpu{host=*bosun*}", "5m", "") 1183 map($q, expr(abs(v()))) 1184 ``` 1185 1186 Or for another example, this would get you the absolute difference of each datapoint from the series average as a new series: 1187 1188 ``` 1189 $q = q("avg:rate:os.cpu{host=*bosun*}", "5m", "") 1190 map($q, expr(abs(v()-avg($q)))) 1191 ``` 1192 1193 Since this function is not optimized for a particular operation on a seriesSet it may not be very efficent. If you find you are doing things that involve more complex expressions within the `expr(...)` inside map (for example, having query functions in there) than you may want to consider requesting a new function to be added to bosun's DSL. 1194 1195 ## expr(expression) 1196 {: .exprFunc} 1197 1198 expr takes an expression and returns either a numberSetExpr or a seriesSetExpr depending on the resulting type of the inner expression. This exists for functions like `map` - it is currently not valid in the expression language outside of function arguments. 1199 1200 ## month(offset scalar, startEnd string) scalar 1201 {: .exprFunc} 1202 1203 Returns the epoch of either the start or end of the month. Offset is the timezone offset from UTC that the month starts/ends at (but the returned epoch is representitive of UTC). startEnd must be either `"start"` or `"end"`. Useful for things like monthly billing, for example: 1204 1205 ``` 1206 $hostInt = host=ny-nexus01,iname=Ethernet1/46 1207 $inMetric = "sum:5m-avg:rate{counter,,1}:__ny-nexus01.os.net.bytes{$hostInt,direction=in}" 1208 $outMetric = "sum:5m-avg:rate{counter,,1}:__ny-nexus01.os.net.bytes{$hostInt,direction=in}" 1209 $commit = 100 1210 $monthStart = month(-4, "start") 1211 $monthEnd = month(-4, "end") 1212 $monthLength = $monthEnd - $monthStart 1213 $burstTime = ($monthLength)*.05 1214 $burstableObservations = $burstTime / d("5m") 1215 $in = q($inMetric, tod(epoch()-$monthStart), "") * 8 / 1e6 1216 $out = q($inMetric, tod(epoch()-$monthStart), "") * 8 / 1e6 1217 $inOverCount = sum($in > $commit) 1218 $outOverCount = sum($out > $commit) 1219 $inOverCount > $burstableObservations || $outOverCount > $burstableObservations 1220 ``` 1221 1222 ## series(tagset string, epoch, value, ...) seriesSet 1223 {: .exprFunc} 1224 1225 Returns a seriesSet with one series. The series will have a group (a.k.a tagset). The tagset can be "" for the empty group, or in "key=value,key=value" format. You can then optionally pass epoch value pairs (if non are provided, the series will be empty). This is can be used for testing or drawing arbitary lines. For example: 1226 1227 ``` 1228 $now = epoch() 1229 $hourAgo = $now-d("1h") 1230 merge(series("foo=bar", $hourAgo, 5, $now, 10), series("foo=bar2", $hourAgo, 6, $now, 11)) 1231 ``` 1232 1233 ## shift(seriesSet, dur string) seriesSet 1234 {: .exprFunc} 1235 1236 Shift takes a seriesSet and shifts the time forward by the value of dur ([OpenTSDB duration string](http://opentsdb.net/docs/build/html/user_guide/query/dates.html)) and adds a tag for representing the shift duration. This is meant so you can overlay times visually in a graph. 1237 1238 ## leftjoin(tagsCSV string, dataCSV string, ...numberSet) table 1239 {: .exprFunc} 1240 1241 leftjoin takes multiple numberSets and joins them to the first numberSet to form a table. tagsCSV is a string that is comma delimited, and should match tags from query that you want to display (i.e., "host,disk"). dataCSV is a list of column names for each numberset, so it should have the same number of labels as there are numberSets. 1242 1243 The only current intended use case is for constructing "Table" panels in Grafana. 1244 1245 For Example, the following in Grafana would create a table that shows the CPU of each host for the current period, the cpu for the adjacent previous period, and the difference between them: 1246 1247 ``` 1248 $cpuMetric = "avg:$ds-avg:rate{counter,,1}:os.cpu{host=*bosun*}{}" 1249 $currentCPU = avg(q($cpuMetric, "$start", "")) 1250 $span = (epoch() - (epoch() - d("$start"))) 1251 $previousCPU = avg(q($cpuMetric, tod($span*2), "$start")) 1252 $delta = $currentCPU - $previousCPU 1253 leftjoin("host", "Current CPU,Previous CPU,Change", $currentCPU, $previousCPU, $delta) 1254 ``` 1255 1256 Note that in the above example is intended to be used in Grafana via the Bosun datasource, so `$start` and `$ds` are replaced by Grafana before the query is sent to Bosun. 1257 1258 ## merge(SeriesSet...) seriesSet 1259 {: .exprFunc} 1260 1261 Merge takes multiple seriesSets and merges them into a single seriesSet. The function will error if any of the tag sets (groups) are identical. This is meant so you can display multiple seriesSets in a single expression graph. 1262 1263 ## nv(numberSet, scalar) numberSet 1264 {: .exprFunc} 1265 1266 Change the NaN value during binary operations (when joining two queries) of unknown groups to the scalar. This is useful to prevent unknown group and other errors from bubbling up. 1267 1268 ## sort(numberSet, (asc|desc) string) numberSet 1269 {: .exprFunc} 1270 1271 Returns the results sorted by value in ascending ("asc") or descending ("desc") 1272 order. Results are first sorted by groupname and then stably sorted so that 1273 results with identical values are always in the same order. 1274 1275 ## timedelta(seriesSet) seriesSet 1276 {: .exprFunc} 1277 1278 Returns the difference between successive timestamps in a series. For example: 1279 1280 ``` 1281 timedelta(series("foo=bar", 1466133600, 1, 1466133610, 1, 1466133710, 1)) 1282 ``` 1283 1284 Would return a seriesSet equal to: 1285 1286 ``` 1287 series("foo=bar", 1466133610, 10, 1466133710, 100) 1288 ``` 1289 1290 ## tail(seriesSet, num numberSet) seriesSet 1291 {: .exprFunc} 1292 1293 Returns the most recent num points from a series. If the series is shorter than the number of requeted points the series is unchanged as all points are in the requested window. This function is useful for making calculating on the leading edge. For example: 1294 1295 ``` 1296 tail(series("foo=bar", 1466133600, 1, 1466133610, 1, 1466133710, 1), 2) 1297 ``` 1298 1299 Would return a seriesSet equal to: 1300 1301 ``` 1302 series("foo=bar", 1466133610, 1, 1466133710, 1) 1303 ``` 1304 1305 </div>