bosun.org@v0.0.0-20210513094433-e25bc3e69a1f/docs/usage.md

bosun.org@v0.0.0-20210513094433-e25bc3e69a1f/docs/usage.md (about)

     1  ---
     2  layout: default
     3  title: Usage
     4  ---
     5  
     6  <div class="row">
     7  <div class="col-sm-3" >
     8    <div class="sidebar" data-spy="affix" data-offset-top="0" data-offset-bottom="0" markdown="1">
     9   
    10   * Some TOC
    11   {:toc}
    12   
    13    </div>
    14  </div>
    15  
    16  <div class="doc-body col-sm-9" markdown="1">
    17  
    18  <p class="title h1">{{page.title}}</p>
    19  This part of the documentation covers using Bosun's user interface and the incident workflow.
    20  
    21  # Alerts and Incidents
    22  
    23  ## Overview
    24  Each alert definition has the potential to turn into multiple incidents (an instantiation of the alert). Incidents get a unique global ID and are also associated with an Alert Key. The Alert Key is made up of the alert name and the tagset. Every possible group in your top level expression is evaluated independently. As an example, with an expression like `avg(q("avg:rate{counter,,1}:os.cpu{host=*}", "5m", ""))` you can have the potential to create an incident for every tag-value of the "host" tag-key that has sent data for the os.cpu metric.
    25  
    26  ## The lifetime of an incident
    27  
    28  An incident gets created when the warn or crit expression evaluates to non-zero, or the alert goes unknown. Once an incident has been created it will notify users only when the lifetime severity of the incident increases. An exception to this is if you have set up notification chains, in which case the alert will send more notifications until someone acknowledges the alert.
    29  
    30  Example:
    31  
    32   * You have an alert named high.cpu defined, and it has warn expression like `avg(q(os.cpu{host=*} ...)) > 50`. One of your hosts (web01) triggers the warn condition of the alert
    33   * We now have an incident, the incident will get a global ID like #23412 and will have an alert key of `high.cpu{host=web01}` and will have a current severity state of warn. Assuming a notification has been set up, the notification will be sent (i.e. an email)
    34   * The incident then goes back to normal severity, and then to warn again. When this happens, no notifications are sent. It is important to note that **notifications are only sent when the lifetime severity of an incident increases**. The lifetime of the incident continues until the alert has been closed - which is generally done by a user.
    35   * The incident can be closed when it goes back to normal state. Once the incident is closed, it is possible for a new incident to be created for the same Alert Key (`high.cpu{host=web01}`).
    36  
    37  ## Severity States
    38  
    39  Incidents can be in one of the following severity levels (From highest to lowest):
    40  
    41  * **Unknown**: When a warn or crit expression can not be evaluated because data is missing. When you define an alert bosun tracks each resulting tagset from the warn/crit expressions. If a tagset is no longer present, that instance goes into an unknown state. Since bosun has data pushed to it, unknown can mean that either data collection has failed, or that the source is down. Unknown triggers when there is no data for the tagset in 2x the check frequency duration. This means that if a query spans an hour, it will be one hour + 2x the check frequency before it triggers.
    42  * **Error**: There is some sort of bosun internal error such as divide by zero or "response too large" with the alert. The error can be viewed by clicking the Errors button on the dashboard
    43  * **Critical**: The expression that `crit` is equal to in the alert definition is non-zero (true). It is recommend that "Critical" be thought of as "has failed".
    44  * **Warning**: The expression that `warn` is equal to in the alert definition is non-zero (true) *and* critical is not true. It is recommended that warning be thought of ha "could lead to failure".
    45  * **Normal**: None of the above states.
    46  
    47  ## Additional States
    48  
    49  * **Active**: The alert is currently in a non-normal state. This is indicated by an exclamation on the dashboard: <i class="fa fa-exclamation-circle fa-lg" aria-hidden="true"></i>.  Alerts don't disappear from the dashboard when they are no longer active until they are closed. This is to ensure that all alerts get handled - which reduces alert noise and fatigue.
    50  * **Silenced**: Someone has created a silence rule that stops this alert from triggering any notification. It will also automatically close when the alert is no longer active. This is indicated by a volume off speaker icon: <i class="fa fa-volume-off fa-lg" aria-hidden="true"></i>.
    51  * **Acknowledged**: Someone has acknowledged the alert, the reason and person should be available via the web interface. Acknowledged alerts stop sending notification chains as long as the severity doesn't increase.
    52  * **Unacknowledged**: Nobody has acknowledged the alert yet at its current severity level.
    53  * **Unevaluated**: An incident is unevaluated if the dependency expression as defined in the alert's depends keyword is non-zero. Unevaluated alerts do not change state or become unknown. If an incident is open then it will still show up on the dashboard, but with a question mark icon: <i class="fa fa-question-circle fa-lg" aria-hidden="true"></i>. New incidents will not be created.
    54  
    55  # Dashboard
    56  
    57  ## Indicators
    58  
    59  ### Colors
    60  
    61  The color of the major of the bar is the incident's last abnormal status. The color that makes up the sliver on the left side of the bar is the incident's current status.
    62  
    63  * <span class="text-info">**Blue**:</span> Unknown
    64  * <span class="text-danger">**Red**:</span> Critical
    65  * <span class="text-warning">**Yellow**:</span> Warning
    66  * <span class="text-success"> **Green**:</span> Normal
    67  
    68  ### Icons
    69  
    70  * <i class="fa fa-exclamation-circle fa-lg" aria-hidden="true"></i> An exclamation icon means the alert is currently in an [active state](/usage#additional-states).
    71  * <i class="fa fa-volume-off fa-lg" aria-hidden="true"></i> A silence icon means the alert has been [silenced](/usage#additional-states).
    72  * <i class="fa fa-question-circle fa-lg" aria-hidden="true"></i> A question icon means the alert is [unevaluated](/usage#additional-states).
    73  * <i class="fa fa-fire fa-lg" aria-hidden="true"></i> A fire icon means the alert is in an [error state](/usage#severity-states).
    74  
    75  
    76  ## Actions
    77  
    78  * **Acknowledge**: Prevent further notifications unless there is a state increase. This also moves it to the acknowledged section of the dashboard. When you acknowledge something you enter a name and a reason. So this means that the person has committed to fixing the problem or the alert.
    79  * **Close**: Make it disappear from the dashboard. This should be used when an alert is handled. Active (non-normal) alerts can not be closed (since all that will happen is that will reappear on the the dashboard after the next schedule run).
    80  * **Forget**: Make bosun forget about this instance of the alert. This is used on active unknown alerts. It is useful when something is not coming back (i.e. you have decommissioned a host). This act is non-destructive because if that data gets sent to bosun again everything will come back.
    81  * **Force Close**: Like close, but does not require alert to be in a normal state. In a few circumstances an alert can be "open" and "active" at the same time. This can occur when a host is decommissioned and an alert has ignoreUnknown set, for example. This may help to clear some of those "stuck" alerts.
    82  * **Purge**: Will delete an active alert and *all* history for that alert key. Should only be used when you absolutely want to forget all data about a host, like when shutting it down. Like forget, but does not require an alert to be unknown.
    83  * **History**: View a timeline of history for the selected alert instances.
    84  * **Note**: Attach a note to an incident. This has no impact on the behavior of the alert and is purely for communication.
    85  
    86  ## Incident Filters
    87  
    88  The open incident filter supports joining terms in `()` as well as the `AND`, `OR`, and `!` operators. The following query terms are supported and are always in the format of `something:something`:
    89  
    90  <table>
    91      <tr>
    92          <th>Term Spec</th>
    93          <th>Description</th>
    94      </tr>
    95      <tr>
    96          <td><code>ack:(true|false)</code></td>
    97          <td>If <code>ack:true</code> incidents that have been acknowledge are returned, when <code>ack:false</code>                        incidents that have not been acknowledged are returned.</td>
    98      </tr>
    99      <tr>
   100          <td><code>ackTime:[<|>](1d)</code></td>
   101          <td>Returns incidents that were acknowledged before <code><</code> or incidents that were acknowledged after <code>></code> the
   102              relative time to now based on the duration. Duration can be in units of s (seconds), m (minutes),
   103              h (hours), d (days), w (weeks), n (months), y (years). If less than or greater than are not part
   104              of the value, it defaults to greater than (after). Now is clock time and is not related to the time
   105              range specified in Grafana. For example, <code>ackTime:<24h</code> shows incidents that were acknowledged more than 24 hours ago.</td>
   106      </tr>
   107      <tr>
   108          <td><code>hasTag:(tagKey|tagKey=|=tagValue|tagKey=tagValue)</code></td>
   109          <td>Determine if the tag key, value, or key=value pair. If there is no equals sign, it is treated as a tag
   110              key. Tag Values maybe have globs such has <code>hasTag:host=ny-*</code></td>
   111      </tr>
   112      <tr>
   113          <td><code>hidden:(true|false)</code></td>
   114          <td>If <code>hidden:false</code> incidents that are hidden will not be show. An incident is hidden if it
   115              is in a silenced or unevaluated state. </td>
   116      </tr>
   117      <tr>
   118          <td><code>name:(something*)</code></td>
   119          <td>Returns incidents where the alert name (not including the tagset) matches the value. Globs can be used
   120              in the value.</td>
   121      </tr>
   122      <tr>
   123          <td><code>user:(username*)</code></td>
   124          <td>Returns incidents where a user has taken any action on that incident. Globs can be used in the value</td>
   125      </tr>
   126      <tr>
   127          <td><code>notify:(notificationName*)</code></td>
   128          <td>Returns incidents where a the notificationName is somewhere in either the crit or warn notification chains.
   129              Globs can be used in the value</td>
   130      </tr>
   131      <tr>
   132          <td><code>silenced:(true|false)</code></td>
   133          <td>If <code>silenced:false</code> incidents that have not been silenced are returned, when <code>silenced:true</code>                        incidents that have not been silenced are returned.</td>
   134      </tr>
   135      <tr>
   136          <td><code>start:[<|>](1d)</code> </td>
   137          <td>Returns incidents that started before <code><</code> or incidents that started after <code>></code> the
   138              relative time to now based on the duration. Duration can be in units of s (seconds), m (minutes),
   139              h (hours), d (days), w (weeks), n (months), y (years). If less than or greater than are not part
   140              of the value, it defaults to greater than (after). Now is clock time and is not related to the time
   141              range specified in Grafana.</td>
   142      </tr>
   143      <tr>
   144          <td><code>unevaluated:(true|false)</code></td>
   145          <td>If <code>unevaluated:false</code> incidents that are not in an unevaluated state are returned, when
   146              <code>ack:true</code> incidents that are unevaluated are returned.</td>
   147      </tr>
   148      <tr>
   149          <td><code>status:(normal|warning|critical|unknown)</code></td>
   150          <td>Returns incidents that are currently in the requested state</td>
   151      </tr>
   152      <tr>
   153          <td><code>worstStatus:(normal|warning|critical|unknown)</code></td>
   154          <td>Returns incidents that have a worst status equal to the requested state</td>
   155      </tr>
   156      <tr>
   157          <td><code>lastAbnormalStatus:(warning|critical|unknown)</code></td>
   158          <td>Returns incidents that have a last abnormal status equal to the requested state</td>
   159      </tr>
   160      <tr>
   161          <td><code>subject:(something*)</code></td>
   162          <td>Returns incidents where the subject string matches the value. Globs can be used in the value</td>
   163      </tr>
   164      <tr>
   165          <td><code>since:[<|>](1d)</code> </td>
   166          <td>Returns incidents that in `status` more than <code><</code> or incidents that in `status` less than <code>></code> the
   167              relative time to now based on the duration. Duration can be in units of s (seconds), m (minutes),
   168              h (hours), d (days), w (weeks), n (months), y (years). If less than or greater than are not part
   169              of the value, it defaults to greater than (after). Now is clock time and is not related to the time
   170              range specified in Grafana.<br>
   171              e.g. `status:normal AND since:<15d` return alerts that are in `normal` more than 15 day's
   172          </td>
   173      </tr>
   174  </table>
   175  
   176  # Rule Editor
   177  The rule editor allows you to edit the the definitions in the [RuleConf](/definitions), preview rendered templates, and test alerts against historical data.
   178  
   179  ## Rule Editor Image
   180  ![Rule Editor Image](/public/rule_editor.jpg)
   181  
   182  ## Textarea
   183  The text area will be loaded with the running config when the Rule Editor view is loaded. A hash of the config when you start editing it is saved. If someone else edits the UI and saves it, Bosun will detect that the config hash has changed and show a warning above the text area.
   184  
   185  When you run test your version of the config is saved in Bosun, and you can link to it so others can see it.
   186  
   187  The editor is built using the open source [Ace editor](https://ace.c9.io/).
   188  
   189  ## Jump Buttons
   190  The Jump drop downs <a href="/usage#rule-editor-image" class="image-number">①</a> will take you to defined sections within the config. In particular, the alert drop down selects which alert will be used for testing.
   191  
   192  At the end there is a switcher that can be used when you are working on an alert. It allows you to just back and forth between the alert and the alert referenced in the template.
   193  
   194  ## Download / Validate
   195  The download button <a href="/usage#rule-editor-image" class="image-number">②</a> will download the config file as a text file. Validate makes sure that Bosun considers the config valid using the same validation that is required for Bosun to start.
   196  
   197  ## Definition [Rule] Saving
   198  The save button <a href="/usage#rule-editor-image" class="image-number">②</a> will bring up a dialogue that lets you save the config. This only appears if you have permission to save the config, and the [system configuration's `EnableSave`](/system_configuration#enablesave) has been set to true.
   199  
   200  The save dialogue will show you a contextual diff of your config and the running config. There are several protections in place to prevent you from overwriting someone elses configuration changes:
   201  
   202    * The Rule Editor will show a warning if the config has been saved since you started editing it
   203    * A contextual-diff is shown of your changes versus the running config (and the save we fail if the contextual diff happens to change in the time window before you hit save)
   204    * When the file is being saved, a global lock is taken in Bosun so nobody else can save while the save his happening
   205  
   206  If the config file is successfully saved then Bosun will reload the new definitions. Alerts that are currently being processed will be cancelled and restarted. In other words, a restart of the Bosun process is *not* required for the new changes to take effect.
   207  
   208  An external command to run on saves can also be defined with the [CommandHookPath setting in the system configuration](/system_configuration#commandhookpath). This can be used to do things like create backups of the file or check the changes into version control. If this command returns a non-zero exit code, saving will also fail.
   209  
   210  In all cases where a save fails, a reload will not happen and the save will not be persisted (the definitions file will not be changed).
   211  
   212  ## Alert Testing
   213  Alerts can be tested before they are committed to production. This allows you to refine the trigger conditions to control the signal to noise and to preview the rendered templates to make sure alerts are informative. This done by selecting the alert the from the [Jump Alert Drop down](/usage#jump-buttons) at <a href="/usage#rule-editor-image" class="image-number">①</a> and the clicking the test alert button at <a href="/usage#rule-editor-image" class="image-number">④</a>.
   214  
   215  There are two ways you can test alerts: 
   216   
   217    1. A single iteration (a snapshot of time)
   218    2. Multiple iterations over a period of time. 
   219    
   220  Which behavior is used depends on the <span class="docFromLabel">From</span> and <label>To</label> fields at <a href="/usage#rule-editor-image" class="image-number">③</a>. If <span class="docFromLabel">From</span> is left blank, that a single iteration is tested with the time current time. If <span class="docFromLabel">From</span> is set to a time and <span class="docFromLabel">To</span> is unset, a single iteration will be done at that time. When doing single iteration testing the <span class="docFromLabel">Results</span> and <span class="docFromLabel">Template</span> <a href="/usage#rule-editor-image" class="image-number">⑤</a> tabs at will be populated. The <span class="docFromLabel">Results</span> tabs show the warn/crit results for each set, and a rendered template will be show in the  <span class="docFromLabel">Template</span> tab.
   221  
   222  Which item from the result set that will be rendered in the Template tab is controlled by the <span class="docFromLabel">Template Group</span> field at <a href="/usage#rule-editor-image" class="image-number">④</a>. Which result to use for the template is picked by specifying a tagset in the format of `key=value,key=value`. The first result that has the specified tags will be used. If no results match, than the first result is chosen.
   223  
   224  <div class="admonition">
   225  <p class="admonition-title">Tip</p>
   226  <p>When working on a template it is good to set the <span class="docFromLabel">From</span> time to a fixed date. That way when expressions are rerun they will likely hit Bosun's query cache and things will be faster.</p>
   227  </div>
   228  
   229  The <span class="docFromLabel">Email</span> field at <a href="/usage#rule-editor-image" class="image-number">④</a> makes it so when an alert is tested, the rendered template is emailed to the address specified in the field. This is so you can check for any differences between what you see in the <span class="docFromLabel">Template</span> tab.
   230  
   231  Setting both <span class="docFromLabel">From</span> and <span class="docFromLabel">To</span> enables testing multiple iterations of the selected alert over time. The number of iterations depends on the setting to the two linked fields <span class="docFromLabel">Intervals</span> and <span class="docFromLabel">Step Duration</span> at <a href="/usage#rule-editor-image" class="image-number">③</a>. Changing one changes the other. Intervals will be the number of runs to do even spaced out over the duration of <span class="docFromLabel">From</span> to <span class="docFromLabel">To</span> and <span class="docFromLabel">Step Duration</span> is how much time in minutes should be between intervals. Doing a test over time will populate the <span class="docFromLabel">Timeline</span> tab <a href="/usage#rule-editor-image" class="image-number">⑤</a> which draws a clickable graphic of severity states for each item in the set:
   232  
   233  ![Rule Editor Timeline Image](/public/timeline.jpg)
   234  
   235  Each row in the image is one of the items in the result set. The color squares represent the severity of that instance. The X-Axis is time. When you click the a square on the image, it will take you to the event you clicked and show you what the template would look like at that time for that particular item.
   236  
   237  # Annotations
   238  
   239  Annotations are currently stored in elastic. When annotations are enabled you can create, edit and visualize them on the the Graph page. There is also a Submit Annotations page that allows for creation and editing annotations. The API described in this [README](https://github.com/bosun-monitor/annotate/blob/master/web/README.md) gets injected into bosun under `/api/` - you can also find a description of the schema there. 
   240  
   241  </div>
   242  </div>