bosun.org@v0.0.0-20210513094433-e25bc3e69a1f/docs/usage.md (about) 1 --- 2 layout: default 3 title: Usage 4 --- 5 6 <div class="row"> 7 <div class="col-sm-3" > 8 <div class="sidebar" data-spy="affix" data-offset-top="0" data-offset-bottom="0" markdown="1"> 9 10 * Some TOC 11 {:toc} 12 13 </div> 14 </div> 15 16 <div class="doc-body col-sm-9" markdown="1"> 17 18 <p class="title h1">{{page.title}}</p> 19 This part of the documentation covers using Bosun's user interface and the incident workflow. 20 21 # Alerts and Incidents 22 23 ## Overview 24 Each alert definition has the potential to turn into multiple incidents (an instantiation of the alert). Incidents get a unique global ID and are also associated with an Alert Key. The Alert Key is made up of the alert name and the tagset. Every possible group in your top level expression is evaluated independently. As an example, with an expression like `avg(q("avg:rate{counter,,1}:os.cpu{host=*}", "5m", ""))` you can have the potential to create an incident for every tag-value of the "host" tag-key that has sent data for the os.cpu metric. 25 26 ## The lifetime of an incident 27 28 An incident gets created when the warn or crit expression evaluates to non-zero, or the alert goes unknown. Once an incident has been created it will notify users only when the lifetime severity of the incident increases. An exception to this is if you have set up notification chains, in which case the alert will send more notifications until someone acknowledges the alert. 29 30 Example: 31 32 * You have an alert named high.cpu defined, and it has warn expression like `avg(q(os.cpu{host=*} ...)) > 50`. One of your hosts (web01) triggers the warn condition of the alert 33 * We now have an incident, the incident will get a global ID like #23412 and will have an alert key of `high.cpu{host=web01}` and will have a current severity state of warn. Assuming a notification has been set up, the notification will be sent (i.e. an email) 34 * The incident then goes back to normal severity, and then to warn again. When this happens, no notifications are sent. It is important to note that **notifications are only sent when the lifetime severity of an incident increases**. The lifetime of the incident continues until the alert has been closed - which is generally done by a user. 35 * The incident can be closed when it goes back to normal state. Once the incident is closed, it is possible for a new incident to be created for the same Alert Key (`high.cpu{host=web01}`). 36 37 ## Severity States 38 39 Incidents can be in one of the following severity levels (From highest to lowest): 40 41 * **Unknown**: When a warn or crit expression can not be evaluated because data is missing. When you define an alert bosun tracks each resulting tagset from the warn/crit expressions. If a tagset is no longer present, that instance goes into an unknown state. Since bosun has data pushed to it, unknown can mean that either data collection has failed, or that the source is down. Unknown triggers when there is no data for the tagset in 2x the check frequency duration. This means that if a query spans an hour, it will be one hour + 2x the check frequency before it triggers. 42 * **Error**: There is some sort of bosun internal error such as divide by zero or "response too large" with the alert. The error can be viewed by clicking the Errors button on the dashboard 43 * **Critical**: The expression that `crit` is equal to in the alert definition is non-zero (true). It is recommend that "Critical" be thought of as "has failed". 44 * **Warning**: The expression that `warn` is equal to in the alert definition is non-zero (true) *and* critical is not true. It is recommended that warning be thought of ha "could lead to failure". 45 * **Normal**: None of the above states. 46 47 ## Additional States 48 49 * **Active**: The alert is currently in a non-normal state. This is indicated by an exclamation on the dashboard: <i class="fa fa-exclamation-circle fa-lg" aria-hidden="true"></i>. Alerts don't disappear from the dashboard when they are no longer active until they are closed. This is to ensure that all alerts get handled - which reduces alert noise and fatigue. 50 * **Silenced**: Someone has created a silence rule that stops this alert from triggering any notification. It will also automatically close when the alert is no longer active. This is indicated by a volume off speaker icon: <i class="fa fa-volume-off fa-lg" aria-hidden="true"></i>. 51 * **Acknowledged**: Someone has acknowledged the alert, the reason and person should be available via the web interface. Acknowledged alerts stop sending notification chains as long as the severity doesn't increase. 52 * **Unacknowledged**: Nobody has acknowledged the alert yet at its current severity level. 53 * **Unevaluated**: An incident is unevaluated if the dependency expression as defined in the alert's depends keyword is non-zero. Unevaluated alerts do not change state or become unknown. If an incident is open then it will still show up on the dashboard, but with a question mark icon: <i class="fa fa-question-circle fa-lg" aria-hidden="true"></i>. New incidents will not be created. 54 55 # Dashboard 56 57 ## Indicators 58 59 ### Colors 60 61 The color of the major of the bar is the incident's last abnormal status. The color that makes up the sliver on the left side of the bar is the incident's current status. 62 63 * <span class="text-info">**Blue**:</span> Unknown 64 * <span class="text-danger">**Red**:</span> Critical 65 * <span class="text-warning">**Yellow**:</span> Warning 66 * <span class="text-success"> **Green**:</span> Normal 67 68 ### Icons 69 70 * <i class="fa fa-exclamation-circle fa-lg" aria-hidden="true"></i> An exclamation icon means the alert is currently in an [active state](/usage#additional-states). 71 * <i class="fa fa-volume-off fa-lg" aria-hidden="true"></i> A silence icon means the alert has been [silenced](/usage#additional-states). 72 * <i class="fa fa-question-circle fa-lg" aria-hidden="true"></i> A question icon means the alert is [unevaluated](/usage#additional-states). 73 * <i class="fa fa-fire fa-lg" aria-hidden="true"></i> A fire icon means the alert is in an [error state](/usage#severity-states). 74 75 76 ## Actions 77 78 * **Acknowledge**: Prevent further notifications unless there is a state increase. This also moves it to the acknowledged section of the dashboard. When you acknowledge something you enter a name and a reason. So this means that the person has committed to fixing the problem or the alert. 79 * **Close**: Make it disappear from the dashboard. This should be used when an alert is handled. Active (non-normal) alerts can not be closed (since all that will happen is that will reappear on the the dashboard after the next schedule run). 80 * **Forget**: Make bosun forget about this instance of the alert. This is used on active unknown alerts. It is useful when something is not coming back (i.e. you have decommissioned a host). This act is non-destructive because if that data gets sent to bosun again everything will come back. 81 * **Force Close**: Like close, but does not require alert to be in a normal state. In a few circumstances an alert can be "open" and "active" at the same time. This can occur when a host is decommissioned and an alert has ignoreUnknown set, for example. This may help to clear some of those "stuck" alerts. 82 * **Purge**: Will delete an active alert and *all* history for that alert key. Should only be used when you absolutely want to forget all data about a host, like when shutting it down. Like forget, but does not require an alert to be unknown. 83 * **History**: View a timeline of history for the selected alert instances. 84 * **Note**: Attach a note to an incident. This has no impact on the behavior of the alert and is purely for communication. 85 86 ## Incident Filters 87 88 The open incident filter supports joining terms in `()` as well as the `AND`, `OR`, and `!` operators. The following query terms are supported and are always in the format of `something:something`: 89 90 <table> 91 <tr> 92 <th>Term Spec</th> 93 <th>Description</th> 94 </tr> 95 <tr> 96 <td><code>ack:(true|false)</code></td> 97 <td>If <code>ack:true</code> incidents that have been acknowledge are returned, when <code>ack:false</code> incidents that have not been acknowledged are returned.</td> 98 </tr> 99 <tr> 100 <td><code>ackTime:[<|>](1d)</code></td> 101 <td>Returns incidents that were acknowledged before <code><</code> or incidents that were acknowledged after <code>></code> the 102 relative time to now based on the duration. Duration can be in units of s (seconds), m (minutes), 103 h (hours), d (days), w (weeks), n (months), y (years). If less than or greater than are not part 104 of the value, it defaults to greater than (after). Now is clock time and is not related to the time 105 range specified in Grafana. For example, <code>ackTime:<24h</code> shows incidents that were acknowledged more than 24 hours ago.</td> 106 </tr> 107 <tr> 108 <td><code>hasTag:(tagKey|tagKey=|=tagValue|tagKey=tagValue)</code></td> 109 <td>Determine if the tag key, value, or key=value pair. If there is no equals sign, it is treated as a tag 110 key. Tag Values maybe have globs such has <code>hasTag:host=ny-*</code></td> 111 </tr> 112 <tr> 113 <td><code>hidden:(true|false)</code></td> 114 <td>If <code>hidden:false</code> incidents that are hidden will not be show. An incident is hidden if it 115 is in a silenced or unevaluated state. </td> 116 </tr> 117 <tr> 118 <td><code>name:(something*)</code></td> 119 <td>Returns incidents where the alert name (not including the tagset) matches the value. Globs can be used 120 in the value.</td> 121 </tr> 122 <tr> 123 <td><code>user:(username*)</code></td> 124 <td>Returns incidents where a user has taken any action on that incident. Globs can be used in the value</td> 125 </tr> 126 <tr> 127 <td><code>notify:(notificationName*)</code></td> 128 <td>Returns incidents where a the notificationName is somewhere in either the crit or warn notification chains. 129 Globs can be used in the value</td> 130 </tr> 131 <tr> 132 <td><code>silenced:(true|false)</code></td> 133 <td>If <code>silenced:false</code> incidents that have not been silenced are returned, when <code>silenced:true</code> incidents that have not been silenced are returned.</td> 134 </tr> 135 <tr> 136 <td><code>start:[<|>](1d)</code> </td> 137 <td>Returns incidents that started before <code><</code> or incidents that started after <code>></code> the 138 relative time to now based on the duration. Duration can be in units of s (seconds), m (minutes), 139 h (hours), d (days), w (weeks), n (months), y (years). If less than or greater than are not part 140 of the value, it defaults to greater than (after). Now is clock time and is not related to the time 141 range specified in Grafana.</td> 142 </tr> 143 <tr> 144 <td><code>unevaluated:(true|false)</code></td> 145 <td>If <code>unevaluated:false</code> incidents that are not in an unevaluated state are returned, when 146 <code>ack:true</code> incidents that are unevaluated are returned.</td> 147 </tr> 148 <tr> 149 <td><code>status:(normal|warning|critical|unknown)</code></td> 150 <td>Returns incidents that are currently in the requested state</td> 151 </tr> 152 <tr> 153 <td><code>worstStatus:(normal|warning|critical|unknown)</code></td> 154 <td>Returns incidents that have a worst status equal to the requested state</td> 155 </tr> 156 <tr> 157 <td><code>lastAbnormalStatus:(warning|critical|unknown)</code></td> 158 <td>Returns incidents that have a last abnormal status equal to the requested state</td> 159 </tr> 160 <tr> 161 <td><code>subject:(something*)</code></td> 162 <td>Returns incidents where the subject string matches the value. Globs can be used in the value</td> 163 </tr> 164 <tr> 165 <td><code>since:[<|>](1d)</code> </td> 166 <td>Returns incidents that in `status` more than <code><</code> or incidents that in `status` less than <code>></code> the 167 relative time to now based on the duration. Duration can be in units of s (seconds), m (minutes), 168 h (hours), d (days), w (weeks), n (months), y (years). If less than or greater than are not part 169 of the value, it defaults to greater than (after). Now is clock time and is not related to the time 170 range specified in Grafana.<br> 171 e.g. `status:normal AND since:<15d` return alerts that are in `normal` more than 15 day's 172 </td> 173 </tr> 174 </table> 175 176 # Rule Editor 177 The rule editor allows you to edit the the definitions in the [RuleConf](/definitions), preview rendered templates, and test alerts against historical data. 178 179 ## Rule Editor Image 180 ![Rule Editor Image](/public/rule_editor.jpg) 181 182 ## Textarea 183 The text area will be loaded with the running config when the Rule Editor view is loaded. A hash of the config when you start editing it is saved. If someone else edits the UI and saves it, Bosun will detect that the config hash has changed and show a warning above the text area. 184 185 When you run test your version of the config is saved in Bosun, and you can link to it so others can see it. 186 187 The editor is built using the open source [Ace editor](https://ace.c9.io/). 188 189 ## Jump Buttons 190 The Jump drop downs <a href="/usage#rule-editor-image" class="image-number">①</a> will take you to defined sections within the config. In particular, the alert drop down selects which alert will be used for testing. 191 192 At the end there is a switcher that can be used when you are working on an alert. It allows you to just back and forth between the alert and the alert referenced in the template. 193 194 ## Download / Validate 195 The download button <a href="/usage#rule-editor-image" class="image-number">②</a> will download the config file as a text file. Validate makes sure that Bosun considers the config valid using the same validation that is required for Bosun to start. 196 197 ## Definition [Rule] Saving 198 The save button <a href="/usage#rule-editor-image" class="image-number">②</a> will bring up a dialogue that lets you save the config. This only appears if you have permission to save the config, and the [system configuration's `EnableSave`](/system_configuration#enablesave) has been set to true. 199 200 The save dialogue will show you a contextual diff of your config and the running config. There are several protections in place to prevent you from overwriting someone elses configuration changes: 201 202 * The Rule Editor will show a warning if the config has been saved since you started editing it 203 * A contextual-diff is shown of your changes versus the running config (and the save we fail if the contextual diff happens to change in the time window before you hit save) 204 * When the file is being saved, a global lock is taken in Bosun so nobody else can save while the save his happening 205 206 If the config file is successfully saved then Bosun will reload the new definitions. Alerts that are currently being processed will be cancelled and restarted. In other words, a restart of the Bosun process is *not* required for the new changes to take effect. 207 208 An external command to run on saves can also be defined with the [CommandHookPath setting in the system configuration](/system_configuration#commandhookpath). This can be used to do things like create backups of the file or check the changes into version control. If this command returns a non-zero exit code, saving will also fail. 209 210 In all cases where a save fails, a reload will not happen and the save will not be persisted (the definitions file will not be changed). 211 212 ## Alert Testing 213 Alerts can be tested before they are committed to production. This allows you to refine the trigger conditions to control the signal to noise and to preview the rendered templates to make sure alerts are informative. This done by selecting the alert the from the [Jump Alert Drop down](/usage#jump-buttons) at <a href="/usage#rule-editor-image" class="image-number">①</a> and the clicking the test alert button at <a href="/usage#rule-editor-image" class="image-number">④</a>. 214 215 There are two ways you can test alerts: 216 217 1. A single iteration (a snapshot of time) 218 2. Multiple iterations over a period of time. 219 220 Which behavior is used depends on the <span class="docFromLabel">From</span> and <label>To</label> fields at <a href="/usage#rule-editor-image" class="image-number">③</a>. If <span class="docFromLabel">From</span> is left blank, that a single iteration is tested with the time current time. If <span class="docFromLabel">From</span> is set to a time and <span class="docFromLabel">To</span> is unset, a single iteration will be done at that time. When doing single iteration testing the <span class="docFromLabel">Results</span> and <span class="docFromLabel">Template</span> <a href="/usage#rule-editor-image" class="image-number">⑤</a> tabs at will be populated. The <span class="docFromLabel">Results</span> tabs show the warn/crit results for each set, and a rendered template will be show in the <span class="docFromLabel">Template</span> tab. 221 222 Which item from the result set that will be rendered in the Template tab is controlled by the <span class="docFromLabel">Template Group</span> field at <a href="/usage#rule-editor-image" class="image-number">④</a>. Which result to use for the template is picked by specifying a tagset in the format of `key=value,key=value`. The first result that has the specified tags will be used. If no results match, than the first result is chosen. 223 224 <div class="admonition"> 225 <p class="admonition-title">Tip</p> 226 <p>When working on a template it is good to set the <span class="docFromLabel">From</span> time to a fixed date. That way when expressions are rerun they will likely hit Bosun's query cache and things will be faster.</p> 227 </div> 228 229 The <span class="docFromLabel">Email</span> field at <a href="/usage#rule-editor-image" class="image-number">④</a> makes it so when an alert is tested, the rendered template is emailed to the address specified in the field. This is so you can check for any differences between what you see in the <span class="docFromLabel">Template</span> tab. 230 231 Setting both <span class="docFromLabel">From</span> and <span class="docFromLabel">To</span> enables testing multiple iterations of the selected alert over time. The number of iterations depends on the setting to the two linked fields <span class="docFromLabel">Intervals</span> and <span class="docFromLabel">Step Duration</span> at <a href="/usage#rule-editor-image" class="image-number">③</a>. Changing one changes the other. Intervals will be the number of runs to do even spaced out over the duration of <span class="docFromLabel">From</span> to <span class="docFromLabel">To</span> and <span class="docFromLabel">Step Duration</span> is how much time in minutes should be between intervals. Doing a test over time will populate the <span class="docFromLabel">Timeline</span> tab <a href="/usage#rule-editor-image" class="image-number">⑤</a> which draws a clickable graphic of severity states for each item in the set: 232 233 ![Rule Editor Timeline Image](/public/timeline.jpg) 234 235 Each row in the image is one of the items in the result set. The color squares represent the severity of that instance. The X-Axis is time. When you click the a square on the image, it will take you to the event you clicked and show you what the template would look like at that time for that particular item. 236 237 # Annotations 238 239 Annotations are currently stored in elastic. When annotations are enabled you can create, edit and visualize them on the the Graph page. There is also a Submit Annotations page that allows for creation and editing annotations. The API described in this [README](https://github.com/bosun-monitor/annotate/blob/master/web/README.md) gets injected into bosun under `/api/` - you can also find a description of the schema there. 240 241 </div> 242 </div>