bosun.org@v0.0.0-20210513094433-e25bc3e69a1f/docs/documentation.md

bosun.org@v0.0.0-20210513094433-e25bc3e69a1f/docs/documentation.md (about)

1 ---
2 layout: page
3 title: Documentation
4 order: 4
5 ---
6
7 {% raw %}
8
9 * auto-gen TOC:
10 {:toc}
11
12 #Architecture
13
14 The main components are:
15
16 * **scollector**: An agent that provides data collection
17 * A binary that gathers Linux and Windows data locally from the system (no external libraries needed)
18 * Has built-in collectors
19 * Can data poll network devices via SNMP and VSphere
20 * Can run external scripts
21 * Queues data when Bosun can't be reached
22 * Sends data to bosun via compressed JSON to a REST API
23 * **bosun**: Data collection and relaying, alerting, and graphing
24 * Has an expression language for creating alerts from time-series data queried from OpenTSDB
25 * Exposes the Go template language for users to craft alert notifications
26 * Has notification escalation
27 * Relays data to OpenTSDB
28 * Collects Metadata (String information about things like hosts (i.e. IP Address, Serial Numbers)) and information about metrics: Description, Gauge vs Counter, and the metrics's measurement unit. Currently stored locally on the server as a state file
29 * Text Configuration that can be version controlled: support macros, lookup tables, alert configuration, notifications, and notification templates
30 * Web Interface:
31 * Has an alert dashboard: Currently Triggered Alerts, Acknowledgments etc. Can also view alert history
32 * Has a Graphing interface
33 * Has a page for running expressions
34 * Has a page for silencing alerts
35 * Has a page for testing alerts over history and previewing notifications
36 * Host views for basic host information such as CPU, Memory, Network throughput, and Disk Space
37 * Page to validate configuration
38
39 ## Diagram
40
41 ![Architecture Diagram](public/arch.png)
42
43 # Alerts
44
45 Each alert definition has the potential to turn into multiple alert instances ("alerts"). Alerts are uniquely identified by the alert name and the OpenTSDB tagset (which we also call the group). Every possible group in your top level expression is evaluated independently. As an example, with an expression like `avg(q("avg:rate{counter,,1}:os.cpu{host=*}", "5m", ""))` you can get an alert for every tag-value of the "host" tag-key that has sent data for the os.cpu metric. In this way bosun integrates fairly tightly with OpenTSDB, however there are ways to change alert groups in expressions (in particular, by using the t() (transpose) function).
46
47 ## Severity States
48
49 Alerts can be in one of the following severity levels (From Highest to Lowest):
50
51 * **Unknown**: When a warn or crit expression can not be evaluated because data is missing. When you define an alert bosun tracks each instance (aka group) for each expression used in the expression. If one of these is no longer present, that instance goes into an unknown state. Since bosun has data pushed to it, unknown can mean that either data collection has failed, or that the source is down. Unknown triggers when there is no data in a query + the check frequency. This means that if a query spans an hour, it will be one hour + the check frequency before it triggers.
52 * **Error**: There is some sort of bosun internal error such as divide by zero or "response too large" with the alert.
53 * **Critical**: The expression that `crit` is equal to in the alert definition is non-zero (true). It is recommend that "Critical" be thought of as "has failed".
54 * **Warning**: The expression that `warn` is equal to in the alert definition is non-zero (true) *and* critical is not true. It is recommended that warning be thought of ha "could lead to failure".
55 * **Normal**: No problems.
56
57 ## Additional States
58
59 * **Active**: The alert is currently in the severity state that triggered it. This is indicated by an exclamation on the dashboard: ![Exclamation Glyph](public/exclamation.png).
60 * **Silenced**: Someone has created a silence rule that stops this alert from triggering any notification. It will also automatically close when the alert is no longer active. This is indicated by a speaker with an X icon on the dashboard: ![Silence Glyph](public/silence.png).
61 * **Acknowledged**: Someone has acknowledged the alert, the reason and person should be available via the web interface. Acknowledged alerts stop sending notifications as long as the severity doesn't increase.
62 * **Unacknowledged**: Nobody has acknowledged the alert yet at its current severity level.
63
64 # Dashboard
65
66 ## Indicators
67
68 ### Colors
69
70 * **Blue**: The alert was/is unknown when triggered
71 * **Red**: The alert was/is critical or error when triggered
72 * **Yellow**: The alert was/is warning when triggered
73
74 ### Icons
75
76 * ![Exclamation Glyph](public/exclamation.png) An Exclamation means the alert is currently triggered (active). Alerts don't disappear from the dashboard when they are no longer active until they are closed. This is to ensure that all alerts get handled - which reduces alert noise and fatigue.
77 * ![Silence Glyph](public/silence.png) A silence icon means the alert has been silenced. Silenced alerts don't send notifications, and automatically close when no longer active.
78
79 ## Actions
80
81 * **Acknowledge**: Prevent further notifications unless there is a state increase. This also moves it to the acknowledged section of the dashboard. When you acknowledge something you enter a name and a reason. So this means that the person has committed to fixing the problem or the alert.
82 * **Close**: Make it disappear from the dashboard. This should be used when an alert is handled. Active alerts can not be closed (since all that will happen is that will reappear on the the dashboard after the next schedule run).
83 * **Forget**: Make bosun forget about this instance of the alert. This is used on active unknown alerts. It is useful when something is not coming back (i.e. you have decommissioned a host). This act is non-destructive because if that data gets sent to bosun again everything will come back.
84 * **History**: View a timeline of history for the selected alert instances.
85
86 # Test Config
87
88 The test configuration page shows the configuration that bosun started with. You can edit or paste a changed config to make sure there are no syntax errors before committing it and restarting bosun. Currently you have to handle your own configuration management, but in the future we may attempt to integrate this with git.
89
90 {% endraw %}