bosun.org@v0.0.0-20210513094433-e25bc3e69a1f/docs/documentation.md (about) 1 --- 2 layout: page 3 title: Documentation 4 order: 4 5 --- 6 7 {% raw %} 8 9 * auto-gen TOC: 10 {:toc} 11 12 #Architecture 13 14 The main components are: 15 16 * **scollector**: An agent that provides data collection 17 * A binary that gathers Linux and Windows data locally from the system (no external libraries needed) 18 * Has built-in collectors 19 * Can data poll network devices via SNMP and VSphere 20 * Can run external scripts 21 * Queues data when Bosun can't be reached 22 * Sends data to bosun via compressed JSON to a REST API 23 * **bosun**: Data collection and relaying, alerting, and graphing 24 * Has an expression language for creating alerts from time-series data queried from OpenTSDB 25 * Exposes the Go template language for users to craft alert notifications 26 * Has notification escalation 27 * Relays data to OpenTSDB 28 * Collects Metadata (String information about things like hosts (i.e. IP Address, Serial Numbers)) and information about metrics: Description, Gauge vs Counter, and the metrics's measurement unit. Currently stored locally on the server as a state file 29 * Text Configuration that can be version controlled: support macros, lookup tables, alert configuration, notifications, and notification templates 30 * Web Interface: 31 * Has an alert dashboard: Currently Triggered Alerts, Acknowledgments etc. Can also view alert history 32 * Has a Graphing interface 33 * Has a page for running expressions 34 * Has a page for silencing alerts 35 * Has a page for testing alerts over history and previewing notifications 36 * Host views for basic host information such as CPU, Memory, Network throughput, and Disk Space 37 * Page to validate configuration 38 39 ## Diagram 40 41 ![Architecture Diagram](public/arch.png) 42 43 # Alerts 44 45 Each alert definition has the potential to turn into multiple alert instances ("alerts"). Alerts are uniquely identified by the alert name and the OpenTSDB tagset (which we also call the group). Every possible group in your top level expression is evaluated independently. As an example, with an expression like `avg(q("avg:rate{counter,,1}:os.cpu{host=*}", "5m", ""))` you can get an alert for every tag-value of the "host" tag-key that has sent data for the os.cpu metric. In this way bosun integrates fairly tightly with OpenTSDB, however there are ways to change alert groups in expressions (in particular, by using the t() (transpose) function). 46 47 ## Severity States 48 49 Alerts can be in one of the following severity levels (From Highest to Lowest): 50 51 * **Unknown**: When a warn or crit expression can not be evaluated because data is missing. When you define an alert bosun tracks each instance (aka group) for each expression used in the expression. If one of these is no longer present, that instance goes into an unknown state. Since bosun has data pushed to it, unknown can mean that either data collection has failed, or that the source is down. Unknown triggers when there is no data in a query + the check frequency. This means that if a query spans an hour, it will be one hour + the check frequency before it triggers. 52 * **Error**: There is some sort of bosun internal error such as divide by zero or "response too large" with the alert. 53 * **Critical**: The expression that `crit` is equal to in the alert definition is non-zero (true). It is recommend that "Critical" be thought of as "has failed". 54 * **Warning**: The expression that `warn` is equal to in the alert definition is non-zero (true) *and* critical is not true. It is recommended that warning be thought of ha "could lead to failure". 55 * **Normal**: No problems. 56 57 ## Additional States 58 59 * **Active**: The alert is currently in the severity state that triggered it. This is indicated by an exclamation on the dashboard: ![Exclamation Glyph](public/exclamation.png). 60 * **Silenced**: Someone has created a silence rule that stops this alert from triggering any notification. It will also automatically close when the alert is no longer active. This is indicated by a speaker with an X icon on the dashboard: ![Silence Glyph](public/silence.png). 61 * **Acknowledged**: Someone has acknowledged the alert, the reason and person should be available via the web interface. Acknowledged alerts stop sending notifications as long as the severity doesn't increase. 62 * **Unacknowledged**: Nobody has acknowledged the alert yet at its current severity level. 63 64 # Dashboard 65 66 ## Indicators 67 68 ### Colors 69 70 * **Blue**: The alert was/is unknown when triggered 71 * **Red**: The alert was/is critical or error when triggered 72 * **Yellow**: The alert was/is warning when triggered 73 74 ### Icons 75 76 * ![Exclamation Glyph](public/exclamation.png) An Exclamation means the alert is currently triggered (active). Alerts don't disappear from the dashboard when they are no longer active until they are closed. This is to ensure that all alerts get handled - which reduces alert noise and fatigue. 77 * ![Silence Glyph](public/silence.png) A silence icon means the alert has been silenced. Silenced alerts don't send notifications, and automatically close when no longer active. 78 79 ## Actions 80 81 * **Acknowledge**: Prevent further notifications unless there is a state increase. This also moves it to the acknowledged section of the dashboard. When you acknowledge something you enter a name and a reason. So this means that the person has committed to fixing the problem or the alert. 82 * **Close**: Make it disappear from the dashboard. This should be used when an alert is handled. Active alerts can not be closed (since all that will happen is that will reappear on the the dashboard after the next schedule run). 83 * **Forget**: Make bosun forget about this instance of the alert. This is used on active unknown alerts. It is useful when something is not coming back (i.e. you have decommissioned a host). This act is non-destructive because if that data gets sent to bosun again everything will come back. 84 * **History**: View a timeline of history for the selected alert instances. 85 86 # Test Config 87 88 The test configuration page shows the configuration that bosun started with. You can edit or paste a changed config to make sure there are no syntax errors before committing it and restarting bosun. Currently you have to handle your own configuration management, but in the future we may attempt to integrate this with git. 89 90 {% endraw %}