bosun.org@v0.0.0-20210513094433-e25bc3e69a1f/docs/documentation.md (about)

     1  ---
     2  layout: page
     3  title: Documentation
     4  order: 4
     5  ---
     6  
     7  {% raw %}
     8  
     9  * auto-gen TOC:
    10  {:toc}
    11  
    12  #Architecture
    13  
    14  The main components are:
    15  
    16  * **scollector**: An agent that provides data collection
    17  	* A binary that gathers Linux and Windows data locally from the system (no external libraries needed)
    18  	* Has built-in collectors
    19  	* Can data poll network devices via SNMP and VSphere
    20  	* Can run external scripts
    21  	* Queues data when Bosun can't be reached
    22  	* Sends data to bosun via compressed JSON to a REST API
    23  * **bosun**: Data collection and relaying, alerting, and graphing
    24  	* Has an expression language for creating alerts from time-series data queried from OpenTSDB
    25  	* Exposes the Go template language for users to craft alert notifications
    26  	* Has notification escalation
    27  	* Relays data to OpenTSDB
    28  	* Collects Metadata (String information about things like hosts (i.e. IP Address, Serial Numbers)) and information about metrics: Description, Gauge vs Counter, and the metrics's measurement unit. Currently stored locally on the server as a state file
    29  	* Text Configuration that can be version controlled: support macros, lookup tables, alert configuration, notifications, and notification templates
    30  	* Web Interface:
    31  		* Has an alert dashboard: Currently Triggered Alerts, Acknowledgments etc. Can also view alert history
    32  		* Has a Graphing interface
    33  		* Has a page for running expressions
    34  		* Has a page for silencing alerts
    35  		* Has a page for testing alerts over history and previewing notifications
    36  		* Host views for basic host information such as CPU, Memory, Network throughput, and Disk Space
    37  		* Page to validate configuration
    38  
    39  ## Diagram
    40  
    41  ![Architecture Diagram](public/arch.png)
    42  
    43  # Alerts
    44  
    45  Each alert definition has the potential to turn into multiple alert instances ("alerts"). Alerts are uniquely identified by the alert name and the OpenTSDB tagset (which we also call the group). Every possible group in your top level expression is evaluated independently. As an example, with an expression like `avg(q("avg:rate{counter,,1}:os.cpu{host=*}", "5m", ""))` you can get an alert for every tag-value of the "host" tag-key that has sent data for the os.cpu metric. In this way bosun integrates fairly tightly with OpenTSDB, however there are ways to change alert groups in expressions (in particular, by using the t() (transpose) function).
    46  
    47  ## Severity States
    48  
    49  Alerts can be in one of the following severity levels (From Highest to Lowest):
    50  
    51  * **Unknown**: When a warn or crit expression can not be evaluated because data is missing. When you define an alert bosun tracks each instance (aka group) for each expression used in the expression. If one of these is no longer present, that instance goes into an unknown state. Since bosun has data pushed to it, unknown can mean that either data collection has failed, or that the source is down. Unknown triggers when there is no data in a query + the check frequency. This means that if a query spans an hour, it will be one hour + the check frequency before it triggers.
    52  * **Error**: There is some sort of bosun internal error such as divide by zero or "response too large" with the alert.
    53  * **Critical**: The expression that `crit` is equal to in the alert definition is non-zero (true). It is recommend that "Critical" be thought of as "has failed".
    54  * **Warning**: The expression that `warn` is equal to in the alert definition is non-zero (true) *and* critical is not true. It is recommended that warning be thought of ha "could lead to failure".
    55  * **Normal**: No problems.
    56  
    57  ## Additional States
    58  
    59  * **Active**: The alert is currently in the severity state that triggered it. This is indicated by an exclamation on the dashboard: ![Exclamation Glyph](public/exclamation.png).
    60  * **Silenced**: Someone has created a silence rule that stops this alert from triggering any notification. It will also automatically close when the alert is no longer active. This is indicated by a speaker with an X icon on the dashboard: ![Silence Glyph](public/silence.png).
    61  * **Acknowledged**: Someone has acknowledged the alert, the reason and person should be available via the web interface. Acknowledged alerts stop sending notifications as long as the severity doesn't increase.
    62  * **Unacknowledged**: Nobody has acknowledged the alert yet at its current severity level.
    63  
    64  # Dashboard
    65  
    66  ## Indicators
    67  
    68  ### Colors
    69  
    70  * **Blue**: The alert was/is unknown when triggered
    71  * **Red**: The alert was/is critical or error when triggered
    72  * **Yellow**: The alert was/is warning when triggered
    73  
    74  ### Icons
    75  
    76  * ![Exclamation Glyph](public/exclamation.png) An Exclamation means the alert is currently triggered (active). Alerts don't disappear from the dashboard when they are no longer active until they are closed. This is to ensure that all alerts get handled - which reduces alert noise and fatigue.
    77  * ![Silence Glyph](public/silence.png) A silence icon means the alert has been silenced. Silenced alerts don't send notifications, and automatically close when no longer active.
    78  
    79  ## Actions
    80  
    81  * **Acknowledge**: Prevent further notifications unless there is a state increase. This also moves it to the acknowledged section of the dashboard. When you acknowledge something you enter a name and a reason. So this means that the person has committed to fixing the problem or the alert.
    82  * **Close**: Make it disappear from the dashboard. This should be used when an alert is handled. Active alerts can not be closed (since all that will happen is that will reappear on the the dashboard after the next schedule run).
    83  * **Forget**: Make bosun forget about this instance of the alert. This is used on active unknown alerts. It is useful when something is not coming back (i.e. you have decommissioned a host). This act is non-destructive because if that data gets sent to bosun again everything will come back.
    84  * **History**: View a timeline of history for the selected alert instances.
    85  
    86  # Test Config
    87  
    88  The test configuration page shows the configuration that bosun started with. You can edit or paste a changed config to make sure there are no syntax errors before committing it and restarting bosun. Currently you have to handle your own configuration management, but in the future we may attempt to integrate this with git.
    89  
    90  {% endraw %}