bosun.org@v0.0.0-20210513094433-e25bc3e69a1f/docs/quickstart.md (about)

     1  ---
     2  layout: default
     3  title: Quick Start
     4  redirect_from: /gettingstarted.html
     5  ---
     6  
     7  <div class="row">
     8  <div class="col-sm-3" >
     9    <div data-spy="affix" data-offset-top="0" data-offset-bottom="0" markdown="1">
    10  
    11   * Some TOC
    12   {:toc}
    13  
    14    </div>
    15  </div>
    16  
    17  <div class="doc-body col-sm-9" markdown="1">
    18  
    19  <p class="h1 title">{{page.title}}</p>
    20  
    21  {% raw %}
    22  
    23  This document is written as a Quick-Start to getting Bosun working in your environment. By following this tutorial, you should have a fully operational Bosun system which not only is aggregating collected metrics from selected machines but also alerting you on relevant data about those systems.  We will be using OpenTSDB.  For some Graphite pointers, see the [graphite](#graphite) section below.
    24  
    25  # Bosun
    26  
    27  This guide is based on using our docker image. At Stack Exchange we do not use Docker in production. For those that do not wish to use docker, we provide binaries for bosun at bosun.org, but you will also need to install OpenTSDB and HBase yourself. (If you install OpenTSDB yourself, we recommend using the [next branch](https://github.com/opentsdb/opentsdb/tree/next), which has support for GZIP connections used by scollector.) For HBase we recommend building a cluster using Cloudera manager.
    28  
    29  ## Docker
    30  
    31  ### Install Docker
    32  
    33  If you do not already have docker installed on your system, you can install it following the instructions outlined in [https://docs.docker.com/get-docker/](https://docs.docker.com/get-docker/).
    34  
    35  **Note:** Don’t forget to ensure the docker daemon is running before moving forward!
    36  
    37  ### Running a Bosun container
    38  
    39  There are two ways to run Bosun in Docker. For a very quick start, you can run the version of Bosun that is published to
    40  Docker Hub. The latest version there is 0.6 which is significantly behind the latest code in GitHub.
    41  
    42  Alternatively, you can clone the repository from [Github](https://github.com/bosun-monitor/bosun) for the latest version. 
    43  
    44  #### From Docker Hub
    45  
    46  To pull the latest version published to Docker Hub, execute the following command:
    47  
    48  	$ docker run -d -p 4242:4242 -p 8070:8070 stackexchange/bosun
    49  
    50  The above command tells the Docker daemon that you would like to start a new daemonized instance of bosun and you wish to port-forward 8070 of your server into the docker container. 
    51  After about 30 seconds, you should have a working Bosun instance on port 8070.
    52  You can navigate to the instance by opening a browser and heading to http://docker-server-ip:8070 where docker-server is your server running the docker daemon.
    53  
    54  #### From the Github repository
    55  
    56  Clone the [Github repository](https://github.com/bosun-monitor/bosun) into a directory of your choice. From that 
    57  directory, run the following two commands: 
    58      
    59      $ cd docker
    60      $ docker-compose up
    61      
    62  This will launch three containers. One which runs OpenTSDB and HBase, and a second one with Redis - Bosun's 
    63  dependencies. A third one runs Bosun, [scollector](#scollector), and TSDBrelay. These three are the main components of 
    64  the Bosun repository.
    65  
    66  Your Bosun is available at http://localhost:8070. OpenTSDB is also available at http://localhost:4242.
    67  
    68  ## Getting data into Bosun
    69  
    70  The Bosun docker image self populates a fair amount of data. See the [scollector](#scollector) section below if you'd like to know more, but you can skip it for now.
    71  
    72  ## Checking for data in Bosun
    73  
    74  Once scollector is running, assuming there are no firewalls preventing communication between the host and server on port 8070, Bosun should be getting statistics from the scollector running on the system. We can check this by going to http://docker-server-ip:8070/items. If you see a list of metrics, congratulations! You're now receiving data. At the bottom of the page (or in a second column if the web browser window is wide enough), you will see the hostname(s) sending data. If you click the hostname, and then click “Available Metrics”, you will see all of the different types of data you can monitor! There is a lot of variables here, but there are some basic stats that we’ll use to explore in this tutorial.
    75  
    76  ## Creating an Alert
    77  
    78  Collecting metrics about our systems is fun but what makes a monitoring system useful is alerting when anomalies arise. This is the real strength of Bosun.
    79  
    80  Bosun encourages a particular workflow that makes it easy to design, test, and deploy an alert. If you look at the top of the Bosun display, the tabs include Items, Graph, Expression, Rule, and Test config in left-to-right order; that reflects the phases you go through as you create an alert. In general, first you'll select an item (metric) that is the basis of the alert. Next you'll graph it to understand its behavior. You'll then turn that graph into an expression, and the expression will be used to build a rule. You can then test the rule before incorporating it into Bosun.
    81  
    82  Let's do an example to see how this works.In our example, we will setup an alert that notifies us about high cpu. The metric we'll focus on is "os.cpu". We will create an alert that triggers if a particular host has high CPU for an hour.
    83  
    84  Go to http://docker-server-ip:8070 to get started.
    85  
    86  ### Items
    87  
    88  Click on the "Items" tab. You'll see a list of all the labels (names) used in metrics currently stored. Click on "os.cpu" and you'll be taken to the Graph tab with that metric pre-loaded.
    89  
    90  ### Graph
    91  
    92  You should see the Graph tab with that metric pre-loaded and a graph displayed for all hosts. We want a single host, so enter in your hostname in that field and click the blue “Query” button. A new graph should show up. This graph is showing the last hour of cpu usage. Since you’ve only had your scollector running for a few minutes, you may not have a lot of data yet, but that’s not a problem for our tutorial.
    93  
    94  (ProTip: You can get the same results by clicking on the Items tab, clicking on the host you are interested in, then the "Available Metrics" tab. Clicking on one of the metrics you see there will bring you to the Graph tab with both the metric name and the host name pre-filled.)
    95  
    96  Now that you have a graph, if you scroll to the bottom of the page there is a section called “Queries.” This section shows you the syntax of the query used to generate the graph.
    97  
    98  Also on the bottom of this page are links called "Expression" and "Rule". These take your current workspace and populate the Expression or Rule tabs respectively. The Expression tab lets us fine-tune the rule and is generally what you want to use. The Rule button skips the expression editor and takes you directly to the rule editor.
    99  
   100  For the purpose of this demo, click on the Expression button.
   101  
   102  ### Expression
   103  
   104  The expression page allows us to tweak the data set for our query. The expression bar should currently have a line that begins with “q(“sum:rate…” This is the recipe that tells Bosun you’re looking for the os.cpu metric for the past 1 hour. If you click the “show” button under the result column in the Queries section, you will see all of the data points as they were graphed. Each data point is a timestamp and a value.
   105  
   106  In the course of making an alert, however, we are probably not interested in a huge set of numbers. We might instead want something like the average.
   107  
   108  To get the average of the data points, we will surround our query in avg(). So, the query will go from this:
   109  
   110  	q("sum:rate{counter,,1}:os.cpu{host=your-system-here}", "1h", "")
   111  
   112  To this:
   113  
   114  	avg(q("sum:rate{counter,,1}:os.cpu{host=your-system-here}", "1h", ""))
   115  
   116  If we click the blue “Test” button, we’ll see the result column show a single number, which is the arithmetic average of all of the data points. At this point, we’ve now got a number we can use to alert whether our average cpu usage is too high. Let us click the “Rule” button, which is right-justified on the same line as the Test button.
   117  
   118  ### Rule
   119  
   120  On the Rule page, we have two boxes, the Alert box and Template box. The alert box shows us the basic barebones alert that Bosun has generated for us based on what we’ve done on the previous graph and expression pages. The template shows the basic template that produces the outbound e-mail alert that Bosun would send out. Currently, the alert is set to go critical all the time. The reason is that the crit and warn variables are boolean. By virtue of us putting our average cpu in the crit field, it becomes nonzero and therefore true. We need to add some more logic into this alert to make it meaningful.
   121  
   122  Change the alert in the box to this:
   123  
   124  	alert cpu.is.too.high {
   125  		template = test
   126  		$metric = q("sum:rate{counter,,1}:os.cpu{host=your-system-here}", "1h", "")
   127  		$avgcpu = avg($metric)
   128  		crit = $avgcpu > 80
   129  		warn = $avgcpu > 60
   130  	}
   131  
   132  This alert, if triggered, would produce a critical alarm if the average cpu is over 80%, and a warning alarm if the average cpu is over 60%. Now, there is still one thing that makes this alert somewhat useless, and that is the fact that we’re only targeting one host (your-system-here.) If you want to use this alert for all of your hosts, you can change host=your-system-here to host=* and the alert will calculate against all hosts! If there are certain hosts you do not wish to be part of the query, you can use the squelch directive in the alert body, but that’s beyond the scope of our quickstart.
   133  
   134  Click the Test button towards the right side of the page, below the Template box. In the Results pane below, you should see a summary of all of your hosts and what status they are in, be it Critical, Warning or Normal. If you click the “Template” pane, you’ll see what is e-mailed out in the alert. The default template isn’t very awesome, so lets replace it with something nice and meaningful:
   135  
   136  	template test {
   137  		subject = {{.Last.Status}}: {{.Alert.Name}} on {{.Group.host}}
   138  		body = `<p>Alert: {{.Alert.Name}} triggered on {{.Group.host}}
   139  		<hr>
   140  		<p><strong>Computation</strong>
   141  		<table>
   142  			{{range .Computations}}
   143  				<tr><td><a href="{{$.Expr .Text}}">{{.Text}}</a></td><td>{{.Value}}</td></tr>
   144  			{{end}}
   145  		</table>
   146  		<hr>
   147  		{{ .Graph .Alert.Vars.metric }}
   148  		<hr>
   149  		<p><strong>Relevant Tags</strong>
   150  		<table>
   151  			{{range $k, $v := .Group}}
   152  				<tr><td>{{$k}}</td><td>{{$v}}</td></tr>
   153  			{{end}}
   154  		</table>`
   155  	}
   156  
   157  When you hit “test” after putting the above template into the template field, the Template pane at the bottom of the page will show the results of our alert. As you can see in the template output, we can show a graph in the alert to give visual learners a bit more context to the alert. This is an svg and should display properly in most e-mail clients when e-mailed.
   158  
   159  ## Persisting your alert
   160  
   161  All of the steps thus far have been geared towards getting your feet wet with Bosun. At this point, you have an alert for high cpu that produces a rather nice-looking alert, but at this point Bosun isn’t going to alert on it. In order for the alert to be incorporated into bosun, it must be added to the config file. We can test the syntax of our alert and config file by going to the “Test Config” pane of Bosun, or navigate directly at http://docker-server-ip:8070/config. Paste in your alert and template fields as shown above to the end of the config file and hit the test button. If Bosun says the config is valid, you are free to copy the config from that window and overwrite the existing bosun.conf file with your new alert and template.  To accomplish this, you may wish to use `docker exec` and modify `/data/bosun.conf` then restart bosun.
   162  
   163  # scollector
   164  
   165  Bosun relies on metrics provided by other programs. For the majority of metrics we will be using a program called **scollector**. scollector is an agent that runs on hosts and will produce valuable output data about the state of that system. scollector also allows you to write custom collectors which permit you to record data that the basic scollector program does not gather. scollector is already installed and running on the docker image.
   166  
   167  Binaries are available for Linux, Windows, and Mac at [http://bosun.org/scollector/](http://bosun.org/scollector/).
   168  
   169  ## Configuring scollector
   170  
   171  By default, scollector will send data to `http://bosun:80`. scollector can be configured to send to different server by specifying a host with the **-h** flag:
   172  
   173  	$ scollector -h docker-server-ip:8070
   174  
   175  You may instead create a `scollector.conf` file alongside the scollector binary with the following contents:
   176  
   177  	host=docker-server-ip:8070
   178  
   179  See the [scollector docs](http://godoc.org/bosun.org/cmd/scollector) for more information.
   180  
   181  # graphite
   182  
   183  Next to OpenTSDB, Bosun also supports querying Graphite and Logstash-Elasticsearch.
   184  You can execute, view and graph expressions, develop and run Graphite/LS alerting rules, get notifications and use the dashboard.
   185  The OpenTSDB specific feature, such as data proxying and the built in general purpose graphing interface don't apply here.
   186  The alerting rules look the same, in fact the only difference is you will query data using [graphite specific functions](http://bosun.org/expressions#graphite-query-functions) such as graphiteQuery and graphiteBand.
   187  
   188  Start Graphite in docker:
   189  
   190      $ docker run -d \
   191        --name graphite \
   192        -p 80:80 \
   193        -p 2003:2003 \
   194        -p 8125:8125/udp \
   195        hopsoft/graphite-statsd
   196  
   197  [Collectd](http://collectd.org/) is commonly used to submit metrics into Graphite. (scollector does not support Graphite).
   198  You can easily launch it like so:
   199  
   200      $ docker run -e HOST_NAME=localhost -e GRAPHITE_HOST=<your host eth0 ip> andreasjansson/collectd-write-graphite
   201  
   202  verify that http://localhost loads with the graphite interface, go into Graphite> localhost> cpu, go into the hierarchy and toggle on some of the metrics. it might take a minute or two before data starts showing up.
   203  
   204  In your config, set
   205  
   206      graphiteHost = http://localhost
   207  
   208  Now you can run alerting rules like so:
   209  
   210      alert os.high.cpu {
   211          template = generic
   212          $d = graphite("*.cpu.*.cpu.idle)", "5m", "", "host..core..type")
   213          $q = avg($d)
   214          # purposely very harsh tresholds so we definitely get some alerts
   215          warn = $q < 100
   216          crit = $q <= 97
   217      }
   218  
   219  the 4th argument of the graphite function is the format of how to parse the series that graphite will return. in this case the first field is the host, the 3rd the core, and the last the cpu usage type, so these fields will be turned into tags within bosun.
   220  
   221  
   222  {% endraw %}
   223    </div>
   224  </div>