In this post we will describe how we built a Monitoring System for FreeSWITCH & Newfies-Dialer using Grafana, InfluxDB and Telegraf. We will collect and report standard metrics such as CPU, RAM, Disk space and other data more specific to FreeSWITCH like concurrent channels & CPS (Calls Per Second).
- Newfies-Dialer – A voice broadcasting platform
- FreeSWITCH – An Open Source communications and telephony platform
- Grafana – A graph and dashboard builder for time series metrics
- InfluxDB – A time series database designed to store large amounts of timestamped data
- Telegraf – An agent for collecting metrics
Specification
Our design brief was to design a centralised monitoring system for FreeSWITCH and Newfies-Dialer, our goals were:
- Multi-Tenant: Provide access to our customers so that they can access their own dashboard.
- FreeSWITCH Metrics: We want to monitor live channels & CPS from several Newfies-Dialer Systems belonging to different customers.
- System & Network Metrics: Support metrics such as:
- CPU / Disk Space / RAM
- Network statistics
- VoIP Quality statistics
- Postgresql data (Slow queries)
- RabbitMQ monitoring
- Nginx Monitoring
- Anomaly Detection: Push alerts by email or via Pagerguty [https://www.pagerduty.com/]
- FOSS: It needs to be a Free and Open Source Software.
Solutions
There are plenty of existing solutions, including proprietary products, such as Datadog, Munin, Nagios and many more.
We wanted a multi-tenant solution, Grafana seems to be quite advanced for the support of organisations and users, it also supports several types of Data Sources and we also wanted to use InfluxDB so that we can query data from other applications.
So let’s see how to get started and install Grafana with InfluxDB.
Install Grafana 2.6
There is extended documentation on how to install Grafana at http://docs.grafana.org/installation/
If you want to install Grafana on Debian or Ubuntu do the following:
$ wget https://grafanarel.s3.amazonaws.com/builds/grafana_2.6.0_amd64.deb
$ sudo apt-get install -y adduser libfontconfig
$ sudo dpkg -i grafana_2.6.0_amd64.deb
Start Grafana by running:
$ service grafana-server start
Grafana will automatically start after reboot. The environment variables are located in `/etc/default/grafana-server` and the configuration file in /etc/grafana/grafana.ini
Install InfluxDB 0.10
Installation documentation is available at https://docs.influxdata.com/influxdb/
but let’s summarise the steps for Debian installation:
$ curl -sL https://repos.influxdata.com/influxdb.key | sudo apt-key add -
$ source /etc/os-release
$ test $VERSION_ID = "7" && echo "deb https://repos.influxdata.com/debian wheezy stable" | sudo tee /etc/apt/sources.list.d/influxdb.list
$ test $VERSION_ID = "8" && echo "deb https://repos.influxdata.com/debian jessie stable" | sudo tee /etc/apt/sources.list.d/influxdb.list
Install and start InfluxDB:
$ apt-get update && apt-get install apt-transport-https influxdb
$ service influxdb start
After installing InfluxDB we recommend that you get comfortable with the CLI:
$ influx
> CREATE DATABASE test
> SHOW DATABASES
> USE test
> INSERT cpu,host=serverA,region=us_west value=0.64
We also recommend you go through the Getting Started documentation: https://docs.influxdata.com/influxdb/v0.10/introduction/getting_started/
Let’s now set a password for the admin user [https://docs.influxdata.com/influxdb/v0.10/administration/authentication_and_authorization/#set-up-authentication]
The authentication is disabled by default, so we will need to enable it in the configuration file:
1. Edit /etc/influxdb/influxdb.conf
2. Change the following `auth-enabled` to `true`:
[http]
enabled = true
bind-address = ":8086"
auth-enabled = true
...
Create a new database:
$ influx
> CREATE DATABASE mydatabase
By default there is no existing admin user, so let’s create a new admin user with password:
> CREATE USER admin WITH PASSWORD 'myinfluxdb' WITH ALL PRIVILEGES
Test the newly created admin user:
$ influx -username 'admin' -password 'myinfluxdb' -database 'mydatabase'
Having metrics in InfluxDB has some great advantages for developers, for instance it’s trivial to display metrics in a different web UI other than Grafana or reuse the statistics inside other applications.
With a single `Curl` command you get CPU loads for the last hour aggregated in blocks of 10 minutes:
$ curl -u admin:myinfluxdb -G 'http://localhost:8086/query?pretty=true' --data-urlencode "db=mydb" --data-urlencode "q=SELECT MEAN(load1) FROM system WHERE time > now() - 1h GROUP BY time(10m) FILL(null)"
Results:
{ "results": [ { "series": [ { "name": "system", "columns": [ "time", "mean" ], "values": [ [ "2016-02-12T11:00:00Z", 0.041851851851851855 ], [ "2016-02-12T11:10:00Z", 0.15583333333333324 ], [ "2016-02-12T11:20:00Z", 0.2555 ], [ "2016-02-12T11:30:00Z", 0.07016666666666661 ], [ "2016-02-12T11:40:00Z", 0.040333333333333325 ], [ "2016-02-12T11:50:00Z", 0.016833333333333336 ], [ "2016-02-12T12:00:00Z", 0.08939393939393941 ] ] } ] } ] }
The results are perfectly aggregated which can be easily plotted on a web dashboard.
Collector / Agent
In the previous section we demonstrated how to install InfluxDB. if this went well. you now have a Time Series Database ready to capture your metrics. The next step is to actually collect metrics. For this we will look at different agents that will harvest and report metrics to InfluxDB.
Here are the options we considered:
- Collectd (https://collectd.org/) is a daemon which collects metrics and store them in a variety of ways.
- StatsD (https://github.com/etsy/statsd) is a simple daemon to aggregate metrics, Statd daemon will generate aggregate metrics and relay them to your monitoring backend.
- Heka (http://hekad.readthedocs.org/) is a tool for collecting and collating data from a number of different sources, you can also process and collect metrics from logs, you can build plugins in Lua and the codebase is Go. We really loved what we saw, it seems to be a very rich solution but we found it a bit over-engineered for our today needs. I would still recommend you have a look at the documentation hekad.readthedocs.org/ as it might be what you are looking for.
- Riemann (http://riemann.io/) is a monitoring tool that aggregates events from your servers and applications with a powerful stream processing language, it comes with a dashboard of its own (http://riemann.io/dashboard.html), you can create alerts when events happen (http://riemann.io/howto.html#alerting-when-a-certain-percentage-of-events-happen). Riemann is written in Clojure and runs on top of a JVM.
- Telegraf (https://influxdata.com/time-series-platform/telegraf/) have a very small footprint, it’s written in Go and have a very simple plugins mechanism. Telegraf was very easy to get started, good documentation and it comes with a simple plugins solution, which now support input & output plugins. There are already plugins for PostgreSQL, CPU, and standard server metrics and it took us very little time to come up with a plugin for FreeSWITCH: https://github.com/areski/freeswitch-telegraf-plugin
Heka and Reimann both sound very interesting and we recommend you watch a presentation comparing each system at: http://www.slideshare.net/nickchappell/pdx-devops-stream-processing-heka-and-riemann
As we wanted something light we discarded Riemann and decided to go with Telegraf for the simplicity and ease of creating plugins.
Install & Configure Telegraf
Telegraf is an application written in Go (https://golang.org/), it’s a collecting agent which will be installed on your server and report metrics to InfluxDB.
To install Telegraf visit the download page of InfluxData [https://influxdata.com/downloads/], you will find the instructions for your OS, let’s show you how to install on Debian:
$ wget http://get.influxdb.org/telegraf/telegraf_0.10.2-1_amd64.deb
$ sudo dpkg -i telegraf_0.10.2-1_amd64.deb
$ sudo service telegraf start
You should have Telegraf running on your Debian server.
Telegraf comes by default with plugins for cpu, system, memory and much more, you can find more about those plugins at: https://docs.influxdata.com/telegraf/v0.10/inputs/
Let’s look at the other plugins, the one for PostgreSQL for instance, type the following on CLI:
$ telegraf -usage postgresql
Output:
# Read metrics from one or many postgresql servers
[[inputs.postgresql]]
# specify address via a url matching:
# postgres://[pqgotest[:password]]@localhost[/dbname]?sslmode=[disable|verify-ca|verify-full]
# or a simple string:
# host=localhost user=pqotest password=... sslmode=... dbname=app_production
#
# All connection parameters are optional.
#
# Without the dbname parameter, the driver will default to a database
# with the same name as the user. This dbname is just for instantiating a
# connection with the server and doesn't restrict the databases we are trying
# to grab metrics for.
#
address = "host=localhost user=postgres sslmode=disable"
# A list of databases to pull metrics about. If not specified, metrics for all
# databases are gathered.
# databases = ["app_production", "testing"]
Copy this at the end of your `telegraf.conf`, and you will be able to monitor the PG databases on some supported metrics.
You might also want to add Redis and or Rabbitmq as we find them very useful in our own deployments:
$ telegraf -usage redis
$ telegraf -usage rabbitmq
Alert / Notification
One last thing we really wanted is ability possible to trigger notifications when some events happen, for instance if the CPU reaches a certain threshold or if there is no activity on some metrics.
Grafana is probably going to solve this missing piece very soon, there is a Github issue to discuss this: https://github.com/grafana/grafana/issues/2209 and you might want to look at this presentation from Dieter at http://www.slideshare.net/Dieterbe/alerting-in-grafana-grafanacon-2015
UPDATE: InfluxData have a new product called Kapacitor https://influxdata.com/time-series-platform/kapacitor/
Summary
Using Grafana, InfluxDB & Telegraf we have a full FOSS monitoring solution that has answered our requirements, Telegraf is a lightweight application so we have a small footprint on our servers. InfluxDB is a great solution to collect metrics, it allows our developers to reuse metrics metrics within our own apps.
Our next step would be to extend the FreeSWITCH plugin, we want to be able to capture more granularity on the events broadcast by FreeSWITCH, for instance we would like to plot the number of hangup events, the audio files played, the call transferred, and even voice quality metrics that we can capture on the outgoing calls.
We hope you find this post useful and please share your own experiences with monitoring systems, we would love to hear what you use.