Centralized Log Management and Monitoring for CoreOS Clusters

If you’ve got an interest in things like CoreOS, logs and monitoring then you should check out our previous CoreOS-related posts on Monitoring Core OS Clusters and how to get CoreOS logs into ELK in 5 minutes.  And they are only the start of SPM integrations with CoreOS!  Case in point: we have recently optimized the SPM setup on CoreOS and integrated a logging gateway to Logsene into the SPM Agent for Docker.  And that’s not all…

In this post we want to share the current state of CoreOS Monitoring and Log Management from Sematext so you know what’s coming — and you know about things that might be helpful for your organization, such as:

  1. Feature Overview
  2. Fleet Units for SPM
  3. How to Set Up Monitoring and Logging Services

1. Feature Overview

  • Quick setup
    • add monitoring and logging for the whole cluster in 5 minutes
  • Collection Performance Metrics for the CoreOS Cluster
    • Metrics for all CoreOS cluster nodes (hosts)
      • CPU, Memory, Disk usage
    • Detailed metrics for all containers on each host
      • CPU, Memory, Limits, Failures, Network and Disk I/O, …
    • Anomaly detection and alerts for all metrics
    • Anomaly detection and alerts for all logs
  • Correlated Container Events, Metrics and Logs
    • Docker Events like start/stop/destroy are related to deployments, maintenance or sometimes to errors and unwanted restarts;  correlation of metrics, events and logs is the natural way to discover problems using SPM.

Docker Events

  • Centralized configuration via etcd
    • There is often a mix of configurations in environment variables, static settings in cloud configuration files, and combinations of confd and etcd. We decided to have all settings stored in etcd, so the settings are done only once and are easy to access.
  • Automatic Log Collection
    • Logging gateway Integrated into SPM Agent
      • SPM Agent for Docker includes a logging gateway service to receive log message via TCP.  The service discovery is solved via etcd (where the exposed TCP is stored). All received messages are parsed, and the following formats are supported:
        • journalctl -o short | short-iso | json
        • integrated messages parser (e.g. for dockerd time, level and message text)
        • line delimited JSON
        • plain text messages
        • In cases where the parsing fails, the gateway adds a timestamp and keeps the message 1:1.
      • The logging gateway can be configured with the Logsene App Token – this makes it compatible with most Unix tools e.g. journalctl -o json -n 10 | netcat localhost 9000
      • SPM for Docker collects all logs from containers directly from the Docker API. The logging gateway is typically used for system logs – or anything else configured in journald (see “Log forwarding service” below)
      • The transmission to Logsene receivers is encrypted via HTTPS.
    • Log forwarding service
      • The log forwarding service streams logs to the logging gateway by pulling them from journald. In addition, it saves the ‘last log time’ to recover after a service restart. Most people take this for granted; but not all logging services have such a recovery function.  There are many tools which just capture the current log stream. Often people realize this only when they miss logs one day because of a reboot, network outage, software update, etc.  But these are exactly the types of situations where you would like to know what is going on!
SPM integrations into CoreOS

SPM integrations into CoreOS

2. Fleet Units for SPM

SPM agent services are installed via fleet (a distributed init system) in the whole cluster. Lets see those unit files before we fire them up into the Cloud.

The first unit file spm-agent.service starts SPM Agent for Docker. It takes the SPM and Logsene app tokens and port for the logging gateway etcd. It starts on every CoreOS host (global unit).


Fleet Unit File – SPM Agent incl. Log Gateway: spm-agent.service

The second unit file logsene-service.service forwards logs from journald to that logging gateway running as part of spm-agent-docker. All fields stored in the journal (down to source-code level and line numbers provided by GO modules) are then available in Logsene.


Fleet Unit File – Log forwarder: logsene.service

3. Set Up Monitoring and Logging Services


  1. Get a free account apps.sematext.com
  2. Create an SPM App of type “Docker” and copy the SPM Application Token
  3. Store the configuration in etcd
# set your application tokens for SPM and Logsene
# set the port for the Logsene Gateway
export $LG_PORT=9000
# Store the tokens in etcd
# please note the same key is used in the unit file!
etcdctl set /sematext.com/myapp/spm/token $SPM_TOKEN
etcdctl set /sematext.com/myapp/logsene/token $LOGSENE_TOKEN
etcdctl set /sematext.com/myapp/logsene/gateway_port $LG_PORT

Download the fleet unit files and start the service via fleetclt

# Download the unit file for SPM
wget https://raw.githubusercontent.com/sematext/spm-agent-docker/master/coreos/spm-agent.service
# Start SPM Agent in the whole cluster
fleetctl load spm-agent.service; fleetctl start spm-agent.service
# Download the unit file for Logsene
wget https://raw.githubusercontent.com/sematext/spm-agent-docker/master/coreos/logsene.service
# Start the log forwarding service
fleetctl load logsene.service; fleetctl start logsene.service

Check the installation

systemctl status spm-agent.service
systemctl status logsene.service

Send a few log lines to see them in Logsene.

journalctl -o json -n 10 | ncat localhost 9000

After about a minute you should see Metrics in SPM and Logs in Logsene.


Cluster Health in ‘Birds Eye View’


Host and Container Metrics Overview for the whole cluster


Logs and Metrics

Open-Source Resources

Some of the things described here are open-sourced:

Summary – What this gets you

Here’s what this setup provides for you:

  • Operating System metrics of each CoreOS cluster node
  • Container and Host Metrics on each node
  • All Logs from Docker containers and Hosts (via journald)
  • Docker Events from all nodes
  • CoreOS logs from all nodes

Having this setup allows you to take the full advantage of SPM and Logsene by defining intelligent alerts for metrics and logs (delivered via channels like e-mail, PagerDuty, Slack, HipChat or any WebHook), as well as making correlations between performance metrics, events, logs, and alerts.

Running CoreOS? Need any help getting CoreOS metrics and/or logs into SPM & Logsene?  Let us know!  Oh, and if you’re a small startup — ping @sematext — you can get a good discount on both SPM and Logsene!

Tomcat Monitoring SPM Integration

This old cat, Apache Tomcat, has been around for ages, but it’s still very much alive!  It’s at version 8, with version 7.x still being maintained, while the new development is happening on version 9.0.  We’ve just added support for Tomcat monitoring to the growing list of SPM integrations the other day, so if you run Tomcat and want to take a peek at all its juicy metrics, give SPM for Tomcat a go!  Note that SPM supports both Tomcat 7.x and 8.x.

Before you jump to the screenshot below, read this: you may reeeeeally want to enable Transaction Tracing in the SPM agent running on your Tomcat boxes.  Why?  Because that will help you find bottlenecks in your web application by tracing transactions (think HTTP requests + method call + network calls + DB calls + ….).  It will also build a whole map of all your applications talking to each other with information about latency, request rate, error and exception rate between all component!  Check this out (and just click it to enlarge):


Everyone loves maps.  It’s human. How about charts? Some of us have a thing for charts. Here are some charts with various Tomcat metrics, courtesy of SPM:

Overview  (click to enlarge)


Session Counters  (click to enlarge)


Cache Usage  (click to enlarge)


Threads (Threads/Connections)  (click to enlarge)


Requests  (click to enlarge)


Hope you like this new addition to SPM.  Got ideas how we could make it more useful for you?  Let us know via comments, email, or @sematext.

Not using SPM yet? Check out the free 30-day SPM trial by registering here (ping us if you’re a startup, a non-profit, or education institution – we’ve got special pricing for you!).  There’s no commitment and no credit card required.  SPM monitors a ton of applications, like Elasticsearch, Solr, Hadoop, Spark, Node.js & io.js (open-source), Docker (get open-source Docker image), CoreOS, and more.

Monitoring CoreOS Clusters

UPDATE: Related to monitoring CoreOS clusters, we have recently optimized the SPM setup on CoreOS and integrated a logging gateway to Logsene into the SPM Agent for Docker.  You can read about it in Centralized Log Management and Monitoring for CoreOS Clusters


In this post you’ll learn how to get operational insights (i.e. performance metrics, container events, etc.) from CoreOS and make that super simple with etcd, fleet, and SPM.

We’ll use:

  • SPM for Docker to run the monitoring agent as a Docker container and collect all Docker metrics and events for all other containers on the same host + metrics for hosts
  • fleet to seamlessly distribute this container to all hosts in the CoreOS cluster by simply providing it with a fleet unit file shown below
  • etcd to set a property to hold the SPM App token for the whole cluster

The Big Picture

Before we get started, let’s take a step back and look at our end goal.  What do we want?  We want charts with Performance Metrics, we want Event Collection, we’d love integrated Anomaly Detection and Alerting, and we want that not only for containers, but also for hosts running containers.  CoreOS has no package manager and deploys services in containers, so we want to run the SPM agent in a Docker container, as shown in the following figure:


By the end of this post each of your Docker hosts could look like the above figure, with one or more of your own containers running your own apps, and a single SPM Docker Agent container that monitors all your containers and the underlying hosts.

Read more of this post

Docker Events and Docker Metrics Monitoring

Docker deployments can be very dynamic with containers being started and stopped, moved around the YARN or Mesos-managed clusters, having very short life spans (the so-called pets) or long uptimes (aka cattle).  Getting insight into the current and historical state of such clusters goes beyond collecting container performance metrics and sending alert notifications.  If a container dies or gets paused, for example, you may want to know about it, right?  Or maybe you’d want to be able to see that a container went belly up in retrospect when troubleshooting, wouldn’t you?

Just two weeks ago we added Docker Monitoring (docker image is right here for your pulling pleasure) to SPM.  We didn’t stop there — we’ve now expanded SPM’s Docker support by adding Docker Event collection, charting, and correlation.  Every time a container is created or destroyed, started, stopped, or when it dies, spm-agent-docker captures the appropriate event so you can later see what happened where and when, correlate it with metrics, alerts, anomalies — all of which are captured in SPM — or with any other information you have at your disposal.  The functionality and the value this brings should be pretty obvious from the annotated screenshot below.

Like this post?  Please tweet about Docker Events and Docker Metrics Monitoring

Know somebody who’d find this post useful?  Please let them know…

Bildschirmfoto 2015-06-24 um 13.56.39

Here’s the list of Docker events SPM Docker monitoring agent currently captures:

  • Version Information on Startup:
    • server-info – created by spm-agent framework with node.js and OS version info on startup
    • docker-info – Docker Version, API Version, Kernel Version on startup
  • Docker Status Events:
    • Container Lifecycle Events like
      • create, exec_create, destroy, export
    • Container Runtime Events like
      • die, exec_start, kill, oom, pause, restart, start, stop, unpause

Every time a Docker container emits one of these events spm-agent-docker will capture it in real-time, ship it over to SPM, and you’ll be able to see it as shown in the above screenshot.

Oh, and if you’re running CoreOS, you may also want to see how to index CoreOS logs into ELK/Logsene. Why? Because then you can have not only metrics and container events in one place, but also all container and application logs, too!

If you’re using Docker, we hope you find this useful!  Anything else you’d like us to add to SPM (for Docker or any other integration)?  Leave a comment, ping @sematext, or send us email – tell us what you’d like to get for early Christmas!

Real-time Server Insights via Birds Eye View

Everyone’s infrastructure is growing – whether you run baremetal servers, use IaaS, or use Containers. This just-added SPM functionality, a new view in SPM that we call BEV (aka Birds Eye View) helps you get better visibility into all your servers requiring attention — especially the hot ones!

Up until now SPM provided you with very detailed insight into all kinds of metrics for whichever SPM App you were looking at.  SPM, of course, lets you monitor a bunch of things!  Thus you, like lots of other SPM users, might be monitoring several (types of) applications (e.g. real-time data processing pipelines). This means you also need to be able to see how servers running those apps are doing health-wise.  Do any of them have maxed out CPUs?  Any of them close to running out of disk?  Any of them swapping like crazy?  Wouldn’t it be nice to see various metrics for lots or even all your servers at a glance?  BEV to the rescue!

With BEV you can get an instant, real-time, and consolidated look at your key server and application-specific metrics, including: CPU utilization, Disk used %, Memory used %, Load, and Swap.  From these metrics SPM computes the general health of the server, which it uses to surface the most problematic servers and, by using red, orange, and green coloring, bring the most critical servers to your attention.

Cross-app Server Visibility

BEV is especially valuable because it gives you the overall view of all your servers, across all your SPM Apps – yet with the ability to filter by app and hostname patterns.  BEV is like top + df for all your servers and clusters.  In fact, BEV was designed to give users at-a-glance capabilities in a few different ways:

Sparklines:  Whereas the typical application performance monitoring (APM) chart is designed to show as much data as possible, sparklines are intended to be succinct and give users an instant idea if whether a specific application is encountering a problem.

Colored Metric Numbers: Getting an instant sense of server health is as easy as driving up to a traffic light.  Green — sweet, looks good.  Orange — hmmm, should watch that.  Red — whoa, better check that out asap!


While BEV already surfaces the hottest servers, you can also set min/max ranges for any of the metrics and thus easily hide servers that you know are healthy and that you don’t want to even see in BEV.  Just use the sliders marked in the screenshot below.


Hope you like this new addition to SPM.  Got ideas how we could make it more useful for you?  Let us know via comments, email, or @sematext.

Not using SPM yet? Check out the free 30-day SPM trial by registering here (ping us if you’re a startup, a non-profit, or education institution – we’ve got special pricing for you!).  There’s no commitment and no credit card required.  SPM monitors a ton of applications, like Elasticsearch, Solr, Hadoop, Spark, Node.js & io.js (open-source), Docker (get open-source Docker image)  Kafka, Cassandra, and more.


New Elasticsearch Reports: Warmers, Thread Pools and Circuit Breakers

[Note: We’re holding a 2-day, hands-on Elasticsearch / ELK Stack training workshop in New York from October 19-20, 2015. Click here for details!]


Have you read the Top 10 Elasticsearch Metrics to Watch?

How about our free eBook – Elasticsearch Monitoring Essentials?

If you have, we’re impressed. If not, it’s great bedtime reading. ;)

Besides writing bedtime reading material, we also wrote some code last month and added a few new and useful Elasticsearch metrics to SPM.  Specifically, we’ve added:

  • Index Warmer metrics
  • Thread Pool metrics
  • Circuit Breaker metrics

So why are these important?  Read on!

Index Warmers

Warmers do what their name implies.  They warm up. But what? Indices. Why?  Because warming up an index means searches against it will be faster.  Thus, one can warm up indices before exposing searches against them.  If you come to Elasticsearch from Solr, this is equivalent to searcher warmup queries in Solr.


Thread Pools

Elasticsearch nodes use a number of dedicated thread pools to handle different types of requests.  For example, indexing requests are handled by a thread pool that is separate from the thread pool that handles search requests.  This helps with better memory management, request prioritization, isolation, etc.  There are over a dozen thread pools, and each of them exposes almost a dozen metrics.

Each pool also has a queue, which makes it possible to hold onto some requests instead of simply dropping them when a node is very busy.  However, if your Elasticsearch cluster handles a lot of concurrent or slow requests, it may sometimes have to start rejecting requests if those thread pool queues are full.  When that starts happening, you will want to know about it ASAP.  Thus, you should pay close attention to thread pool metrics and may want to set Alerts and SPM’s Anomaly Detection Alerts on the metric that shows the number of rejection or queue size, so you can adjust queue size settings, or other parameters to avoid requests being rejected.

Alternatively, or perhaps additionally, you may want to feed your logs to Logsene.  Elasticsearch can log request rejections (see an example below), so if your ship your Elasticsearch logs to Logsene, you’ll have both Elasticsearch metrics and its logs available for troubleshooting.  Moreover, in Logsene you can create alert queries that alert you about anomalies in your logs, and such alert queries will alert you when Elasticsearch starts logging errors, like the example shown here:

o.es.c.u.c.EsRejectedExecutionException: rejected execution (queue capacity 1000) on org.elasticsearch.search.action.SearchServiceTransportAction$23@5a805c60
at org.elasticsearch.common.util.concurrent.EsAbortPolicy.rejectedExecution(EsAbortPolicy.java:62)
at java.util.concurrent.ThreadPoolExecutor.reject(ThreadPoolExecutor.java:821)
at java.util.concurrent.ThreadPoolExecutor.execute(ThreadPoolExecutor.java:1372)
at org.elasticsearch.search.action.SearchServiceTransportAction.execute(SearchServiceTransportAction.java:509)
at org.elasticsearch.search.action.SearchServiceTransportAction.sendExecuteScan(SearchServiceTransportAction.java:441)
at org.elasticsearch.action.search.type.TransportSearchScanAction$AsyncAction.sendExecuteFirstPhase(TransportSearchScanAction.java:68)
at org.elasticsearch.action.search.type.TransportSearchTypeAction$BaseAsyncAction.performFirstPhase(TransportSearchTypeAction.java:171)
at org.elasticsearch.action.search.type.TransportSearchTypeAction$BaseAsyncAction.start(TransportSearchTypeAction.java:153)
at org.elasticsearch.action.search.type.TransportSearchScanAction.doExecute(TransportSearchScanAction.java:52)
at org.elasticsearch.action.search.type.TransportSearchScanAction.doExecute(TransportSearchScanAction.java:42)
at org.elasticsearch.action.support.TransportAction.execute(TransportAction.java:63)
at org.elasticsearch.action.search.TransportSearchAction.doExecute(TransportSearchAction.java:107)
at org.elasticsearch.action.search.TransportSearchAction.doExecute(TransportSearchAction.java:43)
at org.elasticsearch.action.support.TransportAction.execute(TransportAction.java:63)
at org.elasticsearch.action.search.TransportSearchAction$TransportHandler.messageReceived(TransportSearchAction.java:124)
at org.elasticsearch.action.search.TransportSearchAction$TransportHandler.messageReceived(TransportSearchAction.java:113)


Circuit Breakers

Circuit Breakers are Elasticsearch’s attempt to control memory usage and prevent the dreaded OutOfMemoryError.  There are currently two Circuit Breakers – one for Field Data, the other for Requests.  In short, you can set limits for each of them and prevent excessive memory usage to avoid your cluster blowing up with OOME.


Want something like this for your Elasticsearch cluster?

Feel free to register here and enjoy all the SPM for Elasticsearch goodness.  There’s no commitment and no credit card required.  And, if you are a young startup, a small or non-profit organization, or an educational institution, ask us for a discount (see special pricing)!

Feedback & Questions

We are happy to answer questions or receive feedback – please drop us a line or get us @sematext.

eBook: Elasticsearch Monitoring Essentials

[Note: We’re holding a 2-day, hands-on Elasticsearch / ELK Stack training workshop in New York from October 19-20, 2015. Click here for details!]


Elasticsearch is booming.  Together with Logstash, a tool for collecting and processing logs, and Kibana, a tool for searching and visualizing data in Elasticsearch (aka the “ELK stack”), adoption of Elasticsearch continues to grow by leaps and bounds. In this detailed (and free!) booklet Sematext DevOps Evangelist, Stefan Thies, walks readers through Elasticsearch and ELK stack basics and supplies numerous graphs, diagrams and infographics to clearly explain what you should monitor, which Elasticsearch metrics you should watch.  We’ve also included the popular “Top 10 Elasticsearch Metrics” list with corresponding explanations and screenshots.  This booklet will be especially helpful to those who are new to Elasticsearch and ELK stack, but also to experienced users who want a quick jump start into Elasticsearch monitoring.


Like this booklet?  Please tweet about Performance Monitoring Essentials Booklet – Elasticsearch Edition

Know somebody who’d find this booklet useful?  Please let them know…

When it comes to actually using Elasticsearch, there are tons of metrics generated.  The goal of creating this free booklet is to provide information that we at Sematext have found to be extremely useful in our work as Elasticsearch and ELK stack consultants, production support providers, and monitoring solution builders.


Topics, including our Top 10 Elasticsearch Metrics

Topics addressed in the booklet include: Elasticsearch Vocabulary, Scaling a Cluster, How Indexing Works, Cluster Health – Nodes & Shards, Node Performance, Search Performance, and many others.  And here’s a quick taste of the kind of juicy content you’ll find inside: a dashboard view of our 10 Elasticsearch metrics list.


This dashboard image, and all images in the booklet, are from Sematext’s SPM Performance Monitoring tool.

Got Feedback? Questions?

Please give our booklet a look and let us know what you think — we love feedback!  You can DM us (and RT and/or follow us, if you like what you read) @sematext, or drop us an email.

And…if you’d like try SPM to monitor Elasticsearch yourself, check out a Free 30-day trial by registering here.  There’s no commitment and no credit card required. Small startups, startups with no or very little outside funding, non-profit and educational institutions get special pricing – just get in touch with us.


Get every new post delivered to your Inbox.

Join 171 other followers