Monitoring CoreOS Clusters

In this post you’ll learn how to get operational insights (i.e. performance metrics, container events, etc.) from CoreOS and make that super simple with etcd, fleet, and SPM.

We’ll use:

  • SPM for Docker to run the monitoring agent as a Docker container and collect all Docker metrics and events for all other containers on the same host + metrics for hosts
  • fleet to seamlessly distribute this container to all hosts in the CoreOS cluster by simply providing it with a fleet unit file shown below
  • etcd to set a property to hold the SPM App token for the whole cluster

The Big Picture

Before we get started, let’s take a step back and look at our end goal.  What do we want?  We want charts with Performance Metrics, we want Event Collection, we’d love integrated Anomaly Detection and Alerting, and we want that not only for containers, but also for hosts running containers.  CoreOS has no package manager and deploys services in containers, so we want to run the SPM agent in a Docker container, as shown in the following figure:

SPM_for_Docker

By the end of this post each of your Docker hosts could look like the above figure, with one or more of your own containers running your own apps, and a single SPM Docker Agent container that monitors all your containers and the underlying hosts.

Read more of this post

Docker Events and Docker Metrics Monitoring

Docker deployments can be very dynamic with containers being started and stopped, moved around the YARN or Mesos-managed clusters, having very short life spans (the so-called pets) or long uptimes (aka cattle).  Getting insight into the current and historical state of such clusters goes beyond collecting container performance metrics and sending alert notifications.  If a container dies or gets paused, for example, you may want to know about it, right?  Or maybe you’d want to be able to see that a container went belly up in retrospect when troubleshooting, wouldn’t you?

Just two weeks ago we added Docker Monitoring (docker image is right here for your pulling pleasure) to SPM.  We didn’t stop there — we’ve now expanded SPM’s Docker support by adding Docker Event collection, charting, and correlation.  Every time a container is created or destroyed, started, stopped, or when it dies, spm-agent-docker captures the appropriate event so you can later see what happened where and when, correlate it with metrics, alerts, anomalies — all of which are captured in SPM — or with any other information you have at your disposal.  The functionality and the value this brings should be pretty obvious from the annotated screenshot below.

Like this post?  Please tweet about Docker Events and Docker Metrics Monitoring

Know somebody who’d find this post useful?  Please let them know…

Bildschirmfoto 2015-06-24 um 13.56.39

Here’s the list of Docker events SPM Docker monitoring agent currently captures:

  • Version Information on Startup:
    • server-info – created by spm-agent framework with node.js and OS version info on startup
    • docker-info – Docker Version, API Version, Kernel Version on startup
  • Docker Status Events:
    • Container Lifecycle Events like
      • create, exec_create, destroy, export
    • Container Runtime Events like
      • die, exec_start, kill, oom, pause, restart, start, stop, unpause

Every time a Docker container emits one of these events spm-agent-docker will capture it in real-time, ship it over to SPM, and you’ll be able to see it as shown in the above screenshot.

Oh, and if you’re running CoreOS, you may also want to see how to index CoreOS logs into ELK/Logsene. Why? Because then you can have not only metrics and container events in one place, but also all container and application logs, too!

If you’re using Docker, we hope you find this useful!  Anything else you’d like us to add to SPM (for Docker or any other integration)?  Leave a comment, ping @sematext, or send us email – tell us what you’d like to get for early Christmas!

Real-time Server Insights via Birds Eye View

Everyone’s infrastructure is growing – whether you run baremetal servers, use IaaS, or use Containers. This just-added SPM functionality, a new view in SPM that we call BEV (aka Birds Eye View) helps you get better visibility into all your servers requiring attention — especially the hot ones!

Up until now SPM provided you with very detailed insight into all kinds of metrics for whichever SPM App you were looking at.  SPM, of course, lets you monitor a bunch of things!  Thus you, like lots of other SPM users, might be monitoring several (types of) applications (e.g. real-time data processing pipelines). This means you also need to be able to see how servers running those apps are doing health-wise.  Do any of them have maxed out CPUs?  Any of them close to running out of disk?  Any of them swapping like crazy?  Wouldn’t it be nice to see various metrics for lots or even all your servers at a glance?  BEV to the rescue!

With BEV you can get an instant, real-time, and consolidated look at your key server and application-specific metrics, including: CPU utilization, Disk used %, Memory used %, Load, and Swap.  From these metrics SPM computes the general health of the server, which it uses to surface the most problematic servers and, by using red, orange, and green coloring, bring the most critical servers to your attention.

Cross-app Server Visibility

BEV is especially valuable because it gives you the overall view of all your servers, across all your SPM Apps – yet with the ability to filter by app and hostname patterns.  BEV is like top + df for all your servers and clusters.  In fact, BEV was designed to give users at-a-glance capabilities in a few different ways:

Sparklines:  Whereas the typical application performance monitoring (APM) chart is designed to show as much data as possible, sparklines are intended to be succinct and give users an instant idea if whether a specific application is encountering a problem.

Colored Metric Numbers: Getting an instant sense of server health is as easy as driving up to a traffic light.  Green — sweet, looks good.  Orange — hmmm, should watch that.  Red — whoa, better check that out asap!

BEV_1

While BEV already surfaces the hottest servers, you can also set min/max ranges for any of the metrics and thus easily hide servers that you know are healthy and that you don’t want to even see in BEV.  Just use the sliders marked in the screenshot below.

BEV_3

Hope you like this new addition to SPM.  Got ideas how we could make it more useful for you?  Let us know via comments, email, or @sematext.

Not using SPM yet? Check out the free 30-day SPM trial by registering here (ping us if you’re a startup, a non-profit, or education institution – we’ve got special pricing for you!).  There’s no commitment and no credit card required.  SPM monitors a ton of applications, like Elasticsearch, Solr, Hadoop, Spark, Node.js & io.js (open-source), Docker (get open-source Docker image)  Kafka, Cassandra, and more.

 

New Elasticsearch Reports: Warmers, Thread Pools and Circuit Breakers

Have you read the Top 10 Elasticsearch Metrics to Watch?

How about our free eBook – Elasticsearch Monitoring Essentials?

If you have, we’re impressed. If not, it’s great bedtime reading. ;)

Besides writing bedtime reading material, we also wrote some code last month and added a few new and useful Elasticsearch metrics to SPM.  Specifically, we’ve added:

  • Index Warmer metrics
  • Thread Pool metrics
  • Circuit Breaker metrics

So why are these important?  Read on!

Index Warmers

Warmers do what their name implies.  They warm up. But what? Indices. Why?  Because warming up an index means searches against it will be faster.  Thus, one can warm up indices before exposing searches against them.  If you come to Elasticsearch from Solr, this is equivalent to searcher warmup queries in Solr.

ES_Warmer

Thread Pools

Elasticsearch nodes use a number of dedicated thread pools to handle different types of requests.  For example, indexing requests are handled by a thread pool that is separate from the thread pool that handles search requests.  This helps with better memory management, request prioritization, isolation, etc.  There are over a dozen thread pools, and each of them exposes almost a dozen metrics.

Each pool also has a queue, which makes it possible to hold onto some requests instead of simply dropping them when a node is very busy.  However, if your Elasticsearch cluster handles a lot of concurrent or slow requests, it may sometimes have to start rejecting requests if those thread pool queues are full.  When that starts happening, you will want to know about it ASAP.  Thus, you should pay close attention to thread pool metrics and may want to set Alerts and SPM’s Anomaly Detection Alerts on the metric that shows the number of rejection or queue size, so you can adjust queue size settings, or other parameters to avoid requests being rejected.

Alternatively, or perhaps additionally, you may want to feed your logs to Logsene.  Elasticsearch can log request rejections (see an example below), so if your ship your Elasticsearch logs to Logsene, you’ll have both Elasticsearch metrics and its logs available for troubleshooting.  Moreover, in Logsene you can create alert queries that alert you about anomalies in your logs, and such alert queries will alert you when Elasticsearch starts logging errors, like the example shown here:

o.es.c.u.c.EsRejectedExecutionException: rejected execution (queue capacity 1000) on org.elasticsearch.search.action.SearchServiceTransportAction$23@5a805c60
at org.elasticsearch.common.util.concurrent.EsAbortPolicy.rejectedExecution(EsAbortPolicy.java:62)
at java.util.concurrent.ThreadPoolExecutor.reject(ThreadPoolExecutor.java:821)
at java.util.concurrent.ThreadPoolExecutor.execute(ThreadPoolExecutor.java:1372)
at org.elasticsearch.search.action.SearchServiceTransportAction.execute(SearchServiceTransportAction.java:509)
at org.elasticsearch.search.action.SearchServiceTransportAction.sendExecuteScan(SearchServiceTransportAction.java:441)
at org.elasticsearch.action.search.type.TransportSearchScanAction$AsyncAction.sendExecuteFirstPhase(TransportSearchScanAction.java:68)
at org.elasticsearch.action.search.type.TransportSearchTypeAction$BaseAsyncAction.performFirstPhase(TransportSearchTypeAction.java:171)
at org.elasticsearch.action.search.type.TransportSearchTypeAction$BaseAsyncAction.start(TransportSearchTypeAction.java:153)
at org.elasticsearch.action.search.type.TransportSearchScanAction.doExecute(TransportSearchScanAction.java:52)
at org.elasticsearch.action.search.type.TransportSearchScanAction.doExecute(TransportSearchScanAction.java:42)
at org.elasticsearch.action.support.TransportAction.execute(TransportAction.java:63)
at org.elasticsearch.action.search.TransportSearchAction.doExecute(TransportSearchAction.java:107)
at org.elasticsearch.action.search.TransportSearchAction.doExecute(TransportSearchAction.java:43)
at org.elasticsearch.action.support.TransportAction.execute(TransportAction.java:63)
at org.elasticsearch.action.search.TransportSearchAction$TransportHandler.messageReceived(TransportSearchAction.java:124)
at org.elasticsearch.action.search.TransportSearchAction$TransportHandler.messageReceived(TransportSearchAction.java:113)

ES_Threadpools

Circuit Breakers

Circuit Breakers are Elasticsearch’s attempt to control memory usage and prevent the dreaded OutOfMemoryError.  There are currently two Circuit Breakers – one for Field Data, the other for Requests.  In short, you can set limits for each of them and prevent excessive memory usage to avoid your cluster blowing up with OOME.

ES_Circuit_Breakers

Want something like this for your Elasticsearch cluster?

Feel free to register here and enjoy all the SPM for Elasticsearch goodness.  There’s no commitment and no credit card required.  And, if you are a young startup, a small or non-profit organization, or an educational institution, ask us for a discount (see special pricing)!

Feedback & Questions

We are happy to answer questions or receive feedback – please drop us a line or get us @sematext.

eBook: Elasticsearch Monitoring Essentials

Elasticsearch is booming.  Together with Logstash, a tool for collecting and processing logs, and Kibana, a tool for searching and visualizing data in Elasticsearch (aka the “ELK stack”), adoption of Elasticsearch continues to grow by leaps and bounds. In this detailed (and free!) booklet Sematext DevOps Evangelist, Stefan Thies, walks readers through Elasticsearch and ELK stack basics and supplies numerous graphs, diagrams and infographics to clearly explain what you should monitor, which Elasticsearch metrics you should watch.  We’ve also included the popular “Top 10 Elasticsearch Metrics” list with corresponding explanations and screenshots.  This booklet will be especially helpful to those who are new to Elasticsearch and ELK stack, but also to experienced users who want a quick jump start into Elasticsearch monitoring.

Free_download

Like this booklet?  Please tweet about Performance Monitoring Essentials Booklet – Elasticsearch Edition

Know somebody who’d find this booklet useful?  Please let them know…

When it comes to actually using Elasticsearch, there are tons of metrics generated.  The goal of creating this free booklet is to provide information that we at Sematext have found to be extremely useful in our work as Elasticsearch and ELK stack consultants, production support providers, and monitoring solution builders.

ES_Book_cover

Topics, including our Top 10 Elasticsearch Metrics

Topics addressed in the booklet include: Elasticsearch Vocabulary, Scaling a Cluster, How Indexing Works, Cluster Health – Nodes & Shards, Node Performance, Search Performance, and many others.  And here’s a quick taste of the kind of juicy content you’ll find inside: a dashboard view of our 10 Elasticsearch metrics list.

Top_10_dashboard

This dashboard image, and all images in the booklet, are from Sematext’s SPM Performance Monitoring tool.

Got Feedback? Questions?

Please give our booklet a look and let us know what you think — we love feedback!  You can DM us (and RT and/or follow us, if you like what you read) @sematext, or drop us an email.

And…if you’d like try SPM to monitor Elasticsearch yourself, check out a Free 30-day trial by registering here.  There’s no commitment and no credit card required. Small startups, startups with no or very little outside funding, non-profit and educational institutions get special pricing – just get in touch with us.

Docker Monitoring Support

Containers and Docker are all the rage these days.  In fact, containers — with Docker as the leading container implementation — have changed how we deploy systems, especially those comprised of micro-services. Despite all the buzz, however, Docker and other containers are still relatively new and not yet mainstream. That being said, even early Docker adopters need a good monitoring tool, so last month we added Docker monitoring to SPM.  We built it on top of spm-agent – the extensible framework for Node.js-based agents and ended up with spm-agent-docker.

Monitoring of Docker environments is challenging. Why? Because each container typically runs  a single process, has its own environment, utilizes virtual networks, or has various methods of managing storage. Traditional monitoring solutions take metrics from each server and application they run. These servers and the applications running on them are typically very static, with very long uptimes. Docker deployments are different: a set of containers may run many applications, all sharing the resources of a single host. It’s not uncommon for Docker servers to run thousands of short-term containers (e.g., for batch jobs) while a set of permanent services runs in parallel.  Traditional monitoring tools not used to such dynamic environments are not suited for such deployments. SPM, on the other hand, was built with this in mind.  Moreover, container resource sharing calls for stricter enforcement of resource usage limits, an additional issue you must watch carefully. To make appropriate adjustments for resource quotas you need good visibility into any limits containers have reached or errors they have caused. We recommend using alerts according to defined limits; this way you can adjust limits or resource usage even before errors start happening.

How do we get a detailed metrics of each container?

Docker provides a remote interface for container stats (by default exposed via UNIX domain socket). The SPM agent for Docker uses this interface to collect Docker metrics.

SPM for Docker

SPM Docker Agent monitoring other containers, itself running in a Docker container

How to deploy monitoring for Docker

There are several ways one can run a Docker monitor, including:

  1. run it directly on the host machine (“Server” in the figure above)
  2. run one agent for multiple servers
  3. run agent in a container (along containers it monitors) on each server

SPM uses approach 3), aka the “Docker Way”. Thus, SPM for Docker is provided as a Docker Image. This makes the installation easy, requires no installation of dependencies on the host machine compared to approach 1), and it requires no configuration of a server list to support multiple Docker servers.

How to install SPM for Docker

It’s very simple: Create an SPM App of type “Docker” to get the SPM application token (for $TOKEN, see below), and then run:

  1. docker pull sematext/spm-agent-docker and
  2. docker run -d  -v /var/run/docker.sock:/var/run/docker.sock -e SPM_TOKEN=$TOKEN -e HOSTNAME:$HOSTNAME sematext/spm-agent-docker

You’ll see your Docker metrics in SPM after about a minute.

SPM for Docker – Features

If you already know SPM then you’re aware that each SPM integration supports all SPM features.  If, however, you are new to SPM, this summary will help:

  1. Out-of-the-box Dashboards and unlimited custom Dashboards
  2. Multi-user support with role-based access control, application and account sharing
  3. Threshold-based Alerts on all metrics mentioned above including Custom Metrics
  4. Machine learning-based Anomaly Detection on all metrics, including Custom Metrics
  5. Alerting via email, PagerDuty, Nagios and Webhooks  (e.g. Slack, HipChat)
  6. Email subscriptions for scheduled Performance Reports
  7. Secure sharing of graphs and reports with your team, or with the public
  8. Correlation with logs shipped to Logsene
  9. Charting and correlation with arbitrary Events

Let’s continue with the Docker-specific part:

  1. Easy to install docker agent
  2. Monitoring of multiple Docker Hosts and unlimited number of Containers per ‘SPM Docker App’
  3. Predefined Dashboards for all Host and Container metrics
  • OS Metrics of the Docker Host
  • Detailed Container Metrics
    • CPU
    • Memory
    • Network
    • I/O Metrics
  • Limits of Resource Usage
    • CPU throttled times
    • Memory limits
  • Fail counters (e.g., for memory allocation and network packets)
  • Filter and aggregations by Hosts, Images, Container IDs, and Tags

docker-overview-2SPM for Docker – Predefined Dashboard ‘Overview’

Containerized applications typically communicate with other applications via the exposed network ports; that’s why network metrics are definitely on the hot list of metrics to watch for Docker and a reason to provide such detailed Reports in SPM:

Docker-Network-Metrics

Did you enjoy this little excursion on Docker monitoring? Then it’s time to practice it!

We appreciate feedback of early adopters, so please feel free to drop us a line, DM us on Twitter @sematext or chat with us using the web chat in SPM or on our homepage — we are here to get your monitoring up and running.  If you are a startup, get in touch – we offer discounts for startups!

Monitoring Kibana 4’s Node.js App

The release of Kibana 4.x has had an impact on monitoring and other related activities.  In this post we’re going to get specific and show you how to add Node.js monitoring to the Kibana 4 server app.  Why Node.js?  Because Kibana 4 now comes with a little Node.js server app that sits between the Kibana UI and the Elasticsearch backend.  Conveniently, you can monitor Node.js apps with SPM, which means SPM can monitor Kibana in addition to monitoring Elasticsearch.  Futhermore, Logstash can also be monitored with SPM, which means you can use SPM to monitor your whole ELK Stack!  But, I digress…

A few important things to note first:

  • the Kibana 4 project moved from Ruby to pure browser app to Node.js on the server side, as mentioned above
  • it now uses the popular Express Web Framework
  • the server component has a built-in proxy to Elasticsearch, just like it did with the Ruby app
  • when monitoring Kibana 4, the proxy requests to Elasticsearch are monitored at the same time

OK, here’s how to add Node.js monitoring to the Kibana 4 server-side app.

1) Preparation

Get an App Token for SPM by creating a new Node.js SPM App in SPM.

Kibana 4 currently ships with Node.js version 0.10.35 in a subdirectory – so please make sure your Node.js is on 0.10 while installing SPM Agent for Node.js (it compiles native modules, which need to fit to Kibana’s 0.10 runtime).

  npm-install ny -g
  ny 0.10.35

After finishing the described installation below you can easily switch back to 0.12 or io.js 2.0 by using “ny 0.12″ or “ny 2.0″ – because Kibana will use its own node.js sub-folder.

2) Install SPM Agent for Node.js

Switch over to your Kibana 4 installation directory.  It has a “src” folder where the Node.js modules are installed.

  cd src
  npm install spm-agent-nodejs

Add the following line to ./src/app.js

  var spmAgent = require ('spm-agent-nodejs')

Add the following line to bin/kibana shell script at the beginning

export spmagent_tokens__spm=YOUR-SPM-APP-TOKEN

3) Run Kibana

bin/kibana

4) Check results in SPM

After a minute you should see the performance metrics such as EventLoop Latencies, Memory Usage, Garbage Collection details and HTTP statistics of your Kibana 4 Server app in SPM.

Kibana 4 - monitored with SPM for Node.js

Kibana 4 – monitored with SPM for Node.js

SPM for Node.js Monitoring – Details, Screenshots and more

For more specific details about SPM’s Node.js monitoring integration, check out this blog post.

That’s all there is to it!  If you’ve got questions or feedback to this post, please let us know!

Follow

Get every new post delivered to your Inbox.

Join 169 other followers