Top 10 Elasticsearch Metrics to Watch

Elasticsearch is booming.  Together with Logstash, a tool for collecting and processing logs, and Kibana, a tool for searching and visualizing data in Elasticsearch (aka the “ELK” stack), adoption of Elasticsearch continues to grow by leaps and bounds.  When it comes to actually using Elasticsearch, there are tons of metrics generated.  Instead of taking on the formidable task of tackling all-things-metrics in one blog post, we’re going to serve up something that we at Sematext have found to be extremely useful in our work as Elasticsearch consultants, production support providers, and monitoring solution builders: the top 10 Elasticsearch metrics to watch.  This should be especially helpful to those readers new to Elasticsearch, and also to experienced users who want a quick start into performance monitoring of Elasticsearch.

Side note: we’re @sematext, if you’d like to follow us.

Here are the Top 10 Elasticsearch metrics:

  1. Cluster Health – Nodes and Shards
  2. Node Performance – CPU
  3. Node Performance – Memory Usage
  4. Node Performance – Disk I/O
  5. Java – Heap Usage and Garbage Collection
  6. Java – JVM Pool Size
  7. Search Performance – Request Latency and Request Rate
  8. Search Performance – Filter Cache
  9. Search Performance – Field Data Cache
  10. Indexing Performance – Refresh Times and Merge Times

Most of the charts in this piece group metrics either by displaying multiple metrics in one chart or organizing them into dashboards. This is done to provide context for each of the metrics we’re exploring.

To start, here’s a dashboard view of the 10 Elasticsearch metrics we’re going to discuss.

Top_10_dashboard

This dashboard image, and all images in this post, are from Sematext’s SPM Performance Monitoring tool.

Now, let’s dig each of the top 10 metrics one by one and see how to interpret them.

Read more of this post

Cassandra Use Case – including Performance Monitoring

For those of you using Cassandra or considering it, we’ve got another Sematext user’s Planet Cassandra case study for your reading pleasure.  In this Q&A/use case, Alain Rodriguez, main Data Architect for Teads.tv, talks about the business and technical needs that drove their decision to use Cassandra, how they are using Cassandra, how their Cassandra monitoring is done, and more.

Here’s an excerpt related to Cassandra performance monitoring that’s especially dear to our heart:

“Thanks to SPM we now have a very good view of what is happening in terms of Cassandra performance.  This kind of application is fully realized on outage or slowness issues.  With any outage we’ve had since we started using SPM, I always found the root cause have been able to fix or mitigate things.  Any downtime — even for just a few minutes — can lead to hundreds of thousands of dollars lost, plus a negative opinion of Teads by our customers and a negative impact on our image.  Today it is really worth it to invest in a good monitoring solution. I believe SPM belongs to be one of them.”

PlanetCassandra_Teads

 

Here’s the full post: Video Advertising Platform Teads Chose Cassandra, SPM, and OpsCenter to Monitor a Personalized Ad Experience

And here’s our previous client Cassandra use case for Recruiting.com if you’d like another informed take on selecting Cassandra and using SPM for Cassandra monitoring.

If you’d like try SPM to monitor Cassandra yourself (or any number of applications like Hadoop, HBase, Spark, Kafka, Elasticsearch, Solr, etc.), check out a Free 30-day trial by registering here.  There’s no commitment and no credit card required. Small startups, startups with no or very little outside funding, non-profit and educational institutions get special pricing – just get in touch with us.

Monitoring Stream Processing Tools: Cassandra, Kafka and Spark

One of the trends we see in our Elasticsearch and Solr consulting work is that everyone is processing one kind of data stream or another.  Including us, actually – we process endless streams of metrics, continuous log and even streams, high volume clickstreams, etc.  Kafka is clearly the de facto messaging standard.  In the data processing layer Storm used to be a big hit and, while we still encounter it, we see more and more Spark.  What comes out of Spark typically ends up in Cassandra, or Elasticsearch, or HBase, or some other scalable distributed data store capable of high write rates and analytical read loads.

Kafka-Spark-Cassandra - mish-mash tooling

The left side of this figure shows a typical data processing pipeline, and while this is the core of a lot of modern data processing applications, there are often additional pieces of technology involved – you may have Nginx or Apache exposing REST APIs, perhaps Redis for caching, and maybe MySQL for storing information about users of your system.

Once you put an application like this in production, you better keep an eye on it – there are lots of moving pieces and if any one of them isn’t working well you may start getting angry emails from unhappy users or worse – losing customers.

Imagine you had to use multiple different tools, open-source or commercial, to monitor your full stack, and perhaps an additional tool (ELK or Splunk or …) to handle your logs (you don’t just write them to local FS, compress, and rotate, do you?) Yuck!  But that is what that whole right side of the above figure is about.  Each of the tools on that right side is different – they are different open-source projects with different authors, have different versions, are possibly written in different languages, are released on different cycles, are using different deployment and configuration mechanism, etc.  There are also a number of arrows there.  Some carry metrics to Graphite (or Ganglia or …), others connect to Nagios which provides alerting, while another set of arrows represent log shipping to ELK stack.  One could then further ask – well, what/who monitors your ELK cluster then?!?  (don’t say Marvel, because then the question is who watches Marvel’s own ES cluster and we’ll get stack overflow!)  That’s another set of arrows going from the Elasticsearch object.  This is a common picture.  We’ve all seen similar setups!

Our goal with SPM and Logsene is to simplify this picture and thus the lives of DevOps, Ops, and Engineers who need to manage deployments like this stream processing pipeline.  We do that by providing a monitoring and log management solution that can handle all these technologies really well.   Those multiple tools, multiple types of UIs, multiple logins….. they are a bit of a nightmare or at least a not very efficient setup.  We don’t want that.  We want this:

Kafka-Spark-Cassandra

Once you have something like SPM & Logsene in place you can see your complete application stack, all tiers from frontend to backend, in a single pane of glass. Nirvana… almost, because what comes after monitoring?  Alerting, Anomaly Detection, notifications via email, HipChat, Slack, PageDuty, WebHooks, etc.  The reality of the DevOps life is that we can’t just watch pretty, colourful, real-time charts – we also need to set up and handle various alerts.  But in case of SPM & Logsene, at least this is all in one place – you don’t need to fiddle with Nagios or some other alerting tool that needs to be integrated with the rest of the toolchain.

So there you have it.  If you like what we described here, try SPM and/or Logsene – simply sign up here – there’s no commitment and no credit card required.  Small startups, startups with no or very little outside investment money, non-profit and educational institutions get special pricing – just get in touch with us.  If you’d like to help us make SPM and Logsene even better, we are hiring!

SPM Client Async Metrics Collection

Do your SPM charts ever look choppy like this?

choppy-spm-chart-slow-es

Let’s hope not!  However, if you run large Elasticsearch clusters, especially those with thousands of shards, you may suffer from Elasticsearch sometimes taking a long time to provide information about its metrics that SPM, of course, uses to render various metrics charts.  While this slowness is something Elasticsearch developers are undoubtedly working on, we’ve also proactively improved the SPM Client (aka agent, aka monitor), specifically by making metric collection from various sources asynchronous, to better cope with this slowness.

There are other goodies in more recent versions of the SPM Client, including additional optimizations (some specific to Elasticsearch, but others beneficial to all SPM users), but also some that will enable exciting new functionality we’ll be announcing next month.

To get the latest SPM Client please go to https://apps.sematext.com/spm-reports/client.do.

Custom Metrics from Node.js Apps

We recently added support for Node.js and io.js monitoring to SPM and have received great feedback.  While SPM for Node.js monitors all key Node.js metrics, most applications have additional metrics one often wants to track — things like: the number of concurrent users, the number of items placed in a shopping cart, or any other kind of IT metric, business transaction or KPI.  SPM already provides a Custom Metrics API and libraries that make shipping custom metrics from Java and from Ruby applications a snap.  But why leave Node.js behind?  Meet spm-metrics-js (it’s on Github) – the npm module for sending custom metrics from Node.js apps to SPM.  

This JavaScript module supports measurements using counters, meters, timers, and histograms. These helpers calculate values of metrics objects and ship them to SPM, where they are then turned into charts and inputs to alert rules and anomaly detection algorithms.

Here’s an example for counting users on login and logout:

Sending custom metrics is really that easy!

Now, let’s have a look at the options used when creating a custom metric object:

  • name – the name of the metric you can find in SPM’s user interface
  • aggregation – the aggregation type: ‘avg’, ‘sum’, ‘min’ or ‘max’ used in SPM’s aggregations server
  • filter1 – the SPM user interface provides two filter criteria; the value will be available in the UI as the first filter
  • filter2 – the filter value for the second filter field in SPM’s UI
  • interval – time in ms to call save() periodically. Defaults to no automatic call to save(). The save() function captures the metric and resets meters, histograms, counters or timers.
  • valueFilter – array of property names for calculated values. Only specified fields are sent to SPM (e.g. [‘count’, ‘min’, ‘max’].

Additional measurement functions are available to extend the custom metric object automatically with additional calculated properties:

  • Meter – measure rates and provide the following calculated properties:
    • mean: the average rate since the meter was started
    • count: the total of all values added to the meter
    • currentRate: the rate of the meter since the meter was started
    • 1MinuteRate: the rate of the meter biased toward the last 1 minute
    • 5MinuteRate: the rate of the meter biased toward the last 5 minutes
    • 15MinuteRate: the rate of the meter biased toward the last 15 minutes
  • Histogram – build percentile, min, max, & sum aggregations over time
    • min: the lowest observed value
    • max: the highest observed value
    • sum: the sum of all observed values
    • variance: the variance of all observed values
    • mean: the average of all observed values
    • stddev: the stddev of all observed values
    • count: the number of observed values
    • median: 50% of all values in the reservoir are at or below this value.
    • p75: see median, 75% percentile
    • p95: see median, 95% percentile
    • p99: see median, 99% percentile
    • p999: see median, 99.9% percentile
  • Timer – measures time and captures rates in an internal meter and histogram

If this is more than you actually need, we recommend selecting only the relevant properties (using the ‘valueFilter’ option). Please note that Custom Metrics are aggregated by the specified aggregation type (‘avg’, ‘sum’, ‘min’, ‘max’).  Moreover, the aggregation type for each property can be defined – for further details please check the package documentation.

Adding instrumentation always raises the question of performance; in spm-metrics-js all metrics are buffered and efficiently ship metrics to SPM in bulk using asynchronous functions. We recommend using a transmit time of 60 seconds.

Once you send custom metrics to SPM you can create alerts on them, have SPM detect and alert you about anomalies, put charts with those metrics on dashboards, share charts with those metrics publicly or just with your team or organization, etc.

custom-metric-alert

Actions for Metrics – e.g. define alerts using anomaly detection

custom-metric-dashboard

Dashboard with Custom Metric and other Metrics

Please note the free plan has no limits on the number of monitored Applications, Processes, Dashboards or Users and you can share Accounts with your whole DevOps team and integrate SPM with Slack, HipChat, PagerDuty, Webhooks, etc. If you don’t use SPM yet, grab a free account to start monitoring your Node.js and io.js applications and benefit from all standard SPM features such as alerting, anomaly detection, event and log correlation, unlimited dashboards, secure information sharing, etc. Check out spm-metrics-js (or on Github) and drop us a line (or tweet 140 characters to @sematext) — we’d love to hear from you!

SPM REST API

By popular demand…we’ve just added some new goodness to SPM with the SPM REST API.  This new API lets you:

As you can tell from the above, we started by exposing APIs for SPM Alerts first.  Of course, we’ll be expanding the API to expose more SPM functionality, as well as Logsene. You can see all the details — including Listing Alerts, Creating, Disabling and Deleting Alerts, Metrics API, and more are on our wiki.

Alert Creation Designer

One feature of SPM Alerts HTTP API that we do want to call out here is the Alert Creation Designer.  The easiest way to prepare alert rules to be used in API calls is by using the Create Alert dialog available in SPM.

In the example below we use Solr’s req. count metric, whose chart is under the Req. Rate & Latency report, as shown in the screenshot below.

Req Rate Latency

Above every chart in SPM there is a little pull-down menu.  From there simply click on the Create Alert option to open the Alert dialog seen below.  A new “Show API Call” link is shown at the bottom of the dialog. Once clicked, it displays the API call as a “curl” command line. You can modify attributes shown in the dialog and update the API call details by clicking on the little Refresh icon.  Once you’re happy with alert rule parameters you can copy the curl command and execute it from the terminal.

Finally, for Heartbeat Alerts, you would just click on Heart icon on any report (as displayed below) to open a similar dialog:

Heartbeat

Typically, you would prepare your alert templates this way once, and then just tweak them remotely (for example, just use different token parameter to get it applied to your other apps, adjust metric names, thresholds, etc.).

Would this make your life easier?

If this functionality looks like it will make managing alerts more efficient, then check out a Free 30-day SPM trial by signing up.  There’s no commitment and no credit card required.  SPM monitors a ton of applications, like Elasticsearch, Solr, Hadoop, Spark, HBase, Kafka, Storm, Cassandra, Node.js & io.js, and more.

Node.js and io.js Monitoring Support

Node.js and io.js are increasingly being used to run JavaScript on the server side for many types of applications, such as websites, real-time messaging and controllers for small devices with limited resources. For DevOps it is crucial to monitor the whole application stack and Node.js is rapidly becoming an important part of the stack in many organizations. Sematext has historically had a strong support for monitoring big data applications such as Elastic (aka Elasticsearch), Cassandra, Solr, Spark, Hadoop, and HBase, as well as more traditional databases, web servers like Nginx, Nginx Plus and Apache, Java applications, cache servers like Redis and Memcached, messaging middleware like everyone’s darling Kafka, etc.  With such rapid adoption of Node.js and now io.js, we’d be remiss not to add performance monitoring, alerting, and anomaly detection for them in SPM!

spm-node-io

SPM for Node.js

We’re happy to announce we’ve just added Node.js monitoring to this growing list of SPM integrations.  SPM for Node.js covers key Node.js metrics such as Event Loop, Garbage Collection, CPU, Memory and web services metrics.  All metrics are organized in out-of-the-box charts, which can be put on additional dashboards and placed next to performance charts for other parts of the application stack.

Overview for top node.js and io.js metrics

Overview for top node.js and io.js metrics

 

Of course, you can view your Node.js metrics in a larger context.  For example, here is a dashboard that shows Node.js metrics together with Elasticsearch metrics, making it easier to correlate performance across multiple tiers of the application stack.  You could also get your event and log charts on the same dashboard for an even more thorough correlation.

nodejs-elasticsearch-dashboard

Dashboard with node.js HTTP response time and Elasticsearch query latency

Needless to say, we made sure everything works for the latest versions of Node.js (0.12) and io.js (1.6). Installation is as easy as integration of any other module using npm.  If you are not using SPM yet, you can sign up with no commitment or credit card.  You have 30-days free on any new app you create.  If you are already using SPM, you can simply add a new SPM App for Node.js and see all your Node.js metrics in just a few minutes.  Don’t see something in SPM for Node.js?  Please let us know (@sematext) or comment below, we are looking for feedback!

 

Follow

Get every new post delivered to your Inbox.

Join 166 other followers