Announcement: ZooKeeper Performance Monitoring in SPM

You don’t see him, but he is present.  He is all around us.  He keeps things running.  No, we are not talking about Him, nor about The Force.  We are talking about Apache ZooKeeper, the under-appreciated, often not talked-about, yet super-critical component of almost all distributed systems we’ve come to rely on – Hadoop, HBase, Solr, Kafka, Storm, and so on.  Our SPM, Search Analytics, and Logsene, all use ZooKeeper, and we are not alone – check our ZooKeeper poll.

We’re happy to announce that SPM can now monitor Apache ZooKeeper!  This means everyone using SPM to monitor Hadoop HBase, Solr, Kafka, Sensei, and other applications that rely on ZooKeeper can now use the same monitoring and alerting tool – SPM – to monitor their ZooKeeper instances.

Please tweet about Performance Monitoring for ZooKeeper

Here’s a glimpse into what SPM for ZooKeeper provides – click on the image to see the full view or look at the actual SPM live demo:

SPM for ZooKeeper Overview

SPM for ZooKeeper Overview

Please tell us what you think – @sematext is always listening!  Is there something SPM doesn’t monitor that you would really like to monitor?  Please vote for tech to monitor!

Want to build highly distributed big data apps with us?  We’re hiring good engineers (not just for positions listed on our jobs page), and we’re sitting on a heap of some pretty juicy big data!

Presentation: Administering and Monitoring SolrCloud Clusters

Rafal Kuć gave two talks about Solr at Lucene Revolution.  One of them, Administering and Monitoring SolrCloud Clusters is below.

If you are interesting in working with Solr and/or Elasticsearch, we are looking for good people to join our team.

Announcing Hadoop Monitoring in SPM

Take it from one of the must trusted names from the world of Hadoop and HBase, as well as one of the friendliest people you’ll encounter on the Hadoop conference circuit, Lars George from Cloudera:

Hadoop Club Monitoring

We’re happy to announce the immediate availability of SPM for Hadoop (see Sneak Peek: Hadoop Monitoring comes to SPM for some screenshots).  With the latest SPM release Hadoop joins Apache Solr, Apache HBase, ElasticSearch, Sensei, and the JVM as the list of technologies you can monitor with SPM. With  SPM for Hadoop you go from zero to seeing all key metrics for your Hadoop cluster metrics in just a few minutes.  Included in the reports are metrics for both HDFS and MapReduce – metrics for NameNode, JobTracker, TaskTracker, and DataNode are all included along with all the default server metrics.  The YARN version of Hadoop is also supported and includes metrics for NodeManager, ResourceManager, etc.

Don’t forget that SPM monitoring agent can run as in-process agent, as well as in standalone mode (i.e., as an external process).  Running in the standalone mode means you may not have to restart various daemons of your existing Hadoop cluster that you want to monitor (assuming you already enabled JMX), so you can quickly get to your Hadoop metrics without interrupting anything!

What else would you like us to monitor?  Please select your candidates!

What’s New in SPM 1.11.0

We’ve been doing quite a bit of work behind the scenes in SPM.  Here are a few new things in the most recent release – 1.11.0 from April 16, 2013:

  • We’ve added a Standalone Monitor.  So far the only way to monitor Solr, ElasticSearch, HBase, Sensei, or JVM with SPM was by running our SPM Monitor in-process, as a Javaagent.  Starting with this version you have an additional option of running the monitor in a separate process.
  • SPM URLs are now sharable. Just copy the URL from your browser while using SPM and give it to anyone who has access to the same SPM App and they’ll see the exact same view as you – this means seeing the same report, same graph, same filter selection(s), and the same time range!  Because we use SPM with a lot of our Solr and ElasticSearch consulting clients, this is huge for us (and them!), as it helps us all see the exact same view.
  • We have simplified the SPM client installation a lot and have simplified the Collectd config a bit, too.
  • Both SPM Sender and SPM Monitor have been reworked. Monitor has the ability to register new applications and Sender has the ability to pick that up.  Sender should also be running with ionice if you have ionice available and a bit of unnecessary work was removed from Monitor, so it should consume even fewer resources than before.
  • We’ve added more info about SPM to the new SPM Wiki Space.

In addition to that, we’re working on:

  • Hadoop monitoring. This includes performance reports for both HDFS and MapReduce – NameNode, JobTracker, TaskTracker, and DataNode.
  • SPM client packaging.  This means you’ll soon be able to install SPM client as a Deb package or RPM, and then automate with Puppet or Chef.

There are a few more interesting things in the works, but we’ve got to leave something for later.  If you have not tried SPM yet, you should!  User feedback has been awesome and there are a number of good things on the 2013 roadmap!

Sneak Peek: Hadoop Monitoring comes to SPM

When it comes to Hadoop, they say you’ve got to monitor it and then monitor it some more.  Since our own Performance Monitoring and Search Analytics services run on top of Hadoop, we figured it was time to add Hadoop performance monitoring to SPM.  So here is a sneak peek at SPM for Hadoop.  If you’d like to try it on your Hadoop cluster, we’ll be sending invitations soon and you can get on the private beta list starting today!

In the mean time, here is a small sample of pretty self-explanatory reports from SPM for Hadoop, so you can get a sense of what’s available.  There are, of course, a number of other Hadoop-specific reports included, as well as server reports, filtering, alerting, multi-user support, report sharing, etc. etc.

Please don’t forget to tell us what else would you like us to monitor - select your candidates - and if you like what you see and want a good monitoring tool for your Hadoop cluster, please sign up for private beta now.

Click on any graph to see it in its full size and high quality.

Hadoop NameNode Files

Hadoop NameNode Files

.

Hadoop DataNode Read-Write

Hadoop DataNode Read-Write

.

Hadoop JobTracker MapReduce Runtime

Hadoop JobTracker MapReduce Runtime

.

Hadoop TaskTracker Tasks

Hadoop TaskTracker Tasks

.

What else would you like us to monitor with SPM?  Please select your candidates!

For announcements, promotions, discounts, service status, milk, cookies, and other goodies follow @sematext.

EC2 Neighbour Caught Stealing CPU

We run all our services (SPM, Search Analytics, and Logsene) on top of AWS.  We like the flexibility and the speed of provisioning and decommissioning instances.  Unfortunately, this “new age” computing comes at a price.  Once in a while we hit an EC2 instance that has a loud, noisy neighbour.  Kind of like this:

Noisy Jack Nicholson

Unlike in real life, you can’t really hear your noisy neighours in virtualized worlds.  This is kind of good – if you don’t hear them, they won’t bother you, right? Wrong! Oh yes, they will bother you, it’s just that without proper tools you won’t really realize when they’ve become load, how loud they got, and how much their noise is hurting you! So while it’s true you can’t hear these neighbours, you can see them!  Have a look at this graph from SPM:

Noisy neighbour(s) stealing your CPU

Noisy neighbour(s) stealing your CPU. Click for a larger and sharper image.

What we see here is a graph for CPU “steal time” for one of our HBase servers.  Luckily, this happens to be one of our HBase masters, which doesn’t do a ton of CPU intensive work.  What we see is that somebody, some other VM(s) sharing the same underlying host, is stealing about 30% of the CPU that really belongs to us.  Sounds bad, doesn’t it?  What exactly does that mean?  It means that about 30% of the time, applications on this instance (i.e., in our VM) try to use the CPU and the CPU is not available.  Bummer. Of course, this happens at a very, very low level, so from the outside, without this sort of insight, everything looks OK — it’s impossible to tell whether applications are not getting the CPU cycles when they need them by just looking at applications themselves.

So, do you know how noisy your virtual neighbours are?  Do you know how much they steal from you?

If you want to see what your neighbour situation is, whether on AWS or in some other virtualized environment, this is what you can do:

  1. Get SPM (pick “Java” for your SPM Application type once you get in even if you don’t need to monitor any Java apps, yes)
  2. Run the installer, but don’t bother with the “monitor” (aka SPM Monitor) piece – all you need to know are your CPU metrics and for that we don’t need the monitor piece to be running at all actually.
  3. Go to http://apps.sematext.com/ and look at the “CPU & Mem” tab
  4. Unselect all metrics other than “steal”, as show in the image above.  Select each server you want to check in the filter right of that graph (not shown in the image) to check one server at a time.
  5. Make use of SPM alerts and set them up so you get notified when the CPU steal percentage goes over a certain threshold that you don’t want to tolerate. This way you’ll know when it’s time to consider moving to a new VM/instance.

What do you do if you find out you do have noisy neighbours?

There are a couple of options:

  • Be patient and hope they go to sleep or move out
  • Pack your belongings, launch a new EC2 instance, and move there after ensuring it doesn’t suffer from the same problem
  • Create more noise than your neighbour and drive him/her out instead. Yes, I just made this up.

In this particular case, we’ll try the patient option first and move out only when the noise starts noticeably hurting us or we run out of patience.  Happy monitoring!

Slides: Large Scale Performance Monitoring for ElasticSearch, HBase, Solr, SenseiDB…

In this presentation from Berlin Buzzwords 2012 we show how the SPM, our Performance Monitoring service is built.  How metrics are collected, how they are processed, and how they are presented.  We share a few findings along the way, too.

Note: we are actively looking for people with strong Java engineers. If that’s you, please get in touch. Separately, if you have interest and/or experience with HBase and/or Analytics, OLAP, and related areas, or  if you are looking to work with ElasticSearch, Solr, and search in general please get in touch, too.

 

See also:

Poll: What do you use for Solr performance monitoring?

The results of this poll will be included in the “Large Scale ElasticSearch, Solr & HBase Performance Monitoring” presentation at Berlin Buzzwords next week.  Please vote and share this post to help us make this poll statistically significant!

Poll: What do you use for ElasticSearch performance monitoring?

The results of this poll will be included in the “Large Scale ElasticSearch, Solr & HBase Performance Monitoring” presentation at Berlin Buzzwords next week.  Please vote and share this post to help us make this poll statistically significant!

ElasticSearch Cache Usage

We’ve been doing a ton of work with ElasticSearch. Not long ago, we had a few situations where ElasticSearch would “eat” all the JVM heap memory we give it.  It was so hungry, we could not feed it enough memory to keep it happy.  It was insatiable.  After some troubleshooting and looking at SPM for ElasticSearch (btw. we released a new version of the SPM agent earlier this week, so if you don’t have it, go grab agent v1.5.0) we figured out the cause - ElasticSearch default field cache setting was not quite right for our deployment. In this post we’ll share our experience on this topic, explain why this was happening and how to minimize the negative effect of large field caches.

ElasticSearch Cache Types

There are two types of caches in ElasticSearch whose behaviors you can control. The first cache is the filter cache. This cache is responsible for caching results of filters used in your queries. This is very handy, because after a filter is run once, ElasticSearch will subsequently use values stored in the filter cache and thus save precious disk I/O operations and by avoiding disk I/O speed up query execution. There are two main implementations of filter cache in ElasticSearch:

  1. node filter cache (default)
  2. index filter cache

The node filter cache is an LRU cache, which means that the least recently used items will be evicted when the filter cache is full. Its size can be limited to be either a percentage of the total memory allocated to the Java process or by specifying the exact amount of memory. The second type of filter cache is the index filter cache. It is not recommended for use because you can’t predict (in most cases) how much memory it will use, since that depends on which shards are located on which node. In addition to that, you can’t control the amount of memory used by index filter cache, you can only set its expiration time and maximum amount of entries in that cache.

The second type of cache in ElasticSearch is field data cache. This cache is used for sorting and faceting in most cases. It loads all values from the field you sort or facet on and then provides calculations on the basis of loaded values. You can imagine that the cost of building such a cache for a large amount of data might be very high.  And it is.  Apart from the type (which can be either resident or soft) you can control two additional parameters of field data cache – the maximum amount of entries in it and its expiration time.

The Defaults

The default implementation for the filter cache is the index filter cache, with its size set to the maximum of 20% of the memory allocated to the Java process. As you can imagine there is nothing to worry about – if the cache fills up appropriate cache entries will get evicted.  You can then consider adding more RAM to make index filter cache bigger or you must live with evictions. That’s perfectly acceptable.

On the other hand we have the default settings for ElasticSearch field data cache – it is a resident cache with unlimited size. Yes, unlimited. The cost of rebuilding this cache is very high and thus you must know how much memory it can use – you must control your queries and watch what you sort on and on which fields you do the faceting.

What Happens When You Don’t Control Your Cache Size ?

This is what can happen when you don’t control your field data cache size:

As you can see on the above chart field data cache jumped to more than 58 GB, which is enormous. Yes, we got OutOfMemory exception during that time.

What CanYou Do ?

There are actually three thing you can do to make your field data cache use less memory:

Control its Size and Expiration Time

When using the default, resident field data cache type, you can set its size and expiration time. However, please remember, that there are situations when you need the field data cache to hold values for that particular field you are sorting or faceting on. In order to change field data cache size, you specify the following property:

index.cache.field.max_size

It specifies the maximum size entries in that cache per Lucene segment. It doesn’t limit the amount of memory field data cache can use, so you have to do some testing to ensure queries you are using won’t result in OutOfMemory exception.

The other property you can set is the expiration time.  It defaults to -1 which says that the cache will not be expired by default. In order to change that, you must set the following property:

index.cache.field.expire

So if, for example, you would like to have a maximum of 50k entries of field data cache per segment and if you would like to have those entries expiredafter 10 minutes, you would set the following property values in ElasticSearch configuration file:

index.cache.field.max_size: 50000
index.cache.field.expire: 10m

Change its Type

The other thing you can do is change field data cache type from the default resident to soft. Why does that matter? ElasticSearch uses Google Guava libraries to implement its cache. The soft type wraps cache values in soft references, which means that whenever memory is needed garbage collector will clear those references even when they are used. This means that when you start hitting heap memory limit, the JVM wont throw OutOfMemory exception, but will  instead release those soft references with the use of garbage collector. More about soft references can be found at:

http://docs.oracle.com/javase/6/docs/api/java/lang/ref/SoftReference.html

So in order to change the default field data cache type to soft you should add the following property to ElasticSearch configuration file:

index.cache.field.type: soft

Change Your Data

The last thing you can do is the operation that requires much more effort than only changing ElasticSearch configuration – you may want to change your data. Look at your index structure, look at your queries and think. Maybe you can lowercase some string data and this way reduce the number of unique values in the field? Maybe you don’t need your dates be precise down to a second, maybe it can be minute or even an hour? Of course, when doing some faceting operations you can set the granularity, but the data will still be loaded into memory. So if there are parts of your data that can be changed in a way that will result in lower memory consumption, you should consider it. As a matter of fact, that is exactly what we did!

Caches After Some Changes

After we made some changes in our ElasticSearch configuration/deployment, this is what the field data cache usage looked like:

As you can see, the cache dropped from 58 GB down to 37 GB.  Impressive drop!  After these changes we stopped running into OutOfMemory exception problems.

Summary

You have to remember that the default settings for field data cache in ElasticSearch may not be appropriate for you. Of course, that may not be the case in your deployment. You may not need sorting, apart from the default based on Lucene scoring and you may not need faceting on fields with many unique terms. If that’s the case for you, don’t worry about field data cache. If you have enough memory for holding the fields data for your facets and sorting then you also don’t need to change anything regarding the cache setup from the default ElasticSearch configuration. What you need to remember is to monitor your JVM heap memory usage and cache statistics, so you know what is happening in your cluster and react before things get worse.

One More Thing

The charts you see in the post are taken from SPM (Scalable Performance Monitoring) for ElasticSearch.  SPM is currently free and, as you can see, we use it extensively in our client engagements.  If you give it a try, please let us know what you think and what else you would like to see in it.

@sematext (Like working with ElasticSearch?  We’re hiring!)

Follow

Get every new post delivered to your Inbox.

Join 1,633 other followers