You don’t see him, but he is present. He is all around us. He keeps things running. No, we are not talking about Him, nor about The Force. We are talking about Apache ZooKeeper, the under-appreciated, often not talked-about, yet super-critical component of almost all distributed systems we’ve come to rely on – Hadoop, HBase, Solr, Kafka, Storm, and so on. Our SPM, Search Analytics, and Logsene, all use ZooKeeper, and we are not alone – check our ZooKeeper poll.
We’re happy to announce that SPM can now monitor Apache ZooKeeper! This means everyone using SPM to monitor HadoopHBase, Solr, Kafka, Sensei, and other applications that rely on ZooKeeper can now use the same monitoring and alerting tool – SPM – to monitor their ZooKeeper instances.
Take it from one of the must trusted names from the world of Hadoop and HBase, as well as one of the friendliest people you’ll encounter on the Hadoop conference circuit, Lars George from Cloudera:
We’re happy to announce the immediate availability of SPM for Hadoop (see Sneak Peek: Hadoop Monitoring comes to SPM for some screenshots). With the latest SPM release Hadoop joins Apache Solr, Apache HBase, ElasticSearch, Sensei, and the JVM as the list of technologies you can monitor with SPM. With SPM for Hadoop you go from zero to seeing all key metrics for your Hadoop cluster metrics in just a few minutes. Included in the reports are metrics for both HDFS and MapReduce – metrics for NameNode, JobTracker, TaskTracker, and DataNode are all included along with all the default server metrics. The YARN version of Hadoop is also supported and includes metrics for NodeManager, ResourceManager, etc.
Don’t forget that SPM monitoring agent can run as in-process agent, as well as in standalone mode (i.e., as an external process). Running in the standalone mode means you may not have to restart various daemons of your existing Hadoop cluster that you want to monitor (assuming you already enabled JMX), so you can quickly get to your Hadoop metrics without interrupting anything!
We’ve been doing quite a bit of work behind the scenes in SPM. Here are a few new things in the most recent release – 1.11.0 from April 16, 2013:
We’ve added a Standalone Monitor. So far the only way to monitor Solr, ElasticSearch, HBase, Sensei, or JVM with SPM was by running our SPM Monitor in-process, as a Javaagent. Starting with this version you have an additional option of running the monitor in a separate process.
SPM URLs are now sharable. Just copy the URL from your browser while using SPM and give it to anyone who has access to the same SPM App and they’ll see the exact same view as you – this means seeing the same report, same graph, same filter selection(s), and the same time range! Because we use SPM with a lot of our Solr and ElasticSearch consulting clients, this is huge for us (and them!), as it helps us all see the exact same view.
We have simplified the SPM client installation a lot and have simplified the Collectd config a bit, too.
Both SPM Sender and SPM Monitor have been reworked. Monitor has the ability to register new applications and Sender has the ability to pick that up. Sender should also be running with ionice if you have ionice available and a bit of unnecessary work was removed from Monitor, so it should consume even fewer resources than before.
Hadoop monitoring. This includes performance reports for both HDFS and MapReduce – NameNode, JobTracker, TaskTracker, and DataNode.
SPM client packaging. This means you’ll soon be able to install SPM client as a Deb package or RPM, and then automate with Puppet or Chef.
There are a few more interesting things in the works, but we’ve got to leave something for later. If you have not tried SPM yet, you should! User feedback has been awesome and there are a number of good things on the 2013 roadmap!
When it comes to Hadoop, they say you’ve got to monitor it and then monitor it some more. Since our own Performance Monitoring and Search Analytics services run on top of Hadoop, we figured it was time to add Hadoop performance monitoring to SPM. So here is a sneak peek at SPM for Hadoop. If you’d like to try it on your Hadoop cluster, we’ll be sending invitations soon and you can get on the private beta list starting today!
In the mean time, here is a small sample of pretty self-explanatory reports from SPM for Hadoop, so you can get a sense of what’s available. There are, of course, a number of other Hadoop-specific reports included, as well as server reports, filtering, alerting, multi-user support, report sharing, etc. etc.
Please don’t forget to tell us what else would you like us to monitor – select your candidates – and if you like what you see and want a good monitoring tool for your Hadoop cluster, please sign up for private beta now.
Click on any graph to see it in its full size and high quality.
We run all our services (SPM, Search Analytics, and Logsene) on top of AWS. We like the flexibility and the speed of provisioning and decommissioning instances. Unfortunately, this “new age” computing comes at a price. Once in a while we hit an EC2 instance that has a loud, noisy neighbour. Kind of like this:
Unlike in real life, you can’t really hear your noisy neighours in virtualized worlds. This is kind of good – if you don’t hear them, they won’t bother you, right? Wrong! Oh yes, they will bother you, it’s just that without proper tools you won’t really realize when they’ve become load, how loud they got, and how much their noise is hurting you! So while it’s true you can’t hear these neighbours, you can see them! Have a look at this graph from SPM:
What we see here is a graph for CPU “steal time” for one of our HBase servers. Luckily, this happens to be one of our HBase masters, which doesn’t do a ton of CPU intensive work. What we see is that somebody, some other VM(s) sharing the same underlying host, is stealing about 30% of the CPU that really belongs to us. Sounds bad, doesn’t it? What exactly does that mean? It means that about 30% of the time, applications on this instance (i.e., in our VM) try to use the CPU and the CPU is not available. Bummer. Of course, this happens at a very, very low level, so from the outside, without this sort of insight, everything looks OK — it’s impossible to tell whether applications are not getting the CPU cycles when they need them by just looking at applications themselves.
So, do you know how noisy your virtual neighbours are? Do you know how much they steal from you?
If you want to see what your neighbour situation is, whether on AWS or in some other virtualized environment, this is what you can do:
Get SPM (pick “Java” for your SPM Application type once you get in even if you don’t need to monitor any Java apps, yes)
Run the installer, but don’t bother with the “monitor” (aka SPM Monitor) piece – all you need to know are your CPU metrics and for that we don’t need the monitor piece to be running at all actually.
Unselect all metrics other than “steal”, as show in the image above. Select each server you want to check in the filter right of that graph (not shown in the image) to check one server at a time.
Make use of SPM alerts and set them up so you get notified when the CPU steal percentage goes over a certain threshold that you don’t want to tolerate. This way you’ll know when it’s time to consider moving to a new VM/instance.
What do you do if you find out you do have noisy neighbours?
There are a couple of options:
Be patient and hope they go to sleep or move out
Pack your belongings, launch a new EC2 instance, and move there after ensuring it doesn’t suffer from the same problem
Create more noise than your neighbour and drive him/her out instead. Yes, I just made this up.
In this particular case, we’ll try the patient option first and move out only when the noise starts noticeably hurting us or we run out of patience. Happy monitoring!
In this presentation from Berlin Buzzwords 2012 we show how the SPM, our Performance Monitoring service is built. How metrics are collected, how they are processed, and how they are presented. We share a few findings along the way, too.
Note: we are actively looking for people with strong Java engineers. If that’s you, please get in touch. Separately, if you have interest and/or experience with HBase and/or Analytics, OLAP, and related areas, or if you are looking to work with ElasticSearch, Solr, and search in general please get in touch, too.