EC2 Neighbour Caught Stealing CPU
April 22, 2013 6 Comments
We run all our services (SPM, Search Analytics, and Logsene) on top of AWS. We like the flexibility and the speed of provisioning and decommissioning instances. Unfortunately, this “new age” computing comes at a price. Once in a while we hit an EC2 instance that has a loud, noisy neighbour. Kind of like this:
Unlike in real life, you can’t really hear your noisy neighours in virtualized worlds. This is kind of good – if you don’t hear them, they won’t bother you, right? Wrong! Oh yes, they will bother you, it’s just that without proper tools you won’t really realize when they’ve become load, how loud they got, and how much their noise is hurting you! So while it’s true you can’t hear these neighbours, you can see them! Have a look at this graph from SPM:
What we see here is a graph for CPU “steal time” for one of our HBase servers. Luckily, this happens to be one of our HBase masters, which doesn’t do a ton of CPU intensive work. What we see is that somebody, some other VM(s) sharing the same underlying host, is stealing about 30% of the CPU that really belongs to us. Sounds bad, doesn’t it? What exactly does that mean? It means that about 30% of the time, applications on this instance (i.e., in our VM) try to use the CPU and the CPU is not available. Bummer. Of course, this happens at a very, very low level, so from the outside, without this sort of insight, everything looks OK — it’s impossible to tell whether applications are not getting the CPU cycles when they need them by just looking at applications themselves.
So, do you know how noisy your virtual neighbours are? Do you know how much they steal from you?
If you want to see what your neighbour situation is, whether on AWS or in some other virtualized environment, this is what you can do:
- Get SPM (pick “Java” for your SPM Application type once you get in even if you don’t need to monitor any Java apps, yes)
- Run the installer, but don’t bother with the “monitor” (aka SPM Monitor) piece – all you need to know are your CPU metrics and for that we don’t need the monitor piece to be running at all actually.
- Go to http://apps.sematext.com/ and look at the “CPU & Mem” tab
- Unselect all metrics other than “steal”, as show in the image above. Select each server you want to check in the filter right of that graph (not shown in the image) to check one server at a time.
- Make use of SPM alerts and set them up so you get notified when the CPU steal percentage goes over a certain threshold that you don’t want to tolerate. This way you’ll know when it’s time to consider moving to a new VM/instance.
What do you do if you find out you do have noisy neighbours?
There are a couple of options:
- Be patient and hope they go to sleep or move out
- Pack your belongings, launch a new EC2 instance, and move there after ensuring it doesn’t suffer from the same problem
- Create more noise than your neighbour and drive him/her out instead. Yes, I just made this up.
In this particular case, we’ll try the patient option first and move out only when the noise starts noticeably hurting us or we run out of patience. Happy monitoring!