Announcement: Percentiles added to SPM

In the spirit of continuous improvement, we are happy to announce that percentiles have recently been added to SPM’s arsenal of measurement tools.  Percentiles provide more accurate statistics than averages, and users are able to see 50%, 95% and 99% percentiles for specific metrics and set both regular threshold-based as well as anomaly detection alerts.  We will go more into the details about how the percentiles are computed in another post, but for now we want to put the word out and show some of the related graphs — click on them to enlarge them.  Enjoy!

Elasticsearch – Request Rate and Latency

pecentiles_es

Garbage Collectors Time

percentiles_gc

Kafka – Flush Time

percentiles_kafka_1

Kafka – Fetch/Produce Latency 1

percentiles_kafka_2

Kafka – Fetch/Produce Latency 2

percentiles_kafka_3

Solr Req. Rate and Latency 1

percentile_solr

Solr – Req. Rate and Latency 2

percentiles_solr_2

If you enjoy performance monitoring, log analytics, or search analytics, working with projects like Elasticsearch, Solr, HBase, Hadoop, Kafka, Storm, we’re hiring planet-wide!

Announcement: Redis Monitoring in SPM

Don’t worry, we didn’t just stop at Storm monitoring and metrics while improving SPM.  We’re also happy to announce support for Redis.

Specifically, here are some of the key Redis metrics SPM monitors:

  • Used Memory
  • Used Memory Peak
  • Used Memory RSS
  • Connected Clients
  • Connected Slaves
  • Master Last IO Seconds Ago
  • Keyspace Hits
  • Keyspace Misses
  • Evicted Keys
  • Expired Keys
  • Commands Processed
  • Keys count per db
  • To be expired keys count per db

Also, for all application types users can add alerting rules, heartbeat alerts, and Algolerts, as well as receive emails with performance reports for a given time period.

Enough with the words, these are what the graphs look like — click them to enlarge them:

Redis-Overview

Redis-Overview

Redis-Memory

Redis-Memory

Used memory/Used memory peak/Used memory RSS chart

Redis-Keyspace-Hits

Redis-Keyspace-Hits

Keyspace Hits chart

Redis-Expiring-Keys

Redis-Expiring-Keys

Expiring Keys chart

Redis-Evicted-Keys

Redis-Evicted-Keys

Evicted Keys chart

And we’re not done.  Watch this space for more SPM updates coming soon…

Give SPM a spin – it’s free to get going and you’ll have it up and running, graphing all your Redis metrics in 5 minutes!

If you enjoy performance monitoring, log analytics, or search analytics, working with projects like Elasticsearch, Solr, HBase, Hadoop, Kafka, Storm, we’re hiring planet-wide!

Announcement: Apache Storm Monitoring in SPM

There has been a “storm” brewing here at Sematext recently.  Fortunately this has nothing to do with the fierce winter weather many of us are experiencing in different parts of the globe — it’s actually a good kind of storm!  We’ve gotten a lot of requests to add Apache Storm support to SPM and we’re please to say that is now a reality.  SPM can already monitor Kafka, ZooKeeper, Hadoop, Elasticsearch, and more. As a matter of fact, we’ve just announced Redis monitoring, too!

Here’s why you should care:

  1. SPM users can see different Storm metrics in dynamic , real-time graphs, a big improvement from the standard Storm UI which only allows some time-specific snapshots.  Isn’t it better to see trends as opposed to static snapshots?  We certainly think so.
  2. SPM users can create an external link and share their charts with others (like a Mailing List or in a blog post) to get insight into problems without having to provide login credentials.  Here’s an example (you will see the chart even though you don’t know UN/PW):  https://apps.sematext.com/spm-reports/s/aQjuv5GdC1
  3. SPM also provides its users with common System and JVM-related metrics like CPU usage, memory usage, JVM heap size and pool utilization, among others.  This lets you troubleshoot performance issues better by allowing you to correlate  Storm-specific metrics with common System and JVM metrics.

Here are the Storm metrics SPM can now monitor:

  • Supervisors count
  • Topologies count
  • Supervisor total/free/used slots count
  • Topology workers/executors/tasks count
  • Topology spouts/bolts/state spouts count
  • Bolt emitted/transferred events
  • Bolt acked/executed/failed events
  • Bolt executed/processed latencies
  • Spout emitted/transferred events
  • Spout acked/failed events
  • Spout complete latency

Also important to note — users can add alerting rules for all metrics, including Algolerts and heartbeat alerts, as well as receive daily, weekly, and monthly performance reports via email.

Here are some of the graphs — click on them to see larger versions:

Overview

For observing the general state of the system

For observing the general state of the system

Acked-Failed Decrease

Check out how "acked" (blue line) decreased. It may be related to some problems with resources (e.g., CPU load)

Do you see how “acked” (blue line) decreased? It may be related to some problems with resources (e.g., CPU load)

Timing-Increased

Timing-Tncreased

Check out this “Timing” chart: see the spike at ~13:21? It seems that something is up with the CPU (again); it might be the “pressure” from Java GC (Garbage Collector)

Start-Topology-Workers

Start-Topology-Workers

On the first chart you can see how the counts of tasks and workers grew.  It is because a new topology (“job” in Storm terminology) started at 12:25.

Start-Topology

Start-Topology

The same as above: you can see that between 12:00 and 12:30 Storm Supervisor was restarted (something that works on each machine inside the cluster) and topology was added after restarting.

Give SPM a spin – it’s free to get going and you’ll have it up and running, graphing all your Storm metrics in 5 minutes!

If you enjoy performance monitoring, log analytics, or search analytics, working with projects like Elasticsearch, Solr, HBase, Hadoop, Kafka, Storm, we’re hiring planet-wide!

Video: Administering and Monitoring SolrCloud Clusters

As you know, at Sematext, we are not only about consulting services, but also about administration, monitoring, and data analysis. Because of that, during last year’s Lucene Revolution conference in Dublin we gave a talk about administration and monitoring of SolrCloud clusters. During the talk, Rafał Kuć discusses some administration procedures for SolrCloud like collection management and schema modifications with the schema API. In addition, he also talks about why monitoring is important and what to pay attention to. Finally, he shows three real life examples of monitoring usefulnesses.  Enjoy the video and/or the slides!

Note: we are looking for engineers passionate about search to join our professional services team.  We’re hiring planet-wide!

 

 

 

Announcement: JVM Memory Pool Monitoring

Raise your hand if you’ve never experienced the dreaded OutOfMemoryError (aka OOM or OOME) while working with Java.  If you’ve raised your hand count yourself lucky.  The vast majority of Java developers are nowhere near that lucky.  Today, SPM is making the lives of all Java developers a little better by adding JVM Memory Pool Monitoring reports to SPM to the existing JVM Heap, Threading, and Garbage Collection reports.  Note: you’ll need to get the new SPM client (version 1.16 or newer) to see these reports.

Please tweet about JVM Memory Pool Monitoring in SPM

What are JVM Memory Pools (aka Spaces)

The super simplified version of this complex topic is that inside the JVM there are different memory pools (aka spaces) that the JVM uses for different types of objects.  The JVM uses some of these pools to store permanent bits, others for storing young objects, others for storing tenured objects, and so on.  Numerous blog posts and documentation has been written on the topic of the inner workings of the JVM.  It’s a complex topic that is not easy to fully grook in one sitting.  To make the story more complex, different Garbage Collectors and certainly different JVM implementations manage objects differently.  Thus, which exact memory pools you see in SPM will depend and change as you change the JVM or select a different garbage collection algorithms or garbage collection-related parameters.  And this is precisely one case where seeing these pools in SPM comes reeeeeally handy – seeing how memory pool sizes and usage changes as you try different Garbage Collectors or any other JVM parameter can be very informative, educational, and can bring insight into the workings of the JVM that let you select parameters that are optimal for your particular application.

The new JVM Memory Pools reports are available to all SPM users right away.  Let’s have a look at what these new reports look like and what information they provide.

Memory Pool Sizes

This report should be obvious to all Java developers who know about how JVM manages memory.    Here we see relative sizes of all memory spaces and their total size.  If you are troubleshooting performance of the JVM (i.e., any Java application) this is one of the key places to check first, in addition to looking at Garbage Collection and Memory Pool Utilization report (see further below).  In this graph we see a healthy sawtooth pattern clearly showing when major garbage collection kicked in.

spm-jvm-memory-pool-sizes

JVM Memory Pool Size

Memory Pool Utilization

The Memory Pool Size graph is useful, but knowing the actual utilization of each pool may be even more valuable, especially if you are troubleshooting the OutOfMemoryError we mentioned earlier in the post.  Java OOME stack traces don’t often tell you much info about where the OOME happened.  The Memory Pool Utilization graph shows what percentage of each pool is being used over time.  When some of these Memory Pools approach 100% utilization and stay there, it’s time to worry.  When that happens, if you jump over to the Garbage Collection report in SPM you will probably see spikes there as well.  And if you then jump to your CPU report in SPM you will likely see increased CPU usage there, as the JVM keeps trying to free up some space in any pools that are (nearly) full.

spm-jvm-memory-pool-utilization

JVM Memory Pool Utilization

Alerting for Memory Pool before OOME

If you see scary spikes or near 100% utilization of certain memory pools, your application may be in trouble.  Not dealing with this problem, whether through improving the application’s use of memory or giving the JVM more memory via -Xmx and related parameters, will likely result in a big bad OOME.  Nobody wants that to happen.  One way to keep an eye on this is via Alerts in SPM.  As you can see from the graphs, memory utilization naturally and typically varies quite a bit.   Thus, although SPM has nice Algolerts, to monitor utilization of JVM memory pools we recommend using standard threshold-based Alerts and getting alerted when utilization percentage is > N% for M minutes.  N and M are up to you, of course.

Please tell us what you think – @sematext is always listening!  Is there something SPM doesn’t monitor that you would really like to monitor?  Please vote for tech to monitor!

Announcement: SPM Performance Monitoring for Kafka

We are happy users of Apache Kafka.  We make heavy use of it in SPM, in Search Analytics, and in Logsene.  We knew Kafka was fast when we decided to use it.  But it wasn’t until we added Kafka Performance Monitoring support to SPM that we saw all Kafka performance metrics at a glance and admired its speed.  We are happy to announce that in addition to being able to monitor Solr/SolrCloud, Elasticsearch, Hadoop HDFS and MapReduce, HBase, SenseiDB, JVM, RedHat, Debian, Ubuntu, CentOS, SuSE, etc. you can now also monitor Apache Kafka!

Please tweet about Performance Monitoring for Kafka

Here’s a glimpse into what SPM for Kafka provides – click on the image to see the full view or look at the actual SPM live demo:

SPM for Kafka Overview

SPM for Kafka Overview

Of course, SPM can also alert you on any Kafka metric and can do so directly via email or via PagerDuty, using either traditional threshold-based alerts or our recently announced Algolerts.  SPM supports Kafka 0.7.x, and we’ll be adding support for Kafka 0.8.x shortly.

Please tell us what you think – @sematext is always listening!  Is there something SPM doesn’t monitor that you would really like to monitor?  Please vote for tech to monitor!

Introducing Algolerts – Algorithmic Anomaly Detection Alerts

It is not every day that you come across a new term or concept Google doesn’t yet know about.  So today we’ll teach Google about something new we’ve added to SPM in the latest release: Algolerts.

Please tweet about Algolerts – Algorithmic Anomaly Detection Alerts

The Problem with Threshold-based Alerts

Why do we even have alerts in performance monitoring systems?  We have them because we want to be notified when something bad happens, when some metric spikes or dips too much – when CPU usage hits the roof, when disk IO goes up, when the network traffic suspiciously quiets down, and so on.  We see such spikes or dips in metric values as signs that something might be wrong or is about to go wrong.  When limited to traditional threshold-based alerts one is forced to figure out what range of metric values represents a non-alarming, normal state and, conversely, at which point spikes and dips should be considered out of an acceptable range and taken seriously.  One needs to pick minimum and maximum metric values and then create one alert rule for each such value.  The same process then needs to be repeated for every metric one wants to monitor.  This is painful and time-consuming.  To make things worse, these thresholds have to be regularly updated to match what represents the new normal!  One can try to fight this by setting very “loose alerts” by picking high maxima and low minima, but then one risks not getting alerted when something really does go awry.

To summarize:

  • It is hard to estimate the normal range of each metric and pick min and max thresholds
  • Metric values often fluctuate and create false alerts
  • To avoid false alerts one has to regularly adjust alert rule thresholds

Algolerts to the Rescue!

With the name obviously derived from terms Algorithm and Alert, Algolerts are SPM’s alternative, or perhaps even a replacement for traditional threshold-based alerts you so often see in most, if not all monitoring solutions.  Algolerts don’t require thresholds to figure out when to alert you.  Algolerts can watch any metric you tell them to watch and alert you when an anomalous pattern – a pattern that deviates from the norm – is detected.

Creating Algolerts is even simpler than adding threshold-based alerts and is done through a familiar interface:

SPM Algolert creation

SPM Algolert creation

Algolert notifications provide useful and easy to read numbers so one can quickly see just how big of an anomaly this is about.  Here is an example notification:

Anomalous value for 'received' metric has been detected for SPM Application SA.Prod.Kafka
Host filter=xxx, Network Interface filter=eth0.

Anomaly detection window size: 1800 seconds.

Statistics for 'received' metric are:
Current: 1,220,121.00
Average:   185,147.97
Median:     89,536.00
StdDev:    222,173.70

Known Kinks

Algolerts implementation that’s in place in SPM today has a few known kinks.  The kinks we know about and that we’ll be ironing out are:

  • no “things are OK again, you can go back to sleep” notifications are sent when the metric value goes back to normal
  • regular anomalies (e.g. a CPU intensive nightly cron job) may trigger false alerts, though this is not necessarily different from threshold-based alerts anyway
  • recently observed anomalies can create “the new norm” and thus hide subsequent anomalies

Despite this, Algolerts have already proven very good and valuable in our own use of SPM and Algolerts – we’re slowly removing all our threshold-based alerts and are switching to Algolerts and invite you to try them out as well.

Please send us your feedback and follow @sematext for updates.

On Centralizing Logs at Monitorama EU

I’m really excited to be attending Monitorama EU this week! I’ll give a talk about centralizing logs on Friday at 15:15 . You can see the full schedule here.

Please tweet about On Centralizing Logs at Monitorama EU.

The talk is mainly about centralizing logs and storing them in Elasticsearch. It will begin with tips about using Elasticsearch for logs in production, so it runs fast and stable. There will be an introduction to Kibana 3 and then we’ll move to indexing.

We’ll start the indexing part by clarifying the term syslog: is it about a daemon? is it about a log message format? is it about a protocol for transferring logs? Spoiler alert: it can be any of the three, and there are options at every level. And one of them is to write syslog messages to Elasticsearch.

Then, we’ll move to Logstash and describe a few typical deployments. We’ll end by introducing alternatives to the setup described above: from our Logsene to the Flume + Solr combination.

If there’s enough time, I have some extra slides with tips about configuring rsyslog for processing lots of messages. Think 100K+ or even 1M+ messages per second, depending on the hardware and configuration.

Below is a sketchnote of the whole talk, which will be printed and given to participants. Click on the image get the full resolution.

sketchnoteFor the occasion, Sematext is giving a 20% discount for all SPM applications. The discount code is MONEU2013

Also, Manning is giving a 44% discount for Elasticsearch in Action and all the other books from their website. The discount code is mlmoneu13cf

What’s New in SPM 1.13.0

We pushed a new SPM release to production this morning and it’s loaded with goodies.  Here is a quick run-down of a few interesting ones. The slightly longer version can be found in SPM Changelog:

PagerDuty integration. If you are a PagerDuty user, your alerts from SPM can now go to your PagerDuty account where you can handle them along with all your other alerts.

Ruby & Java libraries for Custom Metrics.  We open-sourced sematext-metrics, a Ruby gem for sending Custom Metrics to SPM as well as sematext-metrics for doing the same from Java.

Coda Metrics & Ruby Metriks support.  We open-sourced sematext-metrics-reporter, a Coda’s Metrics reporter for sending Custom Metrics to SPM from Java, Scala, Clojure, and other JVM-based apps, and we’ve done the same for Metriks - the Ruby equivalent of Coda’s Metrics library.

Puppet metrics. We begged James Turnbull to marry Puppet and SPM and write a Puppet report processor thats sends each of the metrics generated by a Puppet run to SPM, which he did without us having to buy him drinks….yet.

Performance.  We’ve done a bit of work in the layer right behind the UI to make the UI a little faster.

CentOS 5.x support.  Apparently a good number of people still use CentOS 5.x, so we’ve update the SPM client SPM to work with it.  You can grab from SPM Client page.

@sematext

What’s new in SPM 1.12.0

We’ve been very heads down since our last official release of SPM, as you can see from our SPM Changelog.  Here is some more info about new things you’ll find in SPM:

  • For the impatient – there is now a demo user (click on this and you will be logged in as the demo user).  This lets you into both SPM and Search Analytics even if you don’t have an account, so you can see  reports for various types of SPM Apps as well as what Search Analytics reports are like.
  • The SPM Client (aka SPM Agent) can now be installed as an RPM or a Debian packages – check SPM client page.  Until now, the installer was completely written in Bash, but using Jordan Sissel’s fpm we were able to easily put together SPM Client packages.  Moreover, you can now easily install SPM Client on Redhat, CentOS, Fedora, Debian, Ubuntu, SuSE, Amazon Linux AMI, and maybe some other smaller distros we didn’t get to test.  If you try it on some  other distro, please let us know if it worked or if you had issues, so we can help!
  • SPM for Hadoop can now be used to monitor Hadoop HDFS, MapReduce, and YARN clusters.  We have tested SPM for Hadoop with CDH (Cloudera Hadoop Distribution), but it should work just as well with Apache Hadoop, or HDP (Hortonworks Data Platform), and perhaps MapR’s Hadoop distros as well.  Of course, you can use the same Sematext Apps account to monitor any number of your clusters and clusters of any type, so it’s extremely easy to switch from looking at your Hadoop metrics, to HBase metrics, to Solr or ElasticSearch metrics.  We are working on expanding monitoring support to other technologies.  Tell us what you would like SPM to monitor for you!
  • SPM users already loved the Overview panel in SPM (see Screenshots), but the new Custom Dashboards are even cooler!  Here are some things to know about the new SPM Dashboards:
    • You can have any number of them and you can name them – there are no limits
    • You can add any SPM graph to any of your existing dashboards and it’s super easy to create a new dashboard when you realize you want the graph you are looking at on a new dashboard
    • You can drag and drop dashboard panels wherever you want, resize them, and they’ll nudge other panels around and snap onto a neat, invisible grid
    • You can add graphs from any number of different SPM applications to the same dashboard.  So you can create a dashboard that has all the graphs that are important to you and your application(s), and combine metrics from different (types of) apps in a single view.  For example, you can have a dashboard that shows the number of HBase compactions as well as ElasticSearch index merging stats on the same dashboard, right next to each other.
    • Not only can you mix and match graphs from different SPM Apps on the same dashboard, but if you are a Search Analytics user you can also have Search Analytics graphs right next to your performance graphs!  That’s powerful!
    • And for those who discovered Logsene, our soon to be announced Data & Log Analytics service, you can imagine how eventually SPM & Logsene will be able to play together and share the same dashboards!
  • SPM Clients do a good job of gathering server and application metrics for applications covered by SPM.  But what if you want to monitor something else in addition to what SPM collects and graphs?  Or what if you want to feed in some non-performance data – say some user engagement metrics or some other Business KPI?  Well, you can do that now!  We’ve added Custom Metrics to all plans and this addition is free of charge!  And guess what?  You can build graphs with your custom metrics and put them on any and however many of your Dashboards you want, so you could potentially have your KPIs be shown next to your Performance Metrics, next to your Log Analytics, next to your Search Analytics!
  • To help SPM users feed custom metrics into SPM we’ve released and open-sourced the first Sematext Metrics library for Java, with libraries for other languages to follow. Improvements welcome, pull requests welcome, support for other languages super welcome!
  • You can now share, tweet, and embed your SPM graphs as you please.  Next to each icon you will see a little “share” icon that will open up a dialog window and there you can save the displayed short link, which you can tweet, or email.  You’ll also see an HTML snippet that you can copy and paste into your blog or your Wiki.  The shared or embedded widget is cool because you can select very specific filters in SPM and the shared/embedded graph will remember them.  Furthermore, you can choose to have your graph show a specific, fixed time range or, alternatively, you can choose to always show the last N minutes, or hours, or days…. This can be quite handy for sharing with team members who may not have access to SPM for one reason or another, but do want access to these graphs – think C*Os or anyone else who needs to be fed graphs and graphs, and some more graphs.
  • Here is a cool and very handy feature that was suggested by George Stathis from Traackr, one of our early SPM users:  If you are a developer in charge of your application, your servers, your cluster, what happens when something breaks or just starts working poorly?  Many people turn to mailing lists to seek help from the open source community.  Experts on these mailing lists often ask for more information so they can provide better help.  Very often this information is in the form of graphs of various performance metrics.  So if you look at SPM you will see a little ambulance icon that, when clicked, puts together an email template that includes a short, public link to whichever graph’s ambulance icon you clicked on.  You can then incrementally keep adding links to other graphs to this email and, when you are ready, email it without leaving SPM.  This is incredibly easy and it makes it a breeze to put together an email that has a bunch of informative graphs in it.  This means you don’t have to go open your email client, you don’t have to type in the mailing list address, and you don’t have to jump between the email you are composing and SPM while you copy and paste links from SPM into your email client.  And while this functionality is really handy when you need help from experts and you want to give them as much information about your system as you can so they can help you better, you can also simply change where this email will get sent and send it to whoever you want.
  • We listen to our user really carefully (see above!).  Some SPM users would like to keep their data longer, so we’ve added Pro Silver – a new plan with longer data retention, more metrics, more everything.
  • We’ve improved SPM for ElasticSearch so it can now collect metrics even from non-HTTP nodes.
  • We have made SPM Sender, the component responsible for sending all the metrics over to us for processing, more resilient and even lighter than before.

 

We want to hear what you think and what you want!  Please use comments to leave suggestions, requests, and feedback, or simply email us or tweet @sematext.

Follow

Get every new post delivered to your Inbox.

Join 1,564 other followers