Going to Be in Austin on April 2nd? Then Check Out BV:IO

Live or work in Austin?  Like small conferences filled with smart, interesting technical people, a roster of great speakers, and innovation everywhere you look?  Great — you’ll fit right in a Bazaarvoice’s first ever public technical conference and hackathon to drive innovation in the social commerce space.  Get all the BV:IO event details here.

And since you are reading this blog, there’s a good chance you know about our founder and CEO, Otis Gospodnetic, and his expertise with all things Search and Big Data.  Otis has been invited to speak, and he goes on at 1:30 pm on Wednesday, April 2nd.  Otis will speak about “Open Source Search Evolution” and he’ll be available before and after the talk at the Sematext sponsor table to say hello and talk about SPM, Logsene, Site Search Analytics, Solr, Elasticsearch, Hadoop, NYC vs. Austin tech scenes, Brooklyn Lager vs. Lone Star…and whatever else you bring to our table.

If you’re thinking of attending BV:IO drop us a line at mick.emmett@sematext.com.  Hope to see you there!

Announcement: Percentiles added to SPM

In the spirit of continuous improvement, we are happy to announce that percentiles have recently been added to SPM’s arsenal of measurement tools.  Percentiles provide more accurate statistics than averages, and users are able to see 50%, 95% and 99% percentiles for specific metrics and set both regular threshold-based as well as anomaly detection alerts.  We will go more into the details about how the percentiles are computed in another post, but for now we want to put the word out and show some of the related graphs — click on them to enlarge them.  Enjoy!

Elasticsearch – Request Rate and Latency

pecentiles_es

Garbage Collectors Time

percentiles_gc

Kafka – Flush Time

percentiles_kafka_1

Kafka – Fetch/Produce Latency 1

percentiles_kafka_2

Kafka – Fetch/Produce Latency 2

percentiles_kafka_3

Solr Req. Rate and Latency 1

percentile_solr

Solr – Req. Rate and Latency 2

percentiles_solr_2

If you enjoy performance monitoring, log analytics, or search analytics, working with projects like Elasticsearch, Solr, HBase, Hadoop, Kafka, Storm, we’re hiring planet-wide!

Announcement: Redis Monitoring in SPM

Don’t worry, we didn’t just stop at Storm monitoring and metrics while improving SPM.  We’re also happy to announce support for Redis.

Specifically, here are some of the key Redis metrics SPM monitors:

  • Used Memory
  • Used Memory Peak
  • Used Memory RSS
  • Connected Clients
  • Connected Slaves
  • Master Last IO Seconds Ago
  • Keyspace Hits
  • Keyspace Misses
  • Evicted Keys
  • Expired Keys
  • Commands Processed
  • Keys count per db
  • To be expired keys count per db

Also, for all application types users can add alerting rules, heartbeat alerts, and Algolerts, as well as receive emails with performance reports for a given time period.

Enough with the words, these are what the graphs look like — click them to enlarge them:

Redis-Overview

Redis-Overview

Redis-Memory

Redis-Memory

Used memory/Used memory peak/Used memory RSS chart

Redis-Keyspace-Hits

Redis-Keyspace-Hits

Keyspace Hits chart

Redis-Expiring-Keys

Redis-Expiring-Keys

Expiring Keys chart

Redis-Evicted-Keys

Redis-Evicted-Keys

Evicted Keys chart

And we’re not done.  Watch this space for more SPM updates coming soon…

Give SPM a spin – it’s free to get going and you’ll have it up and running, graphing all your Redis metrics in 5 minutes!

If you enjoy performance monitoring, log analytics, or search analytics, working with projects like Elasticsearch, Solr, HBase, Hadoop, Kafka, Storm, we’re hiring planet-wide!

Announcement: Apache Storm Monitoring in SPM

There has been a “storm” brewing here at Sematext recently.  Fortunately this has nothing to do with the fierce winter weather many of us are experiencing in different parts of the globe — it’s actually a good kind of storm!  We’ve gotten a lot of requests to add Apache Storm support to SPM and we’re please to say that is now a reality.  SPM can already monitor Kafka, ZooKeeper, Hadoop, Elasticsearch, and more. As a matter of fact, we’ve just announced Redis monitoring, too!

Here’s why you should care:

  1. SPM users can see different Storm metrics in dynamic , real-time graphs, a big improvement from the standard Storm UI which only allows some time-specific snapshots.  Isn’t it better to see trends as opposed to static snapshots?  We certainly think so.
  2. SPM users can create an external link and share their charts with others (like a Mailing List or in a blog post) to get insight into problems without having to provide login credentials.  Here’s an example (you will see the chart even though you don’t know UN/PW):  https://apps.sematext.com/spm-reports/s/aQjuv5GdC1
  3. SPM also provides its users with common System and JVM-related metrics like CPU usage, memory usage, JVM heap size and pool utilization, among others.  This lets you troubleshoot performance issues better by allowing you to correlate  Storm-specific metrics with common System and JVM metrics.

Here are the Storm metrics SPM can now monitor:

  • Supervisors count
  • Topologies count
  • Supervisor total/free/used slots count
  • Topology workers/executors/tasks count
  • Topology spouts/bolts/state spouts count
  • Bolt emitted/transferred events
  • Bolt acked/executed/failed events
  • Bolt executed/processed latencies
  • Spout emitted/transferred events
  • Spout acked/failed events
  • Spout complete latency

Also important to note — users can add alerting rules for all metrics, including Algolerts and heartbeat alerts, as well as receive daily, weekly, and monthly performance reports via email.

Here are some of the graphs — click on them to see larger versions:

Overview

For observing the general state of the system

For observing the general state of the system

Acked-Failed Decrease

Check out how "acked" (blue line) decreased. It may be related to some problems with resources (e.g., CPU load)

Do you see how “acked” (blue line) decreased? It may be related to some problems with resources (e.g., CPU load)

Timing-Increased

Timing-Tncreased

Check out this “Timing” chart: see the spike at ~13:21? It seems that something is up with the CPU (again); it might be the “pressure” from Java GC (Garbage Collector)

Start-Topology-Workers

Start-Topology-Workers

On the first chart you can see how the counts of tasks and workers grew.  It is because a new topology (“job” in Storm terminology) started at 12:25.

Start-Topology

Start-Topology

The same as above: you can see that between 12:00 and 12:30 Storm Supervisor was restarted (something that works on each machine inside the cluster) and topology was added after restarting.

Give SPM a spin – it’s free to get going and you’ll have it up and running, graphing all your Storm metrics in 5 minutes!

If you enjoy performance monitoring, log analytics, or search analytics, working with projects like Elasticsearch, Solr, HBase, Hadoop, Kafka, Storm, we’re hiring planet-wide!

Announcement: New goodness in SPM

We don’t typically announce new SPM, Logsene, or Search Analytics releases, but yesterday’s release calls for an exception.  Logsene release deserves its own post, so we’ll post that separately.  For a quick rundown you can jump over to SPM Changelog. This blog has a bit more descriptive info.

The most visible change in SPM is the whole new, much more modern UI based on Bootstrap. Yes, we have designer(s) on our team now!  You can now much more seamlessly switch between your SPM, Logsene, and Search Analytics apps and the whole experience should feel a lot smoother.  Dashboards were previously fairly hidden, but should now gain visibility.  The “Common” part of SPM, Logsene, and Search Analytics, what we internally call “SUA”, has been radically changes to make navigation much simpler.  While we’ve made lots of UI/UX changes in this release, you’ll see us improving the UI/UX going forward, too.  Please tell us (e.g. leave a comment here) what you think about the new UI, good and bad stuff, and tell us what sort of user experience you’d like to get from SPM!  While the new UI is impossible to miss, there is more in this release:

  • We’ve expanded SPM integration to Redis and Apache Storm.  SPM can now monitor both Redis and Storm and alert you on any of their metrics. This is in addition to monitoring Solr and SolrCloud, Elasticsearch, Hadoop, HBase, Kafka, ZooKeeper, Sensei, JVM, System, and Custom metrics.  Don’t forget to tell us what you want to monitor!
  • More security-sensitive SPM users asked if they could hide their hostnames, which led to the new hostname aliasing/obfuscation feature.  See Can hostnames in SPM be obfuscated or customized? in SPM FAQ.  This is really handy not only because it avoids sending hostnames over the network, but because it lets you specify nice, user-friendly nicknames/aliases for them, so you know which host is which in SPM.
  • When we announced Algolerts a couple of months ago we pointed out a few known kinks.  We’ve taken care of a couple of them in this release.  This boils down to being smart about recognizing regular metric variations and not confusing them with actual anomalies, as well as not missing anomalies that were until now masked by preceding anomalous patterns.
  • We’ve improved the SPM Client, which now loads in a separate classloader from the application in monitors when launched in embedded mode.  This avoids any potential conflicts between libraries included in the SPM Client and those loaded in the monitored application’s process.

If you enjoy performance monitoring, log analytics, or search analytics, working with projects like Elasticsearch, Solr, HBase, Hadoop, Kafka, Storm, we’re hiring planet-wide!

Logstash Performance Monitoring (1.2.2 vs 1.2.1)

We’re using Logstash to collect some of our own logs here at Sematext, because it can easily forward events to Logsene via the elasticsearch_http output. And we recently upgraded to the latest version 1.2.2.

Since we’re obsessed with performance monitoring, the first question that came up was: did the upgrade make any difference in terms of load? So we did a test to find out.

Context

Logstash runs on a JVM, so we’re already monitoring it with SPM for Java Apps.  So we just put it through a steady, moderate logs of about 70 events per second for a while, running both 1.2.1 and 1.2.2, on the same machine.

The configuration remained the same on both machines:

  • a file input that was tailing a single file using the multiline codec
  • a couple of grok filters and a geoIP filter, that we’ll talk about in later posts
  • resulting events are fed to Logsene through the elasticsearch_http output, because Logsene exposes the Elasticsearch API

Results

The biggest difference we’ve seen was in memory usage. The new version uses about 30% less memory:

logstash_spm_memory

Next, there was a significant difference in the amount of garbage collection going on. Again, some 30% difference in favor of 1.2.2:

logstash_spm_gc

We also had slightly less CPU usage. The difference wasn’t significant in our case because we didn’t have a lot of traffic: Logstash is installed on every host and there’s no “central” Logstash to process lots of data. I’m sure that if we had a more complex configuration and/or more traffic, we would have seen much less CPU.

Conclusions

Logstash is getting lighter and lighter, which seems to address its only criticism. You definitely get less memory usage and less garbage collection, so we definitely recommend upgrading to 1.2.2. And if you want to monitor its usage, you can always use our SPM. Happy stashing!

Announcement: JVM Memory Pool Monitoring

Raise your hand if you’ve never experienced the dreaded OutOfMemoryError (aka OOM or OOME) while working with Java.  If you’ve raised your hand count yourself lucky.  The vast majority of Java developers are nowhere near that lucky.  Today, SPM is making the lives of all Java developers a little better by adding JVM Memory Pool Monitoring reports to SPM to the existing JVM Heap, Threading, and Garbage Collection reports.  Note: you’ll need to get the new SPM client (version 1.16 or newer) to see these reports.

Please tweet about JVM Memory Pool Monitoring in SPM

What are JVM Memory Pools (aka Spaces)

The super simplified version of this complex topic is that inside the JVM there are different memory pools (aka spaces) that the JVM uses for different types of objects.  The JVM uses some of these pools to store permanent bits, others for storing young objects, others for storing tenured objects, and so on.  Numerous blog posts and documentation has been written on the topic of the inner workings of the JVM.  It’s a complex topic that is not easy to fully grook in one sitting.  To make the story more complex, different Garbage Collectors and certainly different JVM implementations manage objects differently.  Thus, which exact memory pools you see in SPM will depend and change as you change the JVM or select a different garbage collection algorithms or garbage collection-related parameters.  And this is precisely one case where seeing these pools in SPM comes reeeeeally handy – seeing how memory pool sizes and usage changes as you try different Garbage Collectors or any other JVM parameter can be very informative, educational, and can bring insight into the workings of the JVM that let you select parameters that are optimal for your particular application.

The new JVM Memory Pools reports are available to all SPM users right away.  Let’s have a look at what these new reports look like and what information they provide.

Memory Pool Sizes

This report should be obvious to all Java developers who know about how JVM manages memory.    Here we see relative sizes of all memory spaces and their total size.  If you are troubleshooting performance of the JVM (i.e., any Java application) this is one of the key places to check first, in addition to looking at Garbage Collection and Memory Pool Utilization report (see further below).  In this graph we see a healthy sawtooth pattern clearly showing when major garbage collection kicked in.

spm-jvm-memory-pool-sizes

JVM Memory Pool Size

Memory Pool Utilization

The Memory Pool Size graph is useful, but knowing the actual utilization of each pool may be even more valuable, especially if you are troubleshooting the OutOfMemoryError we mentioned earlier in the post.  Java OOME stack traces don’t often tell you much info about where the OOME happened.  The Memory Pool Utilization graph shows what percentage of each pool is being used over time.  When some of these Memory Pools approach 100% utilization and stay there, it’s time to worry.  When that happens, if you jump over to the Garbage Collection report in SPM you will probably see spikes there as well.  And if you then jump to your CPU report in SPM you will likely see increased CPU usage there, as the JVM keeps trying to free up some space in any pools that are (nearly) full.

spm-jvm-memory-pool-utilization

JVM Memory Pool Utilization

Alerting for Memory Pool before OOME

If you see scary spikes or near 100% utilization of certain memory pools, your application may be in trouble.  Not dealing with this problem, whether through improving the application’s use of memory or giving the JVM more memory via -Xmx and related parameters, will likely result in a big bad OOME.  Nobody wants that to happen.  One way to keep an eye on this is via Alerts in SPM.  As you can see from the graphs, memory utilization naturally and typically varies quite a bit.   Thus, although SPM has nice Algolerts, to monitor utilization of JVM memory pools we recommend using standard threshold-based Alerts and getting alerted when utilization percentage is > N% for M minutes.  N and M are up to you, of course.

Please tell us what you think – @sematext is always listening!  Is there something SPM doesn’t monitor that you would really like to monitor?  Please vote for tech to monitor!

Announcement: SPM Performance Monitoring for Kafka

We are happy users of Apache Kafka.  We make heavy use of it in SPM, in Search Analytics, and in Logsene.  We knew Kafka was fast when we decided to use it.  But it wasn’t until we added Kafka Performance Monitoring support to SPM that we saw all Kafka performance metrics at a glance and admired its speed.  We are happy to announce that in addition to being able to monitor Solr/SolrCloud, Elasticsearch, Hadoop HDFS and MapReduce, HBase, SenseiDB, JVM, RedHat, Debian, Ubuntu, CentOS, SuSE, etc. you can now also monitor Apache Kafka!

Please tweet about Performance Monitoring for Kafka

Here’s a glimpse into what SPM for Kafka provides – click on the image to see the full view or look at the actual SPM live demo:

SPM for Kafka Overview

SPM for Kafka Overview

Of course, SPM can also alert you on any Kafka metric and can do so directly via email or via PagerDuty, using either traditional threshold-based alerts or our recently announced Algolerts.  SPM supports Kafka 0.7.x, and we’ll be adding support for Kafka 0.8.x shortly.

Please tell us what you think – @sematext is always listening!  Is there something SPM doesn’t monitor that you would really like to monitor?  Please vote for tech to monitor!

Introducing Algolerts – Algorithmic Anomaly Detection Alerts

It is not every day that you come across a new term or concept Google doesn’t yet know about.  So today we’ll teach Google about something new we’ve added to SPM in the latest release: Algolerts.

Please tweet about Algolerts – Algorithmic Anomaly Detection Alerts

The Problem with Threshold-based Alerts

Why do we even have alerts in performance monitoring systems?  We have them because we want to be notified when something bad happens, when some metric spikes or dips too much – when CPU usage hits the roof, when disk IO goes up, when the network traffic suspiciously quiets down, and so on.  We see such spikes or dips in metric values as signs that something might be wrong or is about to go wrong.  When limited to traditional threshold-based alerts one is forced to figure out what range of metric values represents a non-alarming, normal state and, conversely, at which point spikes and dips should be considered out of an acceptable range and taken seriously.  One needs to pick minimum and maximum metric values and then create one alert rule for each such value.  The same process then needs to be repeated for every metric one wants to monitor.  This is painful and time-consuming.  To make things worse, these thresholds have to be regularly updated to match what represents the new normal!  One can try to fight this by setting very “loose alerts” by picking high maxima and low minima, but then one risks not getting alerted when something really does go awry.

To summarize:

  • It is hard to estimate the normal range of each metric and pick min and max thresholds
  • Metric values often fluctuate and create false alerts
  • To avoid false alerts one has to regularly adjust alert rule thresholds

Algolerts to the Rescue!

With the name obviously derived from terms Algorithm and Alert, Algolerts are SPM’s alternative, or perhaps even a replacement for traditional threshold-based alerts you so often see in most, if not all monitoring solutions.  Algolerts don’t require thresholds to figure out when to alert you.  Algolerts can watch any metric you tell them to watch and alert you when an anomalous pattern – a pattern that deviates from the norm – is detected.

Creating Algolerts is even simpler than adding threshold-based alerts and is done through a familiar interface:

SPM Algolert creation

SPM Algolert creation

Algolert notifications provide useful and easy to read numbers so one can quickly see just how big of an anomaly this is about.  Here is an example notification:

Anomalous value for 'received' metric has been detected for SPM Application SA.Prod.Kafka
Host filter=xxx, Network Interface filter=eth0.

Anomaly detection window size: 1800 seconds.

Statistics for 'received' metric are:
Current: 1,220,121.00
Average:   185,147.97
Median:     89,536.00
StdDev:    222,173.70

Known Kinks

Algolerts implementation that’s in place in SPM today has a few known kinks.  The kinks we know about and that we’ll be ironing out are:

  • no “things are OK again, you can go back to sleep” notifications are sent when the metric value goes back to normal
  • regular anomalies (e.g. a CPU intensive nightly cron job) may trigger false alerts, though this is not necessarily different from threshold-based alerts anyway
  • recently observed anomalies can create “the new norm” and thus hide subsequent anomalies

Despite this, Algolerts have already proven very good and valuable in our own use of SPM and Algolerts – we’re slowly removing all our threshold-based alerts and are switching to Algolerts and invite you to try them out as well.

Please send us your feedback and follow @sematext for updates.

On Centralizing Logs at Monitorama EU

I’m really excited to be attending Monitorama EU this week! I’ll give a talk about centralizing logs on Friday at 15:15 . You can see the full schedule here.

Please tweet about On Centralizing Logs at Monitorama EU.

The talk is mainly about centralizing logs and storing them in Elasticsearch. It will begin with tips about using Elasticsearch for logs in production, so it runs fast and stable. There will be an introduction to Kibana 3 and then we’ll move to indexing.

We’ll start the indexing part by clarifying the term syslog: is it about a daemon? is it about a log message format? is it about a protocol for transferring logs? Spoiler alert: it can be any of the three, and there are options at every level. And one of them is to write syslog messages to Elasticsearch.

Then, we’ll move to Logstash and describe a few typical deployments. We’ll end by introducing alternatives to the setup described above: from our Logsene to the Flume + Solr combination.

If there’s enough time, I have some extra slides with tips about configuring rsyslog for processing lots of messages. Think 100K+ or even 1M+ messages per second, depending on the hardware and configuration.

Below is a sketchnote of the whole talk, which will be printed and given to participants. Click on the image get the full resolution.

sketchnoteFor the occasion, Sematext is giving a 20% discount for all SPM applications. The discount code is MONEU2013

Also, Manning is giving a 44% discount for Elasticsearch in Action and all the other books from their website. The discount code is mlmoneu13cf

Follow

Get every new post delivered to your Inbox.

Join 1,564 other followers