Announcement: Percentiles added to SPM

In the spirit of continuous improvement, we are happy to announce that percentiles have recently been added to SPM’s arsenal of measurement tools.  Percentiles provide more accurate statistics than averages, and users are able to see 50%, 95% and 99% percentiles for specific metrics and set both regular threshold-based as well as anomaly detection alerts.  We will go more into the details about how the percentiles are computed in another post, but for now we want to put the word out and show some of the related graphs — click on them to enlarge them.  Enjoy!

Elasticsearch – Request Rate and Latency

pecentiles_es

Garbage Collectors Time

percentiles_gc

Kafka – Flush Time

percentiles_kafka_1

Kafka – Fetch/Produce Latency 1

percentiles_kafka_2

Kafka – Fetch/Produce Latency 2

percentiles_kafka_3

Solr Req. Rate and Latency 1

percentile_solr

Solr – Req. Rate and Latency 2

percentiles_solr_2

If you enjoy performance monitoring, log analytics, or search analytics, working with projects like Elasticsearch, Solr, HBase, Hadoop, Kafka, Storm, we’re hiring planet-wide!

Rsyslog 8.1 Elasticsearch Output Performance

Recently, the first rsyslog version 8 release was announced. Major changes in its core should give outputs better performance, and the one for Elasticsearch should benefit a lot. Since we’re using rsyslog and Elasticsearch in Sematext‘s own Logsene, we had to take the new version for a spin.

The Weapon and the Target

For testing, we used a good-old i3 laptop, with 8GB of RAM. We generated 20 million logs, sent them to rsyslog via TCP and from there to Elasticsearch in the Logstash format, so they can get explored with Kibana. The objective was to stuff as many events per second into Elasticsearch as possible.

Rsyslog Architecture Overview

In order to tweak rsyslog effectively, one needs to understand its architecture, which is not that obvious (although there’s an ongoing effort to improve the documentation). The gist of it its architecture represented in the figure below.

  • you have input modules taking messages (from files, network, journal, etc.) and pushing them to a main queue
  • one or more main queue threads take those events and parse them. By default, they parse syslog formats, but you can configure rsyslog to use message modifier modules to do additional parsing (e.g. CEE-formatted JSON messages). Either way, this parsing generates structured events, made out of properties
  • after parsing, the main queue threads push events to the action queue. Or queues, if there are multiple actions and you want to fan-out
  • for each defined action, one or more action queue threads takes properties from events according to templates, and makes messages that would be sent to the destination. In Elasticsearch’s case, a template should make Elasticsearch JSON documents, and the destination would be the REST API endpoint
rsyslog message flow

rsyslog message flow

There are two more things to say about rsyslog’s architecture before we move on to the actual test:

  • you can have multiple independent flows (like the one in the figure above) in the same rsyslog process by using rulesets. Think of rulesets as swim-lanes. They’re useful for example when you want to process local logs and remote logs in a completely separate manner
  • queues can be in-memory, on disk, or a combination called disk-assisted. Here, we’ll use in-memory because they’re the fastest. For more information about how queues work, take a look here

Configuration

To generate messages, we used tcpflood, a small and light tool that’s part of rsyslog’s testbench. It generates messages and sends them over to the local syslog via TCP.

Rsyslog took received those messages with the imtcp input module, queued them and forwarded them to Elasticsearch 0.90.7, which was also installed locally. We also tried with Elasticsearch 1.0 Beta 1 and got the same results (see below).

The flow of messages in this test is represented in the following figure:

Flow of messages in this test

Flow of messages in this test

The actual rsyslog config is listed below. It can be tuned further (for example by using the multithreaded imptcp input module), but we didn’t get significant improvements.

module(load="imtcp")   # TCP input module
module(load="omelasticsearch") # Elasticsearch output module

input(type="imtcp" port="13514")  # where to listen for TCP messages

$MainMsgQueueSize 1000000
$MainMsgQueueDequeueBatchSize 1000
$MainMsgQueueWorkerThreads 2

# template to generate JSON documents for Elasticsearch
template(name="plain-syslog"
         type="list") {
           constant(value="{")
             constant(value="\"@timestamp\":\"")      property(name="timereported" dateFormat="rfc3339")
             constant(value="\",\"host\":\"")        property(name="hostname")
             constant(value="\",\"severity\":\"")    property(name="syslogseverity-text")
             constant(value="\",\"facility\":\"")    property(name="syslogfacility-text")
             constant(value="\",\"syslogtag\":\"")   property(name="syslogtag" format="json")
             constant(value="\",\"message\":\"")    property(name="msg" format="json")
             constant(value="\"}")
         }

action(type="omelasticsearch"
       template="plain-syslog"  # use the template defined earlier
       searchIndex="test-index"
       bulkmode="on"
       queue.dequeuebatchsize="5000"   # ES bulk size
       queue.size="100000"
       queue.workerthreads="5"
       action.resumeretrycount="-1"  # retry indefinitely if ES is unreachable
)

You can see from the configuration that:

  • both main and action queues have a defined size in number of messages
  • both have number of threads that deliver messages to the next step. The action needs more because it has to wait for Elasticsearch to reply
  • moving of messages from the queues happens in batches. For the Elasticsearch output, the batch of messages is sent through the Bulk API, which makes queue.dequeuebatchsize effectively the bulk size

Results

We started with default Elasticsearch settings. Then we tuned them to leave rsyslog with a more significant slice of the CPU. We measured the indexing rate with SPM. Here are the average results over 20 million indexed events:

  • with default Elasticsearch settings, we got 8,000 events per second
  • after setting Elasticsearch up more production-like (5 second refresh interval, increased index buffer size, translog thresholds, etc), and the throughput went up to average of 20,000 events per second
  • in the end, we went berserk and used in-memory indices, updated the mapping to disable any storing or indexing for any field, to have Elasticsearch do as little work as possible and make room for rsyslog. Got an average of 30,000 events per second. In this scenario, rsyslog was using between 1 and 1.5 of the 4 virtual CPU cores, with tcpflood using 0.5 and Elasticsearch using from 2 to 2.5

Conclusion

20K EPS on a low-end machine with production-like configuration means Elasticsearch is quick at indexing. This is very good for logs, where you typically have lots of messages being generated, compared to how often you search.

If you need some tool to ship your logs to Elasticsearch with minimum overhead, rsyslog with its new version 8 may well be your best bet.

Additional resources:

NOTE: We’re hiring, so if you find this interesting, please check us out. Search, logging and performance is our thing!

Video: Using Solr for Logs with Rsyslog, Flume, Fluentd and Logstash

A while ago we published the slides from our talk at Lucene Revolution about using Solr for indexing and searching logs. This topic is of special interest for us, since we’ve released Logsene and we’re also offering consulting services for logging infrastructure. If you’re also into working with search engines or logs, please note that we’re hiring worldwide.

The video for that talk is now available, and you can watch it below. The talk is made of three parts:

  • one that discusses the general concepts of what a log is, structured logging and indexing logs in general, whether it’s Solr or Elasticsearch
  • one that shows how to use existing tools to send logs to Solr: Rsyslog and Fluentd to send structured events (yes, structured syslog!); Apache Flume and Logstash to take unstructured data, make it structured via Morphlines and Grok, and then send it to Solr
  • one that shows how to optimize Solr’s performance for handling logs. From tuning the commit frequency and merge factor to using time-based collections with aliases

Announcement: ZooKeeper Performance Monitoring in SPM

You don’t see him, but he is present.  He is all around us.  He keeps things running.  No, we are not talking about Him, nor about The Force.  We are talking about Apache ZooKeeper, the under-appreciated, often not talked-about, yet super-critical component of almost all distributed systems we’ve come to rely on – Hadoop, HBase, Solr, Kafka, Storm, and so on.  Our SPM, Search Analytics, and Logsene, all use ZooKeeper, and we are not alone – check our ZooKeeper poll.

We’re happy to announce that SPM can now monitor Apache ZooKeeper!  This means everyone using SPM to monitor Hadoop HBase, Solr, Kafka, Sensei, and other applications that rely on ZooKeeper can now use the same monitoring and alerting tool – SPM – to monitor their ZooKeeper instances.

Please tweet about Performance Monitoring for ZooKeeper

Here’s a glimpse into what SPM for ZooKeeper provides – click on the image to see the full view or look at the actual SPM live demo:

SPM for ZooKeeper Overview

SPM for ZooKeeper Overview

Please tell us what you think – @sematext is always listening!  Is there something SPM doesn’t monitor that you would really like to monitor?  Please vote for tech to monitor!

Want to build highly distributed big data apps with us?  We’re hiring good engineers (not just for positions listed on our jobs page), and we’re sitting on a heap of some pretty juicy big data!

Presentation: Scaling Solr with SolrCloud

Squeezing the maximal possible performance out of Solr / SolrCloud, and Elasticsearch and making them scale well is what we do on a daily basis for our clients.  We make sure their servers are optimally configured and maximally utilized.  Rafal Kuć gave a long, 75-minute talk on the topic of Scaling Solr with SolrCloud at Lucene Revolution 2013 conference in Dublin. Enjoy!

If you are interesting in working with Solr and/or Elasticsearch, we are looking for good people to join our team.

Video Presentation: On Centralizing Logs

You might have seen our PDF presentation from Monitorama that was published last week. Now, the video is available as well. You will be able to see more about tuning Elasticsearch’s configuration for logging. You’ll also learn what the various flavors of syslog are all about – and some tips for making rsyslog process hundreds of thousands of messages per second. And, of course, one can’t talk about centralizing logs without mentioning Kibana and Logstash.

If you like using these tools, you might want to check out our Logsene, which will do the heavy lifting for you. If you like working with them, we’re hiring, too.

 

For the occasion, Sematext is giving a 20% discount for all SPM applications. The discount code is MONEU2013.

 

Elasticsearch Refresh Interval vs Indexing Performance

Elasticsearch is near-realtime, in the sense that when you index a document, you need to wait for the next refresh for that document to appear in a search. Refreshing is an expensive operation and that is why by default it’s made at a regular interval, instead of after each indexing operation. This interval is defined by the index.refresh_interval setting, which can go either in Elasticsearch’s configuration, or in each index’s settings. If you use both, index settings override the configuration. The default is 1s, so newly indexed documents will appear in searches after 1 second at most.

Because refreshing is expensive, one way to improve indexing throughput is by increasing refresh_interval. Less refreshing means less load, and more resources can go to the indexing threads. How does all this translate into performance? Below is what our benchmarks revealed when we looked at it through the SPM lens.

Please tweet about Elasticsearch refresh interval vs indexing performance

Test conditions

For this benchmark, we indexed apache logs in bulks of 3000 each, on 2 threads. Those logs would go in one index, with 3 shards and 1 replica, hosted by 3 m1.small Amazon EC2 instances. The Elasticsearch version used was 0.90.0.

As indexing throughput is a priority for this test, we also made some configuration changes in this direction:

  • index.store.type: mmapfs. Because memory-mapped files make better use of OS caches
  • indices.memory.index_buffer_size: 30%. Increased from the default 10%. Usually, the more buffers, the better, but we don’t want to overdo it
  • index.translog.flush_threshold_ops: 50000. This makes commits from the translog to the actual Lucene indices less often than the default setting of 5000

Results

First, we’ve indexed documents with the default refresh_interval of 1s. Within 30 minutes, 3.6M new documents were indexed, at an average of 2K documents per second. Here’s how indexing throughput looks like in SPM for Elasticsearch:

refresh_interval: 1s

Then, refresh_interval was set to 5s. Within the same 30 minutes, 4.5M new documents were indexed at an average of 2.5K documents per second. Indexing thoughput was increased by 25%, compared to the initial setup:

refresh_interval: 5s

The last test was with refresh_interval set to 30s. This time 6.1M new documents were indexed, at an average of 3.4K documents per second. Indexing thoughput was increased by 70% compared to the default setting, and 25% from the previous scenario:

refresh_interval: 30s

Other metrics from SPM, such as load, CPU, disk I/O usage, memory, JVM or Garbage Collection didn’t change significantly between runs. So configuring the refresh_interval really is a trade-off between indexing performance and how “fresh” your searches are. What settings works best for you? It depends on the requirements of your use-case. Here’s a summary of the indexing throughput values we got:

  • refresh_interval: 1s   – 2.0K docs/s
  • refresh_interval: 5s   – 2.5K docs/s
  • refresh_interval: 30s3.4K docs/s

G1GC vs. Concurrent Mark and Sweep Java Garbage Collector

Lately, inquiries about the G1GC Java Garbage Collector have been on the rise (e.g., see this search-lucene.com graph).  So last Friday we took one of our Search Analytics servers running HBase RegionServer and a semi-recent version of Oracle Java (build 1.7.0_07-b10) and switched it to G1 (-XX:+UseG1GC and no other G1-specific parameters).  Below are some before and after graphs taken from SPM, which is what we use to monitor all our servers, of course.  Click on them for full size.

G1GC Switch from CMS

G1GC Switch from CMS

Can you tell what time we made the switch?  If you cannot tell, have a look at the graph below for a more obvious and more dramatic before and after.

G1GC vs. CMS, GC Details

G1GC vs. CMS, GC Details

Bases on the above graphs, we can see that, at least in this particular workload, G1GC does more collections, but they are smaller and thus take less time.  We’ll keep monitoring JVMs using G1, and if they continue to perform well, we’ll switch to them wherever applicable.  Here is one more look at the situation:

G1. vs. CMS Dashboard

G1. vs. CMS Dashboard

On this SPM Dashboard you should compare the panel/graph on top (using CMS) with the panel/graph directly below it (G1).

In short, we are seeing some more GC work done on the Young generation with G1, but a lot less work being done with the Old generation and thus we see far fewer long pauses than with CMS garbage collector!

If performance and/or monitoring excites you, come join us! If you want to learn more about Java GC and live in or near New York, register for Living with Garbage by Gregg Donovan, a Senior Software Engineer at Etsy.

Please tweet this post about G1GC.

Have you tried G1 Garbage Collector lately?  How did it work for you?

- @sematext

What’s New in SPM 1.13.0

We pushed a new SPM release to production this morning and it’s loaded with goodies.  Here is a quick run-down of a few interesting ones. The slightly longer version can be found in SPM Changelog:

PagerDuty integration. If you are a PagerDuty user, your alerts from SPM can now go to your PagerDuty account where you can handle them along with all your other alerts.

Ruby & Java libraries for Custom Metrics.  We open-sourced sematext-metrics, a Ruby gem for sending Custom Metrics to SPM as well as sematext-metrics for doing the same from Java.

Coda Metrics & Ruby Metriks support.  We open-sourced sematext-metrics-reporter, a Coda’s Metrics reporter for sending Custom Metrics to SPM from Java, Scala, Clojure, and other JVM-based apps, and we’ve done the same for Metriks - the Ruby equivalent of Coda’s Metrics library.

Puppet metrics. We begged James Turnbull to marry Puppet and SPM and write a Puppet report processor thats sends each of the metrics generated by a Puppet run to SPM, which he did without us having to buy him drinks….yet.

Performance.  We’ve done a bit of work in the layer right behind the UI to make the UI a little faster.

CentOS 5.x support.  Apparently a good number of people still use CentOS 5.x, so we’ve update the SPM client SPM to work with it.  You can grab from SPM Client page.

@sematext

What’s new in SPM 1.12.0

We’ve been very heads down since our last official release of SPM, as you can see from our SPM Changelog.  Here is some more info about new things you’ll find in SPM:

  • For the impatient – there is now a demo user (click on this and you will be logged in as the demo user).  This lets you into both SPM and Search Analytics even if you don’t have an account, so you can see  reports for various types of SPM Apps as well as what Search Analytics reports are like.
  • The SPM Client (aka SPM Agent) can now be installed as an RPM or a Debian packages – check SPM client page.  Until now, the installer was completely written in Bash, but using Jordan Sissel’s fpm we were able to easily put together SPM Client packages.  Moreover, you can now easily install SPM Client on Redhat, CentOS, Fedora, Debian, Ubuntu, SuSE, Amazon Linux AMI, and maybe some other smaller distros we didn’t get to test.  If you try it on some  other distro, please let us know if it worked or if you had issues, so we can help!
  • SPM for Hadoop can now be used to monitor Hadoop HDFS, MapReduce, and YARN clusters.  We have tested SPM for Hadoop with CDH (Cloudera Hadoop Distribution), but it should work just as well with Apache Hadoop, or HDP (Hortonworks Data Platform), and perhaps MapR’s Hadoop distros as well.  Of course, you can use the same Sematext Apps account to monitor any number of your clusters and clusters of any type, so it’s extremely easy to switch from looking at your Hadoop metrics, to HBase metrics, to Solr or ElasticSearch metrics.  We are working on expanding monitoring support to other technologies.  Tell us what you would like SPM to monitor for you!
  • SPM users already loved the Overview panel in SPM (see Screenshots), but the new Custom Dashboards are even cooler!  Here are some things to know about the new SPM Dashboards:
    • You can have any number of them and you can name them – there are no limits
    • You can add any SPM graph to any of your existing dashboards and it’s super easy to create a new dashboard when you realize you want the graph you are looking at on a new dashboard
    • You can drag and drop dashboard panels wherever you want, resize them, and they’ll nudge other panels around and snap onto a neat, invisible grid
    • You can add graphs from any number of different SPM applications to the same dashboard.  So you can create a dashboard that has all the graphs that are important to you and your application(s), and combine metrics from different (types of) apps in a single view.  For example, you can have a dashboard that shows the number of HBase compactions as well as ElasticSearch index merging stats on the same dashboard, right next to each other.
    • Not only can you mix and match graphs from different SPM Apps on the same dashboard, but if you are a Search Analytics user you can also have Search Analytics graphs right next to your performance graphs!  That’s powerful!
    • And for those who discovered Logsene, our soon to be announced Data & Log Analytics service, you can imagine how eventually SPM & Logsene will be able to play together and share the same dashboards!
  • SPM Clients do a good job of gathering server and application metrics for applications covered by SPM.  But what if you want to monitor something else in addition to what SPM collects and graphs?  Or what if you want to feed in some non-performance data – say some user engagement metrics or some other Business KPI?  Well, you can do that now!  We’ve added Custom Metrics to all plans and this addition is free of charge!  And guess what?  You can build graphs with your custom metrics and put them on any and however many of your Dashboards you want, so you could potentially have your KPIs be shown next to your Performance Metrics, next to your Log Analytics, next to your Search Analytics!
  • To help SPM users feed custom metrics into SPM we’ve released and open-sourced the first Sematext Metrics library for Java, with libraries for other languages to follow. Improvements welcome, pull requests welcome, support for other languages super welcome!
  • You can now share, tweet, and embed your SPM graphs as you please.  Next to each icon you will see a little “share” icon that will open up a dialog window and there you can save the displayed short link, which you can tweet, or email.  You’ll also see an HTML snippet that you can copy and paste into your blog or your Wiki.  The shared or embedded widget is cool because you can select very specific filters in SPM and the shared/embedded graph will remember them.  Furthermore, you can choose to have your graph show a specific, fixed time range or, alternatively, you can choose to always show the last N minutes, or hours, or days…. This can be quite handy for sharing with team members who may not have access to SPM for one reason or another, but do want access to these graphs – think C*Os or anyone else who needs to be fed graphs and graphs, and some more graphs.
  • Here is a cool and very handy feature that was suggested by George Stathis from Traackr, one of our early SPM users:  If you are a developer in charge of your application, your servers, your cluster, what happens when something breaks or just starts working poorly?  Many people turn to mailing lists to seek help from the open source community.  Experts on these mailing lists often ask for more information so they can provide better help.  Very often this information is in the form of graphs of various performance metrics.  So if you look at SPM you will see a little ambulance icon that, when clicked, puts together an email template that includes a short, public link to whichever graph’s ambulance icon you clicked on.  You can then incrementally keep adding links to other graphs to this email and, when you are ready, email it without leaving SPM.  This is incredibly easy and it makes it a breeze to put together an email that has a bunch of informative graphs in it.  This means you don’t have to go open your email client, you don’t have to type in the mailing list address, and you don’t have to jump between the email you are composing and SPM while you copy and paste links from SPM into your email client.  And while this functionality is really handy when you need help from experts and you want to give them as much information about your system as you can so they can help you better, you can also simply change where this email will get sent and send it to whoever you want.
  • We listen to our user really carefully (see above!).  Some SPM users would like to keep their data longer, so we’ve added Pro Silver – a new plan with longer data retention, more metrics, more everything.
  • We’ve improved SPM for ElasticSearch so it can now collect metrics even from non-HTTP nodes.
  • We have made SPM Sender, the component responsible for sending all the metrics over to us for processing, more resilient and even lighter than before.

 

We want to hear what you think and what you want!  Please use comments to leave suggestions, requests, and feedback, or simply email us or tweet @sematext.

Follow

Get every new post delivered to your Inbox.

Join 1,550 other followers