Poll: What do you use for ElasticSearch performance monitoring?

The results of this poll will be included in the “Large Scale ElasticSearch, Solr & HBase Performance Monitoring” presentation at Berlin Buzzwords next week.  Please vote and share this post to help us make this poll statistically significant!

About these ads

15 Responses to Poll: What do you use for ElasticSearch performance monitoring?

  1. Paul Smith says:

    We use a performance monitoring tool that is open-sourced out of Silicon Graphics (SGI) that captures hardware, OS, application metrics, and now has a ES integration (which we wrote).

    http://oss.sgi.com/projects/pcp/

    amazing tool particularly for retrospective analysis since we can dig deep into performance archive logs and replay theories, set up complex rules for triggering.

    • sematext says:

      @Paul – interesting, I thought it was a dead project. Looks like it’s not completely dead after all.

  2. OpenTSDB tCollector for ElasticSearch

  3. lkafle says:

    homegrown , java perl combination, as these other choices tools were not known

  4. Radu Gheorghe says:

    Some custom checks using Shinken (a Nagios-like tool): http://www.shinken-monitoring.org/

    • sematext says:

      @Radu – interesting. I haven’t heard of it before. But note that, like Nagios, it monitors whether servers/services are up or down, but not how they performed, say, over the last 15 minutes. So it’s not really a performance monitoring tool, from what I can tell.

      • Radu Gheorghe says:

        @sematext: Yeah, well that’s the obvious part of it, but you can use NRPE[1] to run Nagios plugins on remote systems. A Nagios plugin[2] is basically any sort of application that returns the output and exit codes Nagios/Shinken can interpret (doesn’t need to be run via NRPE, it can run on the Shinken host as well, if it doesn’t need access to the remote machine). And the output can include one or more performance data values.

        For example, we use Elasticsearch for storing logs, and we have a check that fires off a log, then returns the time it takes that log to be returned in searches. If the time passes certain thresholds, we trigger warning or critical alerts. It’s all work in progress for us, but that’s how we started.

        We also use that performance data to build graphs using pnp4nagios. With “special templates”[3], you can do all sorts of stuff to make your graphs more significant. Like, for instance, aggregate the performance data from all the ES nodes.

        Also, when a service changes its state, you can use “event handlers”[4] to react to those state changes. In our case it might be dropping some logs or throttling them when ES is too loaded. This becomes even more interesting when you start defining some custom services, called “business rules”[5], where you can combine individual services states. For example, when load on ES becomes CRITICAL on at least 2 out of 8 nodes, increase the auto-refresh interval.

        The problem with all this is that the tools are quite generic. So you need to write your own plugins, event handlers, php templates for graphs, you need to define services and business rules, it’s a bit of a pain.

        Useful links:
        [1] http://nagios.sourceforge.net/docs/nrpe/NRPE.pdf
        [2] http://nagiosplug.sourceforge.net/developer-guidelines.html
        [3] http://docs.pnp4nagios.org/pnp-0.6/tpl_special
        [4] http://www.shinken-monitoring.org/wiki/official/advancedtopics-eventhandlers
        [5] http://www.shinken-monitoring.org/wiki/official/advancedtopics-businessrules

        P.S. We also use BigDesk, I voted for it. It goes without saying – it’s awesome :D

        • sematext says:

          @Radu – thanks for sharing all that, very informative! Sounds powerful, but I agree it sounds rather involved and has a lot of small moving pieces.

          • Radu Gheorghe says:

            @sematext [sorry for the late reply] Well, we plan to change a bit of that by open-sourcing bits that would be re-usable. For example, a plugin that measures inserts per second, coupled with a php template if it would be appropriate.

          • xkilian says:

            @Radu – Shinken now provides monitoring packs[1], that put together the templates, the commands and the plugin. This makes it easier to share and improve the monitoring logic. They are also directly linked with the configuration system and the wiki to integrate all the pieces for users. This functionality is being released as part of Shinken 1.2. But is available in the Shinken git today.
            [1] http://community.shinken-monitoring.org/main

            You can also note that Shinken is rolling full speed ahead in integrating Shinken with Graphite. Shinken can collect performance data included as part of the monitoring check output. A typical plugin will return the state of a service but can also include performance data. This data is exported via Shinken to Graphite. There are also plugins[2] [3] that will run checks against data in Graphite itself to use its statistical functions.

            [2] https://github.com/etsy/nagios_tools/
            [3] https://github.com/jbarber/graphite-nagios-plugins

            Typical Nagios/Shinken installations have lots of small moving pieces. Though Shinken does aim at making this easier.

  5. Andrej says:

    Homegrown, combination of using java api stats and bash scripts for system performance

    • sematext says:

      @Andrej – Thanks for the info. You may want to have a look at SPM for ElasticSearch, it sounds like it may be simpler to use/(not) maintain that than bash scripts. I see you are in .de – if you are going to Berlin Buzzwords stop by Sematext’s booth if you want to see SPM for ElasticSearch in live action.

  6. karussell says:

    For jetwick.com and jetsli.de I’m still feeding some stats in one index and grab it with a bit javascript & Dygraph:

    http://karussell.files.wordpress.com/2011/06/jetslide-stats.png

    Later on I’ve added a lot more timelines for query characteristics of certain queries, memory usage of feeder, etc.

    It is very generic: a name, a number, the time (if not now) and application (e.g. ui/backend/…)

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out / Change )

Twitter picture

You are commenting using your Twitter account. Log Out / Change )

Facebook photo

You are commenting using your Facebook account. Log Out / Change )

Google+ photo

You are commenting using your Google+ account. Log Out / Change )

Connecting to %s

Follow

Get every new post delivered to your Inbox.

Join 1,550 other followers

%d bloggers like this: