SPM REST API

By popular demand…we’ve just added some new goodness to SPM with the SPM REST API.  This new API lets you:

  • create SPM Apps for monitoring (e.g. generate a new SPM App + its token during deployment)
  • list all available metrics and charts for a specific App
  • list all alerts defined for some app (threshold, anomaly or heartbeat)
  • create new alerts (of any type: heartbeat, threshold, anomaly)
  • enable/disable or delete individual alerts

As you can tell from the above, we started by exposing APIs for SPM Alerts first.  Of course, we’ll be expanding the API to expose more SPM functionality, as well as Logsene. You can see all the details — including Listing Alerts, Creating, Disabling and Deleting Alerts, Metrics API, and more are on our wiki.

Alert Creation Designer

One feature of SPM Alerts HTTP API that we do want to call out here is the Alert Creation Designer.  The easiest way to prepare alert rules to be used in API calls is by using the Create Alert dialog available in SPM.

In the example below we use Solr’s req. count metric, whose chart is under the Req. Rate & Latency report, as shown in the screenshot below.

Req Rate Latency

Above every chart in SPM there is a little pull-down menu.  From there simply click on the Create Alert option to open the Alert dialog seen below.  A new “Show API Call” link is shown at the bottom of the dialog. Once clicked, it displays the API call as a “curl” command line. You can modify attributes shown in the dialog and update the API call details by clicking on the little Refresh icon.  Once you’re happy with alert rule parameters you can copy the curl command and execute it from the terminal.

Alert

Finally, for Heartbeat Alerts, you would just click on Heart icon on any report (as displayed below) to open a similar dialog:

Heartbeat

Typically, you would prepare your alert templates this way once, and then just tweak them remotely (for example, just use different token parameter to get it applied to your other apps, adjust metric names, thresholds, etc.).

Would this make your life easier?

If this functionality looks like it will make managing alerts more efficient, then check out a Free 30-day SPM trial by signing up.  There’s no commitment and no credit card required.  SPM monitors a ton of applications, like Elasticsearch, Solr, Hadoop, Spark, HBase, Kafka, Storm, Cassandra, Node.js & io.js, and more.

Node.js and io.js Monitoring Support

Node.js and io.js are increasingly being used to run JavaScript on the server side for many types of applications, such as websites, real-time messaging and controllers for small devices with limited resources. For DevOps it is crucial to monitor the whole application stack and Node.js is rapidly becoming an important part of the stack in many organizations. Sematext has historically had a strong support for monitoring big data applications such as Elastic (aka Elasticsearch), Cassandra, Solr, Spark, Hadoop, and HBase, as well as more traditional databases, web servers like Nginx, Nginx Plus and Apache, Java applications, cache servers like Redis and Memcached, messaging middleware like everyone’s darling Kafka, etc.  With such rapid adoption of Node.js and now io.js, we’d be remiss not to add performance monitoring, alerting, and anomaly detection for them in SPM!

spm-node-io

SPM for Node.js

We’re happy to announce we’ve just added Node.js monitoring to this growing list of SPM integrations.  SPM for Node.js covers key Node.js metrics such as Event Loop, Garbage Collection, CPU, Memory and web services metrics.  All metrics are organized in out-of-the-box charts, which can be put on additional dashboards and placed next to performance charts for other parts of the application stack.

Overview for top node.js and io.js metrics

Overview for top node.js and io.js metrics

 

Of course, you can view your Node.js metrics in a larger context.  For example, here is a dashboard that shows Node.js metrics together with Elasticsearch metrics, making it easier to correlate performance across multiple tiers of the application stack.  You could also get your event and log charts on the same dashboard for an even more thorough correlation.

nodejs-elasticsearch-dashboard

Dashboard with node.js HTTP response time and Elasticsearch query latency

Needless to say, we made sure everything works for the latest versions of Node.js (0.12) and io.js (1.6). Installation is as easy as integration of any other module using npm.  If you are not using SPM yet, you can sign up with no commitment or credit card.  You have 30-days free on any new app you create.  If you are already using SPM, you can simply add a new SPM App for Node.js and see all your Node.js metrics in just a few minutes.  Don’t see something in SPM for Node.js?  Please let us know (@sematext) or comment below, we are looking for feedback!

 

HBase 0.98 Monitoring Support

HBase is a popular open-source, non-relational (NoSQL), column-oriented, distributed database that runs on top of the Hadoop Distributed File System (HDFS).  HBase is well suited for sparse data sets, which are common in many big data use cases.  Fortunately for all its users, SPM now supports monitoring, alerting and anomaly detection for HBase version 0.98.  Even those of you not running version 0.98 (here are the results for our HBase version distribution poll) are still in luck because a lot of HBase metrics captured by SPM are also in 0.94.x, 0.96.x, and even the recently released 1.0 version.  That said, HBase is one of those projects whose metrics change from version to version – some are deletes, some are added, others are modified.  If you have your own tools for monitoring HBase and are trying to monitor more than just the most basic HBase metrics, maintaining those tools must not be fun. Related to this common issue, we recently put together a “Build vs. Buy” post that weighs the pros and cons.

Here at Sematext we make heavy use of HBase.  We have recently moved from 0.94.x to 0.98.x and have been enjoying all its benefits.  Furthermore, we’ve recently updated SPM for HBase to monitor a pile of new HBase metrics.  Of course, we eat our own dog food and immediately got new and interesting insights about our own HBase clusters through some of the new metric charts.

For example, from https://apps.sematext.com/spm-reports/s/VhOltU14Cy we are now able to see the dramatic impact of major compactions on data locality (and thus HBase performance!):   (click to enlarge)

Local_Files_1

And from https://apps.sematext.com/spm-reports/s/7LU1qvs7ur we can see the number and size of HLog files over time:   (click to enlarge)

HLog

Alright, on to all the details!

Shiny, New HBase Metrics

In total, we’re talking 290 metrics: 195 for 0.98 and 95 for previous versions.  And lots of them changed in 0.98.  Here’s a summary of top-level SPM reports.  Each report listed below has one or more charts with one or more HBase metrics.  Juicy stuff.

Master:

  • Servers
  • Assign Manager
  • Balancer
  • FS
  • Snapshot

Region Server:

  • Regions & Stores
  • Requests
  • Files
  • Compact & Flush
  • Cache
  • Operations
  • Check & Mutate
  • WAL
  • Hedged Reads
  • MOB
  • Replication
  • Replication Source

Common / pre-0.98:

  • IPC
  • HBase JVM
  • UGI
  • Requests (pre-0.98)
  • Regions (pre-0.98)
  • Split (pre-0.98)
  • Memstore (pre-0.98)
  • Store (pre-0.98)
  • Compactions (pre-0.98)
  • FS (pre-0.98)
  • Block Cache (pre-0.98)

Screenshot: HBase Operation Calls & Time  (click to enlarge)

HBase_Ops_calls

Screenshot: HBase Slow Operations  (click to enlarge)

HBase_Slow_Ops

Screenshot: HBase Sync & Append Ops & Time  (click to enlarge)

HBase_Sync_Apend_Ops_Time

OK OK, how do I get all this stuff?

If you are not using SPM yet, simply sign up, create your first SPM App, and follow the directions in the UI.  You should see all your HBase metrics in a matter of minutes.  SPM is free for 30 days, requires no commitment or credit card and has no limit.  On Premises version is available as well.

If you are already using SPM, but not monitoring HBase, just create the SPM App for HBase, and follow the directions for installing the SPM agent on your HBase nodes.

If you are already using SPM for monitoring HBase, you just need to upgrade the SPM agent and configure it.

Cassandra Case Study – including Performance Monitoring

If you use Cassandra you will find some interesting insights in this Planet Cassandra case study by Sematext client Recruiting.com.  Hitendra Pratap Singh, a Cassandra Software Engineer, talks about why they decided to deploy Cassandra, other NoSQL solutions they looked at, advice for new Cassandra users, and more.

Here’s an excerpt:

Monitoring Apache Cassandra with SPM

“We started using SPM Performance Monitoring and Reporting from Sematext for Apache Solr and were impressed with the amount of real-time stats we could analyze using SPM. We expected the same amount of details for Cassandra as well and decided to go with SPM.  Some of the benefits we’ve seen from SPM include the alert notification system, graphical interface [i.e. easy to analyze], detailed stats related to JVM, and creation of our own custom metrics.

We also utilize SPM for monitoring our deployments of Apache Solr and Memcached servers.”

On the “Overview” screen found below, you can check out some Cassandra metrics, as well as various OS metrics. Specific Cassandra metrics can be drilled down by clicking on one of the tabs along the left side; these metrics include: Compactions, Bloom Filter (space used, false positives ratio), Write Requests (rate, count, latency), Pending Read Operations (read requests, read repair tasks, compactions), and more.

SPM for Cassandra Overview  (click to enlarge)

cassandra_overview_2

You can read the full version of “Recruiting.com Powers Real-Time High Throughput Application with Apache Cassandra” at Planet Cassandra.

And if you’d like to monitor Cassandra yourself (or any number of applications like Hadoop, HBase, Spark, Kafka, Elasticsearch, Solr, etc.), check out a Free 30-day trial by registering here.  There’s no commitment and no credit card required.  You can also see our Cassandra monitoring blog post for more details and screenshots.

Use Case: Spark Performance Monitoring

Guest blog post by Nick Pentreath, Co-founder of Graphflow

Democratizing Recommendation Technology

At Graphflow, our mission is to empower online stores of all sizes to grow their businesses by providing them access to the same machine learning and Big Data tools used by the largest and most sophisticated tech players in the market.

To deliver on this mission, we decided from the very beginning to go ‘all in’ on Spark for our scalable analytics and machine learning applications. When Graphflow started using Spark, it was on version 0.7.0, and it was relatively immature. A lot has changed over the past year and a half: Spark has become a top-level Apache project, version 1.2.0 was released, and Spark has matured significantly in terms of functionality, deployment, stability, and operations.

Spark Monitoring

There are, however, still a few “missing pieces.”  Among these are robust and easy-to-use monitoring systems. With the version 1.0.0 release, Spark added a metrics system to allow reporting and monitoring of various internal and custom Spark application metrics. Built on top of Coda Hale’s Metrics, the metrics system supports various methods of reporting to external monitoring systems.

This is all very well, but being a very small team, we tend to rely on managed services wherever it makes sense — we just don’t have the resources to manage a dedicated monitoring infrastructure. We recently started using SPM (for monitoring, alerting, and anomaly detection) and Logsene (for our logs) — both from Sematext — across most of our systems, including EC2 metrics, Elasticsearch, and web application log collection and monitoring.

With the recent release of SPM for Spark monitoring, we definitely wanted to take it for a spin!

Getting up and Running

The installation process is straightforward:

  1. Install the SPM monitor on each node in the Spark cluster using the standard package manager.
  2. Amend `SPARK_MASTER_OPTS`, `SPARK_WORKER_OPTS`, and `SPARK_SUBMIT_OPTS` in `spark-env.sh` and `spark.executor.extraJavaOptions` in `spark-defaults.conf` on each node, with the appropriate config properties, including an SPM access key (don’t forget to propagate these config changes to each worker – we do this using *spark-ec2’s* `copy-dirs` command).
  3. Create or amend the metrics properties file `metrics.properties` to point to the JMX sink (by setting `*.sink.jmx.class=org.apache.spark.metrics.sink.JmxSink`).

Once all nodes are restarted, you should start seeing metrics appearing in the SPM dashboard within a few minutes.

The main dashboard provides a useful overview of what’s going on in the cluster. The detail tabs on the side allow you to drill down into more detailed metrics for the Master / Driver, and Workers / Executors, and, of course, all key JVM and server metrics.  We can also feed any custom metrics we want to chart into SPM, but we are not making use of that yet.

Spark_monitoring_1

Spark Troubleshooting with SPM

Spark, being a complex distributed system, sometimes has issues. While these have become rarer with the past few releases — which have improved efficiency and stability significantly — they still happen. Probably the most common causes of failure (either of a Job, a Worker, or the Master) are related to memory pressure or misconfiguration.

As a case in point: on a number of days we were experiencing periodic job failures due to Workers going down. However, we were not seeing a precise cause in the logs. Since we had installed SPM for Spark, we took a look through a few of the metrics dashboards. At first, it was still not clear what might be causing the issue. However, we noticed that at the time of the failure, there was a big spike in CPU usage and, directly afterwards, the overall disk usage dropped off noticeably.

Spark_monitoring_2a

Spark_monitoring_2b

Once we drilled down from the aggregated metrics view (above) to the individual disk view, the root cause became clear – running out of disk space on the root device!

Spark_monitoring_3a

Spark_monitoring_3b

Sure enough, once we knew what to look for, we found that the Spark working directory on each Worker node had gotten clogged up with job logs and JARs.  We run a fairly large number of jobs on regular schedules (every 15 minutes, every hour, daily and so on), and each job caused more build up of these files in the working directory.

We had correctly set `spark.local.dir` to the large disk volume, but the default working directory is set to `$SPARK_HOME/work`. This setting can be changed with the environment variable `SPARK_WORKER_DIR` in `spark-env.sh`. We also turned on the ‘worker cleanup’ functionality by setting `spark.worker.cleanup.enabled true` in `spark-defaults.conf`. The Spark Standalone guide has more detail on these settings.

Everything in One Place

Using SPM, together with the Spark Web UI and its ability to keep history on previously run Spark applications, we’ve found that troubleshooting Spark performance issues has gotten much easier. On top of that, the ability to manage metrics, monitoring and logging across our entire stack in one place, as well as integrate log search and analytics for Spark, is a huge win for our team.

To learn more about us and our eCommerce and Recommendation Analytics solutions, visit the Graphflow web site.  And to learn more about SPM for Spark monitoring, check out Sematext.

Got some feedback or suggestions?  Drop Sematext a line — they’d love to hear from you!

Integrating SPM Performance Monitoring with Slack

Many distributed DevOps teams rely on Slack,  a platform for team communication providing everything in one place, instantly searchable and available wherever you go.  SPM Performance Monitoring‘s new integration via WebHooks provides the capability to forward alerts to many services, including Slack.

The integration of both services can be achieved by using the WebHook URL from Slack and then configuring this WebHook in SPM.  The SPM Wiki explains how to get this information from Slack and build the WebHook in SPM: Alerts – Slack integration

spm-slack-alert-logo

This whole process only takes a minute or two.  Slack is a tool that is becoming more popular among the DevOps crowd, and here at Sematext we pride ourselves on staying on top of what our users need and expect.

Need some extra help with this setup or another app you might want to integrate?  Have ideas for other integrations we should explore? Please drop us a line, we’re here to help and listen.

Follow

Get every new post delivered to your Inbox.

Join 157 other followers