Use Case: Spark Performance Monitoring

Guest blog post by Nick Pentreath, Co-founder of Graphflow

Democratizing Recommendation Technology

At Graphflow, our mission is to empower online stores of all sizes to grow their businesses by providing them access to the same machine learning and Big Data tools used by the largest and most sophisticated tech players in the market.

To deliver on this mission, we decided from the very beginning to go ‘all in’ on Spark for our scalable analytics and machine learning applications. When Graphflow started using Spark, it was on version 0.7.0, and it was relatively immature. A lot has changed over the past year and a half: Spark has become a top-level Apache project, version 1.2.0 was released, and Spark has matured significantly in terms of functionality, deployment, stability, and operations.

Spark Monitoring

There are, however, still a few “missing pieces.”  Among these are robust and easy-to-use monitoring systems. With the version 1.0.0 release, Spark added a metrics system to allow reporting and monitoring of various internal and custom Spark application metrics. Built on top of Coda Hale’s Metrics, the metrics system supports various methods of reporting to external monitoring systems.

This is all very well, but being a very small team, we tend to rely on managed services wherever it makes sense — we just don’t have the resources to manage a dedicated monitoring infrastructure. We recently started using SPM (for monitoring, alerting, and anomaly detection) and Logsene (for our logs) — both from Sematext — across most of our systems, including EC2 metrics, Elasticsearch, and web application log collection and monitoring.

With the recent release of SPM for Spark monitoring, we definitely wanted to take it for a spin!

Getting up and Running

The installation process is straightforward:

  1. Install the SPM monitor on each node in the Spark cluster using the standard package manager.
  2. Amend `SPARK_MASTER_OPTS`, `SPARK_WORKER_OPTS`, and `SPARK_SUBMIT_OPTS` in `spark-env.sh` and `spark.executor.extraJavaOptions` in `spark-defaults.conf` on each node, with the appropriate config properties, including an SPM access key (don’t forget to propagate these config changes to each worker – we do this using *spark-ec2’s* `copy-dirs` command).
  3. Create or amend the metrics properties file `metrics.properties` to point to the JMX sink (by setting `*.sink.jmx.class=org.apache.spark.metrics.sink.JmxSink`).

Once all nodes are restarted, you should start seeing metrics appearing in the SPM dashboard within a few minutes.

The main dashboard provides a useful overview of what’s going on in the cluster. The detail tabs on the side allow you to drill down into more detailed metrics for the Master / Driver, and Workers / Executors, and, of course, all key JVM and server metrics.  We can also feed any custom metrics we want to chart into SPM, but we are not making use of that yet.

Spark_monitoring_1

Spark Troubleshooting with SPM

Spark, being a complex distributed system, sometimes has issues. While these have become rarer with the past few releases — which have improved efficiency and stability significantly — they still happen. Probably the most common causes of failure (either of a Job, a Worker, or the Master) are related to memory pressure or misconfiguration.

As a case in point: on a number of days we were experiencing periodic job failures due to Workers going down. However, we were not seeing a precise cause in the logs. Since we had installed SPM for Spark, we took a look through a few of the metrics dashboards. At first, it was still not clear what might be causing the issue. However, we noticed that at the time of the failure, there was a big spike in CPU usage and, directly afterwards, the overall disk usage dropped off noticeably.

Spark_monitoring_2a

Spark_monitoring_2b

Once we drilled down from the aggregated metrics view (above) to the individual disk view, the root cause became clear – running out of disk space on the root device!

Spark_monitoring_3a

Spark_monitoring_3b

Sure enough, once we knew what to look for, we found that the Spark working directory on each Worker node had gotten clogged up with job logs and JARs.  We run a fairly large number of jobs on regular schedules (every 15 minutes, every hour, daily and so on), and each job caused more build up of these files in the working directory.

We had correctly set `spark.local.dir` to the large disk volume, but the default working directory is set to `$SPARK_HOME/work`. This setting can be changed with the environment variable `SPARK_WORKER_DIR` in `spark-env.sh`. We also turned on the ‘worker cleanup’ functionality by setting `spark.worker.cleanup.enabled true` in `spark-defaults.conf`. The Spark Standalone guide has more detail on these settings.

Everything in One Place

Using SPM, together with the Spark Web UI and its ability to keep history on previously run Spark applications, we’ve found that troubleshooting Spark performance issues has gotten much easier. On top of that, the ability to manage metrics, monitoring and logging across our entire stack in one place, as well as integrate log search and analytics for Spark, is a huge win for our team.

To learn more about us and our eCommerce and Recommendation Analytics solutions, visit the Graphflow web site.  And to learn more about SPM for Spark monitoring, check out Sematext.

Got some feedback or suggestions?  Drop Sematext a line — they’d love to hear from you!

Integrating SPM Performance Monitoring with Slack

Many distributed DevOps teams rely on Slack,  a platform for team communication providing everything in one place, instantly searchable and available wherever you go.  SPM Performance Monitoring‘s new integration via WebHooks provides the capability to forward alerts to many services, including Slack.

The integration of both services can be achieved by using the WebHook URL from Slack and then configuring this WebHook in SPM.  The SPM Wiki explains how to get this information from Slack and build the WebHook in SPM: Alerts – Slack integration

spm-slack-alert-logo

This whole process only takes a minute or two.  Slack is a tool that is becoming more popular among the DevOps crowd, and here at Sematext we pride ourselves on staying on top of what our users need and expect.

Need some extra help with this setup or another app you might want to integrate?  Have ideas for other integrations we should explore? Please drop us a line, we’re here to help and listen.

Integrate PagerDuty with SPM Performance Monitoring

Got Alarm Fatigue?

If so, you are not alone!  We talk to a lot of people who want to reduce the frequent “noise” from monitoring alarms.  To solve this common problem, Sematext added anomaly detection for alerts and PagerDuty integration to its SPM Performance Monitoring solution to dramatically reduce this noise compared with simple threshold-based alerting mechanisms.  The integration with PagerDuty helps DevOps with incident management, i.e., managing escalation and routing alerts to the right person by defined schedules and communication channels.

PagerDuty is an alarm aggregation and dispatching service for system administrators and support teams. It collects alerts from your monitoring tools, gives you an overall view of all of your monitoring alarms, and alerts an on-duty engineer if there’s a problem. PagerDuty allows you to build sophisticated alerting rules to determine who to contact when problems occur. You can build on-call schedules to equitably share on-call responsibilities. You can also set up multiple levels of coverage, so if the “primary” on-call person doesn’t respond to an alert in a timely fashion, it’s automatically escalated to a “secondary” person, and so on.” - Source: PagerDuty FAQ.

SPM Performance Monitoring is an enterprise-class, server and application performance monitoring, alerting, and anomaly detection solution. It is available both in the cloud (SaaS) and On Premises.  SPM also integrates with Logsene Log Management and Analytics to correlate metrics, alerts, anomalies, and events with application and server logs.

Get started

Basic setup steps are required to hook up both services:

  1. In PagerDuty: Get an API Key
  2. In SPM: Enter the API Key in SPM alert settings

1) In PagerDuty:

Create a new service:

  1. In your account, under the Services tab, click “Add New Service”.
  2. Select an Escalation Policy (e.g. default)
  3. Start typing “Sematext” for the Integration Type, which will narrow your filtering.
    PagerDuty add service
  4. Click the Add Service button
  5. Once the service is created, you’ll be taken to the Service page. On this page, you’ll see the “Service API key,” which you will need when you configure Sematext products to send events to PagerDuty. Copy the “Service API Key“ to the clipboard. PagerDuty service key

2) In SPM

1) Navigate to SPM Application Settings of your SPM App by clicking the App Settings button in the top right when you’re in the SPM UI.

 SPM - App Settings

2) Navigate to Alerts / PagerDuty

SPM - Service API Key for PagerDuty

3) Enter the API key from PagerDuty in the field Service API key

4) Press the Save button

Done. Every alert from your SPM app will be forwarded to PagerDuty, where you can manage escalation policies and configure notifications to other services like HipChat, Slack, Zapier, Flowdock, and more.

If you’ve got some feedback on this post or ideas for similar posts please let us know!

Elasticsearch Monitoring: SPM vs. Marvel

While many SPM Performance Monitoring users quickly see the benefits of SPM and adopt it in their organizations for monitoring — not just for Elasticsearch, but for their complete application stack — some Elasticsearch users evaluate SPM and compare it to Marvel from Elasticsearch.  We’ve been asked about SPM vs. Marvel enough times that we decided to put together this focused comparison to show some of the key differences and help individuals and organizations pick the right tool for their needs.

Marvel is a relatively young product that provides a detailed visualization of Elasticsearch metrics in a Kibana-based UI. It installs as an Elasticsearch plug-in and includes ‘Sense’ (a developer console), plus a replay functionality for shard allocation history.

SPM, on the other hand, offers multiple agent deployment modes, has both Cloud and On Premises versions, includes alerts and anomaly detection, is not limited to Elasticsearch monitoring, integrates with third party services, etc. The following Venn diagram shows key areas that SPM and Marvel have in common and also the areas where they differ.

SPM-vs-Marvel

Looking into the details surfaces many notable differences.  For example:

  • The SPM agent can run independently from the Elasticsearch process and an upgrade of the agent does not require a restart of Elasticsearch
  • Dashboards are defined with different philosophies: Marvel exposes each Metric in a separate chart, while SPM groups related metrics together in a single chart or in adjacent charts (thus making it easy for people to have more information in a single place without needing to jump between multiple views)
  • Both have the ability to show metrics from multiple nodes in a single chart: Marvel draws a separate line for each node, while in SPM you can choose to aggregate values or display them separately.

The following “SPM vs. Marvel Comparison Table” is a starting point to evaluate monitoring products for organization’s individual needs.

SPM vs. Marvel Comparison Table

Feature SPM by Sematext Marvel by Elasticsearch
Supported Applications Elasticsearch, Hadoop, Spark, Kafka, Storm, Cassandra, HBase, Redis, Memcached, NGINX(+), Apache, MySQL, Solr, AWS CloudWatch, JVM, … Elasticsearch
Agent deployment mode in- and out-of-process
(out-of-process allows for seamless updates without requiring Elasticsearch restarts)
in-process
(as Elasticsearch plug-in; updates require Elasticsearch restarts)
Predefined dashboard graphs organized in groups YES YES
Saving Individual Dashboards Each user can store multiple dashboards, mixing charts from all applications, including both metrics and logs. Current view can be saved, reset to defaults possible. These changes are global.
API for Custom Metrics and Business KPIs YES NO
Extra Elasticsearch Metrics NO

  • Metrics are added based on user demand and users  can always graph them as Custom Metrics.
YES

  • Circuit Breakers
  • ID Cache
  • Lucene memory
  • ES Threadpools
  • Percolator
OS and JVM Metrics YES (+)

  • JVM pool sizes
  • JVM pool utilization
YES
Correlation of Metrics with Logs, Events, Alerts, and Anomalies YES

  • SPM and Logsene integration
  • Ability to ingest and chart arbitrary external Events
NO

  • Cluster Pulse displays only Elasticsearch Events
Deployment model SaaS or On Premises On Premises
Security/User Roles &
Permissions
YES NO
Easy & Secure Sharing of Reports with internal and external organizations YES

  • via short links
  • vie embeds / iframe
  • via email
NO
Machine Learning-based Anomaly Detection YES NO
Threshold based Alerts YES NO
Heartbeat Alerts YES NO
Forwarding Alerts to 3rd parties YES

  • E-Mail
  • PagerDuty
  • Nagios / Shinken
  • HipChat
  • Slack
  • Webhooks
NO
Metrics Aggregation YES

  • Pre-aggregation at multiple granularity levels, including 1 min granularity.  Advantage: more efficient storage, scales better, faster for graphing performance over longer time periods at the expense of sub-minute precision.
YES

  • Query-time aggregation. No write or query-time aggregation.
    Advantage: 10 second precision by default at the expense of storage size, write, and read performance and memory footprint.

As an aside, most of the features in this comparison table would also apply if we compared SPM to BigDesk, ElasticHQ, Statsd, Graphite, Ganglia, Nagios, Riemann, and other application-specific monitoring or alerting tools out there.

If you have any questions about this comparison or have any feedback, please let us know!

Integrating SPM Performance Monitoring with HipChat

Many agile DevOps teams rely on communication via HipChat,  which provides an API and mobile apps to receive messages while being away from one’s desktop. SPM Performance Monitoring‘s new integration via WebHooks provides the capability to forward alerts to many services, including HipChat.

The integration of both services can be achieved by collecting the room_id and an access token from HipChat and then building a WebHook in SPM.  The SPM Wiki explains how to get this information from HipChat and build the WebHook in SPM: Alerts – HipChat integration

Performance-Monitoring-Hip-Chat-Integration

This whole process only takes a minute or two.  HipChat is a tool that is becoming more popular among the DevOps crowd, and here at Sematext we pride ourselves on staying on top of what our users need and expect.

Need some extra help with this setup or another app you might want to integrate?  Have ideas for other integrations we should explore? Please drop us a line, we’re here to help and listen.

Performance Monitoring Comparison: Build vs. Buy

Using a performance monitoring system that you built yourself? You are not alone!  Many organizations monitor their applications and IT infrastructure with a bolted-together and often incompatible assortment of tools.  With larger organizations this can number to a dozen or more different tools.  Seriously.  Build vs. Buy, Do-It-Yourself (DIY), homegrown, in-house, Not Invented Here (NIH) — there are almost as many terms to describe this approach as there are products to do the monitoring.

There’s a good chance you’re using tools like Statsd, Graphite, Nagios and others to stay on top of things.  But that’s a LOT of work.  And why spend all the time doing all that work yourself?  Life, as we all know, is too short.  Is glueing together N tools or building yet another custom monitoring tool really a good use of (y)our (life)time?  This also leads to the next obvious question:

Why Not Use One Monitoring Solution to Do It All?

SPM Performance Monitoring, Alerting and Anomaly Detection is a comprehensive solution that does the work of many individual monitoring tools in one powerful package.  Applications, servers, other key IT devices — even logs! — are all covered.  A partial list of monitored apps includes Elasticsearch, Solr, Hadoop, HBase, Spark, Cassandra, Kafka, Storm, Redis, NGINX Plus and NGINX.  You can even see what I’m talking about right now by checking out our SPM live demo.

In fact, as one SPM user recently told us:

“I don’t want to be a data ape and consume your data to build other reports.  I think that is one of the attractions with SPM — I can push the data to Graphite or another monitoring tool, but you already have the reports done. So my time to insight is much faster.”

There Are Some Huge Differences Between Building and Buying

If your Building approach is draining engineer time that could be better spent elsewhere, then you should consider some of the key differences between building your own monitoring “system” and using SPM, including:

  • Log & Event Correlation: SPM can aggregate, graph and correlate logs with performance metrics and alerts (via integration with Logsene Log Management and Analytics).  If you are managing your logs then you are using a separate solution that does not integrate with your “Build” monitoring system.  Being able to see logs along with performance metrics is essential for effective troubleshooting.
  • On Premises or in Cloud: SPM offers an On Premises version in addition to SaaS.  Most app-specific monitoring tools are SaaS-only, but some organizations like their metrics and logs close to home base.
  • Native App Monitoring vs. 3rd Party Plugins: SPM monitors all apps natively.  If you are monitoring a number of individual apps via a range of 3rd party plugins then you have to deal with multiple installation and data collection mechanisms, various levels of maturity, and widely varying qualities of implementation and of reporting.
  • Anomaly Detection: SPM has support not only for heartbeat alerts and threshold-based alerts, but also for automatic machine learning based anomaly detection.  A Build system most likely does not have comprehensive anomaly detection capabilities.

And Then There is the Cost of Using All Those Different Monitoring Tools…

Cost comparisons between Building your own monitoring system and Buying a solution like SPM are not linear.  While some monitoring tools are open-source and free (though the time to configure them can be costly in its own right), commercial tools run a wide gamut of costs, infrastructure limits, data limits, time limits, pricing schemes, etc.  Just keeping track of the costs is often a job in itself.  In general, the more tools you have, the more value SPM delivers.

Here’s one scenario that will give you an idea of potential Build costs:

Build Your Own Monitoring System — Cost Scenario

  • Hourly rate:        $100 (ballpark figure; could be much higher)
  • Installation:        2 hours (very optimistic)
  • Configuration:   8 hours (very optimistic)
  • Maintenance:    2 hours/month (optimistic)
  • Upgrading:        2 days (i.e., ~20 hours)/year (IF all goes well!)
  • # of servers to run this configuration:  3 (monitoring 10 total servers*)
  • Cost per server (hardware): $1,000 each (i.e., $3,000 total)

___________________________________________________________

  • Total Cost in Year 1:        $6,200
  • Total Cost in Year 2:        $3,200 (not including any additional server purchases)
  • Total Cost in Year 3:        $3,200 (at least, though most likely higher)

And we didn’t even count the time cost to actually learn how to use all these tools!

Moreover, we used very optimistic numbers, assumed nothing will go wrong, assumed no issues like backwards incompatibilities, problems around dependencies, etc. etc. – all issues that are actually very common and can consume days and make the above costs much higher.  We do a ton of DevOps work at Sematext and, like everyone in this field, know how common this is.

* this number can vary widely, but for example purposes, if you want a complete monitoring solution that can do everything SPM can do — monitoring, alerting, anomaly detection, graph emailing, embedding, etc. — then the total servers that can be monitored with 3 monitoring servers will be lower than it would with a bare bones, incomplete monitoring tool.

SPM — Cost Scenario

  • # of servers: 10 servers (for example purposes)
  • Standard plan (our lowest cost plan beyond Free): $25/server/month
  • Time to Register and Install N agents: 1 hour (or $100 at hourly rate)

________________________________________________________________

  • Total Cost in Year 1:        $3,100
  • Total Cost in Year 2:        $3,000
  • Total Cost in Year 3:        $3,000

And these costs don’t even include any Volume Discounts that we would offer!

You can find clear, simple pricing plans for SPM right here.

Conclusion

While it’s great that there are many tools available for monitoring — some of them free — and communities built around those tools, in the DevOps world it still comes back to time.  Time to learn these tools.  Time to stay up-to-date on them.  Time to deploy.  Time to configure.  Time to maintain.  Time to assemble a bunch of disparate tools so you can monitor more than just one app.  You get the picture.  And time, as we all know, carries a cost.  With DevOps this is typically a significant cost.  So before undertaking a long and neverending “Build” journey, it makes sense to look at all the costs — in money and time — of buying a complete monitoring solution like SPM vs. building your own system. The closer you look, the more value a tool like SPM offers you and your organization.

Try SPM for Free for 30 Days

Tired of building, and building, and building…  Try SPM Performance Monitoring for Free for 30 days by registering here.  There’s no commitment and no credit card required.

Apache Spark Monitoring in SPM

Apache Spark is an open-source, large-scale data processing engine built on top of the Hadoop Distributed File System (HDFS) and enables applications in Hadoop clusters to run up to 100x faster in memory, and 10x faster even when running on disk.  So it’s not surprising the usage of Spark is booming as this Google Trends graph shows.

And while Spark usage has been going through the roof, Engineers and DevOps handling Spark have not had a good monitoring tool at their disposal.  Well, that is, until now.  By releasing the first Spark monitoring product to market Sematext has, with the addition of Spark monitoring to SPM Performance Monitoring, Alerting and Anomaly Detection, just filled a big hole in the Spark ecosystem.

Having just been added — along with other goodies — to the latest SPM release, SPM for Spark monitors all Spark metrics.  It includes alerting, anomaly detection, log correlation, custom dashboards, events graphing, custom metrics, and a ton more.  SPM can be installed On Premises or one can use the Cloud version run by Sematext, in which case the setup takes less than 5 minutes before graphs with performance metrics start appearing in real-time.

Enough with the words – Show me what Spark Monitoring looks like!

Have a look at a few screenshots to see how we graph Spark metrics in SPM.  While we don’t use Spark at Sematext at this time and thus don’t have a live demo to show you, you can check out SPM’s live demo and see some other types of apps we monitor, such as Hadoop, HBase, Cassandra, Kafka, Storm, ZooKeeper, Elasticsearch, Solr, NGINX and NGINX Plus, Apache, MySQL, Redis, Java webapps and generic Java applications, as well as custom metrics.

Screenshot – Spark Executor metrics [click to enlarge]

Spark_screenshot_Executor_3

Screenshot – Spark Worker metrics  [click to enlarge]

Spark_screenshot_Worker_2

And One More Thing…

SPM now works hand-in-hand with Logsene Log Management and Analytics.  This makes the integration of performance metrics, logs, events and anomalies more robust for those of you looking to combine performance monitoring and centralized log management in one place — not only knowing that SOMETHING affected performance of your Spark cluster when you look at your performance metrics graphs or get an alert, but also exactly WHAT happened with the cluster by having immediate access to all relevant Spark event logs right there!

Take a Test Drive — It’s Easy and Free to Get Started

Like what you see here?  Sound like something that could benefit your organization?  Then try SPM and/or Logsene for Free for 30 days by registering here.  There’s no commitment and no credit card required.

Follow

Get every new post delivered to your Inbox.

Join 144 other followers