Announcing Hadoop Monitoring in SPM

Take it from one of the must trusted names from the world of Hadoop and HBase, as well as one of the friendliest people you’ll encounter on the Hadoop conference circuit, Lars George from Cloudera:

Hadoop Club Monitoring

We’re happy to announce the immediate availability of SPM for Hadoop (see Sneak Peek: Hadoop Monitoring comes to SPM for some screenshots).  With the latest SPM release Hadoop joins Apache Solr, Apache HBase, ElasticSearch, Sensei, and the JVM as the list of technologies you can monitor with SPM. With  SPM for Hadoop you go from zero to seeing all key metrics for your Hadoop cluster metrics in just a few minutes.  Included in the reports are metrics for both HDFS and MapReduce – metrics for NameNode, JobTracker, TaskTracker, and DataNode are all included along with all the default server metrics.  The YARN version of Hadoop is also supported and includes metrics for NodeManager, ResourceManager, etc.

Don’t forget that SPM monitoring agent can run as in-process agent, as well as in standalone mode (i.e., as an external process).  Running in the standalone mode means you may not have to restart various daemons of your existing Hadoop cluster that you want to monitor (assuming you already enabled JMX), so you can quickly get to your Hadoop metrics without interrupting anything!

What else would you like us to monitor?  Please select your candidates!

Poll Results: Hadoop YARN vs. pre-YARN

Back in April 2013 there was a poll in Hadoop Users LinkedIn group:

YARN or pre-YARN – which version of Hadoop are you using?

Because we were working on adding Hadoop monitoring to SPM, this was an important question for us – which version of Hadoop should SPM be able to monitor?

Here are the results of that poll:

Hadoop MRv1 vs. Hadoop YARN

Hadoop MRv1 vs. Hadoop YARN

As we can see, most Hadoop users are still using the old version of Hadoop and are not using YARN.  The percentage in the “YARN” bar at the top is partially hidden, but it’s 13% — only 13% of Hadoop users who responded are using Hadoop YARN.  But combine it with 17% of people who said they are moving to YARN, it’s 30% all together.  Still only about 1/2 of the total number of Hadoop MRv1 users, but if we asked that question in early 2014 we would likely see a close tie.

So which version of Hadoop are we supporting in SPM?  Both!  With SPM you can monitor both Hadoop MRv1 and Hadoop YARN.  And if you are using pre-YARN Hadoop today and want to switch to Hadoop YARN later, that’s not a problem for SPM.

Sneak Peek: Hadoop Monitoring comes to SPM

When it comes to Hadoop, they say you’ve got to monitor it and then monitor it some more.  Since our own Performance Monitoring and Search Analytics services run on top of Hadoop, we figured it was time to add Hadoop performance monitoring to SPM.  So here is a sneak peek at SPM for Hadoop.  If you’d like to try it on your Hadoop cluster, we’ll be sending invitations soon and you can get on the private beta list starting today!

In the mean time, here is a small sample of pretty self-explanatory reports from SPM for Hadoop, so you can get a sense of what’s available.  There are, of course, a number of other Hadoop-specific reports included, as well as server reports, filtering, alerting, multi-user support, report sharing, etc. etc.

Please don’t forget to tell us what else would you like us to monitor – select your candidates – and if you like what you see and want a good monitoring tool for your Hadoop cluster, please sign up for private beta now.

Click on any graph to see it in its full size and high quality.

Hadoop NameNode Files

Hadoop NameNode Files

.

Hadoop DataNode Read-Write

Hadoop DataNode Read-Write

.

Hadoop JobTracker MapReduce Runtime

Hadoop JobTracker MapReduce Runtime

.

Hadoop TaskTracker Tasks

Hadoop TaskTracker Tasks

.

What else would you like us to monitor with SPM?  Please select your candidates!

For announcements, promotions, discounts, service status, milk, cookies, and other goodies follow @sematext.

Announcing HBase Refcard

We’re happy to announce the very first HBase Refcard proudly authored by two guys from Sematext.  We hope people will find the HBase Refcard useful in their work with HBase, along with the wonderful Apache HBase Reference Guide.  If you think the refcard is missing some important piece of information that deserves to be included or that it contains superfluous content, please do let us know! (e.g., via comments here)

Data Engineer Position at Sematext International

If you’ve always wanted to work with Hadoop, HBase, Flume, and friends and build massively scalable, high-throughput distributed systems (like our Search Analytics and SPM), we have a Data Engineer position that is all about that!  If you are interested, please send your resume to jobs@sematext.com.

Responsibilities:

  • Versatile architect and developer – design and build large, high performance,scalable data processing systems using Hadoop, HBase, and other big data technologies
  • DevOps fan –  run and tune large data processing production clusters
  • Tool maker – develop ops and management tools 
  • Open source participant – keep up with development in areas of cloud and distributed computing, NoSQL, Big Data, Analytics, etc.

Pluses:

  • solid Math, Statistics, Machine Learning, or Data Mining is not required but is a big plus
  • experience with Analytics, OLAP, Data Warehouse or related technologies is a big plus
  • ability and desire to expand and lead a data engineering team
  • ability to think both business and engineering
  • ability to build products and services based on observed client needs
  • ability to present in public, at meetups, conferences, etc.
  • ability to contribute to blog.sematext.com
  • active participation in open-source communities
  • desire to share knowledge and teach
  • positive attitude, humor, agility, high integrity, and low ego, attention to detail

Location:

  • New York

We’re small and growing.  Our HQ is in Brooklyn, but our team is spread over 4 continents.  If you follow this blog you know we have deep expertise in search and big data analytics and that our team members are conference speakers, book authors, Apache members, open-source contributors, etc.

Relevant pointers:

Hadoop 1.0.0 – Extra Notes

The big Hadoop 1.0.0 release has arrived.  The general notes about releases from the dev team include:

  • security
  • Better support for HBase (append/hsynch/hflush, and security)
  • webhdfs (with full support for security)
  • performance enhanced access to local files for HBase
  • other performance enhancements, bug fixes, and features

You can also find the complete release notes here and see all fixes, improvements and new features included in the release. To save you time, please find below additional information about some of the items that attracted our attention from the Hadoop 1.0.0 release.

Cluster Management Optimizations

HADOOP-7728hadoop-setup-conf.sh should be modified to enable task memory manager
Adds additional options to manage memory usage by MR tasks. In particular, this allows to set max memory usage for map and reduce tasks (separately).

Performance Improvements

HDFS-2246hadoop-setup-conf.sh should be modified to enable task memory manager
This is a short-term solution for the issue HDFS-347 “DFS read performance suboptimal when client co-located on nodes with data” which is quite hot in Hadoop dev community nowadays. NOTE: by default this optimization is switched off (or is it? Update: it is not, see the comments) so some config adjustments are required to benefit from it. And you will definitely want to benefit from it: some reported two times I/O performance improvements. Also highly recommended for HBase users.

HDFS-895Allow hflush/sync to occur in parallel with new writes to the file
Previously if a hflush/sync were in progress, an application could not write data to the HDFS client buffer. Again we stress out this improvement for HBase users as this increases the write throughput of the transaction log in HBase.

MAPREDUCE-2494Make the distributed cache delete entires using LRU priority
When certain threshold was reached and distributed cache was being purged, previous implementation deleted all entries that were not currently being used. With new code more hot data can be left in the cache (the percentage is configurable) and thus decrease cache misses.

New Features

HDFS-2316 - [umbrella] WebHDFS: a complete FileSystem implementation for accessing HDFS over HTTP
Allows accessing HDFS over HTTP (read & write)

MAPREDUCE-3169Create a new MiniMRCluster equivalent which only provides client APIs cross MR1 and MR2
Cleaner MR1 & MR2 compatible API for mini MR cluster to be used in unit-tests.

HADOOP-7710Create a script to setup application in order to create root directories for application such hbase, hcat, hive etc
Similar to hadoop-setup-user script, a hadoop-setup-applications script was added to set up root directories for apps to write to (/hbase, /hive, etc.)

Enjoy Hadoop 1.0.0 and we hope you found this quick summary useful!

@sematext

Search Analytics: Business Value & NoSQL Backend Presentation

Last week involved a few late nights for some of us at Sematext – we were busy readying our Search Analytics and Scalable Performance Monitoring services, as well as putting the final touches on the our Search Analytics: Business Value & NoSQL Backend presentation for Lucene Eurocon in Barcelona.

In the past we’ve given a few other public talks about Search Analytics and you can check them all out via http://blog.sematext.com/tag/analytics/.

Follow

Get every new post delivered to your Inbox.

Join 1,695 other followers