Hiring: Data Mining, Analytics, Machine Learning Hackers

If you want to work with search, big data mining, analytics, and machine learning, and you are a positive, proactive, independent creature, please keep reading.We are looking for devops to hack on Sematext’s new products and services, as well as provide services to our growing list of clients.  Working knowledge of Mahout or statistics/machine learning/data mining background would be a major plus.
 

Skills & experience (the more of these you have under your belt the better):

  • Data mining and/or machine learning (Mahout or …)
  • Big data (HBase or Cassandra or Hive or …)
  • Search (Solr or Lucene or Elastic Search or …)

More about an ideal you:

  • You are well organized, disciplined, and efficient
  • You don’t wait to be told what to do and don’t need hand-holding
  • You are reliable, friendly, have a positive attitude, and don’t act like a prima donna
  • You have an eye for detail, don’t like sloppy code, poor spelelling and typous
  • You are able to communicate complex ideas in a clear fashion in English, clean and well designed code, or pretty diagrams

Optional bonus points:

  • You like to write or speak publicly about technologies relevant to what we do
  • You are an open-source software contributor

A few words about us:

We work with search and big data (Lucene, Solr, Nutch, Hadoop, MapReduce, HBase, etc.) on a daily basis and we present at conferences.  Our projects with external clients range from 1 week to several months.  Some clients are small startups, some are large international organizations.  Some are top secret.  New customers knock on our door regularly and this keeps us busy at pretty much all times.  When we are not busy with clients we work on our products.  We run search-lucene.com and search-hadoop.com.  We participate in open-source projects and publish monthly Digest posts that cover Lucene, Solr, Nutch, Mahout, Hadoop, Hive, and HBase.  We don’t write huge spec docs, we work in sprints, we multitask, and try our best to be agile. We send people to conferences, trainings (Hadoop, HBase, Cassandra), and certifications (2 of our team members are Cloudera Certified Hadoop Developers).

We are a small and mostly office-free, highly distributed team spanning 3 continents and 6 countries.  We communicates via email, Skype voice/IM, BaseCamp.  Some of our developers are in Eastern Europe, so we are especially open to new team members being in that area, but we are also interested in good people world-wide, from South America to Far East.

Interested? Please send your resume to jobs @ sematext.com feel free to check out our other positions.

Google Summer of Code and Intern Sponsoring

Are you a student and looking to do some fun and rewarding coding this summer? Then join us for the 2011 Google Summer of Code!

The application deadline is in less than a month! Lucene has identified initial potential projects, but this doesn’t mean you can also pick your own.  If you need additional ideas, look at our Lucene / Solr for Academia: PhD Thesis Ideas (or just the spreadsheet if you don’t want to read the what and the why),  just be sure to discuss with the community first (send an email to dev@lucene.apache.org).

We should also add that, separately from GSoC, Sematext would be happy to sponsor good students and interns interested in work on projects involving search (Lucene, Solr), machine learning & analytics (Mahout), big data (Hadoop, HBase, Hive, Pig, Cassandra), and related areas. We are a virtual and geographically distributed organization whose members are spread over several countries and continents and we welcome students from all across the globe.  For more information please inquire within.

HBase Backup Options

If you are thinking about using HBase you will likely want to understand HBase backup options.  I know we did, so let us share what we found.  Please let us know what we missed and what you use for HBase backup!

Export

You could export your tables using the Export (org.apache.hadoop.hbase.mapreduce.Export) MapReduce job that will export the table data into a Sequence File on HDFS.  This was implemented in HBASE-1684 if you want to check out the patch or comments there.  This tool works on one table at a time, so if you need to backup multiple tables, run this on each table.  The exported data can then be imported back into HBase by the Import tool.

Copy Table

If you have another HBase cluster that you want to treat as a backup cluster, you can use the handy CopyTable tool to copy a table at a time.

Distcp

You could use Hadoop’s distcp command to copy the whole /hbase directory from one HDFS cluster to the other.  However, this can leave your data in an inconsistent state, so it should be avoided.  See http://search-hadoop.com/m/wkMgSjVLDb

At this point we should point out that all of the above backup methods are per-table.  Moreover, they don’t work or create a snapshot of the table.   Export and CopyTable are atomic only at the row level.  Furthermore, if you have multiple tables whose tables depend on each other, if they are being modified while you are exporting or copying them, you will end up with inconsistent data – the data in those tables will not be in sync.  See http://search-hadoop.com/m/Q4bU81G116p.

Backup from Mozilla

Because of the above mentioned issues with distcp when running it over a cluster whose data is being modified while distcp is running, developers at Mozilla came up with their own Backup tool.  They’ve described the tool and its use in the popular Migrating HBase in the Trenches post.

Cluster Replication

HBase has a relatively new and not yet widely used whole cluster replication mechanism.  The backup cluster does not have to be identical to the master cluster, which means that the backup cluster could be much less powerful and thus cheaper, while still having enough storage to serve as backup.

Table Snapshot

Ah, the infamous HBASE-50!  This issue saw some great work during GSoC 2010, but it looks like it was never integrated into HBase.  It is unclear whether the contributor simply ran out of steam or time or whether it became apparent that table snapshots are too difficult to implement or simply not doable because of highly distributed nature of HBase.  The JIRA issue does contain patches you can look at, and the author has a now inactive hbase-snapshot repository up on Github.

HDFS Replication

You could also simply crank up the replication factor to the level that makes you feel safe and call that a backup.  This may not guard against data corruption, but it does guard against certain partial hardware failures.

Since so many people seem to be asking about HBase backup options, I hope this serves as a good point-in-time snapshot, a summary of all HBase backup options that are currently on the table.  With time, this will be added to the HBase Book.

Are there other HBase backup options we should have included?

What you use for HBase backup?

Deferring Processing Updates to Increase HBase Write Performance

In this post we’ll examine, via an example, how deferring the processing of updates can increase write throughput of a data store. In the end we’ll introduce our recently open-sourced HBaseHUT, a tool for automating of deferred updates in HBase.

Please note, that “deferred updates” here does not mean that updates are not “visible” immediately after a write operation. The system still serves add/update/delete and read operations in real-time.

Plug: if this sort of stuff interests you, we are hiring people who know about Hadoop, HBase, MapReduce…

Constraints & Assumptions

The decisions and ideas in the post are based on facts about HBase, but can be applied to many other data stores with similar characteristics:

  • Fetching a set of records as a range (scan operation) is much cheaper than fetching the same records one-by-one
  • Adding new records (simple write operation) is very fast
  • Storage space is cheap
  • “In-place” updates of records are not generally feasible

Well, the last assumption is a bit tricky: HBase provides incrementColumnValue() operation but it is very limited: one can only increment a specified cell that contains long value by some delta. In future versions HBase will allow incrementing of multiple cells at once (already in HBase trunk). However, not all update operations are that simple, as we’ll see in the example. Moreover, incrementing cells in a record is slower than writing a new record (we’ll use this fact later in the post).

Idea Explained

The idea behind deferred updates processing is to postpone updating of the existing record and store incoming deltas as a new record. Thus, record update operations become a simple write operations with corresponding performance. Deferred updates technique elaborated here fits well when system handles a lot of updates of stored data and write performance is the main concern, while reading speed requirements are not that strict. The following cases (each of them separately or any combination of them) may indicate  that one can benefit from using the technique (the list is not complete):

  • updates are well spread over the whole and large dataset
  • lower rate (among write operations) of “true updates” (i.e. low percentage of writes are for completely new data, not really updates of existing data)
  • good portion of data stored/updated may never be accessed
  • system should be able to handle high write peaks without major performance degradation

Next, let’s look at an example use-case to illustrate the idea better.

Example Use-Case

Imagine a simple presentation slides sharing system that gives one the ability to browse presentations on-line (not as downloadable PDF or whatever format files, since that would not be interesting for us here). Suppose we want to provide the presentation’s author with details on user behaviour when a user navigates through slides so that the author can use this data to infer the quality of the presentation (hey, slide sharing systems maintainers: free hint!). Let’s limit statistics data we want to share to simple metrics for the sake of keeping the example simple. Here’s what we can provide:

  • number of times a particular slide was viewed
  • avg/min/max time spent by user on a particular slide

Using these stats the presentation author can find out e.g.:

  • what are the most difficult to understand slides (they probably need to be divided into multiple ones, or better preparation before showing them should be done by giving more/better info on previous slides)
  • what are the most trivial slides (they can be merged/removed)
  • what are the typical users (e.g., If users spend a lot of time thinking on the slides the author can add more details in the presentation.  On the other hand, if users usually quickly walk through it then some very deep details can be removed)

Such a system could provide more stats to authors so that they can work on other sorts of presentation improvements.  For example, the system could provide information about user navigation actions (e.g., if users often go back and forth between two slides then probably the schemes/diagrams on them are better understandable if they are on the same slide and can be easily compared), but that’s not the focus of this post.

To track previously mentioned stats data we need to capture how much time a user spent on a particular slide and send this data to our data store. Let’s assume our records store the following info: {(key)presentationId_slideId, number of views, total viewing time, max viewing time, min viewing time}. Using this data we can provide useful information to author as needed.

Every time user leaves the slide, tracking logic sends the next data to the data store: {(key)presentationId_slideId, time spent on viewing}.

Direct Solution

The straight forward approach would be to update the presentationId_slideId record (fetch, modify and store updated data) as new data comes in. Here are some drawbacks of this naive approach:

  • for each update operation we have to perform fetch and write operations (Get and Put in case of HBase)
  • Most authors won’t really care about all these stats and majority of them will never look at it, so many processing operations can simply be omitted (until there’s interest in stats)
  • for first ever viewing of a slide a redundant fetch operation is performed (data is not there yet), which is OK if slides are viewed many times but in situation when many presentations are viewed just one time (e.g. by their creators) this may result in a bigger portion of redundant operations (this assumption is sort of made up, yes)

The very first point is a major one and is the obvious write performance killer.  This point is really not made up – we have first hand experience with that particular situation.

Deferring Updates Processing

One alternative solution is to defer updates processing to the time when data is requested by presentation author or to the time when the scheduled/periodic updates processing is performed (ideally when data store write load is minimal). In this deferred updates approach all new deltas (new user action data) are written as new records.

When an author requests the stats, efficient updates processing takes place. This time it is done in much more efficient way (compared to direct approach): by using range scan operations, and not by individual record read/writes. Given that presentation has usually tens of slides and is viewed by hundreds or thousands people such range scans are so fast the user will barely notice them. Also, it should be quite OK to ask a person to wait several seconds during his first view of stats page (processed updates are written back into the data store, so next time user fetches data it is available right away).

To prevent too long updates processing on user request the data can be processed periodically instead of at request time. Given the fact that (updated) statistics data is available to user immediately, the processing can be postponed for a long time without affecting the “real-time” visibility aspect of the system. The processing of updates can be scheduled to take place during off-peak hours, when they won’t compromise write performance. By doing this we add a “cushion” since from this point on the maximum amount of data that needs to be processed “on-the-fly” is one day ‘s worth of updates (assuming daily updates processing) , which is sufficient for the majority of cases (e.g. estimated max presentation views per day is tens of thousands which is a very quick scan operation).

The remainder of this post provides details of our implementation of this approach.

HBaseHUT

HBaseHUT is a tool that automates deferred processing of updates for HBase. It hides many details of updates processing from the client code, thus making it an easy to use solution. There are Put and ResultScanner wrappers that encapsulate all needed logic.

To write new data you need to use HutPut implementation of Put. Since it implements the standard HBase Put only the instance creation code needs to be changed:

public class SlideStatsTracker {
  byte[] SLIDE_CF = Bytes.toBytes("slide");
  byte[] VIEWS_NUM_C = Bytes.toBytes("views");
  byte[] TOTAL_VTIME_C = Bytes.toBytes("total_time");
  byte[] MIN_VTIME_C = Bytes.toBytes("min_time");
  byte[] MAX_VTIME_C = Bytes.toBytes("max_time");
  ...
  void trackSlideView(byte[] presentationId, byte[] slideId, long timeSpent)
                               throws InterruptedException, IOException {
    Put put = new HutPut(Bytes.add(presentationId , slideId));
    put.add(SLIDE_CF, VIEWS_NUM_C, Bytes.toBytes(1));
    put.add(SLIDE_CF, TOTAL_VTIME_C, Bytes.toBytes(timeSpent));
    hTable.put(put);
  }
  …
}

Fetching stats is performed using HutResultScanner which implements ResultScanner and hence can also be very easily integrated into the code that uses normal HBase API:

  Scan scan = new Scan(presentationId);
  ResultScanner resultScanner =
      new HutResultScanner(hTable.getScanner(scan), updateProcessor);
  Result result = resultScanner.next();
  while (result != null) {
    // fetch data from result object
    result = resultScanner.next();
  }

The updateProcessor passed as parameter encapsulates update operation logic:

  class SlideStatsUpdateProcessor implements UpdateProcessor {
    @Override
    public void process(Iterable records, UpdateProcessingResult processingResult) {
      int viewsNumber = 0;
      long totalViewTime = 0;
      long minViewTime = Long.MAX_VALUE;
      long maxViewTime = Long.MIN_VALUE;
      // Processing records
      for (Result record : records) {
        for (int i = 0; i < 5; i++) {
          int views = Bytes.toInt(record.getValue(SLIDE_CF, VIEWS_NUM_C));
          long totalTime = Bytes.toLong(record.getValue(SLIDE_CF, TOTAL_VTIME_C));
          long minTime = Bytes.toLong(record.getValue(SLIDE_CF, MIN_VTIME_C));
          long maxTime = Bytes.toLong(record.getValue(SLIDE_CF, MAX_VTIME_C));

          viewsNumber += views;
          totalViewTime += totalTime;
          minViewTime = minTime  maxViewTime ? maxTime : maxViewTime;
        }
      }

      processingResult.add(SLIDE_CF, VIEWS_NUM_C, Bytes.toBytes(viewsNumber));
      processingResult.add(SLIDE_CF, TOTAL_VTIME_C, Bytes.toBytes(totalViewTime));
      processingResult.add(SLIDE_CF, MIN_VTIME_C, Bytes.toBytes(minViewTime));
      processingResult.add(SLIDE_CF, MAX_VTIME_C, Bytes.toBytes(maxViewTime));
    }
  }

As you one can see, HBaseHUT abstraction doesn’t change HBase API and that makes it very easy to use in new code or integrate it in old code.

You can find more details about HBaseHUT features on the project’s wiki. HBaseHUT is a new, recently emerged project – your feedback/questions/ideas/feature requests/forks/pull requests, etc. are all very welcome. Please start new discussions on HBaseHUT’s mailing list.

If you read this far, we’d like to talk to you – we are hiring smart people who know about Hadoop, HBase, MapReduce, want to do large scale data (stream) processing, and build tools like HBaseHUT.

Hiring Search and Data Analytics Engineers

We are growing and looking for smart people to join us either in an “elastic”, on-demand, per-project, or more permanent role:

Lucene/Solr expert who…

  • Has built non-trivial applications with Lucene or Solr or Elastic Search, knows how to tune them, and can design systems for large volume of data and queries
  • Is familiar with (some of the) internals of Lucene or Solr or Elastic Search, at least on the high level (yeah, a bit of an oxymoron)
  • Has a systems/ops bent or knows how to use performance-related UNIX and JVM tools for analyzing disk IO, CPU, GC, etc.

Data Analytics expert who…

  • Has used or built tools to process and analyze large volumes of data
  • Has experience using HDFS and MapReduce, and have ideally also worked with HBase, or Pig, or Hive, or Cassandra, or Voldemort, or Cascading or…
  • Has experience using Mahout or other similar tools
  • Has interest or background in Statistics, or Machine Learning, or Data Mining, or Text Analytics or…
  • Has interest in growing into a Lead role for the Data Analytics team

We like to dream that we can find a person who gets both Search and Data Analytics, and ideally wants or knows how to marry them.

Ideal candidates also have the ability to:

  • Write articles on interesting technical topics (that may or may not relate to Lucene/Solr) on Sematext Blog or elsewhere
  • Create and give technical talks/presentations (at conferences, local user groups, etc.)

Additional personal and professional traits we really like:

  • Proactive and analytical: takes initiative, doesn’t wait to be asked or told what to do and how to do it
  • Self-improving and motivated: acquires new knowledge and skills, reads books, follows relevant projects, keeps up with changes in the industry…
  • Self-managing and organized: knows how to parcel work into digestible tasks, organizes them into Sprints, updates and closes them, keeps team members in the loop…
  • Realistic: good estimator of time and effort (i.e. knows how to multiply by 2)
  • Active in OSS projects: participates in open source community (e.g. mailing list participation, patch contribution…) or at least keeps up with relevant projects via mailing list or some other means
  • Follows good development practices: from code style to code design to architecture
  • Productive, gets stuff done: minimal philosophizing and over-designing

Here are some of the Search things we do (i.e. that you will do if you join us):

  • Work with external clients on their Lucene/Solr projects.  This may involve anything from performance troubleshooting to development of custom components, to designing highly scalable, high performance, fault-tolerant architectures.  See our services page for common requests.
  • Provide Lucene/Solr technical support to our tech support customers
  • Work on search-related products and services

A few words about us:

We work with search and big data (Lucene, Solr, Nutch, Hadoop, MapReduce, HBase, etc.) on a daily basis.  Our projects with external clients range from 1 week to several months.  Some clients are small startups, some are large international organizations.  Some are top secret.  New customers knock on our door regularly and this keeps us busy at pretty much all times.  When we are not busy with clients we work on our products.  We run search-lucene.com and search-hadoop.com.  We participate in open-source projects and publish monthly Digest posts that cover Lucene, Solr, Nutch, Mahout, Hadoop, and HBase.  We don’t write huge spec docs, we work in sprints, we multitask, and try our best to be agile. We send people to conferences, trainings (Hadoop, HBase, Cassandra), and certifications (2 of our team members are Cloudera Certified Hadoop Developers).

We are a small and mostly office-free, highly distributed team that communicates via email, Skype voice/IM, BaseCamp.  Some of our developers are in Eastern Europe, so we are especially open to new team members being in that area, but we are also interested in good people world-wide, from South America to Far East.

Interested? Please send your resume to jobs @ sematext.com.

Search Analytics: Hadoop World Presentation

After our Lucene Revolution talk in Boston, we got ready for last week’s Hadoop World conference in New York.  Like at the Lucene Revolution, we presented to a packed room of 200+ people. The topic of our talk was the Search Analytics tool we’ve built with the help of Flume, HBase, MapReduce, and other open-source tools, and which are now starting to use for search-hadoop.com and search-lucene.com.  If you couldn’t make it to Hadoop World, have a look at our presentation below.  And if you’d like to work on Search, Analytics, and related areas, we’re looking for good people world-wide – see our jobs page.  Enjoy!

Sematext at Hadoop World: Search Analytics with Flume and HBase

Besides working with search in general and Lucene/Solr/Nutch in particular, we also work with Hadoop, HBase, and other related technologies.  This October we’ll be presenting at Hadoop World (see the schedule).  The title of our talk is Search Analytics with Flume and HBase. Here’s the abstract:

In this talk we will show how we use Flume to transport search and clickstream data to HBase with the ultimate goal of producing Search Analytics reports using that data.  We’ll show how data flow through the system from the moment a query or click event is captured in the search application UI, until it lands in HBase via Flume’s HBase sink.  We’ll also share information about what this system looked like in the pre-Flume days.  Finally we’ll demonstrate various reports the system ultimately produces and insight we derive from them.

So, if you are interested in search and analytics, and especially the mix of the two, come see us this October in New York.  If you can’t wait until then or can’t make it to New York, and need help with search and/or analytics, let us know!

HBase Digest, August 2010

The second “developer release”, hbase-0.89.201007d, is now available for download. To remind everyone, there are currently two active branches of HBase:

  • 0.20 – the current stable release series, being maintained with patches for bug fixes only.
  • 0.89 – a development release series with active feature and stability development, not currently recommended for production use.

First one doesn’t support HDFS durability (edits may be lost in the case of node failure) whereas the second one does. You can find more information at this wiki page.  HBase 0.90 release may happen in October!  See info from developers.

Community trends & news:

  • New HBase AMIs are available for dev release and 0.20.6.
  • Looking for some GUI that could be used for browsing through tables in HBase? Check out Toad for Cloud, watch for HBase-Explorer and HBase-GUI-Admin.
  • How many regions a RegionServer can support and what are the consequences of having lots of regions in a RegionServer? Check info in this thread.
  • Some more complaints to be aware of regarding HBase performing on EC2 in this thread. For those who missed it, more on Hadoop & HBase reliability with regard to EC2 in our March digest post.
  • Need guidance in sizing your first Hadoop/HBase cluster? This article will be helpful.

FAQ:

  • Where can I find information about data model design with regard to HBase?
    Take a look at http://wiki.apache.org/hadoop/HBase/HBasePresentations.
  • How can I perform SQL-like query “SELECT … FROM …” on HBase?
    First, consider that HBase is a key-value store which should be treated accordingly. But if you are still up for writing ad-hoc queries in your particular situation take a look at Hive & HBase integration.
  • How can I access Hadoop & HBase metrics?
    Refer to HBase Metrics documentation.
  • How to connect to HBase from java app running on remote (to cluster) machine?
    Check out client package documentation. Alternatively, one can use the REST interface: Stargate.

HBase Case-Study: Using HBaseTestingUtility for Local Testing & Development

Motivation

As HBase becomes more mature there’s is a growing demand for tools and methods for making development process easier – here at Sematext (@sematext) we’ve gone through our own per aspera ad astra learning process in addition to Cloudera’s Hadoop trainings and certifications. In this post we share what we’ve learned and show how one can HBaseTestingUtility for this.

Suppose there is a system that deals with processing data stored in HBase and displaying stored data via reporting application. Data processing is done using Hadoop MapReduce jobs. During development, it would be desirable to be able to:

  • debug MapReduce jobs in an IDE
  • run reporting application locally (on developer’s machine, without setting up a cluster) with possibility of debugging in IDE
  • easily access data stored in HBase for debugging purposes (easily means “naturally” as if all rows are in a text file)

Disclaimer

Described use-case and solution are just one option, an option that makes use of HbaseTestingUtility and underlying “mini” clusters. Depending on the context, this solution might not be the most optimal, but it is a good fit for presenting the ideas. This solution and this post should encourage developers to look at HBase’s unit-test sources when constructing their own tests and/or when finding ways for easier debugging & development.

Problem Details

In our example there are two tables in HBase: one with raw data and another with processed data.  Let’s call them RawDataTable and ProcessedDataTable. We import data into RawDataTable via simple importing MapReduce job which initally takes data from a log file. Subsequently, another MapReduce job processes data in that table and stores the outcome into ProcessedDataTable. We use HBase Scan and Get operations to access the processed data from the client.

Solution

As stated in javadocs, HBaseTestingUtility is a “facility for testing HBase”. Its description comes with a bit more of explanation: “Create an instance and keep it around doing HBase testing. This class is meant to be your one-stop shop for anything you mind need testing. Manages one cluster at a time only.” In this post we describe one possible way of how to use it to achieve the goals described above.

Processing Data

Step 1: Init cluster.

The following code starts “local” cluster and creates two tables:

private final HBaseTestingUtility testUtil = new HBaseTestingUtility();
private HTable rawDataTable;
private HTable processedDataTable;
…
void initCluster() throws Exception {
  testUtil.getConfiguration().addResource("hbase-site-local.xml");
  testUtil.getConfiguration().reloadConfiguration();
  // start mini hbase cluster
  testUtil.startMiniCluster(1);
  // create tables
  rawDataTable = testUtil.createTable(RAW_TABLE_NAME, RAW_TABLE_COLUMN_FAMILIES);
  processedDataTable = testUtil.createTable(PROCESSED_TABLE_NAME, PROCESSED_TABLE_COLUMN_FAMILIES);
  // start MR cluster
  testUtil.startMiniMapReduceCluster();
}

testUtil.startMiniCluster(1) means start cluster with 1 datanode and 1 regionserver. You can start cluster with greater number of servers for test purposes.

Step 2: Import Data

We use simple map-only job for import data. Please refer to org.apache.hadoop.hbase.mapreduce.ImportTsv class for an example of such a job. The following code runs the job that uses locally stored files (e.g. a part of the log file of reasonable size) on just created cluster:

String[] importJobArgs = new String[] {RAW_TABLE_NAME, "file://" + inputFile};
if (!MyImportJob.createSubmittableJob(testUtil.getConfiguration(), importJobArgs).waitForCompletion(true)) {
  System.exit(1);
}

Step 3: Process Data

To process data in RawDataTable we run an appropriate MapReduce job in the same way as during the import:

if (!ProcessLogsJob.createSubmittableJob(testUtil.getConfiguration(), processLogsJobArgs).waitForCompletion(true)) {
  System.exit(1);
}

Step 4: Persist Processed Data

Since we need processed data during our reporting application development and debugging we persist it in some local file. In order to have “easy” access to this data during debugging it makes sense to store table data in a text file in a readable form (so that we could perform “grep” and other handy commands). So we actually write to two files at once. The Result class implements Writable interface, so there is a natural way to serialize its data.

BufferedWriter bw = ...;
DataOutputStream dos = ...;
ResultScanner rs = processedDataTable.getScanner(new Scan());
Result next = rs.next();
while (next != null) {
  next.write(dos);
  bw.write(getHumanReadableString(next));
  bw.newLine();
  next = rs.next();
}

After this step, the processed data is stored on the local disk and can be used for running the reporting application. Importing and processing of data is performed locally and is thus easier to debug.
In order to add extra processed data incrementally to the already stored data, instead of rewriting it from scratch, we need to load it from the file after cluster initialization as described in the following section.

Fetching Data

In order to make reporting application run on “local” cluster instead of the “true” one, we create an alternative HTable factory. Reporting application code uses a single HTable object instantiated by the factory during its whole lifecycle – this is the best practice for minimizing creation of HTable objects.

Step 1: Init cluster.

This step is exactly the same as described previously.

Step 2: Load processed data.

We use a file created during processing data stage to load the data back into just initialized cluster:

DataInputStream dis = ...;
Result next = new Result();
next.readFields(dis);
while (next.getRow() != null) {
  Put put = new Put(next.getRow());
  for (KeyValue kv : next.raw()) {
    put.add(kv);
  }
  processedDataTable.put(put);
  next = new Result();
  try {
    next.readFields(dos);
  } catch (EOFException e) {
    // file went to an end.
    break;
  }
}

After data is all loaded, the constructed processedDataTable can be used by the reporting application code. The app can now also be started and debugged easily from an IDE.

Next Steps

Internally HBaseTestingUtility makes use of a whole bunch of “mini” clusters: MiniZooKeeperCluster, MiniDFSCluster, MiniHBaseCluster and MiniMRCluster. Refer to the unit-test implementations in the source code of respective projects to get more examples on how to use them.

Thank you for reading, we hope you found this useful.  Follow @sematext on Twitter to be notified of new posts on Hadoop, HBase, Lucene, Solr, Mahout, and other related topics.

HBase Digest, June 2010

HBase 0.20.5 is out! It fixes 24 issues since the 0.20.4 release. HBase developers “recommend that all users, particularly those running 0.20.4, upgrade to this release”.

Community trends:

  • There’s a clear need in “sanity check DNS across my cluster” tool as a lot of questions/help requests related to the name/address resolution in the cluster are submitted over time. Any volunteers?
  • Bulk incremental load into an existing table feature (HBASE-1923) is commited to trunk. No multi-family support still.
  • Good number of advice about increasing the write performance/speed in this thread, including shared numbers/techniques from a large production cluster.
  • A set of ORM tools to consider for HBase are suggested here.

Notable efforts:

FAQ:

  • Common issue: tables/data disappears after system restart. Usually people face it when playing with HBase for the first time and even on the single node set-up. The problem is that by default HDFS is configured to store its data in the /tmp dir which might get cleaned up by OS. Configure “dfs.name.dir” and “dfs.data.dir” properties in hdfs-site.xml to aviod these problems.
Follow

Get every new post delivered to your Inbox.

Join 706 other followers