ProjectHub: Lucene Revolution Presentation

Over the past few weeks we’ve been to two conferences: Lucene Revolution in Boston and Hadoop World in New York.  We presented at both.  At Lucene Revolution we talked about how we built search-hadoop.com and search-lucene.com.  The fact that our presentation room was packed despite a couple of other interesting talks being given at the same time tells us this stuff is interesting to people (or at least the title and the brief description in the conference schedule were attractive). For those of you who were unable to make it to Boston, we are sharing our presentation below.  And for those of you who like to work on Search, Analytics, Machine Learning, and related areas, we’re looking for good people world-wide – see our jobs page.  Enjoy!

Sematext at Hadoop World: Search Analytics with Flume and HBase

Besides working with search in general and Lucene/Solr/Nutch in particular, we also work with Hadoop, HBase, and other related technologies.  This October we’ll be presenting at Hadoop World (see the schedule).  The title of our talk is Search Analytics with Flume and HBase. Here’s the abstract:

In this talk we will show how we use Flume to transport search and clickstream data to HBase with the ultimate goal of producing Search Analytics reports using that data.  We’ll show how data flow through the system from the moment a query or click event is captured in the search application UI, until it lands in HBase via Flume’s HBase sink.  We’ll also share information about what this system looked like in the pre-Flume days.  Finally we’ll demonstrate various reports the system ultimately produces and insight we derive from them.

So, if you are interested in search and analytics, and especially the mix of the two, come see us this October in New York.  If you can’t wait until then or can’t make it to New York, and need help with search and/or analytics, let us know!

Hadoop Digest, August 2010

The biggest announcement of the year: Apache Hadoop 0.21.0 released and is available for download here. Over 1300 issues have been addressed since 0.20.2; you can find details for Common, HDFS and MapReduce. Note from Tom White who did an excellent job as a release manager: “Please note that this release has not undergone testing at scale and should not be considered stable or suitable for production. It is being classified as a minor release, which means that it should be API compatible with 0.20.2.”. Please find a detailed description of what’s new in 0.21.0 release here.

Community trends & news:

  • New branch hadoop-0.20-security is being created. Apart from the security features, which are in high demand, it will include improvements and fixes from over 12 months of work by Yahoo!. The new security features are going to be a very valuable and welcome contribution (also discussed before).
  • A thorough discussion about approaches of backing up HDFS data in this thread.
  • Hive voted to become Top Level Apache Project (TLP) (also here).  Note that we’ll keep Hive under Search-Hadoop.com even after Hive goes TLP.
  • Pig voted to become TLP too (also here).  Note that we’ll keep Pig under Search-Hadoop.com even after Pig goes TLP.
  • Tip: if you define a Hadoop object (e.g. Partitioner, as implementing Configurable, then its setConf() method will be called once, right after it gets instantiated)
  • For those new to ZooKeeper and pressed for time, here you can find the shortest ZooKeeper description — only 4 sentences short!
  • Good read “Avoiding Common Hadoop Administration Issues” article.

Notable efforts:

  • Howl: Common metadata layer for Hadoop’s Map Reduce, Pig, and Hive (yet another contribution from Yahoo!)
  • PHP library for Avro, includes schema parsing, Avro data file and
    string IO.
  • avro-scala-compiler-plugin: aimed to auto-generate Avro serializable classes based on some simple case class definitions

FAQ:

  • How to programatically determine the names of the files in a particular Hadoop/HDFS directory?
    Use FileSystem & FileStatus API. Detailed examples are in this thread.
  • How to restrict HDFS space usage?
    Please, refer to HDFS Quotas Guide.
  • How to pass parameters determined at run-time (i.e. not hard-coded) to Hadoop objects (like Partitioner, Writable, etc.)?
    One option is to define a Hadoop object as implementing Configurable. In this case its setConf() method will be called once, right after it gets instantiated and you can use “native” Hadoop configuration for passing parameters you need.

HBase Case-Study: Using HBaseTestingUtility for Local Testing & Development

Motivation

As HBase becomes more mature there’s is a growing demand for tools and methods for making development process easier – here at Sematext (@sematext) we’ve gone through our own per aspera ad astra learning process in addition to Cloudera’s Hadoop trainings and certifications. In this post we share what we’ve learned and show how one can HBaseTestingUtility for this.

Suppose there is a system that deals with processing data stored in HBase and displaying stored data via reporting application. Data processing is done using Hadoop MapReduce jobs. During development, it would be desirable to be able to:

  • debug MapReduce jobs in an IDE
  • run reporting application locally (on developer’s machine, without setting up a cluster) with possibility of debugging in IDE
  • easily access data stored in HBase for debugging purposes (easily means “naturally” as if all rows are in a text file)

Disclaimer

Described use-case and solution are just one option, an option that makes use of HbaseTestingUtility and underlying “mini” clusters. Depending on the context, this solution might not be the most optimal, but it is a good fit for presenting the ideas. This solution and this post should encourage developers to look at HBase’s unit-test sources when constructing their own tests and/or when finding ways for easier debugging & development.

Problem Details

In our example there are two tables in HBase: one with raw data and another with processed data.  Let’s call them RawDataTable and ProcessedDataTable. We import data into RawDataTable via simple importing MapReduce job which initally takes data from a log file. Subsequently, another MapReduce job processes data in that table and stores the outcome into ProcessedDataTable. We use HBase Scan and Get operations to access the processed data from the client.

Solution

As stated in javadocs, HBaseTestingUtility is a “facility for testing HBase”. Its description comes with a bit more of explanation: “Create an instance and keep it around doing HBase testing. This class is meant to be your one-stop shop for anything you mind need testing. Manages one cluster at a time only.” In this post we describe one possible way of how to use it to achieve the goals described above.

Processing Data

Step 1: Init cluster.

The following code starts “local” cluster and creates two tables:

private final HBaseTestingUtility testUtil = new HBaseTestingUtility();
private HTable rawDataTable;
private HTable processedDataTable;
…
void initCluster() throws Exception {
  testUtil.getConfiguration().addResource("hbase-site-local.xml");
  testUtil.getConfiguration().reloadConfiguration();
  // start mini hbase cluster
  testUtil.startMiniCluster(1);
  // create tables
  rawDataTable = testUtil.createTable(RAW_TABLE_NAME, RAW_TABLE_COLUMN_FAMILIES);
  processedDataTable = testUtil.createTable(PROCESSED_TABLE_NAME, PROCESSED_TABLE_COLUMN_FAMILIES);
  // start MR cluster
  testUtil.startMiniMapReduceCluster();
}

testUtil.startMiniCluster(1) means start cluster with 1 datanode and 1 regionserver. You can start cluster with greater number of servers for test purposes.

Step 2: Import Data

We use simple map-only job for import data. Please refer to org.apache.hadoop.hbase.mapreduce.ImportTsv class for an example of such a job. The following code runs the job that uses locally stored files (e.g. a part of the log file of reasonable size) on just created cluster:

String[] importJobArgs = new String[] {RAW_TABLE_NAME, "file://" + inputFile};
if (!MyImportJob.createSubmittableJob(testUtil.getConfiguration(), importJobArgs).waitForCompletion(true)) {
  System.exit(1);
}

Step 3: Process Data

To process data in RawDataTable we run an appropriate MapReduce job in the same way as during the import:

if (!ProcessLogsJob.createSubmittableJob(testUtil.getConfiguration(), processLogsJobArgs).waitForCompletion(true)) {
  System.exit(1);
}

Step 4: Persist Processed Data

Since we need processed data during our reporting application development and debugging we persist it in some local file. In order to have “easy” access to this data during debugging it makes sense to store table data in a text file in a readable form (so that we could perform “grep” and other handy commands). So we actually write to two files at once. The Result class implements Writable interface, so there is a natural way to serialize its data.

BufferedWriter bw = ...;
DataOutputStream dos = ...;
ResultScanner rs = processedDataTable.getScanner(new Scan());
Result next = rs.next();
while (next != null) {
  next.write(dos);
  bw.write(getHumanReadableString(next));
  bw.newLine();
  next = rs.next();
}

After this step, the processed data is stored on the local disk and can be used for running the reporting application. Importing and processing of data is performed locally and is thus easier to debug.
In order to add extra processed data incrementally to the already stored data, instead of rewriting it from scratch, we need to load it from the file after cluster initialization as described in the following section.

Fetching Data

In order to make reporting application run on “local” cluster instead of the “true” one, we create an alternative HTable factory. Reporting application code uses a single HTable object instantiated by the factory during its whole lifecycle – this is the best practice for minimizing creation of HTable objects.

Step 1: Init cluster.

This step is exactly the same as described previously.

Step 2: Load processed data.

We use a file created during processing data stage to load the data back into just initialized cluster:

DataInputStream dis = ...;
Result next = new Result();
next.readFields(dis);
while (next.getRow() != null) {
  Put put = new Put(next.getRow());
  for (KeyValue kv : next.raw()) {
    put.add(kv);
  }
  processedDataTable.put(put);
  next = new Result();
  try {
    next.readFields(dos);
  } catch (EOFException e) {
    // file went to an end.
    break;
  }
}

After data is all loaded, the constructed processedDataTable can be used by the reporting application code. The app can now also be started and debugged easily from an IDE.

Next Steps

Internally HBaseTestingUtility makes use of a whole bunch of “mini” clusters: MiniZooKeeperCluster, MiniDFSCluster, MiniHBaseCluster and MiniMRCluster. Refer to the unit-test implementations in the source code of respective projects to get more examples on how to use them.

Thank you for reading, we hope you found this useful.  Follow @sematext on Twitter to be notified of new posts on Hadoop, HBase, Lucene, Solr, Mahout, and other related topics.

Hadoop Digest, July 2010

Strong moves towards the 0.21 Hadoop release “detected”: 0.21 Release Candidate 0 was out and tested. A number of issues were identified and with it the roadmap to the next candidate is set. Tom White has been hard at work and is acting as the release engineer for the 0.21 release.

Community trends and discussions:

  • Hadoop Summit 2010 slides and videos are available here.
  • In case you’re at the design stage of your Hadoop cluster aimed at work with text-based and/or structured data, you should read the “Text files vs SequenceFiles” thread.
  • Thinking of decreasing HDFS replication factor to 2? This thread might be useful to you.
  • Managing workflows of Sqoop, Hive, Pig, and MapReduce jobs with Oozie (Hadoop workflow engine from Yahoo!) is explained in this post.
  • The 2nd edition of “Hadoop: The Definitive Guide” is now in Production.  Again, Tom While in action.

Small FAQ:

  • How do you efficiently process (large) XML documents in Hadoop MapReduce?
    Take a look at Mahout’s XmlInputFormat in case StreamXmlRecordReader doesn’t do a good job for you. The former one got a lot of positive feedback from the community.
  • What are the ways of importing data to HDFS from remote locations? I need this process to be well-managed and automated.
    Here are just some of the options. First you should look at available HDFS shell commands. For large inter/intra-cluster copying distcp might work best for you. For moving data from RDBMS system you should check Sqoop. To automate moving (constantly produced) data from many different locations refer to Flume. You might also want to look at Chukwa (data collection system for monitoring large distributed systems) and Scribe (server for aggregating log data streamed in real time from a large number of servers).

Hey, follow @sematext if you are on Twitter and RT!

Hadoop Digest, June 2010

Hadoop 0.21 release is getting close: a few blocking issues remain in Common, HDFS and MapReduce modules.

Big announcement from Cloudera: CDHv3 and Cloudera Enterprise were released. In CDHv3 beta 2 the following was added:

  • HBase: the popular distributed columnar storage system with fast read-write access to data managed by HDFS.
  • Oozie: Yahoo!’s workflow engine. (op.ed. How many MapReduce workflow engines are there out there?  We know of at least 4-5 of them!)
  • Flume: a distributed, reliable, and available service for efficiently collecting, aggregating, and moving large amounts of log data. It has a simple and flexible architecture based on streaming data flows.
  • Hue: a graphical user interface to work with CDH. Hue lets developers build attractive, easy-to-use Hadoop applications by providing a desktop-based user interface SDK.
  • Zookeeper: a centralized service for maintaining configuration information, naming, providing distributed synchronization, and providing group services.

Cloudera Enterprise combines the open source CDHv3 platform with critical monitoring, management and administrative tools. It also enables control of access to the data and resources by users and groups (can be integrated with Active Directory and other LDAP implementations). The bad news is that it isn’t going to be free.

Community trends & news:

  • Amazon Elastic MapReduce now supports Hadoop 0.20, Hive 0.5, and Pig 0.6. Please, see the announcement.
  • Chukwa is going to move to the Apache’s Incubator to prepare to become a TLP.
  • Using ‘wget’ to download a file from HDFS is explained here.
  • Yahoo’s back port of security into Hadoop 0.20 is available including a sandbox VM.
  • Those of you who missed a great webinar from Cloudera, “Top ten tips tricks for Hadoop success” can get the slides from here.
  • Twitter intends to open-source Crane: MySQL-to-Hadoop tool.
  • Interesting talk from Jeff Hammerbacher about analytical data platforms. Don’t forget to read this nice passage dedicated to it.

Notable efforts:

Follow @sematext on Twitter.

Hadoop Digest, May 2010

Big news: HBase and Avro have become Apache’s Top Level Projects (TLPs)! The initial discussion happened when our previous Hadoop Digest was published, so you can find links to the threads there. The question of whether to become a TLP or not caused some pretty heated debates in Hadoop subprojects’ communities.  You might find it interesting to read the discussions of the vote results for HBase and Zookeeper. Chris Douglas was kind enough to sum up the Hadoop subprojects’ response to becoming a TLP in his post. We are happy to say that all subprojects which became TLP are still fully searchable via our search-hadoop.com service.

More news:

  • Great! Google granted MapReduce patent license to Hadoop.
  • Chukwa team announced the release of Chukwa 0.4.0, their second public release. This release fixes many bugs, improves documentation, and adds several more collection tools, such as the ability to collect UDP packets.
  • HBase 0.20.4 was released. More info in our May HBase Digest!
  • New Chicago area Hadoop User Group was organized.

Good-to-know nuggets shared by the community:

  • Dedicate a separate partition to Hadoop file space – do not use the “/” (root) partition. Setting dfs.datanode.du.reserved property is not enough to limit the space used by Hadoop, since it limits only HDFS usage, but not MapReduce’s.
  • Cloudera’s Support Team shares some basic hardware recommendations in this post. Read more on proper dedicating & counting RAM for specific parts of the system (and thus avoiding swapping) in this thread.
  • Find a couple of pieces of advice about how to save seconds when you need a job to be completed in tens of seconds or less in this thread.
  • Use Combiners to increase performance when the majority of Map output records have the same key.
  • Useful tips on how to implement Writable can be found in this thread.

Notable efforts:

  • Cascalog: Clojure-based query language for Hadoop inspired by Datalog.
  • pomsets: computational workflow management system for your public and/or private cloud.
  • hiho: a framework for connecting disparate data sources with the Apache Hadoop system, making them interoperable

FAQ

  1. How can I attach external libraries (jars) which my jobs depend on?
    You can put them in a “lib” subdirectory of your jar root directory. Alternatively you can use DistributedCache API.
  2. How to Recommission DataNode(s) in Hadoop?
    Remove the hostname from your dfs.hosts.exclude file and run ‘hadoop dfsadmin -refreshNodes‘. Then start the DataNode process in the ‘recommissioned’ DataNode again.
  3. How to configure log placement under specific directory?
    You can specify the log directory in the environment variable HADOOP_LOG_DIR. It is best to set this variable in bin/hadoop-env.sh.

Thank you for reading us, and if you are a Twitter addict, you can now follow @sematext, too!

HBase Digest, March 2010

We were waiting until the end of the month hoping to include coverage of the new HBase 0.20.4 version, but HBase developers are still working on it. This release will contain a lot of critical fixes and enhancements, so stay tuned.

Typically, our HBase Digest posts consist of three main parts: project status and summary of mailing lists’ most interesting discussions, other projects’ efforts & announcements related to HBase technology, and a FAQ section that aims to save time of the very responsive HBase developers answering the same questions again and again. Please feel free to provide feedback on how you like this coverage format in the post comments.

  • A must-read HBase & HDFS presentation from Todd Lipcon of Cloudera that was a part of “HUG9: March HBase User Group at Mozilla”. Links to other presentations are here. The meeting was followed by a nice discussion on Hadoop (and therefore HBase) reliability with regard to EC2. People shared a lot of useful information about hosting opportunities for one’s HBase setup.
  • Very interesting discussion covers various use-cases of what HBase is a good fit for.
  • Some notes on what settings to adjust when running HBase on a machine with low RAM in this thread.
  • Good questions from HBase evaluating person/team and good answers in this thread. Such discussions periodically appear on mailing lists and given the great responsiveness of HBase committers are very good to read by those who thinking about using HBase or are already using, like we are.
  • The warning we already shared with readers through our Hadoop Digest (March): avoid upgrading your clusters to Sun JVM 1.6.0u18, stick to 1.6.0u16 for a while — this update proved to be very stable.
  • One more explanation of the difference of indexed (IHBase) and transactional (THBase) indices.
  • Deleting the row and putting another one with the same key at the same time (i.e. performing “clean update”) can cause unexpected results if not done properly. There are several solutions to make this process safer currently. In case you face this problem, please share your experience with HBase developers on user mailing list, they will be happy to consider your case when developing solution to the issue in next release.
  • Making column names/keys shorter can result in ~20-30% of RAM savings, and visible storage savings too. Even bigger advantage came with defining the right schema and column families. More advices in this thread.
  • What are the options for connecting to HBase running on EC2 from outside the Amazon cloud using Java library? Thread…

Most notable efforts:

  • Lucehbase: Lucene Index on HBase, ported from Lucandra. Please find more info on this topic in the comments to our super popular Lucandra: A Cassandra-based Lucene backend post.
  • Elephant Bird: Twitter’s library of LZO and/or Protocol Buffer-related Hadoop InputFormats, OutputFormats, Writables, Pig LoadFuncs, HBase miscellanea, etc. The majority of these are in production at Twitter running over rather big data every day.

Small FAQ:

  1. How to back up HBase data?
    You can either do exports at the HBase API level (a la Export class), or you can force flush all your tables and do an HDFS level copy of the /hbase directory (using distcp for example).
  2. Is there a way to perform multiple get (put, delete)?
    There is a work being done on that, please refer to HBASE-1845. The patch is available for 0.20.3 version.

Thank you for reading us and follow us on Twitter – @sematext.

Hadoop Digest, March 2010

Main news first: Hadoop 0.20.2 was released! The list of changes may be found in the release notes here. Related news:

To get the most fresh insight on the 0.21 version release plans, check out this thread and the continuation of it.

More news on releases:

High availability is one of the hottest topics nowadays in Hadoop land. Umbrella HDFS-1064 JIRA issue has been created to track discussions/issues related to HDFS NameNode availability. While there are a lot of questions about eliminating single point of failure, Hadoop developers are more concerned about the minimizing the downtime (including downtime for upgrades, restart time) than getting rid of SPOFs, since high downtime is the real pain for those who manage the cluster. There is some work on adding hot standby that might help with planned upgrades. Please find some thoughts and a bit of explanation on this topic in a thread that started with “Why not to consider Zookeeper for the NameNode?” question. Next time we see “How Hadoop developers feel about SPOF?” come up on the mailing list, we’ll put it in a special FAQ section at the bottom of this digest. :)

We already reported in our latest Lucene Digest (March) about various Lucene projects starting discussions on their mailing lists about becoming Top Level Apache projects. This tendency (motivated by the Apache board’s warnings of Hadoop and Lucene becoming umbrella projects) raised discussions at HBase, Avro, Pig and Zookeeper as well.

Several other notable items from MLs:

  • Important note from Todd Lipcon we’d like to pass to our readers: avoid upgrading your clusters to Sun JVM 1.6.0u18, stick to 1.6.0u16 for a while which proved to be very stable. Please read the complete discussion around it here.
  • Storing Custom Java Objects in Hadoop Distibuted Cache is explained here.
  • Here is a bit of explanation of the fsck command output.
  • Several users shared their experience with issues running Hadoop on a Virtualized O/S vs. the Real O/S in this thread.
  • Those who think about using Hadoop as a base for academic research work (both students and professors) might find a lot of useful links (public datasets, sources for problems, existed researches) in this discussion.
  • Hadoop security features are in high demand among the users and community. Developers will be working hard on deploying authentication mechanisms this summer. You can monitor the progress via HADOOP-4487.

This time a very small FAQ section:

  1. How can I request a larger heap for Map tasks?
    By including -Xmx in mapred.child.java.opts
  2. How to configure and use LZO compression?
    Take a look at http://www.cloudera.com/blog/2009/11/hadoop-at-twitter-part-1-splittable-lzo-compression/.

Thank you for reading us! Please feel free to provide feedback on the format of the digests or anything else, really.

Apache ZooKeeper 3.3.0

Nutch Digest, March 2010

This is the first post in the Nutch Digest series and a little introduction to Nutch seems in order. Nutch is a multi-threaded and, more importantly, a distributed Web crawler with distributed content processing (parsing, filtering), full text indexer and a search runtime. Nutch is at version 1.0 and community is now working towards a 1.1. release. Nutch is a large scale, flexible Web search engine, which includes several types of operations. In this post we’ll present new features and mailing list discussion as we describe each of these operations.

Crawling

Nutch starts crawling from a given “seed list” (a list of seed URLs) and iteratively follows useful/interesting outlinks, thus expanding its link database. When talking about Nutch as a crawler it is important to distinguish between two different approaches: focused or vertical crawling and whole Web or wide crawling. Each approach has a different set-up and issues which need to be addressed.  At Sematext we’ve done both vertical and wide crawling.

When using Nutch at large scale (whole Web crawling and dealing with e.g. billions of URLs), generating a fetchlist (a list of URLs to crawl) from crawlDB (the link database) and updating crawlDB with new URLs tends to take a lot of time. One solution is to limit such operations to a minimum by generating several fetchlists in one parse of the crawlDB and then update the crawlDb only once on several segments (set of data generated by a single fetch iteration). Implementation of a Generator that generates several fetchlists at was created in NUTCH-762. Whether this feature will be included in 1.1 release and when will version 1.1 be released, check here.

One more issue related to whole Web crawling which often pops up on the Nutch mailing list is an authenticated websites crawling, so here is wiki page on this subject.

When using Nutch for vertical/focused crawls, one often ends up with a very slow fetch performance at the end of each fetch iteration. An iteration typically starts with high fetch speed, but it drops significantly over time and keeps dropping, and dropping, and dropping. This is known problem.  It is caused by the fetch run having a small number of sites, some of which may have a lot more pages than others, and may be much slower than others.  Crawler politeness, which means it will politely wait before hitting the same domain again, combined with the fact that the number of distinct domains from fetchlist often drops rapidly during one fetch causes fetcher to wait a lot. More on this and overall Fetcher2 performance (which is a default fetcher in Nutch 1.0) you can find NUTCH-721.

To be able to support whole Web crawling, Nutch needs also needs to have scalable data processing mechanism. For this purpose Nutch uses Hadoop’s MapReduce processing and HDFS for storage.

Content storing

Nutch uses Hadoop’s HDFS as a fully distributed storage which creates multiple replicas of data blocks and distributes them on compute nodes throughout a cluster to enable reliable and fast computations even on large data volumes. Currently, Nutch is using Hadoop 0.20.1, but will be upgrading to Hadoop 0.20.2. in version 1.1.

In this NUTCH-650 you can find out more about the ongoing effort (and progress) to use HBase as Nutch storage backend. This should simplify Nutch storage and make URL/page processing work more efficient due to the features of HBase (data is mutable and indexed by keys/columns/timestamps).

Content processing

As we noted before, Nutch does a lot of content processing, like parsing (parsing downloaded content to extract text) and filtering (extracting only URLs which match filtering requirements). Nutch is moving away from its own processing tools and delegating content parsing and MimeType detection to Tika by use of Tika plugin, you can read more NUTCH-766.

Indexing and Searching

Nutch uses Lucene for indexing, currently version 2.9.1, but pushing to upgrade Lucene to 3.0.1. Nutch 1.0 can also index directly to Solr, becuse of Nutch-Solr integration. More on this and how using Solr with Nutch worked before Solr-Nutch integration you can find here. This integration now upgrades to Solr 1.4, because Solr 1.4 has a StreamingUpdateSolrServer which simplifies the way docs are buffered before sending to the Solr instance. Another improvement in this integration was a change to SolrIndexer to commit only once after all reducers have finished, NUTCH-799.

Some of patches discussed here and a number of other high quality patches were contributed by Julien Nioche who was added as a Nutch committer in December 2009.

One more thing, keep an eye on an interesting thread about Nutch becoming a Top Level Project (TLP) at Apache.

Thank you for reading, and if you have any questions or comments leave them in comments and we’ll respond promptly!

Follow

Get every new post delivered to your Inbox.

Join 706 other followers