Hadoop Digest, May 2010

Big news: HBase and Avro have become Apache’s Top Level Projects (TLPs)! The initial discussion happened when our previous Hadoop Digest was published, so you can find links to the threads there. The question of whether to become a TLP or not caused some pretty heated debates in Hadoop subprojects’ communities.  You might find it interesting to read the discussions of the vote results for HBase and Zookeeper. Chris Douglas was kind enough to sum up the Hadoop subprojects’ response to becoming a TLP in his post. We are happy to say that all subprojects which became TLP are still fully searchable via our search-hadoop.com service.

More news:

  • Great! Google granted MapReduce patent license to Hadoop.
  • Chukwa team announced the release of Chukwa 0.4.0, their second public release. This release fixes many bugs, improves documentation, and adds several more collection tools, such as the ability to collect UDP packets.
  • HBase 0.20.4 was released. More info in our May HBase Digest!
  • New Chicago area Hadoop User Group was organized.

Good-to-know nuggets shared by the community:

  • Dedicate a separate partition to Hadoop file space – do not use the “/” (root) partition. Setting dfs.datanode.du.reserved property is not enough to limit the space used by Hadoop, since it limits only HDFS usage, but not MapReduce’s.
  • Cloudera’s Support Team shares some basic hardware recommendations in this post. Read more on proper dedicating & counting RAM for specific parts of the system (and thus avoiding swapping) in this thread.
  • Find a couple of pieces of advice about how to save seconds when you need a job to be completed in tens of seconds or less in this thread.
  • Use Combiners to increase performance when the majority of Map output records have the same key.
  • Useful tips on how to implement Writable can be found in this thread.

Notable efforts:

  • Cascalog: Clojure-based query language for Hadoop inspired by Datalog.
  • pomsets: computational workflow management system for your public and/or private cloud.
  • hiho: a framework for connecting disparate data sources with the Apache Hadoop system, making them interoperable

FAQ

  1. How can I attach external libraries (jars) which my jobs depend on?
    You can put them in a “lib” subdirectory of your jar root directory. Alternatively you can use DistributedCache API.
  2. How to Recommission DataNode(s) in Hadoop?
    Remove the hostname from your dfs.hosts.exclude file and run ‘hadoop dfsadmin -refreshNodes‘. Then start the DataNode process in the ‘recommissioned’ DataNode again.
  3. How to configure log placement under specific directory?
    You can specify the log directory in the environment variable HADOOP_LOG_DIR. It is best to set this variable in bin/hadoop-env.sh.

Thank you for reading us, and if you are a Twitter addict, you can now follow @sematext, too!

Mahout Digest, May 2010

As we reported in our previous Digest, Mahout was approved by the board to become Top Level Project (TLP) and TLP-related preparations kept Mahout developers busy in May. Mahout mailing lists (user- and dev-) were moved to their new addresses and all subscribers were automatically moved to the new lists: user@mahout.apache.org and dev@mahout.apache.org. Regarding other TLP-related changes and their progress, check the list of items to complete the move.

We’ll start a quick walk-through of May’s new features and improvements with an effort on reducing Mahout’s Vector representation size on disk, resulting in improvement of 11% lower size on a test data set.

Discussion on how UncenteredCosineSimilarity , an implementation of the cosine similarity which does not “center” its data, should behave in distributed vs. non-distributed version and a patch for distributed version can be found in MAHOUT-389. Furthermore, an implementation of distributed version of any ItemSimilarity currently available in a non-distributed form was committed in MAHOUT-393.

Mahout has utilities to generate Vectors from a directory of text documents and one improvement in terms of consistency and ease of understanding was made on tf/tfidf-vectors outputs. When using bin/mahout seq2sparse command to generate vectors from text (actually, from the sequence file generated from text), depending on the weighting method (term frequency or term frequency–inverse document frequency), output was created in different directories. Now with the fix from MAHOUT-398, tf-vectors and tfidf-vectors output directories at the same level.

We’ll end with Hidden Markov Model and its integration into Mahout. In MAHOUT-396 you’ll find a patch and more information regarding where and how it is used.

That’s all from Mahout’s world for May, please feel free to leave any questions or comments and don’t forget you can follow @sematext on Twitter.

Solr Digest, May 2010

May’s Solr Digest brings another review of interesting Solr developments and a short look at current state of Solr’s branches and versions. Confused about which versions to use and which to avoid? Don’t worry, many people are.  We’ll try to clear it up in this Digest.

  • In April’s edition of Solr Digest, we mentioned two JIRA issues dealing with document level security (SOLR-1834 and SOLR-1872). Now another issue (SOLR-1895) deals with LCF security and aims to work with SOLR-1834 to provide a complete solution.
  • One ancient JIRA issue, SOLR-397, finally got resolved and its code is now committed. Solr now has the ability to control how date range end points in facets are dealt with.  You can use this functionality by specifying the facet.date.include HTTP request parameter, which can have values “all”, “lower“, “upper“, “edge“, “between“, “before“, or “after“. More details about this can be found in SOLR-397.
  • Another issue related to date ranges was created.  This one aims to add Date Range QParser to Solr, which would simplify definition of date range filters resulting from date faceting.   This issue is still in its infancy and has no patches attached to it as of this writing, but it looks useful and promising.  When we add date faceting to Search-Lucene.com and Search-Hadoop.com we’ll be looking to use this date range query parser.
  • Some errors in Solr will be much easier to trace after JIRA issue SOLR-1898 gets resolved. Everyone using Solr probably encountered exceptions like: java.lang.NumberFormatException: For input string: “0.0”.  The message itself lacks some crucial details, such as information about the document and field that triggered the exception.  SOLR-1898 will solve that problem, and we are looking forward to this patch!
  • Have you recently been in the situation where you were unsure about which branch or version of Solr you should use on your projects? If yes, you’re certainly not alone! After the recent merge of Solr and Lucene (covered in Solr March Digest and Lucene March Digest), things became confusing, especially for casual observers of Lucene and Solr. Here are some facts about the current state of Solr:
  1. latest stable release version of Solr is still 1.4
  2. 1.4 version was released more than 6 months ago, so many new features, patches and bug fixes aren’t included in it
  3. however, it was a stable release, so if you’re planning your production very soon, maybe one low-risk choice would be using 1.4 version on top of which you could apply the patches that you find necessary for you deployment
  4. current development is ongoing on trunk (considered as unstable version and slated for future Solr 4.0 version) and branch named branch_3x. This branch is the most likely candidate for the next version of Solr (named 3.1) and is considered as (stable) development version which could be usable, though you have to be careful with your testing, as always.
  5. another choice could be some old 1.5 nightly build, but 1.5 is abandoned and, in our opinion, it makes more sense to use nightly builds from branch_3x

Here are couple of threads where you can get more information:

  1. Lucene 3.x branch created
  2. Which Solr to use?
  • To show one of the dangers of unstable versions, we’ll immediately point to one recently open JIRA issue related to “file descriptor leak” while indexing.
  • Although at Sematext we’ve been using Java 6 for a very long time both for our products and with our clients, some people might still be stuck with Java 5. It appears that they will never be able to use Solr 4.0 once it is released, since Solr trunk version now requires Java 6 to compile.

We’ll finish this month’s Solr Digest with two new Solr features:

  • For anyone wanting to use JSON format when sending documents to Sorl, JSON update handler is now committed to trunk
  • on the other hand, if you need CSV as output format from Solr, you might benefit from the work on new CSV Response Writer. Currently, there are no patches with it, but you can watch the issue and see when it is added.

Thanks for reading another Solr Digest!  Help us spread the word, please Re-Tweet it, and follow @sematext on Twitter.

Nutch Digest, May 2010

With May being almost over, it’s time for our regular monthly Nutch Digest. As you know, Nutch became a top-level project and first related changes are already visible/applied. The Nutch site was moved to its new home at nutch.apache.org and the mailing lists (user- and dev-) have been moved to new addresses: user@nutch.apache.org and dev@nutch.apache.org.  Old subscriptions were automatically moved to the new lists, so subscribers don’t have to do anything (apart from changing mail filters, perhaps).  Even though Nutch is not a Lucene sub-project any more, we will continue to index its content (mailing lists, JIRA, Wiki, web site) and searchable over on Search-Lucene.com.  We’ve helped enough organizations with their Nutch clusters that it makes sense for at least us at Sematext to have a handy place to find all things Nutch.

There is a vote on updated Release Candidate (RC3) for the Apache Nutch 1.1 release.  The major differences between this release and RC2 are several bug fixes: NUTCH-732, NUTCH-815, NUTCH-814, NUTCH-812 and one improvement in NUTCH-816.  Nutch 1.1 is expected shortly!

From relevant changes related to Nuch develompent during May it’s important to note than Nutch will be using Ivy in Nutch builds and that there is one change regarding Nutch’s language identification: code for Japanese changed from “jp” to “ja”.  The former is Japan’s country code and the latter is the language code for Japanese.

There’s been a lot of talk on the Nutch mailing list about the first book on Nutch which Dennis Kubes started writing.  We look forward to reading it, Dennis!

Nutch developers were busy with the TLP-related changes and with preparations for the Nutch 1.1 release this month, so this Digest is a bit thinner than usual.  Follow @sematext on Twitter.

Berlin Buzzwords

June is approaching, and that means Berlin Buzzwords is near.  Here at Sematext we are looking forward to Berlin this year, as we’ll be both at the Cloudera Hadoop training (and getting certified), plus attending Berlin Buzzwords.

With this guest post we are doing our part in helping Berlin Buzzwords organizers spread the word about their conference.

With data storage space getting cheaper and cheaper the question of how to efficiently analyze data and find information becomes important for a growing number of businesses. Apache Hadoop, Hive, NoSQL, MongoDB, Apache Lucene and CouchDB are just a few terms that spring to mind when thinking of large data. But what are these projects really all about? Who are the developers behind them? What do people build with these software packages? On 7/8th of June Berlin Buzzwords, a conference on these exact topics, sets out to bring together users and developers of hot data analysis topics. There are tracks specific to the three tags search, store and scale. The conference features a fantastic mixture of developers and users of open source software projects that make scaling data processing today possible:

  • Christophe Bisciglia from Cloudera is giving and overview of Apache Hadoop from an industry perspective. In addition there are talks by Steve Loughran, Aaron Kimball and Stefan Groschupf.
  • As for Apache Lucene the schedule has talks by Grant Ingersoll, Robert Muir and the “Generics Policeman” Uwe Schindler. Michael Busch from Twitter will explain how real-time search can be implemented with Lucene.
  • For those interested in NoSQL databases there is Mathias Stearn from MongoDB, Jan Lehnardt from CouchDB and Eric Evans, the guy who coined the term NoSQL one year ago.

The schedule is available online: http://berlinbuzzwords.de/content/schedule-published

Regular tickets are available online at http://berlinbuzzwords.de/content/tickets . In addition, student tickets are available: With a valid student ID, participants are eligible for one of these tickets. Each costs 100,- Euro. There are also special group tickets available: If you buy four tickets or more you are eligible for a discount of 25%, when purchasing 10 tickets or more the discount is 50%.

Looking forward to seeing you in Berlin.

Solr Performance Monitoring Announcement

Update: Sematext now offers SPM – Scalable Performance Monitoring for Solr (as well as for HBase, OS, JVM, etc.). See our Solr Performance Monitoring with SPM blog post.

We’re happy to announce the partnership between Sematext and New Relic.  Over the last few months we have been using New Relic’s RPM Gold Plan for monitoring performance of our own Solr-based services for searching Lucene and Hadoop ecosystems: http://search-lucene.com/ and http://search-hadoop.com/. We found it valuable for understanding (and fixing!) Solr performance bottlenecks and are going to be offering this service to our new and old customers.

We’ll also set up our Lucene/Solr tech support subscribers with New Relic’s RPM.  This will make our tech support team more visibility in customer’s past Solr performance, spot any suspicious trends or errors, and quickly understand the overall performance trends over time, thus resolving our customer’s issues much more quickly.

You can get more details about the service, and you can get a free, no strings attached 30 days Gold plan trial here.

Note: you will NOT be asked for any credit card information and will not have to pay a thing, nor will you have to cancel anything in order to avoid having to pay for the service.  Signing up via the above link gives you 30 days free Gold plan trial.  After 30 days the plan just goes back to the free Lite plan.  If you need more than 30 days to evaluate the service, please let us know.  Free, no strings attached 30 days Gold plan trial sign up is here.  Enjoy!

Mahout Digest, April 2010

April was a busy month here at Sematext, so we are a few days late with our April Mahout Digest, but better late than never!

The Board has approved Mahout to become Top Level Project (TLP) at the Apache Software Foundation. Check the status thread on change of Mahout mailing lists, svn layout, etc. Several discussions on mailing list, all describing what needs to be done for Mahout to become Top Level Project, resulted in Mahout TLP to-do list

As we reported in previous digest there was a lot of talk on mailing list about Google Summer of Code (GSoC) and here is a follow up on this subject. GSoC announcements are up and Mahout got 5 projects accepted this year. Check the full list of GSoC projects and Mahout’s community reactions on the mailing list.

In the past we have reported about the idea of Mahout/Solr integration, and it seems this is now getting realized. Read more on features and implementation progress of this integration here.

At the beginning of April, there was a proposal to make collections releases independent of the rest of Mahout. After some transition period of loosening the coupling between mahout-collections and the rest of Mahout, mahout-collections were extracted as an independent component.  The vote on the first independent release of collections is in progress. Mahout-collections-1.0 differs from the version released with mahout 0.3 only by removed dependency on slf4j.

Mahout parts that use Lucene are updated to use the latest release of Lucene 3.0.1 and code changes for this migration can be found in this patch.

There was a question that generated a good discussion about cosine similarity between two items and how it is implemented in Mahout. More on cosine similarity between two items which is implemented as PearsonCorrelationSimilarity (source) in Mahout code, can be found in MAHOUT-387.

The goal of clustering is to determine the grouping in a set of data (e.g. a corpus of items or a (search) result set or …).  Often, the problem in clustering implementations is that the data to be clustered has a high number of dimensions, which tend to need reducing (to a smaller number of dimensions) in order to make clustering computationally more feasible (read: faster).  The simplest way of reducing those dimensions is to use some sort of a ‘mapping function’ that takes data presented in n-dimensional space and transforms it to data presented in m-dimensional space, where m is lower than n, hence the reduction. Of course, the key here is to preserve variance of important data features (dimensions) and flatten out unimportant ones. One simple and interesting approach to clustering is the use of several independent hash functions where the probability of collision of similar items is higher. This approach, called Minhash based clustering, was proposed back in March as part of GSoC (see the proposal).  You’ll find more on theory behind it and Mahout applicable implementation in MAHOUT-344.

Those interested in Neural Networks should keep an eye on MAHOUT-383, where Neuroph (a lightweight neural net library) and Mahout will get integrated.

This concludes our short April Mahout Digest.  Once Mahout completes its transition to TLP, we expect the project to flourish even more.  We’ll be sure to report on the progress and new developments later this month.  If you are a Twitter user, you can follow @sematext on Twitter.

Follow

Get every new post delivered to your Inbox.

Join 1,695 other followers