Hive Digest, March 2011

Welcome to the first Hive digest!

Hive is a data warehouse built on Hadoop, initially developed by Facebook, it’s been under the Apache umbrella for about 2 years and has seen very active development. Last year there were 2 major releases which introduced loads of features and bug fixes. Now Hive 0.7.0 has just been released and is packed with goodness.

Hive 0.6.0

Hive 0.6.0 was released October last year. Some of its most interesting features included

Hive 0.7.0

Hive 0.7.0 has just been released! Some of the major features include:

  • Indexing has been implemented, index types are currently limited to compact indexes. This feature opens up lots of potential for future improvements, such as HIVE-1694 which aims to use indexes to accelerate query execution for GROUP BY, ORDER BY, JOINS and other misc cases and HIVE-1803 which will implement bitmap indexing.
  • Security features have been added with authorisation and authentication.
  • There is now an optional concurrency model which makes use of Zookeeper, so tables can now be locked during writes. It is disabled by default, but can be enabled using hive.support.concurrency=true in the config.

And many other small improvements including:

  • Making databases more useful, you can now select across a database.
  • The Hive command line interface has gotten some love and now supports auto-complete.
  • There’s now support for HAVING clauses, so users no longer have to do nested queries in order to apply a filter on group by expressions.

and much more.

You can download Hive 0.7.0 from here and you can follow @sematext on Twitter.

Hadoop Digest, August 2010

The biggest announcement of the year: Apache Hadoop 0.21.0 released and is available for download here. Over 1300 issues have been addressed since 0.20.2; you can find details for Common, HDFS and MapReduce. Note from Tom White who did an excellent job as a release manager: “Please note that this release has not undergone testing at scale and should not be considered stable or suitable for production. It is being classified as a minor release, which means that it should be API compatible with 0.20.2.”. Please find a detailed description of what’s new in 0.21.0 release here.

Community trends & news:

  • New branch hadoop-0.20-security is being created. Apart from the security features, which are in high demand, it will include improvements and fixes from over 12 months of work by Yahoo!. The new security features are going to be a very valuable and welcome contribution (also discussed before).
  • A thorough discussion about approaches of backing up HDFS data in this thread.
  • Hive voted to become Top Level Apache Project (TLP) (also here).  Note that we’ll keep Hive under Search-Hadoop.com even after Hive goes TLP.
  • Pig voted to become TLP too (also here).  Note that we’ll keep Pig under Search-Hadoop.com even after Pig goes TLP.
  • Tip: if you define a Hadoop object (e.g. Partitioner, as implementing Configurable, then its setConf() method will be called once, right after it gets instantiated)
  • For those new to ZooKeeper and pressed for time, here you can find the shortest ZooKeeper description — only 4 sentences short!
  • Good read “Avoiding Common Hadoop Administration Issues” article.

Notable efforts:

  • Howl: Common metadata layer for Hadoop’s Map Reduce, Pig, and Hive (yet another contribution from Yahoo!)
  • PHP library for Avro, includes schema parsing, Avro data file and
    string IO.
  • avro-scala-compiler-plugin: aimed to auto-generate Avro serializable classes based on some simple case class definitions

FAQ:

  • How to programatically determine the names of the files in a particular Hadoop/HDFS directory?
    Use FileSystem & FileStatus API. Detailed examples are in this thread.
  • How to restrict HDFS space usage?
    Please, refer to HDFS Quotas Guide.
  • How to pass parameters determined at run-time (i.e. not hard-coded) to Hadoop objects (like Partitioner, Writable, etc.)?
    One option is to define a Hadoop object as implementing Configurable. In this case its setConf() method will be called once, right after it gets instantiated and you can use “native” Hadoop configuration for passing parameters you need.

HBase Digest, August 2010

The second “developer release”, hbase-0.89.201007d, is now available for download. To remind everyone, there are currently two active branches of HBase:

  • 0.20 – the current stable release series, being maintained with patches for bug fixes only.
  • 0.89 – a development release series with active feature and stability development, not currently recommended for production use.

First one doesn’t support HDFS durability (edits may be lost in the case of node failure) whereas the second one does. You can find more information at this wiki page.  HBase 0.90 release may happen in October!  See info from developers.

Community trends & news:

  • New HBase AMIs are available for dev release and 0.20.6.
  • Looking for some GUI that could be used for browsing through tables in HBase? Check out Toad for Cloud, watch for HBase-Explorer and HBase-GUI-Admin.
  • How many regions a RegionServer can support and what are the consequences of having lots of regions in a RegionServer? Check info in this thread.
  • Some more complaints to be aware of regarding HBase performing on EC2 in this thread. For those who missed it, more on Hadoop & HBase reliability with regard to EC2 in our March digest post.
  • Need guidance in sizing your first Hadoop/HBase cluster? This article will be helpful.

FAQ:

  • Where can I find information about data model design with regard to HBase?
    Take a look at http://wiki.apache.org/hadoop/HBase/HBasePresentations.
  • How can I perform SQL-like query “SELECT … FROM …” on HBase?
    First, consider that HBase is a key-value store which should be treated accordingly. But if you are still up for writing ad-hoc queries in your particular situation take a look at Hive & HBase integration.
  • How can I access Hadoop & HBase metrics?
    Refer to HBase Metrics documentation.
  • How to connect to HBase from java app running on remote (to cluster) machine?
    Check out client package documentation. Alternatively, one can use the REST interface: Stargate.

HBase Digest, July 2010

Big news first: HBase 0.20.6 is out and available for download. It fixes only 8 issues (including 2 blockers), but some of them might be significant in particular cases (like scan recovery in case of region server failure). You can find the release notes here. Message from the HBase dev team: “we recommend that all users, particularly those running 0.20.4, upgrade”.

The very sweet piece of functionality is under active development right now (and looks like it’s nearly complete).  This new functionality makes it possible to take HBase table snapshots: HBASE-50. This might be extremely useful in production. Design plan and implementation are looking so good that “committers should read it as they might learn something”.

Community news & trends:

  • Summary notes of HBase meetup (#11) at Facebook give a comprehensive overview of development activities and what’s in coming releases. Slides are available here.
  • Welcome the official HBase Blog.
  • It is strongly recommended to use HDFS patched with HDFS-630 with HBase. To save time, you can use Cloudera’s distribution: HDFS-630 is included in 0.20.1+169.89 (the latest CDH2, i.e. CDH2u1) a well as both betas of CDH3.
  • From the operations standpoint, is setup of a HBase cluster and their maintenance a fairly complex task? Can a single person manage it? People are sharing their experiences in this thread.
  • It makes sense to disable WAL (e.g. by Put.setWriteToWAL(false)) during one-time large import into HBase (of course this makes things unreliable, but it can be OK when doing import once given the resulting speedup).

Notable efforts:

  • HBase RowLog Library: a component to build WALs and queues backed by HBase.
  • Lily is out: meet Proof of Architecture release. Lily is the cloud-scalable NoSQL-based content store and search repository, built on top of Apache HBase and SOLR.

FAQ:

  • Why are recently added/modified records not present in result of scan and get operations? How can one make them available?
    Check autoFlush option: autoFlush=false causes the client to accumulate puts without sending them to the server. Gets and scans only talk to the server and thus ignore the client write cache. You can either set autoFlush to true or perform HTable.flushCommits() before reading data.

Are you on Twitter?  You can follow @sematext on Twitter.

Hadoop Digest, July 2010

Strong moves towards the 0.21 Hadoop release “detected”: 0.21 Release Candidate 0 was out and tested. A number of issues were identified and with it the roadmap to the next candidate is set. Tom White has been hard at work and is acting as the release engineer for the 0.21 release.

Community trends and discussions:

  • Hadoop Summit 2010 slides and videos are available here.
  • In case you’re at the design stage of your Hadoop cluster aimed at work with text-based and/or structured data, you should read the “Text files vs SequenceFiles” thread.
  • Thinking of decreasing HDFS replication factor to 2? This thread might be useful to you.
  • Managing workflows of Sqoop, Hive, Pig, and MapReduce jobs with Oozie (Hadoop workflow engine from Yahoo!) is explained in this post.
  • The 2nd edition of “Hadoop: The Definitive Guide” is now in Production.  Again, Tom While in action.

Small FAQ:

  • How do you efficiently process (large) XML documents in Hadoop MapReduce?
    Take a look at Mahout’s XmlInputFormat in case StreamXmlRecordReader doesn’t do a good job for you. The former one got a lot of positive feedback from the community.
  • What are the ways of importing data to HDFS from remote locations? I need this process to be well-managed and automated.
    Here are just some of the options. First you should look at available HDFS shell commands. For large inter/intra-cluster copying distcp might work best for you. For moving data from RDBMS system you should check Sqoop. To automate moving (constantly produced) data from many different locations refer to Flume. You might also want to look at Chukwa (data collection system for monitoring large distributed systems) and Scribe (server for aggregating log data streamed in real time from a large number of servers).

Hey, follow @sematext if you are on Twitter and RT!

Hadoop Digest, June 2010

Hadoop 0.21 release is getting close: a few blocking issues remain in Common, HDFS and MapReduce modules.

Big announcement from Cloudera: CDHv3 and Cloudera Enterprise were released. In CDHv3 beta 2 the following was added:

  • HBase: the popular distributed columnar storage system with fast read-write access to data managed by HDFS.
  • Oozie: Yahoo!’s workflow engine. (op.ed. How many MapReduce workflow engines are there out there?  We know of at least 4-5 of them!)
  • Flume: a distributed, reliable, and available service for efficiently collecting, aggregating, and moving large amounts of log data. It has a simple and flexible architecture based on streaming data flows.
  • Hue: a graphical user interface to work with CDH. Hue lets developers build attractive, easy-to-use Hadoop applications by providing a desktop-based user interface SDK.
  • Zookeeper: a centralized service for maintaining configuration information, naming, providing distributed synchronization, and providing group services.

Cloudera Enterprise combines the open source CDHv3 platform with critical monitoring, management and administrative tools. It also enables control of access to the data and resources by users and groups (can be integrated with Active Directory and other LDAP implementations). The bad news is that it isn’t going to be free.

Community trends & news:

  • Amazon Elastic MapReduce now supports Hadoop 0.20, Hive 0.5, and Pig 0.6. Please, see the announcement.
  • Chukwa is going to move to the Apache’s Incubator to prepare to become a TLP.
  • Using ‘wget’ to download a file from HDFS is explained here.
  • Yahoo’s back port of security into Hadoop 0.20 is available including a sandbox VM.
  • Those of you who missed a great webinar from Cloudera, “Top ten tips tricks for Hadoop success” can get the slides from here.
  • Twitter intends to open-source Crane: MySQL-to-Hadoop tool.
  • Interesting talk from Jeff Hammerbacher about analytical data platforms. Don’t forget to read this nice passage dedicated to it.

Notable efforts:

Follow @sematext on Twitter.

HBase Digest, June 2010

HBase 0.20.5 is out! It fixes 24 issues since the 0.20.4 release. HBase developers “recommend that all users, particularly those running 0.20.4, upgrade to this release”.

Community trends:

  • There’s a clear need in “sanity check DNS across my cluster” tool as a lot of questions/help requests related to the name/address resolution in the cluster are submitted over time. Any volunteers?
  • Bulk incremental load into an existing table feature (HBASE-1923) is commited to trunk. No multi-family support still.
  • Good number of advice about increasing the write performance/speed in this thread, including shared numbers/techniques from a large production cluster.
  • A set of ORM tools to consider for HBase are suggested here.

Notable efforts:

FAQ:

  • Common issue: tables/data disappears after system restart. Usually people face it when playing with HBase for the first time and even on the single node set-up. The problem is that by default HDFS is configured to store its data in the /tmp dir which might get cleaned up by OS. Configure “dfs.name.dir” and “dfs.data.dir” properties in hdfs-site.xml to aviod these problems.

HBase Digest, May 2010

Big news first:

  • HBase 0.20.4 is out! This release includes critical fixes, some improvements and performance improvements. HBase 0.20.4 EC2 AMIs are now available in all regions, the latest launch scripts can be found here.
  • HBase has become Apache’s Top Level Project. Congratulations!

Good to know things shared by community:

  • HBase got a code review board. Feel free to join!
  • The guarantees for each operation in HBase with regard to ACID are properties stated here.
  • Writing filter that compares values in different columns is explained in this thread.
  • It is OK to mix transactional IndexTable and regular HTables in the same cluster. One can access tables w/out the transactional semantics/overhead as normal, even when running a TransactionalRegionServer. More in this thread.
  • Gets and scans now never return partially updated rows (as of 0.20.4 release).
  • Try to avoid building code on top of lockRow/unlockRow because this can lead to serious delays in a system work and even deadlock. Thread…
  • Read about how HBase performs load-balancing in this thread.
  • Thinking about using HBase with alternative (to HDFS) file system? Then this thread is a must-read for you.

Notable efforts:

  • HBase Indexing Library aids in building and querying indexes on top of HBase, in Google App Engine datastore-style. The library is complementary to the tableindexed contrib module of HBase.
  • HBasene is a scalable information retrieval engine, compatible with the Lucene library while using HBase as the store for the underlying TF-IDF representation.  This is much like Lucandra, which uses Lucene on top of Cassandra.  We will be covering HBasene in the near future here on Sematext Blog.

FAQ:

  1. Is there an easy way to remove/unset/clean a few columns in a column family for an HBase table?
    You can either delete an entire family or delete all the version of a single family/qualifier. There is no ‘wild card’ deletion or other pattern matching. Column Family is the closest.
  2. How to unsubscribe from user mailing list?
    Send mail to user-unsubscribe@hbase.apache.org.

Hadoop Digest, May 2010

Big news: HBase and Avro have become Apache’s Top Level Projects (TLPs)! The initial discussion happened when our previous Hadoop Digest was published, so you can find links to the threads there. The question of whether to become a TLP or not caused some pretty heated debates in Hadoop subprojects’ communities.  You might find it interesting to read the discussions of the vote results for HBase and Zookeeper. Chris Douglas was kind enough to sum up the Hadoop subprojects’ response to becoming a TLP in his post. We are happy to say that all subprojects which became TLP are still fully searchable via our search-hadoop.com service.

More news:

  • Great! Google granted MapReduce patent license to Hadoop.
  • Chukwa team announced the release of Chukwa 0.4.0, their second public release. This release fixes many bugs, improves documentation, and adds several more collection tools, such as the ability to collect UDP packets.
  • HBase 0.20.4 was released. More info in our May HBase Digest!
  • New Chicago area Hadoop User Group was organized.

Good-to-know nuggets shared by the community:

  • Dedicate a separate partition to Hadoop file space – do not use the “/” (root) partition. Setting dfs.datanode.du.reserved property is not enough to limit the space used by Hadoop, since it limits only HDFS usage, but not MapReduce’s.
  • Cloudera’s Support Team shares some basic hardware recommendations in this post. Read more on proper dedicating & counting RAM for specific parts of the system (and thus avoiding swapping) in this thread.
  • Find a couple of pieces of advice about how to save seconds when you need a job to be completed in tens of seconds or less in this thread.
  • Use Combiners to increase performance when the majority of Map output records have the same key.
  • Useful tips on how to implement Writable can be found in this thread.

Notable efforts:

  • Cascalog: Clojure-based query language for Hadoop inspired by Datalog.
  • pomsets: computational workflow management system for your public and/or private cloud.
  • hiho: a framework for connecting disparate data sources with the Apache Hadoop system, making them interoperable

FAQ

  1. How can I attach external libraries (jars) which my jobs depend on?
    You can put them in a “lib” subdirectory of your jar root directory. Alternatively you can use DistributedCache API.
  2. How to Recommission DataNode(s) in Hadoop?
    Remove the hostname from your dfs.hosts.exclude file and run ‘hadoop dfsadmin -refreshNodes‘. Then start the DataNode process in the ‘recommissioned’ DataNode again.
  3. How to configure log placement under specific directory?
    You can specify the log directory in the environment variable HADOOP_LOG_DIR. It is best to set this variable in bin/hadoop-env.sh.

Thank you for reading us, and if you are a Twitter addict, you can now follow @sematext, too!

Mahout Digest, May 2010

As we reported in our previous Digest, Mahout was approved by the board to become Top Level Project (TLP) and TLP-related preparations kept Mahout developers busy in May. Mahout mailing lists (user- and dev-) were moved to their new addresses and all subscribers were automatically moved to the new lists: user@mahout.apache.org and dev@mahout.apache.org. Regarding other TLP-related changes and their progress, check the list of items to complete the move.

We’ll start a quick walk-through of May’s new features and improvements with an effort on reducing Mahout’s Vector representation size on disk, resulting in improvement of 11% lower size on a test data set.

Discussion on how UncenteredCosineSimilarity , an implementation of the cosine similarity which does not “center” its data, should behave in distributed vs. non-distributed version and a patch for distributed version can be found in MAHOUT-389. Furthermore, an implementation of distributed version of any ItemSimilarity currently available in a non-distributed form was committed in MAHOUT-393.

Mahout has utilities to generate Vectors from a directory of text documents and one improvement in terms of consistency and ease of understanding was made on tf/tfidf-vectors outputs. When using bin/mahout seq2sparse command to generate vectors from text (actually, from the sequence file generated from text), depending on the weighting method (term frequency or term frequency–inverse document frequency), output was created in different directories. Now with the fix from MAHOUT-398, tf-vectors and tfidf-vectors output directories at the same level.

We’ll end with Hidden Markov Model and its integration into Mahout. In MAHOUT-396 you’ll find a patch and more information regarding where and how it is used.

That’s all from Mahout’s world for May, please feel free to leave any questions or comments and don’t forget you can follow @sematext on Twitter.

Follow

Get every new post delivered to your Inbox.

Join 1,637 other followers