Publishing a February Lucene Digest in March? Nonsense, ha? Blame it on Sematext keeping us busy and on the short month of February. At least we got the Solr Digest and HBase Digest out in time!
- Well, the first thing to know about Lucene now is that Lucene 2.9.2 and 3.0.1 have been released. They contain fixes you’ll want to have, so go grab a new release. While you are at it, you may also be interested in seeing a discussion about Lucene upgrades that emerged after the release announcement was made a few days ago.
- Got extra 30 minutes? Ever wondered what other Lucene users would like to see in Lucene? If so, check this If you could have one feature in Lucene… thread. Note how we are not linking to http://search-lucene.com/ — our thread handling needs a bit of work and will be out after the current Sprint is over.
- Guess what? Lucene in Action 2ns ed is in production. What this means is that LIA2 authors are code working on the manuscript and that Manning Publications people are preparing it for print and distribution. You got your MEAP already, right? LIA2 covers Lucene 3.* API.
- If Lucene in Action 2nd ed is not enough for you, note that another Lucene book is in the works: Lucene in Practice. Hey, John Wang, Jake & Co., does LIP have a URL? For now, the only URL I have is to the LIP source code: http://code.google.com/p/lucene-book/.
- If you missed the popular Lucandra post, have a look at it now. Lucandra from @tjake looks interesting. Note that we’ll have a talk about Lucandra soon – keep an eye on the NY Search & Discovery Meetup.
- Lucene developers are a super disciplined bunch. Look how well unit-tested Lucene is in the Lucene Clover Report.
- LinkedIn is one of the bigger Lucene users out there, and they’ve been publishing a lot about that. Zoie and Bobo Browse are two projects you’ll see covered in Lucene in Action 2, but here is a LinkedIn Search presentation.
- Search, search, more and more search frameworks are getting built on top of Lucene. Solr is the biggest and the most well known, of course, but certainly not the only one:
- Sensei from LinkedIn
- Elastic Search (watch out for ES coverage on this blog)
- Katta (an “old timer” – in a good sense)
- And talking about Katta (Lucene (or Hadoop Mapfiles or any content which can be split into shards) in the cloud), the new release is out:
The key changes of the 0.6 release among dozens of bug fixes:
- upgrade lucene to 3.0
- upgrade zookeeper to 3.2.2
- upgrade hadoop to 0.20.1
- generalize katta for serving shard-able content (lucene is one implementation, hadoop mapfiles another one)
- basic lucene field sort capability
- more robust zookeeper session expiration handling
- throttling of shard deployment (kb/sec configurable) to have a stable search while deploying
- load test facility
- monitoring facility
- alpha version of web-gui
See full list of changes at
http://oss.101tec.com/jira/secure/ReleaseNote.jspa?projectId=10000&styleName=Html&version=10010
- Full-text Search and Spatial Search go hand in hand. Both Lucene and Solr have seen work in the spatial search area and now a new Apache project called Spatial Information Systems (SIS) has been proposed and approved. SIS will enter Apache Software Foundation via the Incubator.
The first HBase Digest post received very good feedback from the community. We continue using HBase at Sematext and thus continue covering the status of HBase project with this post.
- Added Performance Evaluation for IHBase. The PE does a good job of showing what IHBase is good and bad at.
- Transactional contrib stuff is more geared to short-duration transactions, but it should be possible to share transaction states across machines with certain rules in mind. Thread…
- Choosing between Thrift and REST connectors for communicating with HBase outside of Java is explained in this thread.
- How to properly set up Zookeeper to be used by HBase (how many instances/resources should be dedicated to it, etc) is discussed in this thread. Some more info you can find is also in this one.
- Yahoo Research has developed a benchmarking tool for “Cloud Serving Systems”. In their paper describing the tool which they intend to open source soon, they compare four “Cloud Serving Systems” and HBase is one of them. Please, also read the explanation from HBase dev team about the numbers inside this paper.
- HBase trunk has been migrated to a Maven build system.
- New branch opened for 0.20 updated version to run on Hadoop 0.21. This lets 0.20.3 or 0.20.2 clients operate against HBase running on HDFS 0.21 (with durable WAL, etc.) without any change to the client side. Thread…
- Since Hadoop 0.21 isn’t going to be released soon and HBase team is waiting for applying critical changes (HDFS-265, HDFS-200, etc.) to make HBase user’s life easier, HBase trunk is likely to support both 0.21 and the patched 0.20 versions of Hadoop. There was a discussion about naming convention for HBase releases with regard to this fact which also touches plans for which features to include in the nearest releases.
- Cloudera’s latest release now includes HBase-0.20.3.
- Exploring possible solutions to “write only top N rows from reduce outcome”. Thread…
- A new handy binary comparator was added that only compares up to the length of the supplied byte array.
- These days, HBase developers are working hard on the very sweet “Multi data center replication” feature. It is aimed for 0.21 and will support federated deployment where someone might have terascale (or beyond) clusters in more than one geography and would want the system to handle replication between the clusters/regions.
We’d also like to introduce a small FAQ and FA (frequent advices) section to save some time for HBase dev team who is very supportive on the mailing lists.
- How to move/copy all data from one HBase cluster to another? If you stop the source cluster then you can distcp the /hbase to the other cluster. Done. A perfect copy.
- Is there a way to get the row count of the table? From Java API? There is no single-call method to do that. Actual row count info isn’t stored anywhere. You can use “count” command from HBase shell which iterates over all records and may take a lot of time to complete. It can be discovered by a table scan, or distributed count (MapReduce job usually).
- I’m editing property X in hbase-default.xml to… You should edit hbase-site.xml, not hbase-default.xml.
- Inserting row keys with an incremental ID is usually not a good idea since sequential writing is usually slower than random writing. If you can’t find a natural row key (which is good for scans), use a UUID.
- Apply HBASE-2180 patch to increase random read performance in case of multiple concurrent clients.
- How can I perform “select * from tableX where columnY=Z”-like query in HBase? You’ll need to use a Scan along with a SingleColumnValueFilter. But this isn’t quick, it’s like performing a SQL query on a column that isn’t indexed: the more data you have the longer it will take. There’s no support for secondary indexes in HBase core, you need you use one of the contribs (2 are available in 0.20.3: src/contrib/indexed and src/contrib/transactional). Another option is maintaining the indexes yourself.
Some other efforts that you may find interesting:
- Clojure-HBase was introduced. It is a simple library for accessing HBase conveniently from Clojure.
- DataNucleus Access Platform 2.0 now contains plugin for persistence to HBase (HADOOP) datastores.
- HBase Indexing Library aids in building and querying indexes on top of HBase, in Google App Engine datastore-style.
- HBase-dsl is meant to help reduce and simplify your HBase code.
This second installment of Solr Digest (see Solr January Digest) will cover 8 topics, some of which are quite new and some with very long history (and still uncertain future).
So, here we go:
1. solr.ISOLatin1AccentFilterFactory is commonly used filter which replaces accented characters in ISO Latin 1 charset with their unaccented version (for instance, ‘à’ is replaced with ‘a’). However, the underlying Lucene filter ISOLatin1AccentFilter is already deprecated in favor of ASCIIFoldingFilter in Lucene 2.9 (BTW, Solr 1.4 release uses Lucene 2.9.1, while trunk with future Solr 1.5 uses Lucene 2.9.2) and has been deleted from Lucene 3.0. Of course, Solr already has a filter factory for the replacement, solr.ASCIIFoldingFilterFactory, so it would probably be wise to start using it in your Solr schemata, if you are still using the old ISOLating1AccentFilter. Functionality wise, there are no differences between these two filters, except that ASCIIFoldingFilter covers a superset of ISO Latin 1, meaning it converts everything ISOLatin1AccentFilter was converting and some more.
2. DataImportHandler became multithreaded - after being filled with different functionalities, DataImportHandler got a performance boost. Your multicore servers will be happy to try it
. As part of JIRA issue SOLR-1352, the patch was created and committed to trunk, so you can expect this functionality in Solr 1.5 release, or you can already try it with one of Solr 1.5 nightly builds.
3. Script based UpdateRequestProcessorFactory – one very interesting feature still in development (JIRA issue SOLR-1725) is adding support for script based UpdateRequestProcessorFactory. It will depend on Java 6 script engine support (so Java 5 based Solr installation will not benefit here, although upgrade to Java 6 is definitely recommended) and be very easy to use. The scripts will have to be placed under SOLR_HOME/conf directory and their names will be defined in solrconfig.xml, like this:
<updateRequestProcessorChain name="script">
<processor>
<str name="scripts">updateProcessor.js</str>
<lst name="params">
<bool name="boolValue">true</bool>
<int name="intValue">3</int>
</lst>
</processor>
</updateRequestProcessorChain>
Implementations would also be simple, here is example of updateProcessor.jsp (copied from patch which brings this functionality):
function processAdd(cmd) {
functionMessages.add("processAdd1");
}
function processDelete(cmd) {
functionMessages.add("processDelete1");
}
function processMergeIndexes(cmd) {
functionMessages.add("processMergeIndexes1");
}
function processCommit(cmd) {
functionMessages.add("processCommit1");
}
function processRollback(cmd) {
functionMessages.add("processRollback1");
}
function finish() {
functionMessages.add("finish1");
}
4. Similar to SolrJ API for communicating with Solr, there are numerous Solr clients for other languages, especially the dynamic scripting languages. As with all scripting languages, one of the main advantages over using pure Java is simplicity and development speed. You just write a few lines of code and immediately run the script — no need for compiling. At Sematext we find them especially handy when making changes to Solr installations, for quickly testing if Solr behaves as we expect. One excellent solution for all Ruby lovers is RSolr. Coincidentally, RSolr will be covered in the upcoming Solr in Action.
5. Field Collapsing - this is a very frequently needed feature, but without satisfactory solution in Solr. There is a long history of this functionality in Solr, everything started while Solr was in version 1.3 with issue SOLR-236. It was never committed to svn, so you basically had to pick one of the many patches available in JIRA and apply it to your distribution. Since Solr was constantly developing, patches would pretty quickly become obsolete, so new versions would be created. Even when you found the correct patch for your Solr version, you would get occasional errors, so this surely wasn’t good enough for enterprise customers.
Recently, there have been renewed efforts invested into this issue and there are plans for this feature to finally be included in Solr 1.5. However, current implementation still isn’t good enough, there are OutOfMemory reports by some users, so it seems like we’ll have to wait some more to get enterprise quality “field collapsing” solution in Solr.
In light of problems with SOLR-236 solution, new JIRA issue SOLR-1773 was created. The goal of this issue is to provide “lightweight” implementation of this feature. There is already a patch containing this implementation and some measurements which show this approach has potential, but this still isn’t ready for serious deployments. The same approach is also implemented in SOLR-1682.
As you can see, work to provide field collapsing is underway, but we’re still some time away from committed code.
6. SystemStatsRequestHandler – designed to provide statistics from stats.jsp to clients which access Solr with APIs like SolrJ or RSolr, it is being developed as JIRA issue SOLR-1750. It is destined to be included in Solr 1.5 version, but for now it is available as Java class attached to the issue. Before it is committed to svn, it might get another name.
7. While Lucene just saw its 2.9.2 and 3.0.1 versions released, Solr trunk already has the latest Lucene 2.9.*, as you can see described in this thread.
8. We’ve saved the best for last. If you could have one feature in Solr… Check out this informative thread to see what people want from Solr that Solr doesn’t already have. What do you want from Solr? Post your Solr desires in comments.
We’ve published the HBase Digest last month, but this is our first ever Hadoop Digest in which we cover Hadoop HDFS and MapReduce pieces of the Hadoop Ecosystem. Before we get started, let us point out that we recently published a guest post titled Introdoction to Cloud MapReduce, which should be interesting to all users of Hadoop, as well as its developers.
As of this writing, there are 34 open issues in JIRA scheduled for 0.21.0 release with most of them considered as “major” and 4 “critical” or “blockers”. There is quite a lot of work to do before 0.21.0 is out. Hadoop developers are working hard, providing at the same time a tons of very helpful answers & advice on mailing lists. Please find the summary of the most interesting discussions along with information on current Hadoop API usage below.
- After several rejections, the USPTO granted a patent to Google for MapReduce. Find out the community reactions in thread and in thread.
- What are security mechanisms in HDFS and what should we expect in the near future? Presentation, Design Document, Thread…
- An attempt was made to get Hadoop into the Debian Linux distribution. All relevant links and summary can be found in this thread.
- Consider using LZO compression, which allows splitting for a compressed file for Map jobs. GZIP is not splittable.
- Use Python-based scripts to utilize EBS for NameNode and DataNode storage to have persistent, restartable Hadoop clusters running on EC2. Old scripts (in src/contrib/ec2) will be deprecated.
- Do not rely on uniquness of objects in the “values” parameter when implementing reduce(T key, Iterable<T> values, Context context), the same instances of objects can be reused. Thread…
- In order for long running tasks not to be aborted, use the Reporter object to provide “task heartbeat”. If the map task takes longer than 600 seconds (default) to complete an iteration map/reduce assumes the task is stalled and axes it.
- Setting up DNS lookup properly (caching DNS servers, reverse DNS setup) for a big cluster to avoid DNS requests traffic flood is discussed in this thread.
- Setting other than default output compressing codec programmatically is explained in this thread.
- What are the version compatibility rules for Hadoop distributions? Read the hot discussion here.
- Critical issue HDFS-101 (DFS write pipeline: DFSClient sometimes does not detect second DataNode failure) was reported and fixed (and compatible with DFSClient 0.20.1) and will be included in 0.20.2.
- Text type is meant for use when you have a UTF8-encoded string. Creating a Text object from a byte array that is not proper UTF-8 is likely to result in an exception or data mangling. BytesWritable should be used for this purpose.
- How to make particular “section of code” run only in any one of the mappers? (or how to share some flag state between jobs running on the different machines). Thread…
We would also like to add small FAQ section here to spot the common user questions.
- MR. Is there a way to cancel/kill the job?
Invoke command: hadoop job -kill jobID - MR. How to get the name of the file that is being used for the map task?
FileSplit fileSplit = (FileSplit) context.getInputSplit(); String sFileName = fileSplit.getPath().getName();
- MR. When framework splits a file, can some part of a line fall in one split and the other part in some other split?
In general, the file split may break the records, it is the responsibility of the record reader to present the record as a whole. If you use standard available InputFormats, the framework will make sure complete records are presented in <key,value>. - HDFS. How to view text content of SequenceFile?
The SequenceFile is not text file, so you can not see the content by invoking UNIX command cat. Use hadoop command : hadoop fs -text <src> - HDFS. How to move file from one dir to another using Hadoop API?
Use FileSystem#rename(Path, Path) to move files. The copy methods will leave you with two of the same file. - Cluster setup. Some of my nodes are in the blacklist, and I want to reuse them again. How can I do that?
Restarting the trackers removes them from the blacklist. - General. What command should I use to…? How should use comand X?
Please, refer to Commands Guide page.
There were also several efforts (be patient, some of them are still somewhat rough) that might be of interest:
- JRuby on Hadoop is a thin wrapper for Hadoop Mapper / Reducer for JRuby, not to be mixed with Hadoop Streaming.
- Stream-to-hdfs is a simple utility for streaming stdin to a file in HDFS.
- Crane manages Hadoop cluster using Clojure.
- Piglet is a DSL for writing Pig Latin scripts in Ruby.
Thank you for reading us! We highly appreciate all feedback to our Digests, so tell us what you like or dislike.
Ever since Microsoft announced they are halting all development of FAST for Linux starting now, every single organization involved in search had something to say on this topic. Here are our thoughts on that.
Apparently 80% of FAST users are Linux users. We won’t speculate what is really behind this seemingly crazy decision to turn off the FAST@Linux dollar faucet. Over at Kellblog, the Mark Logic CEO already itemized whose door the current FAST@Linux customers might want to knock on next depending on what they are using FAST for. Unfortunately, he forgot one important angle there, one key option for FAST@Linux users that in today’s day and age should absolutely not be ignored. After every storm comes sunshine. The same is happening here. While being forced to start thinking about changing one’s search solution probably isn’t pleasant, it does open another important door, a big opportunity, one should not ignore. The question we would pose FAST customers is the following: Is there something you’ve always wanted to do with FAST, but never could? Here is your chance to change that! That door that just opened… walk through it, take look around, you just may like it. Read on.
While FAST@Linux users may be experiencing mental turbulence now, they know there are open-source tools and solutions out there that are more scalable and faster than FAST and don’t involve any crazy per-document, per-query, per-xyz licensing fees. (The part about superior performance is referring to Lucene and solutions built on top of it, like Solr. Several years ago the former C*O of FAST called me up with a search business proposal. One of the bits he mentioned is that his/FAST engineers were testing Lucene and found it faster than their own search components.) More importantly, these tools are completely open and give FAST@Linux customers the opportunity to finally get whatever they could never get from FAST. Sure, there will be some trade-offs (e.g. some of these tools may not have some of the nice GUI pieces FAST has, but that is changing…eh, fast). But the key bit is that these tools are not only free to get and use, they are nearly infinitely flexible. They have their source opened to all the eyeballs of the world. They have open-minded communities (or at least the ones we are involved in at Apache Lucene, Solr, Nutch, etc. are). Development plans are all an open book. Any organization’s engineers are welcome to jump in and either contribute (nearly finished) new components, or collaborate to develop new functionality, or simply explain user-cases and request functionality to support them, or even fork the whole project. One would be crazy to choose the last option, of course – it would be sub-optimal use of the benefits of choosing an open-source solution. Flexibility is one of the key benefits of using open-source. You don’t like relevance is computed? Plug your own secret sauce! You don’t like how data is stored on disk? Change it to your liking!
If you have to get off FAST@Linux in the coming years, think very, very, VERY hard before you go to another closed enterprise search vendor. By going to another closed enterprise search vendor, what have you achieved? You may be able to do what you never could do with FAST before, but there might be some other functionality you’ll have to give up because the new vendor does not have it, does not plan on having, and that you cannot add yourself because most of the solution is a black box with tiny holes poked on some of its sides, so you can kind of take a pretend peek inside. This would not be an improvement. This would be a missed opportunity!
Here are some of the key benefits of going with open-source search tools and solutions:
- No up-front fees
- No increasing fees due to growth
- Flexibility to change anything and everything
- A large and growing user and development communities
- Security of commercial-grade support
Microsoft’s decision to drop FAST development for Linux is a blessing in disguise. It may not be pleasant right now, but it is actually a good opportunity for FAST customers. Choose wisely now and in the coming years, and avoid being put on the spot again.
Update: after posting this we came across a blog post that refers to a search service switching from FAST to Solr and cutting costs 400%. That is, the cost of their new Solr-powered search is 25% of the cost of their old FAST-powered search.
The following post is the introduction to Cloud MapReduce (CMR) written by Huan Liu, CMR’s main author and the Research Manager at Accenture Technology Labs.
MapReduce is a programming model (borrowed from functional programming languages) and its associated implementation, and it was first proposed by Google in 2003 to cope with the challenge of processing an exponentially growing amount of data. In the same year the technology was invented, Google’s production index system was converted to MapReduce. Since then, it is quickly proven to be applicable to a wide range of problems. For example, there are roughly 10,000 MapReduce programs written in Google by June 2007, and there are 2,217,000 MapReduce job runs in the month of September 2007. MapReduce also has found wide application outside of the Google environment.
Cloud MapReduce is another implementation of the MapReduce programming model. Back in late 2008, we saw the emergence of a cloud Operating System (OS) — a set of software managing a large cloud infrastructure rather than an individual PC. We asked ourselves the following questions: what if we build systems on top of a cloud OS instead of directly on bare metal? Can we dramatically simplify system design? We thought we will try implementing MapReduce as a proof of concept; thus, the Cloud MapReduce project was born. At the time, Amazon was the only one that has a “complete” cloud OS, so we built on top of it. In the course of the project, we encountered a lot of problems working with the Amazon cloud OS, most could be attributed to the weaker consistent model it presents. Fortunately, we were able to work through all issues and successfully built MapReduce on top of the Amazon cloud OS. The end result surprises us somewhat. We are not only able to have a simpler design, but we are also able to make it more scalable, more fault tolerant and faster than Hadoop.
Why Cloud MapReduce
There are already several MapReduce implementations, including Hadoop, which is already widely used. So the natural question to ask is: why another implementation? The answer is that all previous implementations essentially copy what Google have described in their original MapReduce paper, but we need to explore alternative implementation approaches for the following reasons:
1) Patent risk. MapReduce is a core technology in Google. By using MapReduce, Google engineers are able to focus on their core algorithms, rather than being bogged down by parallelization details. As a result, MapReduce greatly increases their productivity. It is no surprise that Google would patent such a technology to maintain its competitive advantage. The MapReduce patent covers the Google implementation as described in its paper. Since CMR only implements the programming model, but has a totally different architecture and implementation, it poses a minimal risk w.r.t. the MapReduce patent. This is particularly important for enterprise customers who are concerned about potential risks.
2) Architectural exploration. Google only described one implementation of the MapReduce programming model in its paper. Is it the best one? What are the tradeoffs if using a different one? CMR is the first to explore a completely different architecture. In the following, I will describe what is unique about CMR’s architecture.
Architectural principle and advantages
CMR advocates component decoupling: separate out common components as independent cloud services. If we can separate out a common component as a stand-alone cloud service, the component not only can be leveraged for other systems, but it can also evolve independently. As we have seen in other contexts (e.g., SOA, virtualization), decoupling enables faster innovation.
CMR currently uses existing components offered by Amazon, including Amazon S3, SimpleDB, and SQS. By leveraging the concept of component decoupling, CMR achieves a couple of key advantages.
- A fully distributed architecture. Since each component is a smaller project, it is easier to build it as a fully distributed system. Amazon has done it for all its services (S3, SimpleDB, SQS). Building on what Amazon has done, we are able to build a fully distributed MapReduce implementation with only 3,000 lines of code. A fully distributed architecture has several advantages over a master/slave architecture. First, it is more fault tolerant. Many enterprise customers are not willing to adopt something with a single point of failure, especially for their mission critical data. Second, it is more scalable. In one comparison study, we are able to stress the master node in Hadoop so much that CMR has a 60x performance advantage.
- More efficient data shuffling between Map and Reduce. CMR uses queue as the intermediate point between Map and Reduce, which enables Map to “push” data to Reduce (rather than Reduce “pulls” data from Map). This design is similar to what is used in parallel databases, so it inherits the benefits of efficient data transfer as a result of pipelining. However, unlike in parallel databases, by using tagging, filtering and a commit mechanism, CMR still maintains the fine grain fault tolerance property offered by other MapReduce implementations. The majority of CMR’s performance gain (aside from the 60x gain which is from stressing the master node) comes from this optimization.
CMR is particularly attractive in a cloud environment due to its native integration with the cloud. Hadoop, on the other hand, is designed for a static environment inside an enterprise. If run in a cloud, Hadoop introduces additional overhead. For example, after launching a cluster of virtual machines, Amazon’s Elastic MapReduce (pay-per-use Hadoop) has to configure and setup Hadoop on the cluster, then copy data from S3 to the Hadoop file system before it can start data processing. At the end, it also has to copy results back to S3. All these steps are additional overheads that CMR does not incur, because CMR starts processing right away on S3 data directly, no cluster configuration and data copying are necessary.
CMR’s vision
Although CMR currently only runs on Amazon components, we envision that it will support a wide range of components in the future, including other clouds, such as Microsoft Windows Azure. There are a number of very interesting open source projects already, such as jclouds, libcloud, deltacloud and Dasein, that are building a layer of abstraction on top of various cloud services to hide their differences. These middleware would make it much easier for CMR to support a large number of cloud components.
At the same time, we are also looking at how to build these components and deploy them locally inside an enterprise. Although several products, such as vCloud and Eucalyptus, provide cloud services inside an enterprise, their current version is limited to the compute capability. There are other cloud services, such as storage and queue, that an enterprise has to deploy to provide a full cloud capability to its internal customers. At Accenture Technology Labs, we are helping to address some pieces of the puzzle. For example, we have started a research project to design a fully distributed and scalable queue service, which is similar to SQS in functionality, but exploring a different tradeoff point.
Synergy with Hadoop
Although, on the surface, CMR may seem to compete with Hadoop, there are actually quite a bit of synergies between the two projects. First, they are both moving to the vision of component decoupling. In the recent 0.20.1 release of Hadoop, the HDFS file system is separated out as an independent component. This makes a lot of sense because HDFS is useful as a stand alone component to store large data sets, even if the users are not interested in MapReduce at all. Second, there are lessons to be learned from each project. For example, CMR points the way on how to “push” data from Map to Reduce to streamline data transfer without sacrificing fine-grain fault tolerance. Similarly, Hadoop supports rich data types beyond simple strings, which is something that CMR will for sure inherit in the near future.
Hopefully, by now, I have convinced you that CMR is something that is at least worth a look. In the next post (coming in two weeks), I will follow up with a practical post showing step by step how to write a CMR application and how to run it. The current thinking is that I will demonstrate how to perform certain data analytics with data from search-hadoop.com, but I am open to suggestions. If you have suggestions, I would appreciate if you could post them in comments below.
What is Cassandra?
- It has no single point of failure, and
- Adding nodes to the cluster is as simple as pointing it to any one live node
Cassandra also has built-in multi-master writes, replication, rack awareness, and can handle downed nodes gracefully. Cassandra has a thriving community and is currently being used at companies like Facebook, Digg and Twitter to name a few.
Enter Lucandra
Lucandra is a Cassandra backend for Lucene. Since Cassandra’s original use within Facebook was for search, integrating Lucene with Cassandra seemed like a “no brainer”. Lucene’s core design makes it fairly simple to strip away and plug in custom Analyzer, Writer, Reader, etc. implementations. Rather than trying to build a Lucene Directory interface on top of Lucene as some backends do (DbDirectory for example), our approach was to implement a an IndexReader and IndexWriter directly on top of Cassandra.
Term Key ColumnName Value
"indexName/field/term" => { documentId , positionVector }
Document Key
"indexName/documentId" => { fieldName , value }
Lucandra in Action
Using Lucandra is extremely simple and switching a regular Lucene search application to use Lucandra is a matter of just several lines of code. Let’s have a look.
First we need to create the connection to Cassandra
import lucandra.CassandraUtils;
import lucandra.IndexReader;
import lucandra.IndexWriter;
...
Cassandra.Client client = CassandraUtils.createConnection();
Next, we create Lucandra’s IndexWriter and IndexReader, and Lucene’s own IndexSearcher.
IndexWriter indexWriter = new IndexWriter("bookmarks", client);
IndexReader indexReader = new IndexReader("bookmarks", client);
IndexSearcher indexSearcher = new IndexSearcher(indexReader);
From here on, you work with IndexWriter and IndexSearcher just like you in vanilla Lucene. Look at the BookmarksDemo for the complete class.
What’s next? Solandra!
Term Key ColumnName Value
"indexName/field/term" => { documentId , positionVector }
Document Key
"indexName/documentId" => { fieldName , value }
Last month we published the Lucene Digest, followed by the Solr Digest, and wrapped the month with the HBase Digest (just before Yahoo posted their report showing Cassandra beating HBase in their benchmarks!). We are starting February with a fresh Mahout Digest.
When covering Mahout, it seems logical to group topics following Mahout’s own groups of core algorithms. Thus, we’ll follow that grouping in this post, too:
- Recommendation Engine (Taste)
- Clustering
- Classification
There are, of course, some common concepts, some overlap, like n-grams. Let’s talk n-grams for a bit.
N-grams
There has been a lot of talk about n-gram usage through all of the major subject areas on Mahout mailing lists. This makes sense, since n-gram-based language models are used in various areas of statistical Natural Language Processing. An n-gram is a subsequence of n items from a given sequence of “items”. The “items” in question can be anything, though most commonly n-grams are made up of character or word/token sequenceas. Lucene’s n-gram support provided through NGramTokenizer tokenizes an input String into character n-grams and can be useful when building character n-gram models from text. When there is an existing Lucene TokenStream and character n-gram model is needed, NGramTokenFilter can be applied to the TokenStream. Word n-grams are sometimes referred to as “shingles”. Lucene helps there, too. When word n-gram statistics or model is needed, ShingleFilter or ShingleMatrixFilter can be used.
Classification
Usage of character n-grams in context of classification and, more specifically, the possibility of applying Naive Bayes to character n-grams instead of word/term n-grams is discussed here. Since Naive Bayes classifier as probabilistic classifier treats features of any type, there is no reason it could not be applied to character n-grams, too. Use of character n-gram model instead of word model in text classification could result in more accurate classification of shorter texts. Our language identifier is a good example of a classifier (though it doesn’t use Mahout) and it provides good results even on short texts, try it.
Clustering
Non-trivial word n-grams (aka shingles) extracted from a document can be useful for document clustering. Similar to usage of document’s term vectors, this thread proposes usage of non-trivial word n-grams as a foundation for clustering. For extraction of word n-grams or shingles from a document Lucene’s ShingleAnalyzerWrapper is suggested. ShingleAnalyzerWrapper wraps the previously mentioned ShingleFilter around another Analyzer. Since clustering (grouping similar items) is an example of a unsupervised type of machine learning, it is always interesting to validate clustering results. In clustering there are no referent train or, more importantly, referent test data, so evaluating how well some clustering algorithm works is not a trivial task. Although good clustering results are intuitive and often easily visually evaluated, it is hard to implement an automated test. Here is an older thread about validating Mahout’s clustering output which resulted in an open JIRA issue.
Recommendation Engine
There is an interesting thread about content-based recommendation, what content-based recommendation really is or how it should be defined. So far, Mahout has only Collaborative Filtering based recommendation engine called Taste. Two different approaches are presented in that thread. One approach treats content-based recommendation as Collaborative Filtering problem or generalized Machine Learning problem, where item similarity is based on Collaborative Filtering applied on item’s attributes or ‘related’ user’s attributes (usual Collaborative Filtering treats an item as a black-box). The other approach is to treat content-based recommendation as a “generalized search engine” problem. Here, matching same or similar queries makes two items similar. Just think of queries as queries composed of, say, key words extracted from user’s reading or search history and this will start making sense. If items have enough textual content then content-based analysis (similar items are those that have similar term vectors) seems like a good approach for implementing content-based recommendation. This is actually nothing novel (people have been (ab)using Lucene, Solr, and other search engines as “recommendation engines” for a while), but content-based recommendations is a recently discussed topic of possible Mahout expansion. All algorithms in Mahout tend to run on top of Hadoop as MapReduce jobs, but in current release Taste does not have the MapReduce version. You can read more about about MapReduce Collaborative Filtering implementation in Mahout’s trunk. If you are in need of a working recommendation engine (that is, a whole application built on top of the core recommendation engine libraries), have a look at the Sematext’s recommendation engine.
In addition to Mahout’s basic machine learning algorithms there are discussions and development in directions which don’t fall under any of the above categories, such as collocation extraction. Often phrase extractors use word n-gram model for co-occurrence frequency counts. Check the thread about collocations which resulted in JIRA issue and the first implementation. Also, you can find more details on how Log-likelihood ratio can be used in the context of the collocation extraction in this thread.
Of course, anyone interested in Mahout should definitely read Mahout in Action (we’ve got ourselves a MEAP copy recently) and keep an eye on features for next 0.3 release.
Here at Sematext we are making more and more use of the Hadoop family of projects. We are expanding our digest post series with this HBase Digest and adding it to the existing Lucene and Solr coverage. For those of you who wants to be up to date with HBase community discussions and benefit from the knowledge-packed discussions that happen in that community, but can’t follow all those high volume Hadoop mailing lists, we also include a brief mailing lists coverage.
- HBase 0.20.3 has just been released. Nice way to end the month. It includes fixes of huge number of bugs, fixes of EC2-related issues and good amount of improvements. HBase 0.20.3 uses the latest 3.2.2 version of Zookeeper. We should also note that another distributed and column-oriented database from Apache was released a few days ago, too – Cassandra 0.5.0.
- An alternative indexed HBase implementation (HBASE-2037) was reported as completed (and included in 0.20.3). It speeds up scans by adding indexes to regions rather than secondary tables.
- HBql was announced this month, too. It is an abstraction layer for HBase that introduces SQL dialect for HBase and JDBC-like bindings, i.e. more familiar API for HBase users. Thread…
- Ways of integrating with HBase (instantiating HTable) on client-side: Template for HBase Data Acces (for integration with Spring framework), simple Java Beans mapping for HBase, HTablePool class.
- HbaseExplorer – an open-source web application that helps with simple HBase data administration and monitoring.
- There was a discussion about the possibilities of splitting the process of importing very large data volumes into HBase in separate steps. Thread…
- To get any parallelization, you have to start multiple JVMs in the current Hadoop version. Thread…
- Tips for increasing the HBase write speed: use random int keys to distribute loading between RegionServers; use multi-process client instead of multi-threaded client; set a higher heap space in conf/hbase-env.sh , give it a much as you can without swapping; consider lzo to hold the same amount of data in fewer regions per server. Thread…
- Some advice on hardware configuration for the case of managing 40-50K records/sec write speed. Thread…
- Secondary index can go out of sync with the base table in case of I/O exceptions during commit (when using transactional contrib). Handling such exceptions in transactional layer should be revised. Thread…
- Configuration instance (as well as an instance of HBaseConfiguration) is not thread-safe, so do not change it when sharing between threads. Thread…
- What are the minimal number of boxes for HBase deployment? Covered both HA and non-HA options, what deployments can share the same boxes, etc. Thread…
- Optimizing random reads: using client-side multi-threading will not improve reads greatly according to some tests, but there is an open JIRA issue HBASE-1845 related to batch operations. Thread…
- Exploring possibilities for server-side data filtering. Discussed classpath requirements for that and the variants for filters hot-deploy. Thread…
- How-to: Configure table to keep only one version of data. Thread…
- Recipes and hints for scanning more than one table in Map. Thread…
- Managing timestamps for Put and Delete operations to avoid unwanted overlap between them. Thread…
- The amount of replication should have no effect on the performance for reads using either scanner or random-access. Thread…
Did you really make it this far down?
If you are into search, see January 2010 Digests for Lucene and Solr, too.
Please, share your thoughts about the Digest posts as they come. We really like to know whether they are valuable. Please tell us what format and style you prefer and, really, any other ideas you have would be very welcome.
3.2.2
This coming February, the NY Search & Discovery meetup will be hosting Dr. Pablo Duboue from IBM Research and his talk about UIMA.
This following is an “excerpt from the blurb” about Pablo’s talk:
“… In this talk, I will briefly present UIMA basics before discussing full UIMA systems I have been involved in the past (including our Expert Search system in TREC Enterprise Track 2007). I will be talking about how UIMA supported the construction of our custom NLP tools. I will also sketch the new characteristics of the UIMA Asynchronous Scaleout (UIMA AS) subproject that enable UIMA to run Analysis Engines in thousands of machines… “
The talk/presentation will be on February 24, 2010. Sign up (free) at http://www.meetup.com/NYC-Search-and-Discovery/calendar/12384559/ .
