Mahout Digest, October 2010

We’ve been very busy here at Sematext, so we haven’t covered Mahout during the last few months.  We are pleased with what’s been keeping us busy, but are not happy about our irregular Mahout Digests.  We had covered the last (0.3) release with all of its features and we are not going to miss covering very important milestone for Mahout: release 0.4 is out! In this digest we’ll summarize the most important changes in Mahout from the last digest and add some perspective.

Before we dive into Mahout, please note that we are looking for people with Machine Learning skills and Mahout experience (as well as good Lucene/Solr search people).  See our Hiring Search and Data Analytics Engineers post.

This Mahout release brings overall changes regarding model refactoring and command line interface to Mahout aimed at improving integration and consistency (easier access to Mahout operations via the command line). The command line interface is pretty much standardized for working with all the various options now, which makes it easier to run and use. Interfaces are better and more consistent across algorithms and there have been many small fixes, improvements, refactorings, and clean-ups. Details on what’s included can be found in the release notes and download is available from the Apache Mirrors.

Now let’s add some context to various changes and new features.

GSoC projects

Mahout completed its Google Summer of Code  projects and two completed successfully:

  • EigenCuts spectral clustering implementation on Map-Reduce for Apache Mahout (addresses issue MAHOUT-328), proposal and implementation details can be found in MAHOUT-363
  • Hidden Markov Models based sequence classification (proposal for a summer-term university project), proposal and implementation details in  MAHOUT-396

Two projects did not complete due to lack of student participation and one remains in progress.

Clustering

The biggest addition in clustering department are EigenCuts clustering algorithm (project from GSoC) and MinHash based clustering which we covered as one of possible GSoC suggestions in one of previous digests . MinHash clustering was implemented, but not as a GSoC project. In the first digest from the Mahout series we covered problems related to evaluation of clustering results (unsupervised learning issue), so big addition to Mahout’s clustering are Cluster Evaluation Tools featuring new ClusterEvaluator (uses Mahout In Action code for inter-cluster density and similar code for intra-cluster density over a set of representative points, not the entire clustered data set) and CDbwEvaluator which offers new ways to evaluate clustering effectiveness.

Logistic Regression

Online learning capabilities such as Stochastic Gradient Descent (SGD) algorithm implementation are now part of Mahout. Logistic regression is a model used for prediction of the probability of occurrence of an event. It makes use of several predictor variables that may be either numerical or categories. For example, the probability that a person has a heart attack within a specified time period might be predicted from knowledge of the person’s age, sex and body mass index. Logistic regression is used extensively in the medical and social sciences as well as marketing applications such as prediction of a customer’s propensity to purchase a product or cease a subscription. The Mahout implementation uses Stochastic Gradient Descent (SGD), check more on initial request and development in MAHOUT-228. New sequential logistic regression training framework supports feature vector encoding framework for high speed vectorization without a pre-built dictionary. You can find more details on Mahout’s logistic regression wiki page.

Math

There has been a lot of cleanup done in the math module (you can check details in Cleanup Math discussion on ML), lot’s of it related to an untested Colt framework integration (and deprecated code in Colt framework). The discussion resulted in several pieces of Colt framework getting promoted to a tested status (QRdecomposition, in particular)

Classification

In addition to speedups and bug fixes, main new features in classification are new classifiers (new classification algorithms) and more open/uniformed input data formats (vectors). Most important changes are:

  • New SGD classifier
  • Experimental new type of Naive bayes classifier (using vectors) and feature reduction options for existing Naive bayes classifier (variable length coding of vectors)
  • New VectorModelClassifier allows any set of clusters to be used for classification (clustering as input for classification)
  • Now random forest can be saved and used to classify new data. Read more on how to build a random forest and how to use it to classify new cases on this dedicated wiki page.

Recommendation Engine

The most important changes in this area are related to distributed similarity computations which can be used in Collaborative Filtering (or other areas like clustering, for example). Implementation of Map-Reduce job, based on algorithm suggested in Elsayed et al: Pairwise Document Similarity in Large Collections with MapReduce, which computes item-item similarities for item-based Collaborative Filtering can be found in MAHOUT-362. Generalization of algorithm based on the mailing list discussion led to an implementation of  Map-Reduce job which computes pairwise similarities of the rows of a matrix using a customizable similarity measure (with implementations already provided for Cooccurrence, Euclidean Distance, Loglikelihood, Pearson Correlation, Tanimoto coefficient, Cosine). More on distributed version of any item similarity function (which was available in a non-distributed implementation before) can be found in MAHOUT-393. With pairwise similarity computation defined, RecommenderJob has been evolved to a fully distributed item-based recommender (implementation depends on how the pairwise similarities are computed). You can read more on distributed item-based recommender in MAHOUT-420.

Implementation of distributed operations on very large matrices are very important for a scalable machine learning library which supports large data sets. For example, when term vector is built from textual document/content, terms vectors tend to have high dimension. Now,  if we consider a term-document matrix where each row represents terms from document(s), while a column represents a document we obviously end up with high dimensional matrix. Same/similar thing occurs in Collaborative Filtering: it uses a user-item matrix containing ratings for matrix values, row corresponds to a user and each column represents an item. Again we have large dimension matrix that is sparse.

Now, in both cases (term-document matrix and user-item matrix) we are dealing with high matrix dimensionality which needs to be reduced, but most of information needs to be preserved (in best way possible). Obviously we need to have some sort of matrix operation which will provide lower dimension matrix with important information preserved. For example, large dimensional matrix may be approximated to lower dimensions using Singular Value Decomposition (SVD).

It’s obvious that we need some (java) matrix framework capable of fundamental matrix decompositions. JAMA is a great example of widely used linear algebra package for matrix operations, capable of SVD and other fundamental matrix decompositions (WEKA for example uses JAMA for matrix operations). Operations on highly dimensional matrices always require heavy computation and this requirements produces high HW requirements on any ML production system. This is where Mahout, which features distributed operations on large matrices, should be the production choice for Machine Learning algorithms over frameworks like JAMA, which although great, can not distribute its operations.

In typical recommendation setup users often ‘have’ (used/interacted with) only a few items from the whole item set (item set can be very large) which leads to user-item matrices being sparse matrices. Mahout’s (0.4) distributed Lanczos SVD implementation is particularly useful for finding decompositions of very large sparse matrices.

News and Roadmap

All of the new distributed similarity/recommender implementations we analyzed in previous paragraph were contributed by Sebastian Schelter and as a recognition for this important work he was elected as a new Mahout committer.

The book “Mahout in Action”, published by Manning, has reached 15/16 chapters complete and will soon enter final review.

This is all from us for now.  Any comments/questions/suggestions are more than welcome and until next Mahout digest keep an eye on Mahout’s road map for 0.5 or discussion about what is Mahout missing to become production stabile (1.0) framework.  We’ll see you next month – @sematext.

Mahout Digest, May 2010

As we reported in our previous Digest, Mahout was approved by the board to become Top Level Project (TLP) and TLP-related preparations kept Mahout developers busy in May. Mahout mailing lists (user- and dev-) were moved to their new addresses and all subscribers were automatically moved to the new lists: user@mahout.apache.org and dev@mahout.apache.org. Regarding other TLP-related changes and their progress, check the list of items to complete the move.

We’ll start a quick walk-through of May’s new features and improvements with an effort on reducing Mahout’s Vector representation size on disk, resulting in improvement of 11% lower size on a test data set.

Discussion on how UncenteredCosineSimilarity , an implementation of the cosine similarity which does not “center” its data, should behave in distributed vs. non-distributed version and a patch for distributed version can be found in MAHOUT-389. Furthermore, an implementation of distributed version of any ItemSimilarity currently available in a non-distributed form was committed in MAHOUT-393.

Mahout has utilities to generate Vectors from a directory of text documents and one improvement in terms of consistency and ease of understanding was made on tf/tfidf-vectors outputs. When using bin/mahout seq2sparse command to generate vectors from text (actually, from the sequence file generated from text), depending on the weighting method (term frequency or term frequency–inverse document frequency), output was created in different directories. Now with the fix from MAHOUT-398, tf-vectors and tfidf-vectors output directories at the same level.

We’ll end with Hidden Markov Model and its integration into Mahout. In MAHOUT-396 you’ll find a patch and more information regarding where and how it is used.

That’s all from Mahout’s world for May, please feel free to leave any questions or comments and don’t forget you can follow @sematext on Twitter.

Nutch Digest, May 2010

With May being almost over, it’s time for our regular monthly Nutch Digest. As you know, Nutch became a top-level project and first related changes are already visible/applied. The Nutch site was moved to its new home at nutch.apache.org and the mailing lists (user- and dev-) have been moved to new addresses: user@nutch.apache.org and dev@nutch.apache.org.  Old subscriptions were automatically moved to the new lists, so subscribers don’t have to do anything (apart from changing mail filters, perhaps).  Even though Nutch is not a Lucene sub-project any more, we will continue to index its content (mailing lists, JIRA, Wiki, web site) and searchable over on Search-Lucene.com.  We’ve helped enough organizations with their Nutch clusters that it makes sense for at least us at Sematext to have a handy place to find all things Nutch.

There is a vote on updated Release Candidate (RC3) for the Apache Nutch 1.1 release.  The major differences between this release and RC2 are several bug fixes: NUTCH-732, NUTCH-815, NUTCH-814, NUTCH-812 and one improvement in NUTCH-816.  Nutch 1.1 is expected shortly!

From relevant changes related to Nuch develompent during May it’s important to note than Nutch will be using Ivy in Nutch builds and that there is one change regarding Nutch’s language identification: code for Japanese changed from “jp” to “ja”.  The former is Japan’s country code and the latter is the language code for Japanese.

There’s been a lot of talk on the Nutch mailing list about the first book on Nutch which Dennis Kubes started writing.  We look forward to reading it, Dennis!

Nutch developers were busy with the TLP-related changes and with preparations for the Nutch 1.1 release this month, so this Digest is a bit thinner than usual.  Follow @sematext on Twitter.

Mahout Digest, April 2010

April was a busy month here at Sematext, so we are a few days late with our April Mahout Digest, but better late than never!

The Board has approved Mahout to become Top Level Project (TLP) at the Apache Software Foundation. Check the status thread on change of Mahout mailing lists, svn layout, etc. Several discussions on mailing list, all describing what needs to be done for Mahout to become Top Level Project, resulted in Mahout TLP to-do list

As we reported in previous digest there was a lot of talk on mailing list about Google Summer of Code (GSoC) and here is a follow up on this subject. GSoC announcements are up and Mahout got 5 projects accepted this year. Check the full list of GSoC projects and Mahout’s community reactions on the mailing list.

In the past we have reported about the idea of Mahout/Solr integration, and it seems this is now getting realized. Read more on features and implementation progress of this integration here.

At the beginning of April, there was a proposal to make collections releases independent of the rest of Mahout. After some transition period of loosening the coupling between mahout-collections and the rest of Mahout, mahout-collections were extracted as an independent component.  The vote on the first independent release of collections is in progress. Mahout-collections-1.0 differs from the version released with mahout 0.3 only by removed dependency on slf4j.

Mahout parts that use Lucene are updated to use the latest release of Lucene 3.0.1 and code changes for this migration can be found in this patch.

There was a question that generated a good discussion about cosine similarity between two items and how it is implemented in Mahout. More on cosine similarity between two items which is implemented as PearsonCorrelationSimilarity (source) in Mahout code, can be found in MAHOUT-387.

The goal of clustering is to determine the grouping in a set of data (e.g. a corpus of items or a (search) result set or …).  Often, the problem in clustering implementations is that the data to be clustered has a high number of dimensions, which tend to need reducing (to a smaller number of dimensions) in order to make clustering computationally more feasible (read: faster).  The simplest way of reducing those dimensions is to use some sort of a ‘mapping function’ that takes data presented in n-dimensional space and transforms it to data presented in m-dimensional space, where m is lower than n, hence the reduction. Of course, the key here is to preserve variance of important data features (dimensions) and flatten out unimportant ones. One simple and interesting approach to clustering is the use of several independent hash functions where the probability of collision of similar items is higher. This approach, called Minhash based clustering, was proposed back in March as part of GSoC (see the proposal).  You’ll find more on theory behind it and Mahout applicable implementation in MAHOUT-344.

Those interested in Neural Networks should keep an eye on MAHOUT-383, where Neuroph (a lightweight neural net library) and Mahout will get integrated.

This concludes our short April Mahout Digest.  Once Mahout completes its transition to TLP, we expect the project to flourish even more.  We’ll be sure to report on the progress and new developments later this month.  If you are a Twitter user, you can follow @sematext on Twitter.

Nutch Digest, April 2010

In the first part of this Nutch Digest we’ll go through new and useful features of the upcoming Nutch 1.1 release, while in the second part we’ll focus on developments and plans for next big Nutch milestone, Nutch 2.0. But, let’s start with few informational items.

  • Nutch has been approved by the ASF  board to become Top Level Project (TLP) in the Apache Software Foundation.  The changing of Nutch mailing lists, URL, etc. will start soon.

Nutch 1.1 will be officially released any day now and here is a Nutch 1.1 release features walk through:

  • Nutch release 1.1 uses Tika 0.7 for parsing and MimeType detection
  • Hadoop 0.20.2 is used for job distribution (Map/Reduce) and distributed file system (HDFS)
  • On the indexing and search side, Nutch 1.1 uses either Lucene 3.0.1.with its own search application or Solr 1.4
  • Some of the new features included in release 1.1 were discussed in previous Nutch Digest. For example, alternative generator which can generate several segments in one parse of the crawlDB is included in release 1.1. We used a flavour of this patch in our most recent Nutch engagement that involved super-duper vertical crawl.  Also, improvement of SOLRIndexer, which now commits only once when all reducers have finished, is included in Nutch 1.1.
  • Some of the new and very useful features were not mentioned before. For example, Fetcher2 (now renamed to just Fetcher) was changed to implement Hadoop’s Tool interface. With this change it is possible to override parameters from configuration files, like nutch-site.xml or hadoop-site.xml, on the command line.
  • If you’ve done some focused or vertical crawling you probably know that one or few unresponsive host(s) can slow down entire fetch, so one very useful feature added to Nutch 1.1 is the ability to skip queues (which can be translated to hosts) for URLS getting repeated exceptions.  We made good use of that here at Sematext,  in the Nutch project we just completed in April 2010.
  • Another improvement included in 1.1 release related to Nutch-Solr integration comes in a form of improved Solr schema that allows field mapping from Nutch to Solr index.
  • One useful addition to Nutch’s injector is new functionality which allows user to inject metadata into the CrawlDB. Sometimes you need additional data, related to each URL, to be stored. Such external knowledge can later be used (e.g. indexed) by a custom plug-in. If we can all agree that storing arbitrary data in CrawlDb (with URL as a primary key) can be very useful, then migration to database oriented storage (like HBase) is only a logical step.  This makes a good segue to the second part of this Digest…

In the second half of this Digest we’ll focus on the future of Nutch, starting with Nutch 2.0.  Plans and ideas for the next Nutch release can be found on mailing list under Nutch 2.0 roadmap and on the official wiki page.

Nutch is slowly replacing some of its home-grown functionality with best of breed products — it uses Tika for parsing, Solr for indexing/searching and HBase for storing various types of data.  Migration to Tika is already included in Nutch 1.1. release and exclusive use of Solr as (enterprise) search engine makes sense — for months we have been telling clients and friends we predict Nutch will deprecate its own Lucene-based search web application in favour of Solr, and that time has finally come.  Solr offers much more functionality, configurability, performance and ease of integration than Nutch’s simple search web application.  We are happy Solr users ourselves – we use it to power search-lucene.com.

Storing data in HBase instead of directly in HDFS has all of the usual benefits of storing data in database instead of a files system.  Structured (fetched and parsed) data is not split into segments (in file system directories), so data can be accessed easily and time consuming segment merges can be avoided, among other things.  As a matter of fact, we are about to engage in a project that involves this exact functionality: the marriage of Nutch and HBase.  Naturally, we are hoping we can contribute this work back to Nutch, possibly through NUTCH-650.

Of course, when you add a persistence layer to an application there is always a question if whether it is acceptable for it to be tied to one back-end (database) or whether it is better to have an ORM layer on top of the datastore. Such an ORM layer would be an additional layer which would allow different backends to be used to store data.  And guess what? Such an ORM, initially focused on HBase and Nutch, and then on Cassandra and other column-oriented databases is in the works already!  Check the evaluation of ORM frameworks which support non-relational column-oriented datastores and RDBMs and development of an ORM framework that, while initially using Nutch as the guinea pig, already lives its own decoupled life over at http://github.com/enis/gora.

That’s all from us on Nutch’s present and future for this month, stay tuned for more Nutch news, next month! And of course, as usual, feel free to leave any comments or questions – we appreciate any and all feedback.  You can also follow @sematext on Twitter.

Mahout Digest, March 2010

In this Mahout Digest we’ll summarize what went on in the Mahout world since our last post back in February.

There has been some talk on the mailing list about Mahout becoming a top level project (TLP). Indeed, the decision to go TLP has been made (see Lucene March Digest to find out about other Lucene subprojects going for TLP) and this will probably happen soon, now that Mahout 0.3 is released. Check the discussion thread on Mahout as TLP and follow the discussion on what the PMC will look like. Also, Sean Owen is nominated as Mahout PMC Chair.  There’s been some logo tweaking.

There has been a lot of talk on Mahout mailing list about Google Summer Of Code and project ideas related to Mahout. Check the full list of Mahout’s GSOC project ideas or take on the invitation to write up your GSOC idea for Mahout!

Since Sematext is such a big Solr shop, we find the proposal to integrate Mahout clustering or classification with Solr quite interesting.  Check more details in MAHOUT-343 JIRA issue.  One example of classification integrated with Solr or actually, classifier as a Solr component, is Sematext’s Multilingual Indexer.  Among other things, Multilingual Indexer uses our Language Identifier to classify documents and individual fields based on language.

When talking about classification we should point out a few more interesting developments. There is an interesting thread on implementation of new classification algorithms and overall classifier architecture. In the same context of classifier architecture, there is a MAHOUT-286 JIRA issue on how (and when)  Mahout’s Bayes classifier will be able to support classification of non-text (non-binary) data like numeric features. If you are interested in using decision forests to classify new data, check this wiki page and this JIRA and patch.

In the previous post we discussed application of n-grams in collocation identification and now there is a wiki page where you can read more on how Mahout handles collocations. Also, check memory management improvements in collocation identification here. Of course, if you think you need more features in key phrases identification and extraction, check Sematext’s Key Phrase Extractor demo – it does more than collocations, can be extended, etc.

Finally, two new committers, Drew Farris and Benson Margulies, have been added to the list of Mahout committers.

That’s all for now from the Mahout world.  Please feel free to leave comments or if you have any questions – just ask, we are listening!

Nutch Digest, March 2010

This is the first post in the Nutch Digest series and a little introduction to Nutch seems in order. Nutch is a multi-threaded and, more importantly, a distributed Web crawler with distributed content processing (parsing, filtering), full text indexer and a search runtime. Nutch is at version 1.0 and community is now working towards a 1.1. release. Nutch is a large scale, flexible Web search engine, which includes several types of operations. In this post we’ll present new features and mailing list discussion as we describe each of these operations.

Crawling

Nutch starts crawling from a given “seed list” (a list of seed URLs) and iteratively follows useful/interesting outlinks, thus expanding its link database. When talking about Nutch as a crawler it is important to distinguish between two different approaches: focused or vertical crawling and whole Web or wide crawling. Each approach has a different set-up and issues which need to be addressed.  At Sematext we’ve done both vertical and wide crawling.

When using Nutch at large scale (whole Web crawling and dealing with e.g. billions of URLs), generating a fetchlist (a list of URLs to crawl) from crawlDB (the link database) and updating crawlDB with new URLs tends to take a lot of time. One solution is to limit such operations to a minimum by generating several fetchlists in one parse of the crawlDB and then update the crawlDb only once on several segments (set of data generated by a single fetch iteration). Implementation of a Generator that generates several fetchlists at was created in NUTCH-762. Whether this feature will be included in 1.1 release and when will version 1.1 be released, check here.

One more issue related to whole Web crawling which often pops up on the Nutch mailing list is an authenticated websites crawling, so here is wiki page on this subject.

When using Nutch for vertical/focused crawls, one often ends up with a very slow fetch performance at the end of each fetch iteration. An iteration typically starts with high fetch speed, but it drops significantly over time and keeps dropping, and dropping, and dropping. This is known problem.  It is caused by the fetch run having a small number of sites, some of which may have a lot more pages than others, and may be much slower than others.  Crawler politeness, which means it will politely wait before hitting the same domain again, combined with the fact that the number of distinct domains from fetchlist often drops rapidly during one fetch causes fetcher to wait a lot. More on this and overall Fetcher2 performance (which is a default fetcher in Nutch 1.0) you can find NUTCH-721.

To be able to support whole Web crawling, Nutch needs also needs to have scalable data processing mechanism. For this purpose Nutch uses Hadoop’s MapReduce processing and HDFS for storage.

Content storing

Nutch uses Hadoop’s HDFS as a fully distributed storage which creates multiple replicas of data blocks and distributes them on compute nodes throughout a cluster to enable reliable and fast computations even on large data volumes. Currently, Nutch is using Hadoop 0.20.1, but will be upgrading to Hadoop 0.20.2. in version 1.1.

In this NUTCH-650 you can find out more about the ongoing effort (and progress) to use HBase as Nutch storage backend. This should simplify Nutch storage and make URL/page processing work more efficient due to the features of HBase (data is mutable and indexed by keys/columns/timestamps).

Content processing

As we noted before, Nutch does a lot of content processing, like parsing (parsing downloaded content to extract text) and filtering (extracting only URLs which match filtering requirements). Nutch is moving away from its own processing tools and delegating content parsing and MimeType detection to Tika by use of Tika plugin, you can read more NUTCH-766.

Indexing and Searching

Nutch uses Lucene for indexing, currently version 2.9.1, but pushing to upgrade Lucene to 3.0.1. Nutch 1.0 can also index directly to Solr, becuse of Nutch-Solr integration. More on this and how using Solr with Nutch worked before Solr-Nutch integration you can find here. This integration now upgrades to Solr 1.4, because Solr 1.4 has a StreamingUpdateSolrServer which simplifies the way docs are buffered before sending to the Solr instance. Another improvement in this integration was a change to SolrIndexer to commit only once after all reducers have finished, NUTCH-799.

Some of patches discussed here and a number of other high quality patches were contributed by Julien Nioche who was added as a Nutch committer in December 2009.

One more thing, keep an eye on an interesting thread about Nutch becoming a Top Level Project (TLP) at Apache.

Thank you for reading, and if you have any questions or comments leave them in comments and we’ll respond promptly!

Mahout Digest, February 2010

Last month we published the Lucene Digest, followed by the Solr Digest, and wrapped the month with the HBase Digest (just before Yahoo posted their report showing Cassandra beating HBase in their benchmarks!).  We are starting February with a fresh Mahout Digest.

When covering Mahout, it seems logical to group topics following Mahout’s own groups of core algorithms.  Thus, we’ll follow that grouping in this post, too:

  • Recommendation Engine (Taste)
  • Clustering
  • Classification

There are, of course, some common concepts, some overlap, like n-grams.  Let’s talk n-grams for a bit.

N-grams

There has been a lot of talk about n-gram usage through all of the major subject areas on Mahout mailing lists. This makes sense, since n-gram-based language models are used in various areas of statistical Natural Language Processing.  An n-gram is a subsequence of n items from a given sequence of “items”. The “items” in question can be anything, though most commonly n-grams are made up of character or word/token sequenceas.  Lucene’s n-gram support provided through NGramTokenizer tokenizes an input String into character n-grams and can be useful when building character n-gram models from text. When there is an existing Lucene TokenStream and character n-gram model is needed, NGramTokenFilter can be applied to the TokenStream.  Word n-grams are sometimes referred to as “shingles”.  Lucene helps there, too.  When word n-gram statistics or model is needed, ShingleFilter or ShingleMatrixFilter can be used.

Classification

Usage of character n-grams in context of classification and, more specifically, the possibility of applying Naive Bayes to character n-grams instead of word/term n-grams is discussed here. Since Naive Bayes classifier as probabilistic classifier treats features of any type, there is no reason it could not be applied to character n-grams, too. Use of character n-gram model instead of word model in text classification could result in more accurate classification of shorter texts.  Our language identifier is a good example of a classifier (though it doesn’t use Mahout) and it provides good results even on short texts, try it.

Clustering

Non-trivial word n-grams (aka shingles) extracted from a document can be useful for document clustering. Similar to usage of document’s term vectors, this thread proposes usage of non-trivial word n-grams as a foundation for clustering. For extraction of word n-grams or shingles from a document Lucene’s ShingleAnalyzerWrapper is suggested. ShingleAnalyzerWrapper wraps the previously mentioned ShingleFilter around another Analyzer.  Since clustering (grouping similar items) is an example of a unsupervised type of machine learning, it is always interesting to validate clustering results. In clustering there are no referent train or, more importantly, referent test data, so evaluating how well some clustering algorithm works is not a trivial task. Although good clustering results are intuitive and often easily visually evaluated, it is hard to implement an automated test. Here is an older thread about validating Mahout’s clustering output which resulted in an open JIRA issue.

Recommendation Engine

There is an interesting thread about content-based recommendation, what content-based recommendation really is or how it should be defined. So far, Mahout has only Collaborative Filtering based recommendation engine called Taste.  Two different approaches are presented in that thread. One approach treats content-based recommendation as Collaborative Filtering problem or generalized Machine Learning problem, where item similarity is based on Collaborative Filtering applied on item’s attributes or ‘related’ user’s attributes (usual Collaborative Filtering treats an item as a black-box).  The other approach is to treat content-based recommendation as a “generalized search engine” problem. Here, matching same or similar queries makes two items similar. Just think of queries as queries composed of, say, key words extracted from user’s reading or search history and this will start making sense.  If items have enough textual content then content-based analysis (similar items are those that have similar term vectors) seems like a good approach for implementing content-based recommendation.  This is actually nothing novel (people have been (ab)using Lucene, Solr, and other search engines as “recommendation engines” for a while), but content-based recommendations is a recently discussed topic of possible Mahout expansion.  All algorithms in Mahout tend to run on top of Hadoop as MapReduce jobs, but in current release Taste does not have the MapReduce version. You can read more about about MapReduce Collaborative Filtering implementation in Mahout’s trunk. If you are in need of a working recommendation engine (that is, a whole application built on top of the core recommendation engine libraries), have a look at the Sematext’s recommendation engine.

In addition to Mahout’s basic machine learning algorithms there are discussions and development in directions which don’t fall under any of the above categories, such as collocation extraction. Often phrase extractors use word n-gram model for co-occurrence frequency counts. Check the thread about collocations which resulted in JIRA issue and the first implementation. Also, you can find more details on how Log-likelihood ratio can be used in the context of the collocation extraction in this thread.

Of course, anyone interested in Mahout should definitely read Mahout in Action (we’ve got ourselves a MEAP copy recently) and keep an eye on features for next 0.3 release.

Follow

Get every new post delivered to your Inbox.

Join 1,218 other followers