Mahout Digest, April 2010
May 4, 2010 1 Comment
April was a busy month here at Sematext, so we are a few days late with our April Mahout Digest, but better late than never!
The Board has approved Mahout to become Top Level Project (TLP) at the Apache Software Foundation. Check the status thread on change of Mahout mailing lists, svn layout, etc. Several discussions on mailing list, all describing what needs to be done for Mahout to become Top Level Project, resulted in Mahout TLP to-do list
As we reported in previous digest there was a lot of talk on mailing list about Google Summer of Code (GSoC) and here is a follow up on this subject. GSoC announcements are up and Mahout got 5 projects accepted this year. Check the full list of GSoC projects and Mahout’s community reactions on the mailing list.
In the past we have reported about the idea of Mahout/Solr integration, and it seems this is now getting realized. Read more on features and implementation progress of this integration here.
At the beginning of April, there was a proposal to make collections releases independent of the rest of Mahout. After some transition period of loosening the coupling between mahout-collections and the rest of Mahout, mahout-collections were extracted as an independent component. The vote on the first independent release of collections is in progress. Mahout-collections-1.0 differs from the version released with mahout 0.3 only by removed dependency on slf4j.
Mahout parts that use Lucene are updated to use the latest release of Lucene 3.0.1 and code changes for this migration can be found in this patch.
There was a question that generated a good discussion about cosine similarity between two items and how it is implemented in Mahout. More on cosine similarity between two items which is implemented as PearsonCorrelationSimilarity (source) in Mahout code, can be found in MAHOUT-387.
The goal of clustering is to determine the grouping in a set of data (e.g. a corpus of items or a (search) result set or …). Often, the problem in clustering implementations is that the data to be clustered has a high number of dimensions, which tend to need reducing (to a smaller number of dimensions) in order to make clustering computationally more feasible (read: faster). The simplest way of reducing those dimensions is to use some sort of a ‘mapping function’ that takes data presented in n-dimensional space and transforms it to data presented in m-dimensional space, where m is lower than n, hence the reduction. Of course, the key here is to preserve variance of important data features (dimensions) and flatten out unimportant ones. One simple and interesting approach to clustering is the use of several independent hash functions where the probability of collision of similar items is higher. This approach, called Minhash based clustering, was proposed back in March as part of GSoC (see the proposal). You’ll find more on theory behind it and Mahout applicable implementation in MAHOUT-344.
This concludes our short April Mahout Digest. Once Mahout completes its transition to TLP, we expect the project to flourish even more. We’ll be sure to report on the progress and new developments later this month. If you are a Twitter user, you can follow @sematext on Twitter.