Mahout Digest, March 2010
March 30, 2010 1 Comment
In this Mahout Digest we’ll summarize what went on in the Mahout world since our last post back in February.
There has been some talk on the mailing list about Mahout becoming a top level project (TLP). Indeed, the decision to go TLP has been made (see Lucene March Digest to find out about other Lucene subprojects going for TLP) and this will probably happen soon, now that Mahout 0.3 is released. Check the discussion thread on Mahout as TLP and follow the discussion on what the PMC will look like. Also, Sean Owen is nominated as Mahout PMC Chair. There’s been some logo tweaking.
There has been a lot of talk on Mahout mailing list about Google Summer Of Code and project ideas related to Mahout. Check the full list of Mahout’s GSOC project ideas or take on the invitation to write up your GSOC idea for Mahout!
Since Sematext is such a big Solr shop, we find the proposal to integrate Mahout clustering or classification with Solr quite interesting. Check more details in MAHOUT-343 JIRA issue. One example of classification integrated with Solr or actually, classifier as a Solr component, is Sematext’s Multilingual Indexer. Among other things, Multilingual Indexer uses our Language Identifier to classify documents and individual fields based on language.
When talking about classification we should point out a few more interesting developments. There is an interesting thread on implementation of new classification algorithms and overall classifier architecture. In the same context of classifier architecture, there is a MAHOUT-286 JIRA issue on how (and when) Mahout’s Bayes classifier will be able to support classification of non-text (non-binary) data like numeric features. If you are interested in using decision forests to classify new data, check this wiki page and this JIRA and patch.
In the previous post we discussed application of n-grams in collocation identification and now there is a wiki page where you can read more on how Mahout handles collocations. Also, check memory management improvements in collocation identification here. Of course, if you think you need more features in key phrases identification and extraction, check Sematext’s Key Phrase Extractor demo – it does more than collocations, can be extended, etc.
Finally, two new committers, Drew Farris and Benson Margulies, have been added to the list of Mahout committers.
That’s all for now from the Mahout world. Please feel free to leave comments or if you have any questions – just ask, we are listening!