Google Summer of Code and Intern Sponsoring

Are you a student and looking to do some fun and rewarding coding this summer? Then join us for the 2011 Google Summer of Code!

The application deadline is in less than a month! Lucene has identified initial potential projects, but this doesn’t mean you can also pick your own.  If you need additional ideas, look at our Lucene / Solr for Academia: PhD Thesis Ideas (or just the spreadsheet if you don’t want to read the what and the why),  just be sure to discuss with the community first (send an email to dev@lucene.apache.org).

We should also add that, separately from GSoC, Sematext would be happy to sponsor good students and interns interested in work on projects involving search (Lucene, Solr), machine learning & analytics (Mahout), big data (Hadoop, HBase, Hive, Pig, Cassandra), and related areas. We are a virtual and geographically distributed organization whose members are spread over several countries and continents and we welcome students from all across the globe.  For more information please inquire within.

Mahout Digest, March 2010

In this Mahout Digest we’ll summarize what went on in the Mahout world since our last post back in February.

There has been some talk on the mailing list about Mahout becoming a top level project (TLP). Indeed, the decision to go TLP has been made (see Lucene March Digest to find out about other Lucene subprojects going for TLP) and this will probably happen soon, now that Mahout 0.3 is released. Check the discussion thread on Mahout as TLP and follow the discussion on what the PMC will look like. Also, Sean Owen is nominated as Mahout PMC Chair.  There’s been some logo tweaking.

There has been a lot of talk on Mahout mailing list about Google Summer Of Code and project ideas related to Mahout. Check the full list of Mahout’s GSOC project ideas or take on the invitation to write up your GSOC idea for Mahout!

Since Sematext is such a big Solr shop, we find the proposal to integrate Mahout clustering or classification with Solr quite interesting.  Check more details in MAHOUT-343 JIRA issue.  One example of classification integrated with Solr or actually, classifier as a Solr component, is Sematext’s Multilingual Indexer.  Among other things, Multilingual Indexer uses our Language Identifier to classify documents and individual fields based on language.

When talking about classification we should point out a few more interesting developments. There is an interesting thread on implementation of new classification algorithms and overall classifier architecture. In the same context of classifier architecture, there is a MAHOUT-286 JIRA issue on how (and when)  Mahout’s Bayes classifier will be able to support classification of non-text (non-binary) data like numeric features. If you are interested in using decision forests to classify new data, check this wiki page and this JIRA and patch.

In the previous post we discussed application of n-grams in collocation identification and now there is a wiki page where you can read more on how Mahout handles collocations. Also, check memory management improvements in collocation identification here. Of course, if you think you need more features in key phrases identification and extraction, check Sematext’s Key Phrase Extractor demo – it does more than collocations, can be extended, etc.

Finally, two new committers, Drew Farris and Benson Margulies, have been added to the list of Mahout committers.

That’s all for now from the Mahout world.  Please feel free to leave comments or if you have any questions – just ask, we are listening!

Follow

Get every new post delivered to your Inbox.

Join 1,672 other followers