Solr Digest, August 2010

August brought a lot of activity into Solr world. There were many important developments, so we again compiled the most interesting ones for you, grouped into 4 categories:

Some new (and already committed) features

  • We already wrote about new work done on CollapsingComponent in June’s digest under SOLR-1682. A lot of work was done on this component and it appears that it is very close to being committed. Patches attached to the issue are functional, so you can give it a try.
  • SpellCheckComponent got improvement related to recent Lucene changes -  Add support for specifying Spelling SuggestWord Comparator to Lucene spell checkers for SpellCheckComponent. Issue SOLR-2053 is already fixed, patch is attached if you need it, but it is also committed to trunk and 3_x branch.
  • Another minor feature is improvement of WordDelimiterFilter in SOLR-2059Allow customizing how WordDelimiterFilter tokenizes text. Patch is already there and committed to trunk and 3_x.
  • Performance boost for faceting can be found in SOLR-2089Faceting: order term ords before converting to values. Behind this intimidating title hides a very decent speedup in cases when facet.limit is high. Patch is available, trunk and branch 3_x also got this magic committed.

Some new features being discussed and implemented

  • One very important (and probably much wanted) feature just got its Jira issue – SOLR-2080Create a Related Search Component. The issue was created by Grant Ingersoll, so we can expect some quality work do be done here. There are no patches (or even discussions) yet as the issue is in its infancy, but you can watch its progress in Jira. In the meantime, if you’re interested in such functionality, you can check Sematext’s RelatedSearches product.
  • Jira issue SOLR-2026Need infrastructure support in Solr for requests that perform multiple sequential queries – might add some interesting capabilities to search components, especially if you’re writing some of them on your own. We at Sematext have plenty of experience with writing of custom Solr components (check, for instance, our DYM ReSearcher or its Relaxer sibling), so we know that sometimes it is not a very pleasant task. If Solr gets better support for execution of multiple queries during a single request, writing custom components will become easier. One patch is already posted to this issue, so you can check it out, however, it is still unclear in which way this feature will evolve. We’re hoping for a flexible and comprehensive solution which would be easily extensible to many other features.
  • Defining QueryComponent’s default query parser can be made configurable with the patch attached to the issue SOLR-2031. You probably didn’t encounter many cases where you needed this functionality, but if you needed it, you had a problem before, and now that problem will become history.
  • It appears that QueryElevationComponent might get an improvement : Distinguish Editorial Results from “normal” results in the QueryElevationComponent. Jira issue SOLR-2037 will be the place to watch the progress.

Some newly found bugs

  • DataImportHandler has a bug – Multivalued fields with dynamic names does not work properly with DIH – the fix isn’t available, but if you have such problems, you check the status here.
  • Another bug in DataImportHandler points to a connection-leak issues – DIH doesn’t release JDBC connections in conjunction with DB2. There is no fix at the moment but, as usual, you can check the status in Jira.

Other interesting news

  • One potentially useful tool we recommend checking is SolrMeter. It is a standalone tool for stress testing of you Solr. From their site: The main goal of this open source project is to bring to the solr user community a “generic tool to interact specifically with solr”, firing queries and adding documents to make sure that your Solr implementation will support the real use. With SolrMeter you can simulate your work load over solr index and retrieve statistics graphically.
  • In which IDEs do you work with Solr/Lucene? Here at Sematext, we use both Eclipse and IntelliJ IDEA. If you use the latter and you want to set up Lucene or Solr in it, you can check a very useful description and patch in LUCENE-2611 IntelliJ IDEA setup.

We hope you enjoyed another Solr Digest from @sematext.  Come back and read us next month!

Solr Digest, July 2010

As usual, July is one of the slower months in Solr world, however, we managed to find a few interesting topics for our readers.

  • Interesting feature might be added with SOLR-1979Create LanguageIdentifierUpdateProcessor. It would provide ability to differently handle the text in different languages (think about stemming in analysis, for instance) and to do it automatically. This issue was just created, so the work on it and any usable patches are coming some time in the future. However, if you need something working now, Sematext has a few products for similar multilingual functionality, for instance, Multilingual Indexer or its cousin Language Identifier.
  • Another interesting feature might come with SOLR-1980Implement boundary match support. This will enable one to specify that query should match only at the start or at the end of the field (or be exact match), not somewhere in the middle, which could provide more relevant search results in some specific cases. This issue is also in its infancy and has no patches yet, so we’ll have to wait and see how it progresses.
  • Ever wanted Solr to store as the value of some field something other than the raw input value (remember, when you search Solr, you search on analyzed and indexed values; when you fetch the content of some field, you get the raw input value added to that field, not its analyzed version)? Patch for that already exists in one rather fresh JIRA issue – SOLR-1997Store internal value instead of input one.
  • Getting ready to start using Solr, but are unsure about which version you should use? Don’t worry, confusion about Solr’s version started this spring (see Solr May 2010 Digest), but things stabilized lately. The latest release is the fairly recent 1.4.1, which is basically 1.4 version with many bugfixes. The next release version is 3.1 which can be found on branch_3x branch. You can find its nightly build versions here. The trunk is still used for “unstable” development and the future 4.0 version. To get more information, check these recent threads on the Solr mailing list: here and here.
  • Many will probably agree that Solr’s SpellCheckComponent isn’t very useful in real-life applications. One of the main problems is that it poorly handles multi-word queries, where it creates its suggestion as a collated version of best suggestion for each word of the query, so you often get suggestions which have 0 hits. Also, it doesn’t return important information about suggested query, like how many hits such query would generate and what results it would give. Some of these issues could be fixed some day with SOLR-2010Improvements to SpellCheckComponent Collate functionality. The first version of the patch is already provided. However, if you’d like to use such functionality in your Solr production today, you might consider one much more sophisticated and production-ready component developed by Sematext – DYM ReSearcher – you can see DYM ReSearcher in action on Search-Lucene.com, for example.
  • One minor functionality is added to QueryElevationComponent – Add option to return only the specified results. It was added with JIRA issue SOLR-1966 and is already committed to 3.x and trunk.

We hope that this was enough to satisfy your Solr appetite.  Hopefully, we’ll dig more interesting topics for you in August.  Until then you can keep up with us via @sematext on Twitter.

Add option to return only the specified results

Solr Digest, June 2010

We have already written about news in Solr world this month here and here, so you already know that Solr’s 1.4.1. version was released, based on Lucene 2.9.3. Still, one thread from the mailing lists gives some more info about svn branches and how they are related to Solr versions.

Real Time indexing is again one of the hot topics. We already mentioned Zoie plugin in Solr March Digest, so this time we’ll point to interesting discussion on mailing lists. In case you followed this topic, Zoie Solr Plugin is a great plugin for Solr, but still has some limitations. For instance, master-slave architecture (which is the base of almost all big Solr deployments) isn’t well suited for Zoie. Version 2.9 of Lucene brought interesting addition of Near Realtime Search capabilities. As you probably already know, Solr 1.4 release already was running on Lucene 2.9 (2.9.1. to be precise), but support for NRT wasn’t implemented. Solr’s next release might have it since there is a JIRA issue dealing with NRT integration, but don’t hold your breath.

We’ll also mention some new functionalities in Solr:

  • Added relevancy function queries – JIRA issue SOLR-1932 adds function queries for relevancy factors such as tf, idf, etc. This issue is already fixed and committed to trunk.
  • Improved Solr response indentation - added with issue SOLR-1933. Solr only supported 7 levels of indenting previously, so this issue solves it. The downside is a small increase in response size (since instead of tabs, blank spaces will be used). The fix is already committed, but not only to trunk, but also to 3_x branch.
  • Ever wanted to see index files without logging into your servers? This patch will make them visible from Solr admin pages or by using LukeRequestHandlers.
  • Another related issue also got a patch and is already committed to the trunk – SOLR-1946misc enhancements to SystemInfoHandler. Here is a brief list of additions:   include CWD in directory info, include raw bytes version of memory stats, include a list of all system properties.

We’ll end with the short overview of interesting issues which are still in development:

  • Use Lucene’s Field Cache To Retrieve Stored Fields From Memory – the issue SOLR-1961 isn’t finished yet, althought there is a patch. When it is finished, it might give a new boost to the performance of your Solr server, thanks to developers from Cisco.
  • If you want to track performance improvements prepared for 4.0 release, you can just follow JIRA issue SOLR-1965. Some stuff is already listed there, so you can go and check what is in store for the future versions.
  • For anyone using PHP to talk to Solr, there is a new PHP Response Writer – currently, it is available as a Jar that has to be added to your Solr’s classpath. For more details check JIRA issue comments.
  • Field collapsing is one of the longest still unresolved issues in Solr world. SOLR-236 (many people probably easily recognize this JIRA issue number :)) was created more than 3 years ago and during the time it has grown into a “monster” – huge number of comments, patches, problems, parameters… you name it.  Integrating it with your Solr version was never fun (we tried it!). New hope appeared on the field collapsing horizon with the opening of SOLR-1682 (that’s a new JIRA issue for you to commit to your memory!). Some work had already been done there in the past, but now Yonik decided to dedicate some of his time to this issue, which means we might soon have a non-monster implementation that will be committed to Solr.

That’s all for this month. As you can see, in Solr May Digest there was no mention of new 1.4.1. release, but it happened, almost unexpectedly. So stay tuned (and follow @sematext) – you never know if something unexpected might happen this month too…

Lucene Digest, May 2010

Last month we were busy with work and didn’t publish our monthly Lucene Digest.  To make up for it, this month’s Lucene Digest really covers all Lucene developments in May from A to Z.

  • Mark Harwood had a busy month.  In LUCENE-2454 he contributed a production-tested and often-needed functionality for properly indexing parent-child entities (or, more generally, any form of hierarchical search).  He introduced his work in Adding another dimension to Lucene searches.  Joaquin Delgado has been talking about the merge of unstructured and structured search (not surprising, considering his old company with Lucene-based Federated Search product got acquired by Oracle several years ago!), so he quickly related this to ability to perform XQuery + Full-Text searches.  MarkLogic, watch your back! ;)
  • Mark also contributed a match spotter for all query types in LUCENE-1999.  This patch makes it possible to figure out which field(s) a match for a particular hit was on, which is functionality people ask about on Lucene and Solr mailing lists every so often.  Warning, though: spotting the matching and encoding that causes some score precision loss.
  • While Lucene already has TimeLimitedCollector, it’s not perfect and offers room for improvement.  Back in 2009, Mark came up with TimeLimitedIndexReader, as you can tell from his messages in Improving TimeLimitedCollector thread and created a patch with it in LUCENE-1720, which filled some of the TimeLimitedCollector’s gaps:
    • Any reader activity can be time-limited rather than just single searches e.g. the document retrieve phase.
    • Times out faster (i.e. runaway queries such as fuzzies detected quickly before last “collect” stage of query processing)
  • Robert Muir, who gave a well-received presentation on Finite State Queries in Lucene at New York Search & Discovery Meetup (see slides) back in April 2010, has been busy consolidating Lucene and Solr analyzers, tokenizers, token filters, character filters, etc. and moving them to their new home: modules/analysis, under Lucene root dir in svn.  The plan is to produce separate and standalone artifacts (read: jars) for this analysis module.  Here at Sematext we will make use of this module immediately for some of our products that currently list Lucene as a dependency, even though they really only need Lucene’s analyzers.  Solr, too, will be another customer for the new analysis module, as described by Robert in solr and analyzers module (yes, we’re showing off Search-Lucene.com’s in-document search-term highlighting, which we find very useful).
  • Robert also worked on and committed an ICU-based tokenizer for Unicode-based text segmentation in LUCENE-2414.  This translates to having the ability to properly tokenize text that doesn’t use spaces as token separators.  If you’ve ever had to deal with searching Chinese, for example, you’ll know that word segmentation is one of the initial challenges one has to deal with.
  • Talking about splitting on space, another task Robert took upon himself was to stop Lucene QueryParser from splitting queries on space: LUCENE-2458.  This problem of tokenizing queries in space comes up quote often, so this is going to be a very welcome improvement in Lucene.
  • One day Robert was super bored, so he decided to write a Lucene analyzer for Indonesian: LUCENE-2437.
  • Andrzej and Isreal Ekpo (the author of one of the Solr PHP clients) both decided to add support for search-time bitwise operations of integer fields around the same time.  Isreal’s work in in LUCENE-2460, with an accompanying SOLR-1913 issue, while Andrzej’s is in SOLR-1918 and has no pure Lucene patch.  The difference is that Israel’s patch offers only filtering, while Andrzej’s patch performs scoring, which allows finding the best matching inexact bit patterns. This has applications in e.g. near-duplicate detection.
  • In one of our current engagements we are working with a large, household-name organization and a big U.S. government contractor.  Their index is heavily sharded and is well over 2 TB.  Working with such large indices is no joke (though I’m happy to say we were able to immediately improve their search performance by 40% in the first performance tuning iteration). What if we could make their indices smaller?  Would that make their search even faster?  Of course!  In LUCENE-1812 (nice number), Andrej implemented a static index pruning tool that removes posting data from indices for terms with in-document frequency lower than some threshold.  We haven’t used this tool, and it looks like we may not use it for a while, because IBM apparently holds a patent on an exact same algorithm used in this tool.
  • Phrase queries got a little performance boost in LUCENE-2410.  Every little bit counts!
  • Tom Burton-West created and contributed a handy tool that outputs total term frequency and document frequency from a Lucene index: LUCENE-2393.  This tool can be handy for estimating sizes of some of the Lucene index files, and thus getting a better grasp on disk IO needs.
  • On both Lucene and Solr lists with often see people asking about updating individual Document fields instead of simply deleting and re-adding the whole Document.  Delete and re-add approach is not necessarily a problem for Lucene/Solr, but for an external system from which data for the complete new Document needs to be fetched.  Shai Erera, another recently added Lucene committer, proposed a new approach for incremental field updates that was well received.  Once implemented, this will be a big boon for Lucene and Solr!  If that thread or message is too long for you to read, let us at least highlight (pun intended) the two great use cases from this message.
  • Lucandra is a Cassandra backend for Lucene.  But no, it’s not a Lucene Directory implementation.  Lucandra has its own IndexReader and IndexWriter that read from Cassandra and write to it.  But in LUCENE-2456 we now have another option: a Cassandra-based Lucene Directory.  We hope to have a post on this in the near future!
  • The author of Cassandra-based Lucene Directory also opened LUCENE-2425 for Anti-Merging Multi-Directory Indexing Framework that splits an index (at run-time) into multiple sub-indices, based on a “split policy”, several of which have also been added to Lucene’s JIRA.  This is somewhat similar to Lucene’s ParallelWriter, but has some differences, as described in the issue.
  • Michael McCandless is working on prototyping a multi-stage pipeline sub-system that aims to further decouple analysis from indexing.  In this pipeline, indexing would be just one step, one stage in the pipeline.  Based on the work done so far, this may even bring some performance improvements.
  • LUCENE-2295 added a LimitTokenCountAnalyzer / LimitTokenCountFilter to wrap any other Analyzer and provide the same functionality as MaxFieldLength provided on IndexWriter
  • Shay Banon, the author of Elastic Search, contributed LUCENE-2468 (can you complete this hard to figure out numeric sequence?), which allows one to specify how new Document deletions should be handled in CachingWrapperFilter and CachingSpanFilter.  We recently did work for another large organization and a household name (in the U.S. at least) where we improved their Lucene-based search performance by over 30%.  One of the things we did was making good use of CachingWrapperFilter.
  • LUCENE-2480 removes support for pre-Lucene 3.* indices from Lucene 4.*.  Thus, if you are still on Lucene 1.* or Lucene 2.*, we suggest moving to Lucene 3.* soon.  But, due to radical Lucene changes, even moving from Lucene 3.x to Lucene 4.0 won’t be as seamless as with previous Lucene upgrades.  Lucene 4.0 will include a Lucene 3.x to Lucene 4.0 migration tool: LUCENE-2441.

That’s it for this month.  Remember that you can also follow @sematext on Twitter.

Mahout Digest, May 2010

As we reported in our previous Digest, Mahout was approved by the board to become Top Level Project (TLP) and TLP-related preparations kept Mahout developers busy in May. Mahout mailing lists (user- and dev-) were moved to their new addresses and all subscribers were automatically moved to the new lists: user@mahout.apache.org and dev@mahout.apache.org. Regarding other TLP-related changes and their progress, check the list of items to complete the move.

We’ll start a quick walk-through of May’s new features and improvements with an effort on reducing Mahout’s Vector representation size on disk, resulting in improvement of 11% lower size on a test data set.

Discussion on how UncenteredCosineSimilarity , an implementation of the cosine similarity which does not “center” its data, should behave in distributed vs. non-distributed version and a patch for distributed version can be found in MAHOUT-389. Furthermore, an implementation of distributed version of any ItemSimilarity currently available in a non-distributed form was committed in MAHOUT-393.

Mahout has utilities to generate Vectors from a directory of text documents and one improvement in terms of consistency and ease of understanding was made on tf/tfidf-vectors outputs. When using bin/mahout seq2sparse command to generate vectors from text (actually, from the sequence file generated from text), depending on the weighting method (term frequency or term frequency–inverse document frequency), output was created in different directories. Now with the fix from MAHOUT-398, tf-vectors and tfidf-vectors output directories at the same level.

We’ll end with Hidden Markov Model and its integration into Mahout. In MAHOUT-396 you’ll find a patch and more information regarding where and how it is used.

That’s all from Mahout’s world for May, please feel free to leave any questions or comments and don’t forget you can follow @sematext on Twitter.

Solr Digest, May 2010

May’s Solr Digest brings another review of interesting Solr developments and a short look at current state of Solr’s branches and versions. Confused about which versions to use and which to avoid? Don’t worry, many people are.  We’ll try to clear it up in this Digest.

  • In April’s edition of Solr Digest, we mentioned two JIRA issues dealing with document level security (SOLR-1834 and SOLR-1872). Now another issue (SOLR-1895) deals with LCF security and aims to work with SOLR-1834 to provide a complete solution.
  • One ancient JIRA issue, SOLR-397, finally got resolved and its code is now committed. Solr now has the ability to control how date range end points in facets are dealt with.  You can use this functionality by specifying the facet.date.include HTTP request parameter, which can have values “all”, “lower“, “upper“, “edge“, “between“, “before“, or “after“. More details about this can be found in SOLR-397.
  • Another issue related to date ranges was created.  This one aims to add Date Range QParser to Solr, which would simplify definition of date range filters resulting from date faceting.   This issue is still in its infancy and has no patches attached to it as of this writing, but it looks useful and promising.  When we add date faceting to Search-Lucene.com and Search-Hadoop.com we’ll be looking to use this date range query parser.
  • Some errors in Solr will be much easier to trace after JIRA issue SOLR-1898 gets resolved. Everyone using Solr probably encountered exceptions like: java.lang.NumberFormatException: For input string: “0.0″.  The message itself lacks some crucial details, such as information about the document and field that triggered the exception.  SOLR-1898 will solve that problem, and we are looking forward to this patch!
  • Have you recently been in the situation where you were unsure about which branch or version of Solr you should use on your projects? If yes, you’re certainly not alone! After the recent merge of Solr and Lucene (covered in Solr March Digest and Lucene March Digest), things became confusing, especially for casual observers of Lucene and Solr. Here are some facts about the current state of Solr:
  1. latest stable release version of Solr is still 1.4
  2. 1.4 version was released more than 6 months ago, so many new features, patches and bug fixes aren’t included in it
  3. however, it was a stable release, so if you’re planning your production very soon, maybe one low-risk choice would be using 1.4 version on top of which you could apply the patches that you find necessary for you deployment
  4. current development is ongoing on trunk (considered as unstable version and slated for future Solr 4.0 version) and branch named branch_3x. This branch is the most likely candidate for the next version of Solr (named 3.1) and is considered as (stable) development version which could be usable, though you have to be careful with your testing, as always.
  5. another choice could be some old 1.5 nightly build, but 1.5 is abandoned and, in our opinion, it makes more sense to use nightly builds from branch_3x

Here are couple of threads where you can get more information:

  1. Lucene 3.x branch created
  2. Which Solr to use?
  • To show one of the dangers of unstable versions, we’ll immediately point to one recently open JIRA issue related to “file descriptor leak” while indexing.
  • Although at Sematext we’ve been using Java 6 for a very long time both for our products and with our clients, some people might still be stuck with Java 5. It appears that they will never be able to use Solr 4.0 once it is released, since Solr trunk version now requires Java 6 to compile.

We’ll finish this month’s Solr Digest with two new Solr features:

  • For anyone wanting to use JSON format when sending documents to Sorl, JSON update handler is now committed to trunk
  • on the other hand, if you need CSV as output format from Solr, you might benefit from the work on new CSV Response Writer. Currently, there are no patches with it, but you can watch the issue and see when it is added.

Thanks for reading another Solr Digest!  Help us spread the word, please Re-Tweet it, and follow @sematext on Twitter.

Nutch Digest, May 2010

With May being almost over, it’s time for our regular monthly Nutch Digest. As you know, Nutch became a top-level project and first related changes are already visible/applied. The Nutch site was moved to its new home at nutch.apache.org and the mailing lists (user- and dev-) have been moved to new addresses: user@nutch.apache.org and dev@nutch.apache.org.  Old subscriptions were automatically moved to the new lists, so subscribers don’t have to do anything (apart from changing mail filters, perhaps).  Even though Nutch is not a Lucene sub-project any more, we will continue to index its content (mailing lists, JIRA, Wiki, web site) and searchable over on Search-Lucene.com.  We’ve helped enough organizations with their Nutch clusters that it makes sense for at least us at Sematext to have a handy place to find all things Nutch.

There is a vote on updated Release Candidate (RC3) for the Apache Nutch 1.1 release.  The major differences between this release and RC2 are several bug fixes: NUTCH-732, NUTCH-815, NUTCH-814, NUTCH-812 and one improvement in NUTCH-816.  Nutch 1.1 is expected shortly!

From relevant changes related to Nuch develompent during May it’s important to note than Nutch will be using Ivy in Nutch builds and that there is one change regarding Nutch’s language identification: code for Japanese changed from “jp” to “ja”.  The former is Japan’s country code and the latter is the language code for Japanese.

There’s been a lot of talk on the Nutch mailing list about the first book on Nutch which Dennis Kubes started writing.  We look forward to reading it, Dennis!

Nutch developers were busy with the TLP-related changes and with preparations for the Nutch 1.1 release this month, so this Digest is a bit thinner than usual.  Follow @sematext on Twitter.

Mahout Digest, April 2010

April was a busy month here at Sematext, so we are a few days late with our April Mahout Digest, but better late than never!

The Board has approved Mahout to become Top Level Project (TLP) at the Apache Software Foundation. Check the status thread on change of Mahout mailing lists, svn layout, etc. Several discussions on mailing list, all describing what needs to be done for Mahout to become Top Level Project, resulted in Mahout TLP to-do list

As we reported in previous digest there was a lot of talk on mailing list about Google Summer of Code (GSoC) and here is a follow up on this subject. GSoC announcements are up and Mahout got 5 projects accepted this year. Check the full list of GSoC projects and Mahout’s community reactions on the mailing list.

In the past we have reported about the idea of Mahout/Solr integration, and it seems this is now getting realized. Read more on features and implementation progress of this integration here.

At the beginning of April, there was a proposal to make collections releases independent of the rest of Mahout. After some transition period of loosening the coupling between mahout-collections and the rest of Mahout, mahout-collections were extracted as an independent component.  The vote on the first independent release of collections is in progress. Mahout-collections-1.0 differs from the version released with mahout 0.3 only by removed dependency on slf4j.

Mahout parts that use Lucene are updated to use the latest release of Lucene 3.0.1 and code changes for this migration can be found in this patch.

There was a question that generated a good discussion about cosine similarity between two items and how it is implemented in Mahout. More on cosine similarity between two items which is implemented as PearsonCorrelationSimilarity (source) in Mahout code, can be found in MAHOUT-387.

The goal of clustering is to determine the grouping in a set of data (e.g. a corpus of items or a (search) result set or …).  Often, the problem in clustering implementations is that the data to be clustered has a high number of dimensions, which tend to need reducing (to a smaller number of dimensions) in order to make clustering computationally more feasible (read: faster).  The simplest way of reducing those dimensions is to use some sort of a ‘mapping function’ that takes data presented in n-dimensional space and transforms it to data presented in m-dimensional space, where m is lower than n, hence the reduction. Of course, the key here is to preserve variance of important data features (dimensions) and flatten out unimportant ones. One simple and interesting approach to clustering is the use of several independent hash functions where the probability of collision of similar items is higher. This approach, called Minhash based clustering, was proposed back in March as part of GSoC (see the proposal).  You’ll find more on theory behind it and Mahout applicable implementation in MAHOUT-344.

Those interested in Neural Networks should keep an eye on MAHOUT-383, where Neuroph (a lightweight neural net library) and Mahout will get integrated.

This concludes our short April Mahout Digest.  Once Mahout completes its transition to TLP, we expect the project to flourish even more.  We’ll be sure to report on the progress and new developments later this month.  If you are a Twitter user, you can follow @sematext on Twitter.

Nutch Digest, April 2010

In the first part of this Nutch Digest we’ll go through new and useful features of the upcoming Nutch 1.1 release, while in the second part we’ll focus on developments and plans for next big Nutch milestone, Nutch 2.0. But, let’s start with few informational items.

  • Nutch has been approved by the ASF  board to become Top Level Project (TLP) in the Apache Software Foundation.  The changing of Nutch mailing lists, URL, etc. will start soon.

Nutch 1.1 will be officially released any day now and here is a Nutch 1.1 release features walk through:

  • Nutch release 1.1 uses Tika 0.7 for parsing and MimeType detection
  • Hadoop 0.20.2 is used for job distribution (Map/Reduce) and distributed file system (HDFS)
  • On the indexing and search side, Nutch 1.1 uses either Lucene 3.0.1.with its own search application or Solr 1.4
  • Some of the new features included in release 1.1 were discussed in previous Nutch Digest. For example, alternative generator which can generate several segments in one parse of the crawlDB is included in release 1.1. We used a flavour of this patch in our most recent Nutch engagement that involved super-duper vertical crawl.  Also, improvement of SOLRIndexer, which now commits only once when all reducers have finished, is included in Nutch 1.1.
  • Some of the new and very useful features were not mentioned before. For example, Fetcher2 (now renamed to just Fetcher) was changed to implement Hadoop’s Tool interface. With this change it is possible to override parameters from configuration files, like nutch-site.xml or hadoop-site.xml, on the command line.
  • If you’ve done some focused or vertical crawling you probably know that one or few unresponsive host(s) can slow down entire fetch, so one very useful feature added to Nutch 1.1 is the ability to skip queues (which can be translated to hosts) for URLS getting repeated exceptions.  We made good use of that here at Sematext,  in the Nutch project we just completed in April 2010.
  • Another improvement included in 1.1 release related to Nutch-Solr integration comes in a form of improved Solr schema that allows field mapping from Nutch to Solr index.
  • One useful addition to Nutch’s injector is new functionality which allows user to inject metadata into the CrawlDB. Sometimes you need additional data, related to each URL, to be stored. Such external knowledge can later be used (e.g. indexed) by a custom plug-in. If we can all agree that storing arbitrary data in CrawlDb (with URL as a primary key) can be very useful, then migration to database oriented storage (like HBase) is only a logical step.  This makes a good segue to the second part of this Digest…

In the second half of this Digest we’ll focus on the future of Nutch, starting with Nutch 2.0.  Plans and ideas for the next Nutch release can be found on mailing list under Nutch 2.0 roadmap and on the official wiki page.

Nutch is slowly replacing some of its home-grown functionality with best of breed products — it uses Tika for parsing, Solr for indexing/searching and HBase for storing various types of data.  Migration to Tika is already included in Nutch 1.1. release and exclusive use of Solr as (enterprise) search engine makes sense — for months we have been telling clients and friends we predict Nutch will deprecate its own Lucene-based search web application in favour of Solr, and that time has finally come.  Solr offers much more functionality, configurability, performance and ease of integration than Nutch’s simple search web application.  We are happy Solr users ourselves – we use it to power search-lucene.com.

Storing data in HBase instead of directly in HDFS has all of the usual benefits of storing data in database instead of a files system.  Structured (fetched and parsed) data is not split into segments (in file system directories), so data can be accessed easily and time consuming segment merges can be avoided, among other things.  As a matter of fact, we are about to engage in a project that involves this exact functionality: the marriage of Nutch and HBase.  Naturally, we are hoping we can contribute this work back to Nutch, possibly through NUTCH-650.

Of course, when you add a persistence layer to an application there is always a question if whether it is acceptable for it to be tied to one back-end (database) or whether it is better to have an ORM layer on top of the datastore. Such an ORM layer would be an additional layer which would allow different backends to be used to store data.  And guess what? Such an ORM, initially focused on HBase and Nutch, and then on Cassandra and other column-oriented databases is in the works already!  Check the evaluation of ORM frameworks which support non-relational column-oriented datastores and RDBMs and development of an ORM framework that, while initially using Nutch as the guinea pig, already lives its own decoupled life over at http://github.com/enis/gora.

That’s all from us on Nutch’s present and future for this month, stay tuned for more Nutch news, next month! And of course, as usual, feel free to leave any comments or questions – we appreciate any and all feedback.  You can also follow @sematext on Twitter.

Solr Digest, April 2010

Another month is almost over, so it is time for our regular monthly Solr Digest. This time we’ll focus on interesting JIRA issues, so let’s start:

  • Issue SOLR-1860 intends to improve stopwords list handling in Solr, based on recent Lucene’s stopwords lists additions to all language analyzers. The work hasn’t started just yet (there are no patches to try), so we’ll need to be patient before actually using it.
  • Ever had problems with http authentication in distributed Solr environment? Currently, it worked only when querying one Solr server. Now JIRA issue SOLR-1861 solves such problems and allows specification of credentials for each shard, while in the absence of credential info it falls back to default functionality (no credentials). The patch is already attached to the issue and it can be used with Solr 1.4.
  • If you have used Solr’s MoreLikeThisComponent, you noticed its output lacks any info which would explain why it recommended some item. Patch in issue SOLR-860 deals with that and improves MLT Component by adding debug info, like this (copied from JIRA):

"debug":{
"moreLikeThis":{
"IW-02":{
"rawMLTQuery":"",
"boostedMLTQuery":"",
"realMLTQuery":"+() -id:IW-02"},
"SOLR1000":{
"rawMLTQuery":"",
"boostedMLTQuery":"",
"realMLTQuery":"+() -id:SOLR1000"},
"F8V7067-APL-KIT":{
"rawMLTQuery":"",
"boostedMLTQuery":"",
"realMLTQuery":"+() -id:F8V7067-APL-KIT"},
"MA147LL/A":{
"rawMLTQuery":"features:2 features:0 features:lcd features:x features:3",
"boostedMLTQuery":"features:2 features:0 features:lcd features:x features:3",
"realMLTQuery":"+(features:2 features:0 features:lcd features:x features:3) -id:MA147LL/A"}},

This issue is marked to be included in Solr 3.1.

  • If you ever got a requirement like “some users should be able to access these documents while being forbidden to access some other”, Solr wasn’t able to help you much. Recently, document level security has been the subject of 2 JIRA issues. In SOLR-1834 you can find a patch which is already running in production environment, while another approach to the same problem (also with attached patch) is presented in SOLR-1872 (the latter currently adds security only on select queries, delete is not supported yet).
  • SolrCloud brings exciting new capabilities to Solr, some of them already mentioned in our Solr Digest posts (for instance, check Solr Digest January 2010). Solr Cloud functionality is getting committed to trunk, you can monitor the progress in SOLR-1873.  This is big!
  • When working with Solr, you should explicitly configure Solr to take care of lowercasing indexed tokens and query strings (so uppercased versions of some words match their lowercase versions.  For instance, to have query Sematext matches SEMATEXT, sematext and Sematext). However, there is one old JIRA issue SOLR-219 designated to be fixed in Solr 1.5 which would automatically make Solr smart enough for searches to be case insensitive.
  • One common source of confusion for first time Solr users was dismax and its relation to default query operator defined in schema.xml. In reality, the default query operator has no effect on how dismax works. Also, with dismax you can’t use directly AND and OR operators, but you can achieve such functionality by using dismax’s mm (minimum should match) parameter. The default value for it is 100% (meaning that all clauses must match, which is equal to using AND operator between all clauses). If you want to achieve OR operator functionality, you would just define its value to 1 (meaning, one matching clause is enough). The confusion with default operator arises from the fact that in case your default query operator in schema.xml is OR, dismax would by default behave like it was AND. Issue SOLR-1889 should deal with that and assign default mm value for dismax depending on the default query operator from schema.xml, which will make Solr behave more consistently for new users.
  • Another old JIRA issue got its first patch a few days ago, SOLR-571. This patch allows autowarmCount values to be specified as percentages of cache size (for instance, 50% would mean that autowarm of only top half of cached queries is needed) instead of being specified by an absolute amount.
  • Solr 1.4 introduced ClusteringComponent which can cluster search results and documents. By using plugins, it allows implementation of any clustering engine. One such engine was recently unveiled, lsa4solr, which is based on Latent Semantic Analysis. This engine depends on development version of Solr 1.3 and Clojure 1.2, so take a look if you’re interested in clustering.
  • And last, but not least, for all Solr enthusiasts, an interesting webinar is on schedule for April 29th: “Practical Search with Solr: Beyond just looking it up”. You can find more about it here.

Remember, you can also follow us on Twitter: @sematext.  Until next month!

Follow

Get every new post delivered to your Inbox.

Join 1,633 other followers