Solr Digest, October 2010

Another busy month is behind us.  There were plenty of interesting topics, so let’s get started:

Already committed functionality

Interesting functionality in development

  • Faceting is heavily used functionality, but occasionally people find they’re missing some form of faceting. Hierarchical faceting is one such thing. It has been in development for a very long time but, despite a few posted patches, is still not a part of Solr distribution, plus it hasn’t seen much activity lately. There is another similar issue – Pivot (aka Decision Tree) Faceting Component which should come to life as a separate component. However, there is renewed effort to make it usable, so eventually we’ll see expanded faceting support in Solr.

Interesting new functionality

  • Extending SchemaField with custom attributes is being dealt with in the issue Custom SchemaField object.
  • Improving search relevance is always a big issue (and represents a good part of what Sematext does in client engagements), no matter how good out of the box Solr and Lucene relevance is.  One very useful addition to our search relevance arsenal could come from the Anti-phrasing feature. The idea is that some word sequences in a query are irrelevant to the query meaning (like “Where can I find” or “Where is.”) and could/should be ignored while searching the index. This JIRA issue is still very fresh, so don’t hold your breath waiting for the implementation to become available next week, although we are bound to see this feature in one of the future Solr releases.
  • If you often working with financial data, you might find patches from issue Money FieldType useful. The new field type will support point and range queries, sorting, and exchange rates.
  • Lucene’s ICUTokenizer is useful for multilingual tokenizing but until recently there was no support for it in Solr. The issue Provide Solr FilterFactory for Lucene ICUTokenizer will provide a filter factory which will enable us to use it from Solr. Bingo! The patch already exists, so it can be tried already. Additional new functionality will be added over time.  If you need multilingual support in Solr, have a look at Sematext‘s popular Multilingual Indexer.

Miscellaneous

  • One of the favorite topics, which we also cover frequently, is related to the ongoing confusion about Solr versions. October didn’t disappoint, this topic was discussed on mailing lists again. So, here is one such thread – Which version of Solr to use?. Let us summarize the key parts. Solr 1.5 will probably never be released.  The branch_3x is a stable version from which the next Solr 3.1 version will likely be released.  The trunk contains relatively stable, but still development version of what will become Solr 4.0 one day.
  • If you provide faceting functionality in your application, here is a small (but interesting) discussion that might give you a few ideas about how to optimize it – Faceting and first letter of fields.
  • It appears that Solr has problems running on Tomcat 7. These problems are not related to a particular version of Solr, but to all versions. To learn more, start with Problems running on tomcat and SOLR-2022 .
  • The replication between Solr master and slave when they’re running different versions of Solr is broken, as you can see in issue Cross-version replication broken by new javabin format. The cause is the new javabin format, so in cases like the one described in this issue (master 1.4.1, slave 3x), you’ll encounter problems. Keep that in mind if you plan cross-version replication for some reason.

These were the most interesting highlights for the month of October. Thank you for reading Sematext Blog and following @sematext on Twitter.

Solr Digest, September 2010

Mahout Digest, October 2010

We’ve been very busy here at Sematext, so we haven’t covered Mahout during the last few months.  We are pleased with what’s been keeping us busy, but are not happy about our irregular Mahout Digests.  We had covered the last (0.3) release with all of its features and we are not going to miss covering very important milestone for Mahout: release 0.4 is out! In this digest we’ll summarize the most important changes in Mahout from the last digest and add some perspective.

Before we dive into Mahout, please note that we are looking for people with Machine Learning skills and Mahout experience (as well as good Lucene/Solr search people).  See our Hiring Search and Data Analytics Engineers post.

This Mahout release brings overall changes regarding model refactoring and command line interface to Mahout aimed at improving integration and consistency (easier access to Mahout operations via the command line). The command line interface is pretty much standardized for working with all the various options now, which makes it easier to run and use. Interfaces are better and more consistent across algorithms and there have been many small fixes, improvements, refactorings, and clean-ups. Details on what’s included can be found in the release notes and download is available from the Apache Mirrors.

Now let’s add some context to various changes and new features.

GSoC projects

Mahout completed its Google Summer of Code  projects and two completed successfully:

  • EigenCuts spectral clustering implementation on Map-Reduce for Apache Mahout (addresses issue MAHOUT-328), proposal and implementation details can be found in MAHOUT-363
  • Hidden Markov Models based sequence classification (proposal for a summer-term university project), proposal and implementation details in  MAHOUT-396

Two projects did not complete due to lack of student participation and one remains in progress.

Clustering

The biggest addition in clustering department are EigenCuts clustering algorithm (project from GSoC) and MinHash based clustering which we covered as one of possible GSoC suggestions in one of previous digests . MinHash clustering was implemented, but not as a GSoC project. In the first digest from the Mahout series we covered problems related to evaluation of clustering results (unsupervised learning issue), so big addition to Mahout’s clustering are Cluster Evaluation Tools featuring new ClusterEvaluator (uses Mahout In Action code for inter-cluster density and similar code for intra-cluster density over a set of representative points, not the entire clustered data set) and CDbwEvaluator which offers new ways to evaluate clustering effectiveness.

Logistic Regression

Online learning capabilities such as Stochastic Gradient Descent (SGD) algorithm implementation are now part of Mahout. Logistic regression is a model used for prediction of the probability of occurrence of an event. It makes use of several predictor variables that may be either numerical or categories. For example, the probability that a person has a heart attack within a specified time period might be predicted from knowledge of the person’s age, sex and body mass index. Logistic regression is used extensively in the medical and social sciences as well as marketing applications such as prediction of a customer’s propensity to purchase a product or cease a subscription. The Mahout implementation uses Stochastic Gradient Descent (SGD), check more on initial request and development in MAHOUT-228. New sequential logistic regression training framework supports feature vector encoding framework for high speed vectorization without a pre-built dictionary. You can find more details on Mahout’s logistic regression wiki page.

Math

There has been a lot of cleanup done in the math module (you can check details in Cleanup Math discussion on ML), lot’s of it related to an untested Colt framework integration (and deprecated code in Colt framework). The discussion resulted in several pieces of Colt framework getting promoted to a tested status (QRdecomposition, in particular)

Classification

In addition to speedups and bug fixes, main new features in classification are new classifiers (new classification algorithms) and more open/uniformed input data formats (vectors). Most important changes are:

  • New SGD classifier
  • Experimental new type of Naive bayes classifier (using vectors) and feature reduction options for existing Naive bayes classifier (variable length coding of vectors)
  • New VectorModelClassifier allows any set of clusters to be used for classification (clustering as input for classification)
  • Now random forest can be saved and used to classify new data. Read more on how to build a random forest and how to use it to classify new cases on this dedicated wiki page.

Recommendation Engine

The most important changes in this area are related to distributed similarity computations which can be used in Collaborative Filtering (or other areas like clustering, for example). Implementation of Map-Reduce job, based on algorithm suggested in Elsayed et al: Pairwise Document Similarity in Large Collections with MapReduce, which computes item-item similarities for item-based Collaborative Filtering can be found in MAHOUT-362. Generalization of algorithm based on the mailing list discussion led to an implementation of  Map-Reduce job which computes pairwise similarities of the rows of a matrix using a customizable similarity measure (with implementations already provided for Cooccurrence, Euclidean Distance, Loglikelihood, Pearson Correlation, Tanimoto coefficient, Cosine). More on distributed version of any item similarity function (which was available in a non-distributed implementation before) can be found in MAHOUT-393. With pairwise similarity computation defined, RecommenderJob has been evolved to a fully distributed item-based recommender (implementation depends on how the pairwise similarities are computed). You can read more on distributed item-based recommender in MAHOUT-420.

Implementation of distributed operations on very large matrices are very important for a scalable machine learning library which supports large data sets. For example, when term vector is built from textual document/content, terms vectors tend to have high dimension. Now,  if we consider a term-document matrix where each row represents terms from document(s), while a column represents a document we obviously end up with high dimensional matrix. Same/similar thing occurs in Collaborative Filtering: it uses a user-item matrix containing ratings for matrix values, row corresponds to a user and each column represents an item. Again we have large dimension matrix that is sparse.

Now, in both cases (term-document matrix and user-item matrix) we are dealing with high matrix dimensionality which needs to be reduced, but most of information needs to be preserved (in best way possible). Obviously we need to have some sort of matrix operation which will provide lower dimension matrix with important information preserved. For example, large dimensional matrix may be approximated to lower dimensions using Singular Value Decomposition (SVD).

It’s obvious that we need some (java) matrix framework capable of fundamental matrix decompositions. JAMA is a great example of widely used linear algebra package for matrix operations, capable of SVD and other fundamental matrix decompositions (WEKA for example uses JAMA for matrix operations). Operations on highly dimensional matrices always require heavy computation and this requirements produces high HW requirements on any ML production system. This is where Mahout, which features distributed operations on large matrices, should be the production choice for Machine Learning algorithms over frameworks like JAMA, which although great, can not distribute its operations.

In typical recommendation setup users often ‘have’ (used/interacted with) only a few items from the whole item set (item set can be very large) which leads to user-item matrices being sparse matrices. Mahout’s (0.4) distributed Lanczos SVD implementation is particularly useful for finding decompositions of very large sparse matrices.

News and Roadmap

All of the new distributed similarity/recommender implementations we analyzed in previous paragraph were contributed by Sebastian Schelter and as a recognition for this important work he was elected as a new Mahout committer.

The book “Mahout in Action”, published by Manning, has reached 15/16 chapters complete and will soon enter final review.

This is all from us for now.  Any comments/questions/suggestions are more than welcome and until next Mahout digest keep an eye on Mahout’s road map for 0.5 or discussion about what is Mahout missing to become production stabile (1.0) framework.  We’ll see you next month – @sematext.

Solr Digest, September 2010

It is a busy time of year here at Sematext – we have 3 different presentations to prepare for 3 different conferences to prepare (2 down, 1 more to go!), so we’re a bit late with our digests. Nevertheless, we managed to compile a list of interesting topics in Solr world:

Already committed functionality

  • Solr was upgraded to use Tika 0.7SOLR-1819 – the fix was applied to 1.4.2, 3.1 and 4.0 versions.  Of course, Tika 0.8 is going to happen in not very distant future.
  • If you’re still using old rsync based replication and have a need to throttle transfer rate, have a look at a patch contributed in JIRA issue SOLR-2099. Unfortunatelly, if you’re using 1.4 Java based replication, there is currently no way to throttle replication.
  • If you are using new spatial capabilities in Solr, you might have noticed some incorrect calculations. One of them is fixed – Spatial filter is not accurate –  on 3.1 and 4.0 branches
  • Another minor but useful addition – function queries can now be defined in terms of parameters from other request parameters. Check JIRA issue “full parameter dereferencing for function queries”. It is already implemented in 3.1 and 4.0 and is ready to be used. Here is a short example from JIRA (check how add function is defined and note v1 and v2 request parameters):

http://localhost:8983/solr/select?defType=func&fl=id,score&q=add($v1,$v2)&v1=mul(2,3)&v2=10

Can we say, Solr Calculator, eh?

Interesting functionalities in development for some time

  • Ever wanted to add some custom fields to a response, although they were not stored in your Solr index? You could always create a custom response writer which would add those fields (although it would probably be a “dirty” copy of some already existing Solr’s response writer). However, we all know that it doesn’t sound as the right way to code.  One JIRA issue might deliver a correct way some day - Allow components to add fields to outgoing documents. We say “some day“, since this functionality has been in development for quite some time now and, although it has some patches (currently unfunctional, it seems), is probably is not very near being completed.  But it will be handy to have once it’s done.

Interesting new functionalities

  • Highlighter could get one frequently requested improvement – Highlighter fragement/formatter for returning just the matching terms – we believe this will be a useful addition, although we don’t expect it very soon.
  • One potentially useful feature for all of you who use HDFS – DIH should be able read data directly from HDFS for indexing.  This issue already contains some working code, although it is a question if the fix will become a part of standard Solr distribution.  Still, if you’re using Solr 1.4.1 and you have data in HDFS that you want to index with Solr, have a look at this contribution.
  • Another improvement related to replication is in SOLR-2117 – Allow slaves to replicate at different times. This should be useful to anyone who has long (and therefore heavy) warmup periods on their slaves after replication. This way, you can have your slaves replicate at different time and at the time of replication just take replicating slave offline (to avoid degradation of response times). Be careful though, there is a downside : for some time (limited, but still), your slaves will serve different data. Patch is available for 4.0 version.

Miscellaneous

So, we had a little bit of everything from Solr this month. Until late October (or start of November) when new Solr Digest arrives, stay tuned to @sematext, where we tweet other interesting stuff on a wider set of topics from time to time.

Solr Digest, August 2010

August brought a lot of activity into Solr world. There were many important developments, so we again compiled the most interesting ones for you, grouped into 4 categories:

Some new (and already committed) features

  • We already wrote about new work done on CollapsingComponent in June’s digest under SOLR-1682. A lot of work was done on this component and it appears that it is very close to being committed. Patches attached to the issue are functional, so you can give it a try.
  • SpellCheckComponent got improvement related to recent Lucene changes –  Add support for specifying Spelling SuggestWord Comparator to Lucene spell checkers for SpellCheckComponent. Issue SOLR-2053 is already fixed, patch is attached if you need it, but it is also committed to trunk and 3_x branch.
  • Another minor feature is improvement of WordDelimiterFilter in SOLR-2059Allow customizing how WordDelimiterFilter tokenizes text. Patch is already there and committed to trunk and 3_x.
  • Performance boost for faceting can be found in SOLR-2089Faceting: order term ords before converting to values. Behind this intimidating title hides a very decent speedup in cases when facet.limit is high. Patch is available, trunk and branch 3_x also got this magic committed.

Some new features being discussed and implemented

  • One very important (and probably much wanted) feature just got its Jira issue – SOLR-2080Create a Related Search Component. The issue was created by Grant Ingersoll, so we can expect some quality work do be done here. There are no patches (or even discussions) yet as the issue is in its infancy, but you can watch its progress in Jira. In the meantime, if you’re interested in such functionality, you can check Sematext’s RelatedSearches product.
  • Jira issue SOLR-2026Need infrastructure support in Solr for requests that perform multiple sequential queries – might add some interesting capabilities to search components, especially if you’re writing some of them on your own. We at Sematext have plenty of experience with writing of custom Solr components (check, for instance, our DYM ReSearcher or its Relaxer sibling), so we know that sometimes it is not a very pleasant task. If Solr gets better support for execution of multiple queries during a single request, writing custom components will become easier. One patch is already posted to this issue, so you can check it out, however, it is still unclear in which way this feature will evolve. We’re hoping for a flexible and comprehensive solution which would be easily extensible to many other features.
  • Defining QueryComponent’s default query parser can be made configurable with the patch attached to the issue SOLR-2031. You probably didn’t encounter many cases where you needed this functionality, but if you needed it, you had a problem before, and now that problem will become history.
  • It appears that QueryElevationComponent might get an improvement : Distinguish Editorial Results from “normal” results in the QueryElevationComponent. Jira issue SOLR-2037 will be the place to watch the progress.

Some newly found bugs

  • DataImportHandler has a bug – Multivalued fields with dynamic names does not work properly with DIH – the fix isn’t available, but if you have such problems, you check the status here.
  • Another bug in DataImportHandler points to a connection-leak issues – DIH doesn’t release JDBC connections in conjunction with DB2. There is no fix at the moment but, as usual, you can check the status in Jira.

Other interesting news

  • One potentially useful tool we recommend checking is SolrMeter. It is a standalone tool for stress testing of you Solr. From their site: The main goal of this open source project is to bring to the solr user community a “generic tool to interact specifically with solr”, firing queries and adding documents to make sure that your Solr implementation will support the real use. With SolrMeter you can simulate your work load over solr index and retrieve statistics graphically.
  • In which IDEs do you work with Solr/Lucene? Here at Sematext, we use both Eclipse and IntelliJ IDEA. If you use the latter and you want to set up Lucene or Solr in it, you can check a very useful description and patch in LUCENE-2611 IntelliJ IDEA setup.

We hope you enjoyed another Solr Digest from @sematext.  Come back and read us next month!

Solr Digest, July 2010

As usual, July is one of the slower months in Solr world, however, we managed to find a few interesting topics for our readers.

  • Interesting feature might be added with SOLR-1979Create LanguageIdentifierUpdateProcessor. It would provide ability to differently handle the text in different languages (think about stemming in analysis, for instance) and to do it automatically. This issue was just created, so the work on it and any usable patches are coming some time in the future. However, if you need something working now, Sematext has a few products for similar multilingual functionality, for instance, Multilingual Indexer or its cousin Language Identifier.
  • Another interesting feature might come with SOLR-1980Implement boundary match support. This will enable one to specify that query should match only at the start or at the end of the field (or be exact match), not somewhere in the middle, which could provide more relevant search results in some specific cases. This issue is also in its infancy and has no patches yet, so we’ll have to wait and see how it progresses.
  • Ever wanted Solr to store as the value of some field something other than the raw input value (remember, when you search Solr, you search on analyzed and indexed values; when you fetch the content of some field, you get the raw input value added to that field, not its analyzed version)? Patch for that already exists in one rather fresh JIRA issue – SOLR-1997Store internal value instead of input one.
  • Getting ready to start using Solr, but are unsure about which version you should use? Don’t worry, confusion about Solr’s version started this spring (see Solr May 2010 Digest), but things stabilized lately. The latest release is the fairly recent 1.4.1, which is basically 1.4 version with many bugfixes. The next release version is 3.1 which can be found on branch_3x branch. You can find its nightly build versions here. The trunk is still used for “unstable” development and the future 4.0 version. To get more information, check these recent threads on the Solr mailing list: here and here.
  • Many will probably agree that Solr’s SpellCheckComponent isn’t very useful in real-life applications. One of the main problems is that it poorly handles multi-word queries, where it creates its suggestion as a collated version of best suggestion for each word of the query, so you often get suggestions which have 0 hits. Also, it doesn’t return important information about suggested query, like how many hits such query would generate and what results it would give. Some of these issues could be fixed some day with SOLR-2010Improvements to SpellCheckComponent Collate functionality. The first version of the patch is already provided. However, if you’d like to use such functionality in your Solr production today, you might consider one much more sophisticated and production-ready component developed by Sematext – DYM ReSearcher – you can see DYM ReSearcher in action on Search-Lucene.com, for example.
  • One minor functionality is added to QueryElevationComponent – Add option to return only the specified results. It was added with JIRA issue SOLR-1966 and is already committed to 3.x and trunk.

We hope that this was enough to satisfy your Solr appetite.  Hopefully, we’ll dig more interesting topics for you in August.  Until then you can keep up with us via @sematext on Twitter.

Add option to return only the specified results

Solr Digest, June 2010

We have already written about news in Solr world this month here and here, so you already know that Solr’s 1.4.1. version was released, based on Lucene 2.9.3. Still, one thread from the mailing lists gives some more info about svn branches and how they are related to Solr versions.

Real Time indexing is again one of the hot topics. We already mentioned Zoie plugin in Solr March Digest, so this time we’ll point to interesting discussion on mailing lists. In case you followed this topic, Zoie Solr Plugin is a great plugin for Solr, but still has some limitations. For instance, master-slave architecture (which is the base of almost all big Solr deployments) isn’t well suited for Zoie. Version 2.9 of Lucene brought interesting addition of Near Realtime Search capabilities. As you probably already know, Solr 1.4 release already was running on Lucene 2.9 (2.9.1. to be precise), but support for NRT wasn’t implemented. Solr’s next release might have it since there is a JIRA issue dealing with NRT integration, but don’t hold your breath.

We’ll also mention some new functionalities in Solr:

  • Added relevancy function queries – JIRA issue SOLR-1932 adds function queries for relevancy factors such as tf, idf, etc. This issue is already fixed and committed to trunk.
  • Improved Solr response indentation - added with issue SOLR-1933. Solr only supported 7 levels of indenting previously, so this issue solves it. The downside is a small increase in response size (since instead of tabs, blank spaces will be used). The fix is already committed, but not only to trunk, but also to 3_x branch.
  • Ever wanted to see index files without logging into your servers? This patch will make them visible from Solr admin pages or by using LukeRequestHandlers.
  • Another related issue also got a patch and is already committed to the trunk – SOLR-1946misc enhancements to SystemInfoHandler. Here is a brief list of additions:   include CWD in directory info, include raw bytes version of memory stats, include a list of all system properties.

We’ll end with the short overview of interesting issues which are still in development:

  • Use Lucene’s Field Cache To Retrieve Stored Fields From Memory – the issue SOLR-1961 isn’t finished yet, althought there is a patch. When it is finished, it might give a new boost to the performance of your Solr server, thanks to developers from Cisco.
  • If you want to track performance improvements prepared for 4.0 release, you can just follow JIRA issue SOLR-1965. Some stuff is already listed there, so you can go and check what is in store for the future versions.
  • For anyone using PHP to talk to Solr, there is a new PHP Response Writer – currently, it is available as a Jar that has to be added to your Solr’s classpath. For more details check JIRA issue comments.
  • Field collapsing is one of the longest still unresolved issues in Solr world. SOLR-236 (many people probably easily recognize this JIRA issue number :)) was created more than 3 years ago and during the time it has grown into a “monster” – huge number of comments, patches, problems, parameters… you name it.  Integrating it with your Solr version was never fun (we tried it!). New hope appeared on the field collapsing horizon with the opening of SOLR-1682 (that’s a new JIRA issue for you to commit to your memory!). Some work had already been done there in the past, but now Yonik decided to dedicate some of his time to this issue, which means we might soon have a non-monster implementation that will be committed to Solr.

That’s all for this month. As you can see, in Solr May Digest there was no mention of new 1.4.1. release, but it happened, almost unexpectedly. So stay tuned (and follow @sematext) – you never know if something unexpected might happen this month too…

Lucene Digest, May 2010

Last month we were busy with work and didn’t publish our monthly Lucene Digest.  To make up for it, this month’s Lucene Digest really covers all Lucene developments in May from A to Z.

  • Mark Harwood had a busy month.  In LUCENE-2454 he contributed a production-tested and often-needed functionality for properly indexing parent-child entities (or, more generally, any form of hierarchical search).  He introduced his work in Adding another dimension to Lucene searches.  Joaquin Delgado has been talking about the merge of unstructured and structured search (not surprising, considering his old company with Lucene-based Federated Search product got acquired by Oracle several years ago!), so he quickly related this to ability to perform XQuery + Full-Text searches.  MarkLogic, watch your back! ;)
  • Mark also contributed a match spotter for all query types in LUCENE-1999.  This patch makes it possible to figure out which field(s) a match for a particular hit was on, which is functionality people ask about on Lucene and Solr mailing lists every so often.  Warning, though: spotting the matching and encoding that causes some score precision loss.
  • While Lucene already has TimeLimitedCollector, it’s not perfect and offers room for improvement.  Back in 2009, Mark came up with TimeLimitedIndexReader, as you can tell from his messages in Improving TimeLimitedCollector thread and created a patch with it in LUCENE-1720, which filled some of the TimeLimitedCollector’s gaps:
    • Any reader activity can be time-limited rather than just single searches e.g. the document retrieve phase.
    • Times out faster (i.e. runaway queries such as fuzzies detected quickly before last “collect” stage of query processing)
  • Robert Muir, who gave a well-received presentation on Finite State Queries in Lucene at New York Search & Discovery Meetup (see slides) back in April 2010, has been busy consolidating Lucene and Solr analyzers, tokenizers, token filters, character filters, etc. and moving them to their new home: modules/analysis, under Lucene root dir in svn.  The plan is to produce separate and standalone artifacts (read: jars) for this analysis module.  Here at Sematext we will make use of this module immediately for some of our products that currently list Lucene as a dependency, even though they really only need Lucene’s analyzers.  Solr, too, will be another customer for the new analysis module, as described by Robert in solr and analyzers module (yes, we’re showing off Search-Lucene.com’s in-document search-term highlighting, which we find very useful).
  • Robert also worked on and committed an ICU-based tokenizer for Unicode-based text segmentation in LUCENE-2414.  This translates to having the ability to properly tokenize text that doesn’t use spaces as token separators.  If you’ve ever had to deal with searching Chinese, for example, you’ll know that word segmentation is one of the initial challenges one has to deal with.
  • Talking about splitting on space, another task Robert took upon himself was to stop Lucene QueryParser from splitting queries on space: LUCENE-2458.  This problem of tokenizing queries in space comes up quote often, so this is going to be a very welcome improvement in Lucene.
  • One day Robert was super bored, so he decided to write a Lucene analyzer for Indonesian: LUCENE-2437.
  • Andrzej and Isreal Ekpo (the author of one of the Solr PHP clients) both decided to add support for search-time bitwise operations of integer fields around the same time.  Isreal’s work in in LUCENE-2460, with an accompanying SOLR-1913 issue, while Andrzej’s is in SOLR-1918 and has no pure Lucene patch.  The difference is that Israel’s patch offers only filtering, while Andrzej’s patch performs scoring, which allows finding the best matching inexact bit patterns. This has applications in e.g. near-duplicate detection.
  • In one of our current engagements we are working with a large, household-name organization and a big U.S. government contractor.  Their index is heavily sharded and is well over 2 TB.  Working with such large indices is no joke (though I’m happy to say we were able to immediately improve their search performance by 40% in the first performance tuning iteration). What if we could make their indices smaller?  Would that make their search even faster?  Of course!  In LUCENE-1812 (nice number), Andrej implemented a static index pruning tool that removes posting data from indices for terms with in-document frequency lower than some threshold.  We haven’t used this tool, and it looks like we may not use it for a while, because IBM apparently holds a patent on an exact same algorithm used in this tool.
  • Phrase queries got a little performance boost in LUCENE-2410.  Every little bit counts!
  • Tom Burton-West created and contributed a handy tool that outputs total term frequency and document frequency from a Lucene index: LUCENE-2393.  This tool can be handy for estimating sizes of some of the Lucene index files, and thus getting a better grasp on disk IO needs.
  • On both Lucene and Solr lists with often see people asking about updating individual Document fields instead of simply deleting and re-adding the whole Document.  Delete and re-add approach is not necessarily a problem for Lucene/Solr, but for an external system from which data for the complete new Document needs to be fetched.  Shai Erera, another recently added Lucene committer, proposed a new approach for incremental field updates that was well received.  Once implemented, this will be a big boon for Lucene and Solr!  If that thread or message is too long for you to read, let us at least highlight (pun intended) the two great use cases from this message.
  • Lucandra is a Cassandra backend for Lucene.  But no, it’s not a Lucene Directory implementation.  Lucandra has its own IndexReader and IndexWriter that read from Cassandra and write to it.  But in LUCENE-2456 we now have another option: a Cassandra-based Lucene Directory.  We hope to have a post on this in the near future!
  • The author of Cassandra-based Lucene Directory also opened LUCENE-2425 for Anti-Merging Multi-Directory Indexing Framework that splits an index (at run-time) into multiple sub-indices, based on a “split policy”, several of which have also been added to Lucene’s JIRA.  This is somewhat similar to Lucene’s ParallelWriter, but has some differences, as described in the issue.
  • Michael McCandless is working on prototyping a multi-stage pipeline sub-system that aims to further decouple analysis from indexing.  In this pipeline, indexing would be just one step, one stage in the pipeline.  Based on the work done so far, this may even bring some performance improvements.
  • LUCENE-2295 added a LimitTokenCountAnalyzer / LimitTokenCountFilter to wrap any other Analyzer and provide the same functionality as MaxFieldLength provided on IndexWriter
  • Shay Banon, the author of Elastic Search, contributed LUCENE-2468 (can you complete this hard to figure out numeric sequence?), which allows one to specify how new Document deletions should be handled in CachingWrapperFilter and CachingSpanFilter.  We recently did work for another large organization and a household name (in the U.S. at least) where we improved their Lucene-based search performance by over 30%.  One of the things we did was making good use of CachingWrapperFilter.
  • LUCENE-2480 removes support for pre-Lucene 3.* indices from Lucene 4.*.  Thus, if you are still on Lucene 1.* or Lucene 2.*, we suggest moving to Lucene 3.* soon.  But, due to radical Lucene changes, even moving from Lucene 3.x to Lucene 4.0 won’t be as seamless as with previous Lucene upgrades.  Lucene 4.0 will include a Lucene 3.x to Lucene 4.0 migration tool: LUCENE-2441.

That’s it for this month.  Remember that you can also follow @sematext on Twitter.

Follow

Get every new post delivered to your Inbox.

Join 1,696 other followers