Lucene & Solr Year 2011 in Review

The year 2011 is coming to an end and it’s time to reflect on the past 12 months.  Without further fluff, let’s look back and summarize all significant events that happened in Lucene and Solr world over the course of last dozen months. In the next few paragraphs we’ll go over major changes in Lucene and Solr, new blood, relevant conferences and books.

We should start by pointing out that this year Apache Lucene celebrated its 10 year anniversary as an Apache Software Foundation project.  Lucene itself is actually over 10 years old.  Otis is one of the very few people from the early years who is still around.  While we didn’t celebrations any Solr anniversaries this year, we should note that Solr, too, has been around for quite a while and is in fact approaching its 6th year at ASF!

This year saw numerous changes and additions both in Lucene and Solr.  As a matter of fact, we’d venture to say we saw more changes in Lucene & Solr this year than in any one year before.  In that sense, both projects are very much like wine – getting better with time. Lets take a look at a few of the most significant changes in 2011.

The much anticipated Near Real-Time search (NRT) functionality has arrived.  What this means is that documents that were just added to a Lucene/Solr index can immediately be made visible in search results.  This is big!  Of course, work on NRT is still in progress, but NRT is ready and you, like a number of our clients, should start using it.

Field Collapsing was one of the most watched and voted for JIRA issues for many month.  This functionality was implemented this year and now Lucene and Solr users can group result on the basis of a field or a query. In addition, you can control the groups and even do faceting calculation on the groups, not single documents. A rather powerful feature.

From Lucene users’ perspective it is also worth noting that Lucene finally got a faceting module.  Until now, faceting was available only in Solr.  If you are a pure Lucene users, you now don’t need Solr to calculate facets.

In the past modeling parent-child relationships in Lucene and Solr indices was not really possible – one had to flatten everything.  No longer – if you need to model a parent-child relationship in your index you can use the Join contrib module.  This Join functionality lets you join parent and child documents at query-time, while relaying on some assumptions about how documents were indexed.

Good and broad language support is hugely important for any search solution and this year was good for Lucene and Solr in that department: KStemFilter English stemmer was added, full Unicode 4 support was added, a new Japanese and Chinese support was added, a new stemmer-protection mechanism was added, work on synonym filter RAM consumption reduction was done, etc.  Another big addition was integration with Hunspell, which enables language-specific processing for all languages supported by Open Office.  That’s a lot of new languages we can now handle with Lucene and Solr! There is more.

Lucene 3.5.0 introduced significantly reduced the  term dictionary memory footprint. Big!  Right now, Lucene uses 3 to 5 times less memory for when dealing with terms dictionary, so it’s even less RAM consuming.

If you use Lucene and need to page through a lot of results you may run into problems. That’s why in Lucene 3.5.0 the searchAfter method was introduced which solves the deep paging problem once and for all!

There is also a new, fast and reliable Term Vector-based highlighter that both Lucene and Solr can use.

Dismax is great, but Extended Dismax query parser added to Solr is even better – it extends Dismax query parser functionality and can further improve the quality of search results.

You can now also sort by function (imagine sorting the results by distance from a point) and a new spatial search with filtering.

Solr also got the new suggest/autocomplete functionality based on FST automaton which significantly reduced the memory needed for such functionality.  If you need this for your search application, have a look at Sematext’s AutoComplete – it has additional functionality that lots of our customers like.

While not yet officially released, the new transaction log support provides Solr with a real-time get operation – as soon as you add a document you can retrieve it by ID.  This will also be used for recovering nodes in SolrCloud.

And talking about SolrCloud…  We’ve covered SolrCloud on this blog before in Solr Digest, Spring-Summer 2011, Part 2: Solr Cloud and Near Real Time Search, and we’ll be covering it again soon.  In short, SolrCloud will make it easier for people to operate larger Solr clusters by making use of more modern design principles and software components such as ZooKeeper, that make creation of distributed, cluster-based software/services easier.  Some of the core functionality is that there will be no single point of failure, any node will be able to handle any operation, there will be no traditional master-slave setup, there will be centralized cluster management and configuration, failovers will be automatic and in general things will be much more dynamic.  SolrCloud has not been released yet, but Solr developers are working on it and the codebase is seeing good progress.  We’ve used SolrCloud in a few recent engagements with our customers and were pleased by what we saw.

After merging developments of those two projects back in the 2010, we saw a speed up in development and releases. Lucene and Solr committers introduced five(!) new versions of both projects! In March, Lucene and Solr 3.1 was released with the Unicode 4 support, ReusableTokenStream, Spatial search, Vector-based Highlighter, Extended Dismax parser, and many more features and bug fixes. Then, after less than 3 months(!) on June 4th, version 3.2 was released. This release introduced a new and much desired results grouping module, NRTCachingDirectory, and highlighting performance improvements. Just one month later, on July 1st, Lucene and Solr 3.3 were introduced. That release included KStem stemmer, new implementations of Spellchecker, Field Collapsing in Solr and RAM usage reduction for autocomplete mechanism. By the end of summer there was another release, this time it was version 3.4 released on the 14th of September. Pure Lucene users got what Solr could do for a very long time – the long awaited faceting module contributed by IBM. Version 3.4 also included the new Join functionality, ability to turn off query and filter caches and faceting calculation for Field Collapsing. The last release of Lucene and Solr saw the light of day in late November. The 3.5.0 version consisted of huge memory reduction when dealing with term dictionaries, deep paging support, SearcherManager and SearcherLifetimeManager classes along with language identification provided by Tika, as well as sortMissingFirst and sortMissingLast support for TrieFields.

During the last 12 months we attended three major conferences focused on search and big data themes. Lucene Revolution took place in San Francisco in May. Otis gave a talk titled “Search Analytics: What? Why? How?” (slides) during the first day. There were a number of other good talks there and the complete conference agenda is available on  http://lucenerevolution.com/2011/agenda. Some videos are available as well. Next came the Berlin Buzzwords conference, a more grass-roots conference which took place between 4th and 10th of June. Otis gave the updated version of his “Search Analytics: What? Why? How?”. If you want to know more, check conference official site – http://berlinbuzzwords.de. The last conference focused exclusively on Lucene and Solr was Lucene Eurocon 2011 in sunny and tourist-filled Barcelona between 17th and 20th of October. And guess what – we were there again (surprise!), this time in slightly larger numbers. Otis gave a talk about “Search Analytics: Business Value & BigData NoSQL Backend” (video, slides) and Rafał gave a talk on a pretty popular topic - “Explaining & Visualizing Solr ‘explain’ information” (video, slides). No open source project can endure without regular injections of new blood. This year, Lucene and Solr development team was joined by a number of new people whose names may look familiar to you:

These 7 men are now Lucene and Solr committers and we look forward to our next year’s Year in Review post, where we hope to go over the good things these people will have brought to Lucene and Solr in 2012.

You know an open source project is successful when a whole book is dedicated to it.  You know a project is very successful when more than one book and more than one publisher cover it.  There were no new editions of Lucene in Action (amazon, manning) this year, but our own Rafał Kuć published his Solr 3.1 Cookbook (amazon) in July.  Rafał’s cookbook includes a number of recipes that can make your life easier when it comes to solving common problems with Apache Solr. Another book, Apache Solr 3 Enterprise Search Server (amazon) by David Smiley and Eric Pugh was published in November. This is a major update to the first edition of the book and it covers a wide range of functionalities of Apache Solr.

@sematext

Solr Digest, Spring-Summer 2011, Part 2: Solr Cloud and Near Real Time Search

As promised in Part 1 of Solr Digest, Spring-Summer 2011, in this Part 2 post we’ll summarize what’s new with Solr’s Near-Real-Time Search support and Solr Cloud (if you love clouds and search with some big data on the side, get in touch). Let’s first examine what is being worked on for Solr Cloud and what else is in the queue for the near future. A good overview of what is currently functional can be found in the old Solr Cloud wiki page. Also, there is now another wiki page covering New Solr Cloud Design, which we find quite useful.  The individual pieces of Solr Cloud functionality that are being worked on are as follows:

  • Work is still in progress on Distributed Indexing and Shard distribution policy. Patches exist, although they are now over 6 months old, so you can expect to see them updated soon.
  • As part of the Distributed Indexing effort, shard leader functionality deals with leader election and with publishing the information about which node is a leader of which shard and in Zookeeper in order to notify all interested parties.  The development is pretty active here and initial patches already exist.
  • At some point in the future, Replication Handler may become cloud aware, which means it should be possible to switch the roles of masters and slaves, master URLs will be able to change based on cluster state, etc. The work hasn’t started on this issue.
  • Another feature Solr Cloud will have is automatic Spliting and migrating of Indices. The idea is that when some shard’s index becomes too large or the shard itself starts having bad query response times, we should be able to split parts of that index and migrate it (or merge) with indices on other (less loaded) nodes. Again, the work on this hasn’t started yet.  Once this is implemented one will be able to split and move/merge indices using a Solr Core Admin as described in SOLR-2593.
  • To achieve more efficiency in search and gain control over where exactly each document gets indexed to, you will be able to define a custom shard lookup mechanism. This way, you’ll be able to limit execution of search requests to only some shards that are known to hold target documents, thus making the query more efficient and faster.  This, along with the above mentioned shard distribution policy, is akin to routing functionality in ElasticSearch.

On to NRT:

  • There is a now a new wiki page dedicated to Solr NRT Search. In short, NRT Search will be available in Solr 4.0 and the work currently in progress is already available on the trunk. The first new functionality that enables NRT Search in Solr is called “soft-commit”.  A soft commit is a light version of a regular commit, which means that it avoids the costly parts of a regular commit, namely the flushing of documents from memory to disk, while still allowing searches to see new documents. It appears that a common way of using this will be having a soft-commit every second or so, to make Solr behave as NRT as possible, while also having a “hard-commit” automatically every 1-10 minutes. “Hard-commit” will still be needed so the latest index changes are persisted to the storage. Otherwise, in case of crash, changes since last “hard-commit” would be lost.
  • Initial steps in supporting NRT Search in Solr were done in Re-architect Update Handler. Some old issues Solr had were dealt with, like waiting for background merges to finish before opening a new IndexReader, blocking of new updates while commit is in progress and a problem where it was possible that multiple IndexWriters were open on the same index. The work was done on solr2193 branch and that is the place where the spinoffs of this issue will continue to move Solr even closer to NRT.
  • One of the spinoffs of the Update Handler rearchitecture is SOLR-2565, which provides further improvements on the above mentioned issue.  New issues to deal with other related functionality will be opened along the way, while SOLR-2566 looks to serve as an umbrella issue for NRT Search in Solr.
  • Partially related to NRT Search is the new Transaction Log implemented in Solr under SOLR-2700. The goal is to provide durability of updates, while also supporting features like the already committed Realtime get.  Transaction logs are implemented in various other search solutions such as ElasticSearch and Zoie, so Simon Willnauer started a good thread about the possibility of generalizing this new Transaction Log functionality so that it is not limited to Solr, but exposed to other users and applications, such as Lucene, too.

We hope you found this post useful.  If you have any questions or suggestions, please leave a comment, and if you want to follow us, we are @sematext on Twitter.

Solr Digest, Spring-Summer 2011, Part 1

No, Solr Digests are not dead, we’ve just been crazily busy at Sematext (yes, we are hiring!). Since our last Solr Digest not one, but 2 new Solr releases have been made: 3.2 in June, 3.3 in July and version 3.4 is imminent – voting is already in progress, so you can expect a new release pretty soon. Also, there were a number of interesting developments on the trunk (future 3.x and 4.0 versions). Therefore, we will be publishing two Solr Digests this time. This first Digest covers general developments in Solr world, while the sequel will be more focused on two features drawing a lot of attention: Solr Cloud and Near Real Time search.

Let’s get started with a short overview of announced news in 3.2 and 3.3. First, 3.2 brought us:

  • Ability to specify overwrite and commitWithin as request parameters when using the JSON update format
  • TermQParserPlugin, useful when generating filter queries from terms returned by field faceting or terms component
  • DebugComponent now supports using a NamedList to model Explanation objects in its responses instead of Explanation.toString
  • Improvements to the UIMA and Carrot2 integrations
  • Highlighting performance improvements
  • A test-framework jar for easy testing of Solr extensions
  • Bugfixes and improvements from Apache Lucene 3.2

With 3.3 we got:

  • Grouping / Field Collapsing
  • A new, automaton-based suggest/autocomplete implementation offering an order of magnitude smaller RAM consumption
  • KStemFilterFactory, an optimized implementation of a less aggressive stemmer for English
  • Solr defaults to a new, more efficient merge policy (TieredMergePolicy). See Mike’s cool Lucene segment merging video
  • Important bugfixes, including extremely high RAM usage in spellchecking
  • Bugfixes and improvements from Apache Lucene 3.3

Let’s now look at other interesting stuff. We’ll start with DataImportHandler and its bug fixes. As you’ll notice, there are quite a few of them (and we didn’t even list them all!) so we advise using all available patches.

Already committed features

  • A bug-fix for DataImportHandler – “replication reserves commit-point forever if using replicateAfter=startup”. SOLR-2469 brought a fix to version 3.2 and future 4.0 (trunk). This problem caused unnecessary (and huge) buildup in the number of index files on the slaves.
  • Another bug-fix for DataImportHandler – DIH does not commit if only Deletes are processed. When using special commands $deleteDocById and/or $deleteDocByQuery, when there were no updates of documents, commit wasn’t called by the DIH. Fix is available in 3.4 and 4.0.
  • Also – DataImportHandler multi-threaded option throws exception. The problem would happen when threads attribute was used. The fix for this is available in 3.4 and 4.0. Related to this is another fixed issue – DIH multi threaded mode does not resolves attributes correctly also available in 3.4 and 4.0.
  • Join feature got committed to the trunk (future 4.0 version). It can also perform cross-core joins now, which can be very useful. However, this feature also initiated some heated discussions which can be seen in SOLR-2272. The root cause was the fact that this feature was committed only to the Solr while Lucene got none of it. Of course, it might get refactored and included in Lucene too in the future, but this discussion shows the divisons which still existed between Solr and Lucene communities back then.
  • While we’re talking about Join feature, it might be worth mentioning a patch in SOLR-2604 which back-ports it to 3.x version. Be careful though, it was created for version 3.2 more than two months ago, so a few more adjustments after applying this patch might be needed.
  • Function Queries got new if(), exists(), and(), or(), not(), xor() and def() functions. The fix is committed to trunk so you’ll be able to use it in 4.0.
  • As can be seen from the Solr 3.3 announcement, one of the longest living Solr issue is finally closed for good :). SOLR-236 – Field Collapsing – along with SOLR-2524 finally bring field collapsing to 3_x and future 4.0 versions.
  • Since grouping/field collapsing was added to Solr, we should be able to use faceting in combination with it. Issue SOLR-2665 – Solr Post Group Faceting – brought exactly that to 3.4 and 4.0.
  • Ever wanted to have more control over what gets stored in the cache? SOLR-2429 will bring exactly that starting with the next Solr release – 3.4. It is simple to use, just add cache=false to your queries like this: fq={!frange l=10 u=100 cache=false}mul(popularity,price).  Note that with this new functionality you can prevent either a filter or a query to be cached, while document caching still remains out of request-time control.
  • If you’re using JMX to observe the state of your Solr installation, you might have encountered a problem when reloading Solr cores – it appears that JMX beans didn’t survive those reloads in the past versions. The fix is created and is available in future 3_x and trunk releases.

Interesting features in development

  • To achieve case-insensitive search with wildcard queries you could use a patch suplied under issue SOLR-2438. It has to be said that this isn’t committed to svn and it is hard to say whether it ever will be since there is a similar issue SOLR-219 on which work started 4 years ago.
  • Multithreaded faceting might bring some performance improvements. At the moment, initial patch exists, but more work will be needed here and it still isn’t clear how big improvement we could expect in real-world conditions, but it is worth keeping an eye on this issue.
  • We all know that Solr’s Spatial support has its limitations. One of them is specifying bounding box which isn’t based on point distance, effectively making it limited to a circular shape. Under SOLR-2609 we might get support for exactly this.
  • For anyone interested in which direction Spatial support might evolve, we suggest checking Lucene Spatial Playground. It continues the great work done in SOLR-2155 which provided extension to initial GeoSpatial support in Solr by adding multivalued spatial fields. At some point, SOLR-2155 might get the goodness from LSP. Also, another thing to check would be a thread on Lucene Spatial Future.

Interesting new features

  • Support for Lucene’s Surround Parser is added to Solr in issue SOLR-2703. The patch is already committed to the trunk.
  • Solr will get the ability to use configuration like analyzer type=”phrase”. Lucene’s Query Parsers recently got a simpler way to use different analyzer based on the query string. One example is usage of double quotes where one can decide that instead of current meaning in Lucene/Solr world – specifying a phrase to be searched for – it should have a meaning like in Google’s search engine – find this exact wording. Patch for this exists and can be applied on the trunk (it depends on Lucene trunk).
  • SOLR-2593 aims to provide a new Solr core admin action – ‘split’ – for splitting index. It would be used in case some core got too big or in any other case you might find it necessary.  Lucene already has a similar function.

Miscellaneous

  • Oracle released Java 7 about a month ago, but we advize against using it yet. JVM crashes and index corruption are issues likely to be encoutered with it. For more information, visit this URL
  • As anticipated for some time, Java 5 support got axed from Lucene 4.0 (trunk). You can expect similar stuff for Solr too.
  • Solr’s build system has been reworked now. Among other things, this implies changes in directory structure in Solr project. For example, solr/src/ doesn’t exist any more and its old subdirs /java and /test are now in solr/core/. The changes are already applied to the trunk and 3_x which holds the next 3.4 version. For more details, see SOLR-2452.
  • A handy Solr architecture diagram can be found in ML thread
  • Solr’s Admin UI is being refreshed with the work in JIRA issue SOLR-2399 (we already wrote about it) and its spin-off SOLR-2667. Some of this stuff is already committed (on the trunk), so you may want to inspect the changes. More details can be found in the wiki where you can also get the sneak-peak of the upcoming changes.

And that would be all for part one of the Solr Spring-Summer 2011 Digest edition from @sematext. Part two of the Spring-Summer Digest is coming in a few days – stay tuned!

Solr Digest, February-March 2011

We Sematexters have been very busy over the past few months, so we missed Solr’s February Digest. This one will therefore be a bit longer than usual.  Let’s get started…

First, some major news : Solr 3.1 is officially released! The details of the announcement can be found here. We covered most of the new features in our digests already, so we’ll keep it short:

You can start your download :).

Already committed features

  • post.jar got improved – JIRA issue improve post.jar to handle non UTF-8 files removed some of its very old limitations
  • jetty server included in Solr distribution didn’t support UTF-8. Now this is solved, fresh 3.1 version already contains this fix

Interesting features in development

  • as part of SolrCloud, distributed indexing is being implemented in JIRA issue SOLR-2358. You can already see the work in progress in the initial patch, but you can also check SOLR-2341 which deals with shard distribution policies which will be available in Solr 4.0
  • If you ever wanted to add custom fields (not existing in the index) to Solr responses, you couldn’t have done that from Solr components. There were other ways to achieve such functionality (for instance, customizing response writer class), but it looks like we’ll get such ability inside of components, too. No need to say how much more natural that would be. Anyway, issue Allow components to add fields to outgoing documents provides the umbrella for this new functionality. Although it is already closed, there are few sub-issues in which actual pieces of logic will be implemented.
  • if you have problem with case sensitive searches in wildcard queries, you might take a look at a patch provided in JIRA issue Case Insensitive Search for Wildcard Queries
  • although Solr got its first solid spatial implementation in version 3.1, many people found its limitations. One of them is surely a case where documents have multivalued spatial fields. We already wrote about SOLR-2155 in our December digest, but work under that issue hasn’t stopped and keeps evolving. It is likely that it will become a part of the standard Solr distribution and Lucene could get it incorporated, too. If you need spatial search you may want to watch this issue.

Interesting new features

  • one common problem when using Solr’s default spellchecker or auto-suggest is filtering of suggestions based on what some user can see (for instance, depending on the region in which your user resides). JIRA issue Doc Filters for Auto-suggest, spell checking, terms component, etc. proposes a feature which would help here. Currently, no work was done there, though we believe we’ll get to see some progress in the future. While we are at it, in case you need such feature in Auto-suggest now, you might take a look at our in-house Search Auto-Complete solution, which you can see in action on search-lucene.com and search-hadoop.com.
  • just like there are default components for SearchHandlers (which are used by default for every new search handler, unless overriden), update processors will get a similar feature. JIRA issue Let some UpdateProcessors be default without explicitly configuring them will take care that some important update processors are available by default to your UpdateRequestProcessorChain.
  • one great new feature could be added to Solr – ability not to cache a filter. JIRA issue SOLR-2429 will deal with this. Many Solr users will be happy to optimize their cache performance when this feature is available some day.

Miscellaneous

  • some interesting thoughts on spellchecker can be found in ML thread My spellchecker experiment and much more on that topic in the related blog
  • should you use ASCIIFoldingFilter or MappingCharFilter when dealing with accents? Interesting discussion in thread Should ASCIIFoldingFilter be deprecated? could help you decide which one is right for you
  • interesting idea for Solr’s admin UI can be found in this ML thread. Community’s reception was very good so we also got Solr Admin Interface, reworked issue as the home for this new work.
  • anyone using Solr’s UIMA (Unstructured Information Management Architecture) contrib might be interested to know that its wiki page got a major improvement – more docs to read!
  • we might be a bit late on this, but there is still some time left – Google’s Summer of Code applications can be submitted until 8th April. Check this ML thread for some detail.  And don’t forge that Sematext is sponsoring interns, too!
  • new Solr/Lucene users should take a look at the Refcard provided by Erik Hatcher in ML thread [infomercial] Lucene Refcard at DZone
  • some deep thoughts on Solr/Lucene’s release process by some of the key people can be found here Brainstorming on Improving the Release Process. Related to that is a JIRA issue Define Test Plan for 4.0 which will… eh, contain some info about Test plan for 4.0 release, obviously. Also, check the TestPlans wiki page that’s in the making.

Although there were some other interesting topics, we have to stop somewhere. Until next month, you’ll find us on Twitter.

Solr Digest, January 2011

Welcome to the second season of Sematext’s monthly Solr Digests. Once again, we compiled a list of most interesting topics in Solr world for the previous month:

Already committed features

  • A bug related to using PHPSerialized response writer in sharded environment was fixed and committed in SOLR-2307. It affected all recent Solr versions (trunk, 3_x, 1.4.1,…) and the fix is committed to 3_x branch and trunk. In case you’re stuck with older version of Solr, you can manually try applying the patch, it should be doable.
  • One old JIRA issue Enable sorting by Function Query is finally closed and committed to 3_x and trunk.
  • A problem with race condition in StreamingUpdateSolrServer got its fixes before, however it appears that issue wasn’t fixed completely. Now another fix is committed to 3_x and trunk, so if you use this feature, we advise picking up the fix.

Interesting features in development

  • Support for complex syntax (e.g. wildcards) in phrase queries is being brought to Lucene. In case you’re interested, you can take a look at LUCENE-1823 or LUCENE-1486 which was another try at similar functionality. These issues have been in development for a long time and still aren’t finished, although patches exist. Similar feature for Solr is developed under SOLR-1604, where you can also find some patches. However, we think it is a bit unclear if any of these issues will ever be committed to Lucene/Solr, so if you’re interested, check the progress on them occasionally and don’t hold your breath.

Interesting new features

Miscellaneous

And that’s all for January.

Solr Digest, December 2010

Just the other day, we posted the Lucene & Solr highlights in our Lucene & Solr: 2010 in Review post, and now it’s time to really conclude 2010 in Solr world with December Solr Digest. Although one might expect festive period to take its toll on the Solr development velocity, it wasn’t like that at all. Open source never sleeps.  Here are the most interesting highlights:

Interesting features in development

  • In our July’s Digest we mentioned LanguageIdentifierUpdateProcessor feature which is being proposed under JIRA issue SOLR-1979. Some artifacts in the form of patches are starting to appear attached to that issue, so if you’re interested in this feature, take a look.
  • Solr’s spatial capabilities are being further refined. With issue SOLR-2268, Solr will get “support for Point in Polygon searches”. This should enable features like “for a given point, return all documents which contain a polygon inside of which that point lays” and “for a given polygon, return all documents which have a point contained inside of that polygon“. Of course, negated versions of such feature will be supported. The work is in early stages, one patch is attached, but it can be used only as a general pointer about how this thing will be implemented, nothing else.
  • Support for “ColognePhonetic” encoder was added to PhoneticFilterFactory. Since “ColognePhonetic” will be added to Commons Codec 1.4.1, the patch provided in SOLR-2276 will wait until that version gets released.

Interesting new features

  • Solr is getting JOIN functionality – sort of. As part of SOLR-2272 , Solr got a working patch that provides SQL JOIN-like functionality. Of course, this is not exactly the JOIN you might know from SQL, but it is probably the closest thing to it which can be implemented in Solr. It is likely that this feature will be integrated into Lucene as well – it makes no sense to have it strictly in Solr. It is also likely that this feature will be expanded in the future; currently it has only one algorithm and supports many-to-many type of JOIN.
  • As part of SolrCloud, new feature SolrCloud distributed indexing will be added to Solr some day. SOLR-2293 will be likely JIRA home for this feature. Before SolrCloud, anyone using distributed indexing had to create a custom logic which handled distribution of documents over various shards in the cluster. With SolrCloud, this will be transparent to the clients. Also, SolrCloud will include some out-of-the-box distribution algorithms, while, of course, plugging in custom algorithms will be easy to accomplish. However, don’t hold your breath waiting for this feature. At the moment, there exists only JIRA issue (and some guidelines in the Wiki) related to this feature.

Miscellaneous

  • Solr’s Jetty is now upgraded to the latest 6.1.26 version. The change was committed to 3_x branch and trunk but, of course, it doesn’t mean you have to upgrade your Jetty too. It means just that stuff (various jars and few xmls) under /solr/example got upgraded. Details of this change can be found in SOLR-2265.
  • Did you experience any problems with DataImportHandler and its multi-threaded option? If so, you are not the only one. More details about the nature of the problem can be found in SOLR-2186, while it appears that SOLR-2233 already contains a patch which might help in case of JDBC data source. That patch contains a  few other DataImportHandler multithreading fixes. Nothing related to this issue has been committed to trunk or 3_x branch yet.
  • Many problems people encounter with Solr are related to OutOfMemory error. There were many interesting discussions on the ML in December, but we consider two of them related to that topic which you might find interesting OutOfMemory GC: GC overhead limit exceeded – Why isn’t WeakHashMap getting collected? and Memory use during merges (OOM)
  • If you found “bf” parameter somewhat limited – it doesn’t accept complex nested expressions with lots of whitespaces – you might find patch from  SOLR-2014 useful. It appears that this is a common problem: another similar bug report (although this actually isn’t a bug) was also opened recently – SOLR-2267. However, consider that “bf” parameter will likely be deprecated in the future, since “bq” could be used to achieve everything “bf” can and more.
  • In our November’s Digest, we mentioned SOLR-2154 and Solr’s problem with multi-valued spatial fields. What we missed was SOLR-2155, which actually provides a patch for such problems. Although it isn’t committed, the patch appears to be functional, so give it a try.

And that would be all for Solr in 2010.  We’ll be back with the new Solr Digest in a month.  Follow @sematext for other interesting search news.

Solr Digest, November 2010

It is time for the last Solr Digest of 2010; the next Digest will be published some time in January 2011. This was not a month with too many interesting developments, so here we bring to your attention only the few more interesting bits. Here we go…

Already committed features

  • Anyone working with Polish language will be happy to hear that factory for Polish stemmer is committed to 3_x and trunk.

Interesting features in development

Miscellaneous

  • A fix for a feature that was committed earlier this year – Enable sorting by Function Query – is close to being committed. This is big one!  There were some problems with it: functions weren’t weighted, function query wasn’t being properly parsed, some deprecated bits of code were used, etc. Patch is already posted, so if you are eager to use this functionality you can start by applying the patch yourself.
  • Many people are using Spatial Search features recently introduced in Solr. If you’re considering that too, be careful about one limitation: there is no Spatial support for multi-valued fields. So, if you have multi-valued spatial fields and you’d like to do some sorting on them, you’ll end up with incorrect results. The feature we’re describing here can be found in some other search tools, though, like Elastic Search, so Solr might be getting it too some day. You can check if there is some progress with this in JIRA issues like SOLR-2154
  • There is a major bug in DataImportHandler - it doesn’t release JDBC connections. It appears that this issue isn’t related to any particular database, so this is an obvious bug in DIH. Check this JIRA issue for updates.
  • If you prefer git over svn, you might be interested in Solr’s git repository recently set up. Check this ML thread to learn more about it.

So long until 2011, Solr Digest readers!  Follow @sematext on Twitter for other stuff from Sematext.

Solr Digest, October 2010

Another busy month is behind us.  There were plenty of interesting topics, so let’s get started:

Already committed functionality

Interesting functionality in development

  • Faceting is heavily used functionality, but occasionally people find they’re missing some form of faceting. Hierarchical faceting is one such thing. It has been in development for a very long time but, despite a few posted patches, is still not a part of Solr distribution, plus it hasn’t seen much activity lately. There is another similar issue – Pivot (aka Decision Tree) Faceting Component which should come to life as a separate component. However, there is renewed effort to make it usable, so eventually we’ll see expanded faceting support in Solr.

Interesting new functionality

  • Extending SchemaField with custom attributes is being dealt with in the issue Custom SchemaField object.
  • Improving search relevance is always a big issue (and represents a good part of what Sematext does in client engagements), no matter how good out of the box Solr and Lucene relevance is.  One very useful addition to our search relevance arsenal could come from the Anti-phrasing feature. The idea is that some word sequences in a query are irrelevant to the query meaning (like “Where can I find” or “Where is.”) and could/should be ignored while searching the index. This JIRA issue is still very fresh, so don’t hold your breath waiting for the implementation to become available next week, although we are bound to see this feature in one of the future Solr releases.
  • If you often working with financial data, you might find patches from issue Money FieldType useful. The new field type will support point and range queries, sorting, and exchange rates.
  • Lucene’s ICUTokenizer is useful for multilingual tokenizing but until recently there was no support for it in Solr. The issue Provide Solr FilterFactory for Lucene ICUTokenizer will provide a filter factory which will enable us to use it from Solr. Bingo! The patch already exists, so it can be tried already. Additional new functionality will be added over time.  If you need multilingual support in Solr, have a look at Sematext‘s popular Multilingual Indexer.

Miscellaneous

  • One of the favorite topics, which we also cover frequently, is related to the ongoing confusion about Solr versions. October didn’t disappoint, this topic was discussed on mailing lists again. So, here is one such thread – Which version of Solr to use?. Let us summarize the key parts. Solr 1.5 will probably never be released.  The branch_3x is a stable version from which the next Solr 3.1 version will likely be released.  The trunk contains relatively stable, but still development version of what will become Solr 4.0 one day.
  • If you provide faceting functionality in your application, here is a small (but interesting) discussion that might give you a few ideas about how to optimize it – Faceting and first letter of fields.
  • It appears that Solr has problems running on Tomcat 7. These problems are not related to a particular version of Solr, but to all versions. To learn more, start with Problems running on tomcat and SOLR-2022 .
  • The replication between Solr master and slave when they’re running different versions of Solr is broken, as you can see in issue Cross-version replication broken by new javabin format. The cause is the new javabin format, so in cases like the one described in this issue (master 1.4.1, slave 3x), you’ll encounter problems. Keep that in mind if you plan cross-version replication for some reason.

These were the most interesting highlights for the month of October. Thank you for reading Sematext Blog and following @sematext on Twitter.

Solr Digest, September 2010

Mahout Digest, October 2010

We’ve been very busy here at Sematext, so we haven’t covered Mahout during the last few months.  We are pleased with what’s been keeping us busy, but are not happy about our irregular Mahout Digests.  We had covered the last (0.3) release with all of its features and we are not going to miss covering very important milestone for Mahout: release 0.4 is out! In this digest we’ll summarize the most important changes in Mahout from the last digest and add some perspective.

Before we dive into Mahout, please note that we are looking for people with Machine Learning skills and Mahout experience (as well as good Lucene/Solr search people).  See our Hiring Search and Data Analytics Engineers post.

This Mahout release brings overall changes regarding model refactoring and command line interface to Mahout aimed at improving integration and consistency (easier access to Mahout operations via the command line). The command line interface is pretty much standardized for working with all the various options now, which makes it easier to run and use. Interfaces are better and more consistent across algorithms and there have been many small fixes, improvements, refactorings, and clean-ups. Details on what’s included can be found in the release notes and download is available from the Apache Mirrors.

Now let’s add some context to various changes and new features.

GSoC projects

Mahout completed its Google Summer of Code  projects and two completed successfully:

  • EigenCuts spectral clustering implementation on Map-Reduce for Apache Mahout (addresses issue MAHOUT-328), proposal and implementation details can be found in MAHOUT-363
  • Hidden Markov Models based sequence classification (proposal for a summer-term university project), proposal and implementation details in  MAHOUT-396

Two projects did not complete due to lack of student participation and one remains in progress.

Clustering

The biggest addition in clustering department are EigenCuts clustering algorithm (project from GSoC) and MinHash based clustering which we covered as one of possible GSoC suggestions in one of previous digests . MinHash clustering was implemented, but not as a GSoC project. In the first digest from the Mahout series we covered problems related to evaluation of clustering results (unsupervised learning issue), so big addition to Mahout’s clustering are Cluster Evaluation Tools featuring new ClusterEvaluator (uses Mahout In Action code for inter-cluster density and similar code for intra-cluster density over a set of representative points, not the entire clustered data set) and CDbwEvaluator which offers new ways to evaluate clustering effectiveness.

Logistic Regression

Online learning capabilities such as Stochastic Gradient Descent (SGD) algorithm implementation are now part of Mahout. Logistic regression is a model used for prediction of the probability of occurrence of an event. It makes use of several predictor variables that may be either numerical or categories. For example, the probability that a person has a heart attack within a specified time period might be predicted from knowledge of the person’s age, sex and body mass index. Logistic regression is used extensively in the medical and social sciences as well as marketing applications such as prediction of a customer’s propensity to purchase a product or cease a subscription. The Mahout implementation uses Stochastic Gradient Descent (SGD), check more on initial request and development in MAHOUT-228. New sequential logistic regression training framework supports feature vector encoding framework for high speed vectorization without a pre-built dictionary. You can find more details on Mahout’s logistic regression wiki page.

Math

There has been a lot of cleanup done in the math module (you can check details in Cleanup Math discussion on ML), lot’s of it related to an untested Colt framework integration (and deprecated code in Colt framework). The discussion resulted in several pieces of Colt framework getting promoted to a tested status (QRdecomposition, in particular)

Classification

In addition to speedups and bug fixes, main new features in classification are new classifiers (new classification algorithms) and more open/uniformed input data formats (vectors). Most important changes are:

  • New SGD classifier
  • Experimental new type of Naive bayes classifier (using vectors) and feature reduction options for existing Naive bayes classifier (variable length coding of vectors)
  • New VectorModelClassifier allows any set of clusters to be used for classification (clustering as input for classification)
  • Now random forest can be saved and used to classify new data. Read more on how to build a random forest and how to use it to classify new cases on this dedicated wiki page.

Recommendation Engine

The most important changes in this area are related to distributed similarity computations which can be used in Collaborative Filtering (or other areas like clustering, for example). Implementation of Map-Reduce job, based on algorithm suggested in Elsayed et al: Pairwise Document Similarity in Large Collections with MapReduce, which computes item-item similarities for item-based Collaborative Filtering can be found in MAHOUT-362. Generalization of algorithm based on the mailing list discussion led to an implementation of  Map-Reduce job which computes pairwise similarities of the rows of a matrix using a customizable similarity measure (with implementations already provided for Cooccurrence, Euclidean Distance, Loglikelihood, Pearson Correlation, Tanimoto coefficient, Cosine). More on distributed version of any item similarity function (which was available in a non-distributed implementation before) can be found in MAHOUT-393. With pairwise similarity computation defined, RecommenderJob has been evolved to a fully distributed item-based recommender (implementation depends on how the pairwise similarities are computed). You can read more on distributed item-based recommender in MAHOUT-420.

Implementation of distributed operations on very large matrices are very important for a scalable machine learning library which supports large data sets. For example, when term vector is built from textual document/content, terms vectors tend to have high dimension. Now,  if we consider a term-document matrix where each row represents terms from document(s), while a column represents a document we obviously end up with high dimensional matrix. Same/similar thing occurs in Collaborative Filtering: it uses a user-item matrix containing ratings for matrix values, row corresponds to a user and each column represents an item. Again we have large dimension matrix that is sparse.

Now, in both cases (term-document matrix and user-item matrix) we are dealing with high matrix dimensionality which needs to be reduced, but most of information needs to be preserved (in best way possible). Obviously we need to have some sort of matrix operation which will provide lower dimension matrix with important information preserved. For example, large dimensional matrix may be approximated to lower dimensions using Singular Value Decomposition (SVD).

It’s obvious that we need some (java) matrix framework capable of fundamental matrix decompositions. JAMA is a great example of widely used linear algebra package for matrix operations, capable of SVD and other fundamental matrix decompositions (WEKA for example uses JAMA for matrix operations). Operations on highly dimensional matrices always require heavy computation and this requirements produces high HW requirements on any ML production system. This is where Mahout, which features distributed operations on large matrices, should be the production choice for Machine Learning algorithms over frameworks like JAMA, which although great, can not distribute its operations.

In typical recommendation setup users often ‘have’ (used/interacted with) only a few items from the whole item set (item set can be very large) which leads to user-item matrices being sparse matrices. Mahout’s (0.4) distributed Lanczos SVD implementation is particularly useful for finding decompositions of very large sparse matrices.

News and Roadmap

All of the new distributed similarity/recommender implementations we analyzed in previous paragraph were contributed by Sebastian Schelter and as a recognition for this important work he was elected as a new Mahout committer.

The book “Mahout in Action”, published by Manning, has reached 15/16 chapters complete and will soon enter final review.

This is all from us for now.  Any comments/questions/suggestions are more than welcome and until next Mahout digest keep an eye on Mahout’s road map for 0.5 or discussion about what is Mahout missing to become production stabile (1.0) framework.  We’ll see you next month – @sematext.

Solr Digest, September 2010

It is a busy time of year here at Sematext – we have 3 different presentations to prepare for 3 different conferences to prepare (2 down, 1 more to go!), so we’re a bit late with our digests. Nevertheless, we managed to compile a list of interesting topics in Solr world:

Already committed functionality

  • Solr was upgraded to use Tika 0.7SOLR-1819 – the fix was applied to 1.4.2, 3.1 and 4.0 versions.  Of course, Tika 0.8 is going to happen in not very distant future.
  • If you’re still using old rsync based replication and have a need to throttle transfer rate, have a look at a patch contributed in JIRA issue SOLR-2099. Unfortunatelly, if you’re using 1.4 Java based replication, there is currently no way to throttle replication.
  • If you are using new spatial capabilities in Solr, you might have noticed some incorrect calculations. One of them is fixed – Spatial filter is not accurate -  on 3.1 and 4.0 branches
  • Another minor but useful addition – function queries can now be defined in terms of parameters from other request parameters. Check JIRA issue “full parameter dereferencing for function queries”. It is already implemented in 3.1 and 4.0 and is ready to be used. Here is a short example from JIRA (check how add function is defined and note v1 and v2 request parameters):

http://localhost:8983/solr/select?defType=func&fl=id,score&q=add($v1,$v2)&v1=mul(2,3)&v2=10

Can we say, Solr Calculator, eh?

Interesting functionalities in development for some time

  • Ever wanted to add some custom fields to a response, although they were not stored in your Solr index? You could always create a custom response writer which would add those fields (although it would probably be a “dirty” copy of some already existing Solr’s response writer). However, we all know that it doesn’t sound as the right way to code.  One JIRA issue might deliver a correct way some day - Allow components to add fields to outgoing documents. We say “some day“, since this functionality has been in development for quite some time now and, although it has some patches (currently unfunctional, it seems), is probably is not very near being completed.  But it will be handy to have once it’s done.

Interesting new functionalities

  • Highlighter could get one frequently requested improvement – Highlighter fragement/formatter for returning just the matching terms – we believe this will be a useful addition, although we don’t expect it very soon.
  • One potentially useful feature for all of you who use HDFS – DIH should be able read data directly from HDFS for indexing.  This issue already contains some working code, although it is a question if the fix will become a part of standard Solr distribution.  Still, if you’re using Solr 1.4.1 and you have data in HDFS that you want to index with Solr, have a look at this contribution.
  • Another improvement related to replication is in SOLR-2117 – Allow slaves to replicate at different times. This should be useful to anyone who has long (and therefore heavy) warmup periods on their slaves after replication. This way, you can have your slaves replicate at different time and at the time of replication just take replicating slave offline (to avoid degradation of response times). Be careful though, there is a downside : for some time (limited, but still), your slaves will serve different data. Patch is available for 4.0 version.

Miscellaneous

So, we had a little bit of everything from Solr this month. Until late October (or start of November) when new Solr Digest arrives, stay tuned to @sematext, where we tweet other interesting stuff on a wider set of topics from time to time.

Follow

Get every new post delivered to your Inbox.

Join 1,633 other followers