Lucene & Solr Year 2011 in Review

The year 2011 is coming to an end and it’s time to reflect on the past 12 months.  Without further fluff, let’s look back and summarize all significant events that happened in Lucene and Solr world over the course of last dozen months. In the next few paragraphs we’ll go over major changes in Lucene and Solr, new blood, relevant conferences and books.

We should start by pointing out that this year Apache Lucene celebrated its 10 year anniversary as an Apache Software Foundation project.  Lucene itself is actually over 10 years old.  Otis is one of the very few people from the early years who is still around.  While we didn’t celebrations any Solr anniversaries this year, we should note that Solr, too, has been around for quite a while and is in fact approaching its 6th year at ASF!

This year saw numerous changes and additions both in Lucene and Solr.  As a matter of fact, we’d venture to say we saw more changes in Lucene & Solr this year than in any one year before.  In that sense, both projects are very much like wine – getting better with time. Lets take a look at a few of the most significant changes in 2011.

The much anticipated Near Real-Time search (NRT) functionality has arrived.  What this means is that documents that were just added to a Lucene/Solr index can immediately be made visible in search results.  This is big!  Of course, work on NRT is still in progress, but NRT is ready and you, like a number of our clients, should start using it.

Field Collapsing was one of the most watched and voted for JIRA issues for many month.  This functionality was implemented this year and now Lucene and Solr users can group result on the basis of a field or a query. In addition, you can control the groups and even do faceting calculation on the groups, not single documents. A rather powerful feature.

From Lucene users’ perspective it is also worth noting that Lucene finally got a faceting module.  Until now, faceting was available only in Solr.  If you are a pure Lucene users, you now don’t need Solr to calculate facets.

In the past modeling parent-child relationships in Lucene and Solr indices was not really possible – one had to flatten everything.  No longer – if you need to model a parent-child relationship in your index you can use the Join contrib module.  This Join functionality lets you join parent and child documents at query-time, while relaying on some assumptions about how documents were indexed.

Good and broad language support is hugely important for any search solution and this year was good for Lucene and Solr in that department: KStemFilter English stemmer was added, full Unicode 4 support was added, a new Japanese and Chinese support was added, a new stemmer-protection mechanism was added, work on synonym filter RAM consumption reduction was done, etc.  Another big addition was integration with Hunspell, which enables language-specific processing for all languages supported by Open Office.  That’s a lot of new languages we can now handle with Lucene and Solr! There is more.

Lucene 3.5.0 introduced significantly reduced the  term dictionary memory footprint. Big!  Right now, Lucene uses 3 to 5 times less memory for when dealing with terms dictionary, so it’s even less RAM consuming.

If you use Lucene and need to page through a lot of results you may run into problems. That’s why in Lucene 3.5.0 the searchAfter method was introduced which solves the deep paging problem once and for all!

There is also a new, fast and reliable Term Vector-based highlighter that both Lucene and Solr can use.

Dismax is great, but Extended Dismax query parser added to Solr is even better – it extends Dismax query parser functionality and can further improve the quality of search results.

You can now also sort by function (imagine sorting the results by distance from a point) and a new spatial search with filtering.

Solr also got the new suggest/autocomplete functionality based on FST automaton which significantly reduced the memory needed for such functionality.  If you need this for your search application, have a look at Sematext’s AutoComplete – it has additional functionality that lots of our customers like.

While not yet officially released, the new transaction log support provides Solr with a real-time get operation – as soon as you add a document you can retrieve it by ID.  This will also be used for recovering nodes in SolrCloud.

And talking about SolrCloud…  We’ve covered SolrCloud on this blog before in Solr Digest, Spring-Summer 2011, Part 2: Solr Cloud and Near Real Time Search, and we’ll be covering it again soon.  In short, SolrCloud will make it easier for people to operate larger Solr clusters by making use of more modern design principles and software components such as ZooKeeper, that make creation of distributed, cluster-based software/services easier.  Some of the core functionality is that there will be no single point of failure, any node will be able to handle any operation, there will be no traditional master-slave setup, there will be centralized cluster management and configuration, failovers will be automatic and in general things will be much more dynamic.  SolrCloud has not been released yet, but Solr developers are working on it and the codebase is seeing good progress.  We’ve used SolrCloud in a few recent engagements with our customers and were pleased by what we saw.

After merging developments of those two projects back in the 2010, we saw a speed up in development and releases. Lucene and Solr committers introduced five(!) new versions of both projects! In March, Lucene and Solr 3.1 was released with the Unicode 4 support, ReusableTokenStream, Spatial search, Vector-based Highlighter, Extended Dismax parser, and many more features and bug fixes. Then, after less than 3 months(!) on June 4th, version 3.2 was released. This release introduced a new and much desired results grouping module, NRTCachingDirectory, and highlighting performance improvements. Just one month later, on July 1st, Lucene and Solr 3.3 were introduced. That release included KStem stemmer, new implementations of Spellchecker, Field Collapsing in Solr and RAM usage reduction for autocomplete mechanism. By the end of summer there was another release, this time it was version 3.4 released on the 14th of September. Pure Lucene users got what Solr could do for a very long time – the long awaited faceting module contributed by IBM. Version 3.4 also included the new Join functionality, ability to turn off query and filter caches and faceting calculation for Field Collapsing. The last release of Lucene and Solr saw the light of day in late November. The 3.5.0 version consisted of huge memory reduction when dealing with term dictionaries, deep paging support, SearcherManager and SearcherLifetimeManager classes along with language identification provided by Tika, as well as sortMissingFirst and sortMissingLast support for TrieFields.

During the last 12 months we attended three major conferences focused on search and big data themes. Lucene Revolution took place in San Francisco in May. Otis gave a talk titled “Search Analytics: What? Why? How?” (slides) during the first day. There were a number of other good talks there and the complete conference agenda is available on  http://lucenerevolution.com/2011/agenda. Some videos are available as well. Next came the Berlin Buzzwords conference, a more grass-roots conference which took place between 4th and 10th of June. Otis gave the updated version of his “Search Analytics: What? Why? How?”. If you want to know more, check conference official site – http://berlinbuzzwords.de. The last conference focused exclusively on Lucene and Solr was Lucene Eurocon 2011 in sunny and tourist-filled Barcelona between 17th and 20th of October. And guess what – we were there again (surprise!), this time in slightly larger numbers. Otis gave a talk about “Search Analytics: Business Value & BigData NoSQL Backend” (video, slides) and Rafał gave a talk on a pretty popular topic - “Explaining & Visualizing Solr ‘explain’ information” (video, slides). No open source project can endure without regular injections of new blood. This year, Lucene and Solr development team was joined by a number of new people whose names may look familiar to you:

These 7 men are now Lucene and Solr committers and we look forward to our next year’s Year in Review post, where we hope to go over the good things these people will have brought to Lucene and Solr in 2012.

You know an open source project is successful when a whole book is dedicated to it.  You know a project is very successful when more than one book and more than one publisher cover it.  There were no new editions of Lucene in Action (amazon, manning) this year, but our own Rafał Kuć published his Solr 3.1 Cookbook (amazon) in July.  Rafał’s cookbook includes a number of recipes that can make your life easier when it comes to solving common problems with Apache Solr. Another book, Apache Solr 3 Enterprise Search Server (amazon) by David Smiley and Eric Pugh was published in November. This is a major update to the first edition of the book and it covers a wide range of functionalities of Apache Solr.

@sematext

Solr Digest, Spring-Summer 2011, Part 2: Solr Cloud and Near Real Time Search

As promised in Part 1 of Solr Digest, Spring-Summer 2011, in this Part 2 post we’ll summarize what’s new with Solr’s Near-Real-Time Search support and Solr Cloud (if you love clouds and search with some big data on the side, get in touch). Let’s first examine what is being worked on for Solr Cloud and what else is in the queue for the near future. A good overview of what is currently functional can be found in the old Solr Cloud wiki page. Also, there is now another wiki page covering New Solr Cloud Design, which we find quite useful.  The individual pieces of Solr Cloud functionality that are being worked on are as follows:

  • Work is still in progress on Distributed Indexing and Shard distribution policy. Patches exist, although they are now over 6 months old, so you can expect to see them updated soon.
  • As part of the Distributed Indexing effort, shard leader functionality deals with leader election and with publishing the information about which node is a leader of which shard and in Zookeeper in order to notify all interested parties.  The development is pretty active here and initial patches already exist.
  • At some point in the future, Replication Handler may become cloud aware, which means it should be possible to switch the roles of masters and slaves, master URLs will be able to change based on cluster state, etc. The work hasn’t started on this issue.
  • Another feature Solr Cloud will have is automatic Spliting and migrating of Indices. The idea is that when some shard’s index becomes too large or the shard itself starts having bad query response times, we should be able to split parts of that index and migrate it (or merge) with indices on other (less loaded) nodes. Again, the work on this hasn’t started yet.  Once this is implemented one will be able to split and move/merge indices using a Solr Core Admin as described in SOLR-2593.
  • To achieve more efficiency in search and gain control over where exactly each document gets indexed to, you will be able to define a custom shard lookup mechanism. This way, you’ll be able to limit execution of search requests to only some shards that are known to hold target documents, thus making the query more efficient and faster.  This, along with the above mentioned shard distribution policy, is akin to routing functionality in ElasticSearch.

On to NRT:

  • There is a now a new wiki page dedicated to Solr NRT Search. In short, NRT Search will be available in Solr 4.0 and the work currently in progress is already available on the trunk. The first new functionality that enables NRT Search in Solr is called “soft-commit”.  A soft commit is a light version of a regular commit, which means that it avoids the costly parts of a regular commit, namely the flushing of documents from memory to disk, while still allowing searches to see new documents. It appears that a common way of using this will be having a soft-commit every second or so, to make Solr behave as NRT as possible, while also having a “hard-commit” automatically every 1-10 minutes. “Hard-commit” will still be needed so the latest index changes are persisted to the storage. Otherwise, in case of crash, changes since last “hard-commit” would be lost.
  • Initial steps in supporting NRT Search in Solr were done in Re-architect Update Handler. Some old issues Solr had were dealt with, like waiting for background merges to finish before opening a new IndexReader, blocking of new updates while commit is in progress and a problem where it was possible that multiple IndexWriters were open on the same index. The work was done on solr2193 branch and that is the place where the spinoffs of this issue will continue to move Solr even closer to NRT.
  • One of the spinoffs of the Update Handler rearchitecture is SOLR-2565, which provides further improvements on the above mentioned issue.  New issues to deal with other related functionality will be opened along the way, while SOLR-2566 looks to serve as an umbrella issue for NRT Search in Solr.
  • Partially related to NRT Search is the new Transaction Log implemented in Solr under SOLR-2700. The goal is to provide durability of updates, while also supporting features like the already committed Realtime get.  Transaction logs are implemented in various other search solutions such as ElasticSearch and Zoie, so Simon Willnauer started a good thread about the possibility of generalizing this new Transaction Log functionality so that it is not limited to Solr, but exposed to other users and applications, such as Lucene, too.

We hope you found this post useful.  If you have any questions or suggestions, please leave a comment, and if you want to follow us, we are @sematext on Twitter.

Solr Digest, Spring-Summer 2011, Part 1

No, Solr Digests are not dead, we’ve just been crazily busy at Sematext (yes, we are hiring!). Since our last Solr Digest not one, but 2 new Solr releases have been made: 3.2 in June, 3.3 in July and version 3.4 is imminent – voting is already in progress, so you can expect a new release pretty soon. Also, there were a number of interesting developments on the trunk (future 3.x and 4.0 versions). Therefore, we will be publishing two Solr Digests this time. This first Digest covers general developments in Solr world, while the sequel will be more focused on two features drawing a lot of attention: Solr Cloud and Near Real Time search.

Let’s get started with a short overview of announced news in 3.2 and 3.3. First, 3.2 brought us:

  • Ability to specify overwrite and commitWithin as request parameters when using the JSON update format
  • TermQParserPlugin, useful when generating filter queries from terms returned by field faceting or terms component
  • DebugComponent now supports using a NamedList to model Explanation objects in its responses instead of Explanation.toString
  • Improvements to the UIMA and Carrot2 integrations
  • Highlighting performance improvements
  • A test-framework jar for easy testing of Solr extensions
  • Bugfixes and improvements from Apache Lucene 3.2

With 3.3 we got:

  • Grouping / Field Collapsing
  • A new, automaton-based suggest/autocomplete implementation offering an order of magnitude smaller RAM consumption
  • KStemFilterFactory, an optimized implementation of a less aggressive stemmer for English
  • Solr defaults to a new, more efficient merge policy (TieredMergePolicy). See Mike’s cool Lucene segment merging video
  • Important bugfixes, including extremely high RAM usage in spellchecking
  • Bugfixes and improvements from Apache Lucene 3.3

Let’s now look at other interesting stuff. We’ll start with DataImportHandler and its bug fixes. As you’ll notice, there are quite a few of them (and we didn’t even list them all!) so we advise using all available patches.

Already committed features

  • A bug-fix for DataImportHandler – “replication reserves commit-point forever if using replicateAfter=startup”. SOLR-2469 brought a fix to version 3.2 and future 4.0 (trunk). This problem caused unnecessary (and huge) buildup in the number of index files on the slaves.
  • Another bug-fix for DataImportHandler – DIH does not commit if only Deletes are processed. When using special commands $deleteDocById and/or $deleteDocByQuery, when there were no updates of documents, commit wasn’t called by the DIH. Fix is available in 3.4 and 4.0.
  • Also – DataImportHandler multi-threaded option throws exception. The problem would happen when threads attribute was used. The fix for this is available in 3.4 and 4.0. Related to this is another fixed issue – DIH multi threaded mode does not resolves attributes correctly also available in 3.4 and 4.0.
  • Join feature got committed to the trunk (future 4.0 version). It can also perform cross-core joins now, which can be very useful. However, this feature also initiated some heated discussions which can be seen in SOLR-2272. The root cause was the fact that this feature was committed only to the Solr while Lucene got none of it. Of course, it might get refactored and included in Lucene too in the future, but this discussion shows the divisons which still existed between Solr and Lucene communities back then.
  • While we’re talking about Join feature, it might be worth mentioning a patch in SOLR-2604 which back-ports it to 3.x version. Be careful though, it was created for version 3.2 more than two months ago, so a few more adjustments after applying this patch might be needed.
  • Function Queries got new if(), exists(), and(), or(), not(), xor() and def() functions. The fix is committed to trunk so you’ll be able to use it in 4.0.
  • As can be seen from the Solr 3.3 announcement, one of the longest living Solr issue is finally closed for good :). SOLR-236 – Field Collapsing – along with SOLR-2524 finally bring field collapsing to 3_x and future 4.0 versions.
  • Since grouping/field collapsing was added to Solr, we should be able to use faceting in combination with it. Issue SOLR-2665 – Solr Post Group Faceting – brought exactly that to 3.4 and 4.0.
  • Ever wanted to have more control over what gets stored in the cache? SOLR-2429 will bring exactly that starting with the next Solr release – 3.4. It is simple to use, just add cache=false to your queries like this: fq={!frange l=10 u=100 cache=false}mul(popularity,price).  Note that with this new functionality you can prevent either a filter or a query to be cached, while document caching still remains out of request-time control.
  • If you’re using JMX to observe the state of your Solr installation, you might have encountered a problem when reloading Solr cores – it appears that JMX beans didn’t survive those reloads in the past versions. The fix is created and is available in future 3_x and trunk releases.

Interesting features in development

  • To achieve case-insensitive search with wildcard queries you could use a patch suplied under issue SOLR-2438. It has to be said that this isn’t committed to svn and it is hard to say whether it ever will be since there is a similar issue SOLR-219 on which work started 4 years ago.
  • Multithreaded faceting might bring some performance improvements. At the moment, initial patch exists, but more work will be needed here and it still isn’t clear how big improvement we could expect in real-world conditions, but it is worth keeping an eye on this issue.
  • We all know that Solr’s Spatial support has its limitations. One of them is specifying bounding box which isn’t based on point distance, effectively making it limited to a circular shape. Under SOLR-2609 we might get support for exactly this.
  • For anyone interested in which direction Spatial support might evolve, we suggest checking Lucene Spatial Playground. It continues the great work done in SOLR-2155 which provided extension to initial GeoSpatial support in Solr by adding multivalued spatial fields. At some point, SOLR-2155 might get the goodness from LSP. Also, another thing to check would be a thread on Lucene Spatial Future.

Interesting new features

  • Support for Lucene’s Surround Parser is added to Solr in issue SOLR-2703. The patch is already committed to the trunk.
  • Solr will get the ability to use configuration like analyzer type=”phrase”. Lucene’s Query Parsers recently got a simpler way to use different analyzer based on the query string. One example is usage of double quotes where one can decide that instead of current meaning in Lucene/Solr world – specifying a phrase to be searched for – it should have a meaning like in Google’s search engine – find this exact wording. Patch for this exists and can be applied on the trunk (it depends on Lucene trunk).
  • SOLR-2593 aims to provide a new Solr core admin action – ‘split’ – for splitting index. It would be used in case some core got too big or in any other case you might find it necessary.  Lucene already has a similar function.

Miscellaneous

  • Oracle released Java 7 about a month ago, but we advize against using it yet. JVM crashes and index corruption are issues likely to be encoutered with it. For more information, visit this URL
  • As anticipated for some time, Java 5 support got axed from Lucene 4.0 (trunk). You can expect similar stuff for Solr too.
  • Solr’s build system has been reworked now. Among other things, this implies changes in directory structure in Solr project. For example, solr/src/ doesn’t exist any more and its old subdirs /java and /test are now in solr/core/. The changes are already applied to the trunk and 3_x which holds the next 3.4 version. For more details, see SOLR-2452.
  • A handy Solr architecture diagram can be found in ML thread
  • Solr’s Admin UI is being refreshed with the work in JIRA issue SOLR-2399 (we already wrote about it) and its spin-off SOLR-2667. Some of this stuff is already committed (on the trunk), so you may want to inspect the changes. More details can be found in the wiki where you can also get the sneak-peak of the upcoming changes.

And that would be all for part one of the Solr Spring-Summer 2011 Digest edition from @sematext. Part two of the Spring-Summer Digest is coming in a few days – stay tuned!

Solr Digest, February-March 2011

We Sematexters have been very busy over the past few months, so we missed Solr’s February Digest. This one will therefore be a bit longer than usual.  Let’s get started…

First, some major news : Solr 3.1 is officially released! The details of the announcement can be found here. We covered most of the new features in our digests already, so we’ll keep it short:

You can start your download :).

Already committed features

  • post.jar got improved – JIRA issue improve post.jar to handle non UTF-8 files removed some of its very old limitations
  • jetty server included in Solr distribution didn’t support UTF-8. Now this is solved, fresh 3.1 version already contains this fix

Interesting features in development

  • as part of SolrCloud, distributed indexing is being implemented in JIRA issue SOLR-2358. You can already see the work in progress in the initial patch, but you can also check SOLR-2341 which deals with shard distribution policies which will be available in Solr 4.0
  • If you ever wanted to add custom fields (not existing in the index) to Solr responses, you couldn’t have done that from Solr components. There were other ways to achieve such functionality (for instance, customizing response writer class), but it looks like we’ll get such ability inside of components, too. No need to say how much more natural that would be. Anyway, issue Allow components to add fields to outgoing documents provides the umbrella for this new functionality. Although it is already closed, there are few sub-issues in which actual pieces of logic will be implemented.
  • if you have problem with case sensitive searches in wildcard queries, you might take a look at a patch provided in JIRA issue Case Insensitive Search for Wildcard Queries
  • although Solr got its first solid spatial implementation in version 3.1, many people found its limitations. One of them is surely a case where documents have multivalued spatial fields. We already wrote about SOLR-2155 in our December digest, but work under that issue hasn’t stopped and keeps evolving. It is likely that it will become a part of the standard Solr distribution and Lucene could get it incorporated, too. If you need spatial search you may want to watch this issue.

Interesting new features

  • one common problem when using Solr’s default spellchecker or auto-suggest is filtering of suggestions based on what some user can see (for instance, depending on the region in which your user resides). JIRA issue Doc Filters for Auto-suggest, spell checking, terms component, etc. proposes a feature which would help here. Currently, no work was done there, though we believe we’ll get to see some progress in the future. While we are at it, in case you need such feature in Auto-suggest now, you might take a look at our in-house Search Auto-Complete solution, which you can see in action on search-lucene.com and search-hadoop.com.
  • just like there are default components for SearchHandlers (which are used by default for every new search handler, unless overriden), update processors will get a similar feature. JIRA issue Let some UpdateProcessors be default without explicitly configuring them will take care that some important update processors are available by default to your UpdateRequestProcessorChain.
  • one great new feature could be added to Solr – ability not to cache a filter. JIRA issue SOLR-2429 will deal with this. Many Solr users will be happy to optimize their cache performance when this feature is available some day.

Miscellaneous

  • some interesting thoughts on spellchecker can be found in ML thread My spellchecker experiment and much more on that topic in the related blog
  • should you use ASCIIFoldingFilter or MappingCharFilter when dealing with accents? Interesting discussion in thread Should ASCIIFoldingFilter be deprecated? could help you decide which one is right for you
  • interesting idea for Solr’s admin UI can be found in this ML thread. Community’s reception was very good so we also got Solr Admin Interface, reworked issue as the home for this new work.
  • anyone using Solr’s UIMA (Unstructured Information Management Architecture) contrib might be interested to know that its wiki page got a major improvement – more docs to read!
  • we might be a bit late on this, but there is still some time left – Google’s Summer of Code applications can be submitted until 8th April. Check this ML thread for some detail.  And don’t forge that Sematext is sponsoring interns, too!
  • new Solr/Lucene users should take a look at the Refcard provided by Erik Hatcher in ML thread [infomercial] Lucene Refcard at DZone
  • some deep thoughts on Solr/Lucene’s release process by some of the key people can be found here Brainstorming on Improving the Release Process. Related to that is a JIRA issue Define Test Plan for 4.0 which will… eh, contain some info about Test plan for 4.0 release, obviously. Also, check the TestPlans wiki page that’s in the making.

Although there were some other interesting topics, we have to stop somewhere. Until next month, you’ll find us on Twitter.

Solr Digest, January 2011

Welcome to the second season of Sematext’s monthly Solr Digests. Once again, we compiled a list of most interesting topics in Solr world for the previous month:

Already committed features

  • A bug related to using PHPSerialized response writer in sharded environment was fixed and committed in SOLR-2307. It affected all recent Solr versions (trunk, 3_x, 1.4.1,…) and the fix is committed to 3_x branch and trunk. In case you’re stuck with older version of Solr, you can manually try applying the patch, it should be doable.
  • One old JIRA issue Enable sorting by Function Query is finally closed and committed to 3_x and trunk.
  • A problem with race condition in StreamingUpdateSolrServer got its fixes before, however it appears that issue wasn’t fixed completely. Now another fix is committed to 3_x and trunk, so if you use this feature, we advise picking up the fix.

Interesting features in development

  • Support for complex syntax (e.g. wildcards) in phrase queries is being brought to Lucene. In case you’re interested, you can take a look at LUCENE-1823 or LUCENE-1486 which was another try at similar functionality. These issues have been in development for a long time and still aren’t finished, although patches exist. Similar feature for Solr is developed under SOLR-1604, where you can also find some patches. However, we think it is a bit unclear if any of these issues will ever be committed to Lucene/Solr, so if you’re interested, check the progress on them occasionally and don’t hold your breath.

Interesting new features

Miscellaneous

And that’s all for January.

Solr Digest, December 2010

Just the other day, we posted the Lucene & Solr highlights in our Lucene & Solr: 2010 in Review post, and now it’s time to really conclude 2010 in Solr world with December Solr Digest. Although one might expect festive period to take its toll on the Solr development velocity, it wasn’t like that at all. Open source never sleeps.  Here are the most interesting highlights:

Interesting features in development

  • In our July’s Digest we mentioned LanguageIdentifierUpdateProcessor feature which is being proposed under JIRA issue SOLR-1979. Some artifacts in the form of patches are starting to appear attached to that issue, so if you’re interested in this feature, take a look.
  • Solr’s spatial capabilities are being further refined. With issue SOLR-2268, Solr will get “support for Point in Polygon searches”. This should enable features like “for a given point, return all documents which contain a polygon inside of which that point lays” and “for a given polygon, return all documents which have a point contained inside of that polygon“. Of course, negated versions of such feature will be supported. The work is in early stages, one patch is attached, but it can be used only as a general pointer about how this thing will be implemented, nothing else.
  • Support for “ColognePhonetic” encoder was added to PhoneticFilterFactory. Since “ColognePhonetic” will be added to Commons Codec 1.4.1, the patch provided in SOLR-2276 will wait until that version gets released.

Interesting new features

  • Solr is getting JOIN functionality – sort of. As part of SOLR-2272 , Solr got a working patch that provides SQL JOIN-like functionality. Of course, this is not exactly the JOIN you might know from SQL, but it is probably the closest thing to it which can be implemented in Solr. It is likely that this feature will be integrated into Lucene as well – it makes no sense to have it strictly in Solr. It is also likely that this feature will be expanded in the future; currently it has only one algorithm and supports many-to-many type of JOIN.
  • As part of SolrCloud, new feature SolrCloud distributed indexing will be added to Solr some day. SOLR-2293 will be likely JIRA home for this feature. Before SolrCloud, anyone using distributed indexing had to create a custom logic which handled distribution of documents over various shards in the cluster. With SolrCloud, this will be transparent to the clients. Also, SolrCloud will include some out-of-the-box distribution algorithms, while, of course, plugging in custom algorithms will be easy to accomplish. However, don’t hold your breath waiting for this feature. At the moment, there exists only JIRA issue (and some guidelines in the Wiki) related to this feature.

Miscellaneous

  • Solr’s Jetty is now upgraded to the latest 6.1.26 version. The change was committed to 3_x branch and trunk but, of course, it doesn’t mean you have to upgrade your Jetty too. It means just that stuff (various jars and few xmls) under /solr/example got upgraded. Details of this change can be found in SOLR-2265.
  • Did you experience any problems with DataImportHandler and its multi-threaded option? If so, you are not the only one. More details about the nature of the problem can be found in SOLR-2186, while it appears that SOLR-2233 already contains a patch which might help in case of JDBC data source. That patch contains a  few other DataImportHandler multithreading fixes. Nothing related to this issue has been committed to trunk or 3_x branch yet.
  • Many problems people encounter with Solr are related to OutOfMemory error. There were many interesting discussions on the ML in December, but we consider two of them related to that topic which you might find interesting OutOfMemory GC: GC overhead limit exceeded – Why isn’t WeakHashMap getting collected? and Memory use during merges (OOM)
  • If you found “bf” parameter somewhat limited – it doesn’t accept complex nested expressions with lots of whitespaces – you might find patch from  SOLR-2014 useful. It appears that this is a common problem: another similar bug report (although this actually isn’t a bug) was also opened recently – SOLR-2267. However, consider that “bf” parameter will likely be deprecated in the future, since “bq” could be used to achieve everything “bf” can and more.
  • In our November’s Digest, we mentioned SOLR-2154 and Solr’s problem with multi-valued spatial fields. What we missed was SOLR-2155, which actually provides a patch for such problems. Although it isn’t committed, the patch appears to be functional, so give it a try.

And that would be all for Solr in 2010.  We’ll be back with the new Solr Digest in a month.  Follow @sematext for other interesting search news.

Solr Digest, November 2010

It is time for the last Solr Digest of 2010; the next Digest will be published some time in January 2011. This was not a month with too many interesting developments, so here we bring to your attention only the few more interesting bits. Here we go…

Already committed features

  • Anyone working with Polish language will be happy to hear that factory for Polish stemmer is committed to 3_x and trunk.

Interesting features in development

Miscellaneous

  • A fix for a feature that was committed earlier this year – Enable sorting by Function Query – is close to being committed. This is big one!  There were some problems with it: functions weren’t weighted, function query wasn’t being properly parsed, some deprecated bits of code were used, etc. Patch is already posted, so if you are eager to use this functionality you can start by applying the patch yourself.
  • Many people are using Spatial Search features recently introduced in Solr. If you’re considering that too, be careful about one limitation: there is no Spatial support for multi-valued fields. So, if you have multi-valued spatial fields and you’d like to do some sorting on them, you’ll end up with incorrect results. The feature we’re describing here can be found in some other search tools, though, like Elastic Search, so Solr might be getting it too some day. You can check if there is some progress with this in JIRA issues like SOLR-2154
  • There is a major bug in DataImportHandler - it doesn’t release JDBC connections. It appears that this issue isn’t related to any particular database, so this is an obvious bug in DIH. Check this JIRA issue for updates.
  • If you prefer git over svn, you might be interested in Solr’s git repository recently set up. Check this ML thread to learn more about it.

So long until 2011, Solr Digest readers!  Follow @sematext on Twitter for other stuff from Sematext.

Follow

Get every new post delivered to your Inbox.

Join 1,695 other followers