Solr Digest, Spring-Summer 2011, Part 2: Solr Cloud and Near Real Time Search

As promised in Part 1 of Solr Digest, Spring-Summer 2011, in this Part 2 post we’ll summarize what’s new with Solr’s Near-Real-Time Search support and Solr Cloud (if you love clouds and search with some big data on the side, get in touch). Let’s first examine what is being worked on for Solr Cloud and what else is in the queue for the near future. A good overview of what is currently functional can be found in the old Solr Cloud wiki page. Also, there is now another wiki page covering New Solr Cloud Design, which we find quite useful.  The individual pieces of Solr Cloud functionality that are being worked on are as follows:

  • Work is still in progress on Distributed Indexing and Shard distribution policy. Patches exist, although they are now over 6 months old, so you can expect to see them updated soon.
  • As part of the Distributed Indexing effort, shard leader functionality deals with leader election and with publishing the information about which node is a leader of which shard and in Zookeeper in order to notify all interested parties.  The development is pretty active here and initial patches already exist.
  • At some point in the future, Replication Handler may become cloud aware, which means it should be possible to switch the roles of masters and slaves, master URLs will be able to change based on cluster state, etc. The work hasn’t started on this issue.
  • Another feature Solr Cloud will have is automatic Spliting and migrating of Indices. The idea is that when some shard’s index becomes too large or the shard itself starts having bad query response times, we should be able to split parts of that index and migrate it (or merge) with indices on other (less loaded) nodes. Again, the work on this hasn’t started yet.  Once this is implemented one will be able to split and move/merge indices using a Solr Core Admin as described in SOLR-2593.
  • To achieve more efficiency in search and gain control over where exactly each document gets indexed to, you will be able to define a custom shard lookup mechanism. This way, you’ll be able to limit execution of search requests to only some shards that are known to hold target documents, thus making the query more efficient and faster.  This, along with the above mentioned shard distribution policy, is akin to routing functionality in ElasticSearch.

On to NRT:

  • There is a now a new wiki page dedicated to Solr NRT Search. In short, NRT Search will be available in Solr 4.0 and the work currently in progress is already available on the trunk. The first new functionality that enables NRT Search in Solr is called “soft-commit”.  A soft commit is a light version of a regular commit, which means that it avoids the costly parts of a regular commit, namely the flushing of documents from memory to disk, while still allowing searches to see new documents. It appears that a common way of using this will be having a soft-commit every second or so, to make Solr behave as NRT as possible, while also having a “hard-commit” automatically every 1-10 minutes. “Hard-commit” will still be needed so the latest index changes are persisted to the storage. Otherwise, in case of crash, changes since last “hard-commit” would be lost.
  • Initial steps in supporting NRT Search in Solr were done in Re-architect Update Handler. Some old issues Solr had were dealt with, like waiting for background merges to finish before opening a new IndexReader, blocking of new updates while commit is in progress and a problem where it was possible that multiple IndexWriters were open on the same index. The work was done on solr2193 branch and that is the place where the spinoffs of this issue will continue to move Solr even closer to NRT.
  • One of the spinoffs of the Update Handler rearchitecture is SOLR-2565, which provides further improvements on the above mentioned issue.  New issues to deal with other related functionality will be opened along the way, while SOLR-2566 looks to serve as an umbrella issue for NRT Search in Solr.
  • Partially related to NRT Search is the new Transaction Log implemented in Solr under SOLR-2700. The goal is to provide durability of updates, while also supporting features like the already committed Realtime get.  Transaction logs are implemented in various other search solutions such as ElasticSearch and Zoie, so Simon Willnauer started a good thread about the possibility of generalizing this new Transaction Log functionality so that it is not limited to Solr, but exposed to other users and applications, such as Lucene, too.

We hope you found this post useful.  If you have any questions or suggestions, please leave a comment, and if you want to follow us, we are @sematext on Twitter.

Follow

Get every new post delivered to your Inbox.

Join 1,564 other followers