The New SolrCloud: Overview

Just the other day we wrote about Sensei, the new distributed, real-time full-text search database built on top of Lucene and here we are again writing about another “new” distributed, real-time, full-text search server also built on top of Lucene: SolrCloud.

In this post we’ll share some interesting SolrCloud bits and pieces that matter mostly to those working with large data and query volumes, but that all search lovers should find really interesting, too.  If you have any questions about what we wrote (or did not write!) in this post, please leave a comment – we’re good at following up to comments!  Or just ask @sematext!

Please note that functionality described in this post is now part of trunk in Lucene and Solr SVN repository.  This means that it will be available when Lucene and Solr 4.0 are released, but you can also use trunk version just like we did, if you don’t mind living on the bleeding edge.

Recently, we were given the opportunity to once again use big data (massive may actually be more descriptive of this data volume) stored in a HBase cluster and search. We needed to design a scalable search cluster capable of elastically handling future data volume growth.  Because of the huge data volume and high search rates our search system required the index to be sharded.  We also wanted the indexing to be as simple as possible and we also wanted a stable, reliable, and very fast solution. The one thing we did not want to do is reinvent the wheel.  At this point you may ask why we didn’t choose ElasticSearch, especially since we use ElasticSearch a lot at Sematext.  The answer is that when we started the engagement with this particular client a whiiiiile back when ElasticSearch wasn’t where it is today.  And while ElasticSearch does have a number of advantages over the old master-slave Solr, with SolrCloud being in the trunk now, Solr is again a valid choice for very large search clusters.

And so we took the opportunity to use SolrCloud and some of its features not present in previous versions of Solr.  In particular, we wanted to make use of Distributed Indexing and Distributed Searching, both of which SolrCloud makes possible. In the process we looked at a few JIRA issues, such as SOLR-2358 and SOLR-2355, and we got familiar with relevant portions of SolrCloud source code.  This confirmed SolrCloud would indeed satisfy our needs for the project and here we are sharing what we’ve learned.

Our Search Cluster Architecture

Basically, we wanted the search cluster to look like this:

SolrCloud App Architecture

Simple? Yes, we like simple.  Who doesn’t!  But let’s peek inside that “Solr cluster” box now.

SolrCloud Features and Architecture

Some of the nice things about SolrCloud are:

  • centralized cluster configuration
  • automatic node fail-over
  • near real time search
  • leader election
  • durable writes

Furthermore, SolrCloud can be configured to:

  • have multiple index shards
  • have one or more replicas of each shards

Shards and Replicas are arranged into Collections. Multiple Collections can be deployed in a single SolrCloud cluster.  A single search request can search multiple Collections at once, as long as they are compatible. The diagram below shows a high-level picture of how SolrCloud indexing works.

SolrCloud Shards, Replicas, Replication

As the above diagram shows, documents can be sent to any SolrCloud node/instance in the SolrCloud cluster.  Documents are automatically forwarded to the appropriate Shard Leader (labeled as Shard 1 and Shard 2 in the diagram). This is done automatically and documents are sent in batches between Shards. If a Shard has one or more replicas (labeled Shard 1 replica and Shard 2 replica in the diagram) a document will get replicated to one or more replicas.  Unlike in traditional master-slave Solr setups where index/shard replication is performed periodically in batches, replication in SolrCloud is done in real-time.  This is how Distributed Indexing works at the high level.  We simplified things a bit, of course – for example, there is no ZooKeeper or overseer shown in our diagram.

Setup Details

All configuration files are stored in ZooKeeper.  If you are not familiar with ZooKeeper you can think of it as a distributed file system where SolrCloud configuration files are stored. When the first Solr instance in a SolrCloud cluster is started configuration files need to be sent to ZooKeeper and one needs to specify how many shards there should be in the cluster. Then, this Solr instance/node is running one can start additional Solr instances/nodes and point them to the ZooKeeper  instance (ZooKeeper is actually typically deployed as a quorum or 3, 5, or more instances in production environments).  And voilà – the SolrCloud cluster is up!  I must say, it’s quite simple and straightforward.

Shard Replicas in SolrCloud serve multiple purposes.  They provide fault tolerance in the sense that when (not if!) a single Solr instance/node containing a portion of the index goes down, you still have one or more replicas of data that was served by that instance elsewhere in the cluster and thus you still have the whole data set and no data loss.  They also allow you to spread query load over more servers, this making the cluster capable of handling higher query rates.

Indexing

As you saw above, the new SolrCloud really simplifies Distributed Indexing.  Document distribution between Shards and Replicas is automatic and real-time.  There is no master server one needs to send all documents to. A document can be sent to any SolrCloud instance and SolrCloud takes care of the rest. Because of this, there is no longer a SPOF (Single Point of Failure) in Solr.  Previously, Solr master was a SPOF in all but the most elaborate setups.

Querying

One can query SolrCloud a few different ways:

  • One can query a single Shard, which is just like Solr querying a search a single Solr instance.
  • The second option is to query a single Collection (i.e., search all shards holding pieces of a given Collection’s index).
  • The third option is to only query some of the Shards by specifying their addresses or names.
  • Finally, one can query multiple Collections assuming they are compatible and Solr can merge results they return.

As you can see, lots of choices!

Administration with Core Admin

In addition to the standard core admin parameters there are some new ones available in SolrCloud. These new parameters let one:

  • create new Shards for an existing Collection
  • create a new Collection
  • add more nodes

The Future

If you look at the New SolrCloud Design wiki page (http://wiki.apache.org/solr/NewSolrCloudDesign) you will notice, that not all planned features have been implemented yet. There are still things like cluster re-balancing or monitoring (if you are using SolrCloud already and want to monitor its performance, let us know if you want early access to SPM for SolrCloud) to be done.  Now that SolrCloud is in the Solr trunk, it should see more user and more developer attention.  We look forward to using SolrCloud in more projects in the future!

@sematext

About these ads

About Rafał Kuć
Sematext engineer, books author, speaker.

25 Responses to The New SolrCloud: Overview

  1. Mark Kerzner says:

    gro, what role does the HBase play?

  2. gro says:

    Mark, we got the data there that we needed to index and process.

  3. Very nice to see that Solr is taking the step to the cloud. In out company we have the traditional Solr setup with masters and slaves. We are currently investigating if ElasticSearch was the thing for us. It seems to have everything built in. How would you compare ES with SolrCloud? Which has the most potential, which is more stable, which supports the most queries/indexing operations?

  4. Thanks for posting this. How do the shard leaders decide which shard should get the document? I’m also curious about how document updates are handled and what happens when a document gets reindexed.

    • Rafał Kuć says:

      There is a hashing algorithm that is used to decide which shard the document should be put at. There is a default implementation and it’s also designed to be pluggable, so You can develop Your own method of hashing. As far as I’m aware, the default implementation is Murmur hash ported by Yonik. Also, the updates are distributed to the correct shards so You don’t have to worry about documents not being updated or deleted.

  5. Dun says:

    We are also currently investigating if ElasticSearch was the thing for us. It seems to have everything built in. How would you compare ES with Solr 4 ? Thanks !

    • sematext says:

      This is a big question many people are asking today. The complete answer, if you wanted to get into details, could easily take an hour. The short answer is that ElasticSearch is here today, while Solr 4.0 is still in development.

  6. Luke says:

    Is it correct to say that Solr itself can manage replication between clustered nodes within a DC and that SolrCloud would be more appropriate to use when managing particular collections across multiple clusters, DCs and regions?

    • sematext says:

      A pre-4.0 Solr could also be set up to replicate across clusters, DCs, and regions. But since Solr 4.0 is upon us, everyone should move towards it anyway.

  7. Pingback: Solr vs. ElasticSearch: Part 1 – Overview « Sematext Blog

  8. Esteban says:

    great intro article. One question: how does SolrCloud handle data consistency considering that the docs to be indexed need to be forwarded to the correct leader shard and then replicated to all the replicas. I imagine these are all async operations. What happen with queries involving this new doc in the meantime?

    • Rafał Kuć says:

      That’s a good question Esteban. If you would like the changes to be visible almost in real time, auto soft commit should be enabled, with very fast refresh rates (for example every one second), but that’s only the part of the work. When the document hits the leader shard it is automatically updated, but I imagine it can happen, that the changes are not yet visible. However there is a new handler called Realtime Get which should take care of such situations – you can include the realtime get component in your handler configuration and always see the latest version of the document.

      • Esteban says:

        thanks for your response. But what would happen if, for instance, the replication with one of the nodes is delayed? Subsequent queries will return different version of the doc (depending on what node resolves the query) until all the replicas get updated, right? Or even if the replication with one node fails. That node will keep and return an outdated version? Does SolrCloud have any conflict resolution mechanism for these situations?

        • Rafał Kuć says:

          The question is why replication is delayed, because it shouldn’t be. If there is a problem with Solr node than Zookeeper should remove that node from cluster and stop sending both queries and data to it.

          • Esteban says:

            it could be delayed due to a network or server overload. Also even if Zookeeper removes a failed Solr node, if I have a load balancer upfront my cluster I can still deliver queries to than outdated node. Am I missing something?

          • Rafał Kuć says:

            I can’t nest replies anymore, so I’ll reply to my post :)

            If server is overloaded or network is the issue it shouldn’t be able to communicate with Zookeeper and thus will be removed from the cluster. If it is removed from the cluster it shouldn’t be able to serve queries. When talking about SolrCloud it will stop serving search data if it can’t see the whole view of the index, so if a node gets disconnected it won’t respond to a search request.

            There is a very nice thread on Solr ML that describes how Solr behaves in case of node disconnections http://search-lucene.com/m/joyiaxuETB1/SolrCloud+and+split-brain&subj=Re+SolrCloud+and+split+brain

  9. Esteban says:

    great! thanks for the info.

  10. Jason says:

    Thanks for the post. Do you mind sharing more information on how you integrate HBase and Solr (i.e. how do you send the HBase tables to SolrCloud and index them there?) thanks!

  11. denizdurmus87 says:

    thank you for this useful post… I am trying to learn about solrcloud having some questions in my mind… first I wanna ask what happens if one of the shards go down? the replica will be automatically active? if we are manually giving the shard name in url, how to handle the url for a down shard?

    as for my second, is it possible to use different directory types for a shard and its replica? like shard is using ram and replica is using mmap?

    • sematext says:

      @Deniz – if you have shard replicas, then if one of those replicas goes down, you don’t have to worry, indexing and searching will keep working. That’s the point of replicas in systems like SolrCloud. As for the second question – I think the answer is negative, but you should ask on the mailing list. Yes, I know you’ve been asking there already. :)

  12. Atp says:

    hi thanks for this ,very helpful. can you please confirm , if we use external zookeepers for solrcloud instead of hbase zookeeper , how to configure conf/hbase-indexer-site.xml in hbase-indexer? the installation step says ,
    In the hbase-indexer directory, edit the file conf/hbase-indexer-site.xml and configure the ZooKeeper connection string, (twice, once for hbase-indexer, and once for hbase, )

    but here we used 3 external zookeepers in 3 machines , how to set properties here? please guide

    hbaseindexer.zookeeper.connectstring
    zookeeperhost

    hbase.zookeeper.quorum
    zookeeperhost

    Thanks,
    Atp

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out / Change )

Twitter picture

You are commenting using your Twitter account. Log Out / Change )

Facebook photo

You are commenting using your Facebook account. Log Out / Change )

Google+ photo

You are commenting using your Google+ account. Log Out / Change )

Connecting to %s

Follow

Get every new post delivered to your Inbox.

Join 1,633 other followers

%d bloggers like this: