ElasticSearch Shard Placement Control

In this post you will learn how to control ElasticSearch shard placement.  If you are coming to our ElasticSearch talk at BerlinBuzzwords, we’ll be talking about this and more.

If you’ve ever used ElasticSearch you probably know that you can set it up to have multiple shards and replicas of each index it serves. This can be very handy in many situations. With the ability to have multiple shards of a single index we can deal with indices that are too large for efficient serving by a single machine. With the ability to have multiple replicas of each shard we can handle higher query load by spreading replicas over multiple servers. Of course, replicas are useful for more than just spreading the query load, but that’s another topic.  In order to shard and replicate ElasticSearch has to figure out where in the cluster it should place shards and replicas. It needs to figure out which server/nodes each shard or replica should be placed on.

Lets Start

Imagine the following situation. We have 6 nodes in our ElasticSearch cluster. We decided to have our index split into 3 shards and in addition to that we want to have 1 replica of each shard. That gives us 6 shards total for our index – 3 shards plus 3*1 replicas. In order to have shards evenly distributed between nodes we would ideally like to have them placed one on each node. However, until version 0.19.0 of ElasticSearch this was not always the case out of the box.  For instance, one could end up with an unbalanced cluster as shown in the following figure.

As you can see, this is far from perfect. Of course, one may say that rebalacing could be done to fix this.  True, but as we’ve seen in our client engagements involving very large ElasticSearch clusters,  shard/replica rebalancing is not something one should take lightly. It ain’t cheap of super quick – depending on your cluster size, index size, and shard/replica placement, rebalancing could take time and make your cluster sweat.  So what can we do? Can we control it? Of course we can, otherwise we’d have nothing to write about.  Moreover, not only can we control it, in complex and large ElasticSearch clusters it is arguable that one should control it. So how do we do it?  ElasticSearch lets you control shard and replica allocation on cluster and index level and we’ll try to show you what you can achieve using ElasticSearch configuration and REST API.

Specifying the Number of Shards Per Node

The first thing we can control is the total number of shards that can be placed on a single node. In our simple 6-node deployment example we would set this value to 1, which would mean that we want a maximum of 1 shard of a given index on a single node. This can be handy when you want to distribute shards evenly in our example above. But we can imagine a situation where that is not enough.

Allocating Nodes for New Shards

ElasticSearch lets you specify nodes to include when deciding where shards should be placed. To use that functionality you need to specify additional configuration parameter that will let ElasticSearch recognize your nodes or group of nodes in the cluster. Lets get deeper into our above example and lets divide the nodes in 2 zones. The first 3 nodes will be associated with zone_one and the rest of the nodes will be associated with zone_two.  So for the first 3 nodes you would add the following value to the ElasticSearch configuration file:

node.zone = zone_one

and for the other 3 nodes the following value:

node.zone = zone_two

The above specified attribute name can be different, of course – you can use anything you want as long as it doesn’t interfere with other ElasticSearch configuration parameters.

So we divided our cluster into two ‘zones’. Now, if we would like to have two indices, one placed on zone_one nodes and the the other index placed on zone_two nodes we would have to include an additional command when creating an index. To create an index on  zone_one nodes we’d use the following command:

curl -XPUT 'localhost:9200/sematext1' -d '{
   "index.routing.allocation.include.zone" : "zone_one"
}'

Then to create the second index on zone_two nodes we’d use the following command:

curl -XPUT 'localhost:9200/sematext2' -d '{
   "index.routing.allocation.include.zone" : "zone_two"
}'

By specifying the cluster.routing.allocation.include.zone parameter we tell ElasticSearch that we want only nodes with the given parameter value to be included in shard allocation. Of course, we can have multiple includes specified during a single index creation. The following is also correct and will tell ElasticSearch to create an index on all nodes from our example:

curl -XPUT 'localhost:9200/sematext2' -d '{
   "index.routing.allocation.include.zone" : "zone_one",
   "index.routing.allocation.include.zone" " "zone_two"
}'

Remember that ElasticSearch will only use the nodes we specified for index creation. No other nodes will be used.

Excluding Nodes from Shard Placement

Just like we’ve  included nodes for shard placement, we can also exclude them. This is done exactly the same way as with including, but instead of specifying cluster.routing.allocation.include parameter we specify cluster.routing.allocation.exclude parameter. Lets assume we have more than 2 zones defined and we want to exclude a specific zone, for example zone_one, from shard placement for one of our indices. We could specify the cluster.routing.allocation.include parameters for all the zones except zone_one or we can just use the following command:

curl -XPUT 'localhost:9200/sematext' -d '{
   "index.routing.allocation.exclude.zone" : "zone_one"
}'

The above command will tell ElasticSearch to exclude zone_one from shard placement and place shards and replicas of that index only on the rest of the cluster.

IP Addresses Instead of Node Attributes

The include and exclude definitions can also be specified without using node attributes. You can use node IP address instead. In order to do that, you specify the special _ip attribute instead of attribute name. For example, if you would like to exclude node with IP 10.1.1.2 from shards placement you would set the following configuration value:

"cluster.routing.allocation.exclude._ip" : "10.1.1.2"

Cluster and Index Level Settings

All the settings described above can be specified on cluster level or on the index level. This way you can control how your shards are placed either globally for the whole cluster or in a different manner for each index. In addition to that, all the above setting can be changed during run-time using ElasticSearch REST API. For example, to include only nodes with IP 10.1.1.2 and 10.1.1.3 for the ‘sematext’ index, we’d use the following command:

curl -XPUT 'localhost:9200/sematext' -d '{
   "index.routing.allocation.include._ip" : "10.1.1.2,10.1.1.3"
}'

To specify this at the cluster level we’d use the following command instead:

curl -XPUT 'localhost:9200/_cluster/settings' -d '{
   "cluster.routing.allocation.include._ip" : "10.1.1.2,10.1.1.3"
}'

Visualizing Shard Placement

It’s 2012 and people like see information presented visually.  If you’d like to see how your shards are placed in your ElasticSearch cluster, have a look at Scalable Performance Monitoring, more specifically SPM for ElasticSearch. This is what shard placement looks like in SPM:

What you can see in the screenshot above is that the number of shards and their placement was constant during the selected period of time. OK.  But, more importantly, what happens if ElasticSearch is rebalancing shards and replicas across the cluster. This is what SPM shows when things are not balanced and when shards and replicas are being moved around the cluster:

The Future

Right now the situation with shard placement is not bad although it is not ideal either. On the other hand, the future is bright – in ElasticSearch 0.20 new shard placement mechanism is expected, so let’s cross our fingers, because we may get an even better search engine soon. Right now, what you can do, is plan and control where your index shards and replicas are placed by using ElasticSearch shard allocation control which do its job well.

We’re Hiring

If you are looking for an opportunity to work with ElasticSearch, we’re hiring.  We are also looking for talented engineers to join our SPM development team.

@sematext

About Rafał Kuć
Sematext engineer, books author, speaker.

16 Responses to ElasticSearch Shard Placement Control

  1. Great post. Note that setting index level allocation uses the _index_ prefix, see more here: http://www.elasticsearch.org/guide/reference/index-modules/allocation.html.

  2. Rafał Kuć says:

    Thanks Shay :) And you are right, will correct that shortly (this shows that copy+paste is a bad idea even when writing blog post) :)

  3. Ivan Brusic says:

    I was wondering if it was possible to move shards of an existing off of an existing node. ElasticSearch complains when changing the settings on an open index, but does not work with a closed index either. For example, trying to remove the 10.1.1.4 node from the foobar index

    curl -XPUT ‘locahost:9200/foobar/_settings’ -d ‘
    {
    “index” : {
    “index.routing.allocation.include._ip” : “10.1.1.2,10.1.1.3”
    }
    }

    Probably wishful thinking on my part, but it would be an interesting feature to have (needed for rolling upgrades).

  4. Pingback: Solr vs. ElasticSearch: Part 1 – Overview « Sematext Blog

  5. Mark Harwood says:

    Terminology is really important. I took exception to the statement in the opening “Let’s start” paragraph that states 3 shards with 1 replica of each becomes 6 shards. A replica should mean a replica of a shard not *another* shard. If I had 50 replicas of 3 shards I still have my content divided into only 3 shards – with a whole heap of replicas.

    • sematext says:

      @Mark – agreed about importance of correct and consistent terminology usage. Another related one is the usage of # replicas in ES vs. replication factor in Solr. One doesn’t count the original shard, the other one just counts the total number of instances of a shard. So the paragraph in question uses the Solr naming approach here, but yes, we should use the ES one here. See https://issues.apache.org/jira/browse/SOLR-4118

      • Mark Harwood says:

        The off-by-one counting of replicas (i.e. do I include the original?) is a lesser issue and perhaps excusable.
        Calling a replica “another shard” fundamentally undoes the meaning of a shard.
        Sharding = dividing content to deal with data volumes
        Replication = multiplying content to cope with users.
        The number of shards in a system is a function of your sharding policy not your replication policy.

        • Rafał Kuć says:

          Thanks for the comment Mark. But I would say, that according to this: http://www.elasticsearch.org/guide/appendix/glossary.html, we shouldn’t use shard and replica, but primary shard and replica shard. Because ‘shard’ is ‘A shard is a single Lucene instance’.

          • Mark Harwood says:

            OK. So if “shard” in ES-world means “Lucene instance” what can I use to describe the logical subdivision of an index that is represented by either a “Primary shard” or any one of it’s “replica shards”?

            Let’s go with “Sliver”.

            How do I ensure that all of the slivers that make up an index are spread evenly across nodes in order to achieve high parallelism of each user’s query? Does the “shards per node” placement setting select a more-or-less random choice of Lucene instances to home on a node or is it in any way “sliver-aware”?

  6. Rafał Kuć says:

    I can’t reply to your reply Mark, so I’ll just post here. The index in ES is divided into shards and those can be either primary shards or replica shards – that’s what the glossary says and let’s stick to that.

    So the index.routing.allocation.total_shards_per_node property specifies the maximum number of shards for a given index can be placed on the same node. Again getting back to the glossary that would mean that it doesn’t matter if the shard is a primary one or a replica. So if you know how many nodes your cluster is built of you can choose the right value for that property for your index to have it more-or-less evenly distributed.

    • Mark Harwood says:

      Thanks Rafał

      Consistency of terminology could certainly do with tightening up in some places. If “shard” is supposed to mean “physical or replica Lucene instance” then for example the config entry “index.number_of_shards” should arguably be renamed to “index.number_of_primary_shards” to make things clearer.

      From your response It sounds like ES doesn’t take any particular care to ensure the splits/slivers that make up an index are represented evenly by appropriate choice of “Lucene instances” across nodes but I expect/hope in practice this balances out so it may not be an issue.

      • Rafał Kuć says:

        Yeah, I would be very happy if Solr/Lucene/ElasticSearch and others out there would use the same terminology, but I doubt that will happen – for instance in SolrCloud we have collection, while in ElasticSearch we have index :) And I think you are right the index.number_of_primary_shards would be more appropriate in terms of terminology.

        As the the second thing – the primaries placement (we are talking about the primaries right ? ;)) – ES will try to redistribute them as good as possible and will make sure that primary shards and its replica(s) won’t be placed on the same nodes. For example you can look at the EvenShardsCountAllocator to see how its happening – https://github.com/elasticsearch/elasticsearch/blob/master/src/main/java/org/elasticsearch/cluster/routing/allocation/allocator/EvenShardsCountAllocator.java

        • Mark Harwood says:

          Let’s take this back to user requirements and then understand ES capabilities:

          As a user I want 2 things from a placement policy:
          1) Resilience – in the interests of failure management replicas should not be placed on the same server as the primaries which they replicate.
          2) Performance – each user query should be serviced by many servers in parallel to maximize parallelism of query execution.

          So as I understand it the ES placement policy only addresses point 1 and does not address point 2.
          This is from the EvenShardsCountAllocator javadocs……

          “a single node can hold multiple shards from a single index such that the shard(s) of an index
          are not necessarily balanced across nodes. Yet, due to high-level AllocationDecider decisions multiple instances of the same shard won’t be allocated on the same node.”

          • Rafał Kuć says:

            You are right, sometimes we notice, that ES tends to put shards of the same index on a single machine. It can be painful, but can come in handy other times, not that I wouldn’t like to have a perfect distribution mechanism in ES. And with the newest releases of ES you can manually move the shards around with a simple API call.

            I told that having multiple shards of the same index on a single nodes can be handy, let me go a bit into this. For example when you start with less nodes than the number of primary shards and you plan to add more nodes later. This way you can have multiple primary shards on a single node and rebalance the cluster when the new machines are added to the cluster.

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out / Change )

Twitter picture

You are commenting using your Twitter account. Log Out / Change )

Facebook photo

You are commenting using your Facebook account. Log Out / Change )

Google+ photo

You are commenting using your Google+ account. Log Out / Change )

Connecting to %s

Follow

Get every new post delivered to your Inbox.

Join 1,640 other followers

%d bloggers like this: