Solr vs. ElasticSearch: Part 1 – Overview
August 23, 2012 13 Comments
A good Solr vs. ElasticSearch coverage is long overdue. At Sematext we make good use of our own Search Analytics and pay attention to what people search for. Not surprisingly, lots of people are wondering when to choose Solr and when ElasticSearch, and this SolrCloud vs. ElasticSearch question is something we regularly address in our search consulting engagements.
As the Apache Lucene 4.0 release approaches and with it Solr 4.0 release as well, we thought it would be beneficial to take a deeper look and compare the two leading open source search engines built on top of Lucene – Apache Solr and ElasticSearch. Because the topic is very wide and can go deep, we are publishing our research as a series of blog posts starting with this post, which provides the general overview of the functionality provided by both search engines.
- Solr vs. ElasticSearch: Part 1 – Overview
- Solr vs. ElasticSearch: Part 2 – Indexing and Language Handling
- Solr vs. ElasticSearch: Part 3 – Searching
- Solr vs. ElasticSearch: Part 4 – Faceting
- Solr vs. ElasticSearch: Part 5 - Management API Capabilities
- Solr vs. ElasticSearch: Part 6 – User & Dev Communities Compared
Before We Start
This post is based on released versions of Solr and ElasticSearch. For Solr, all the functionality description is based on version 4.0 beta and all of the ElasticSearch functionality is based on 0.19.8. Because we are comparing ElasticSearch and Solr, on the Solr side the focus is on Solr 4.0 (aka SolrCloud) functionality functionality and not Solr 3.*, so we could call this series as SolrCloud vs. ElasticSearch, too.
Under the Hood
For indexing and searching both Solr and ElasticSearch use Lucene. As you may suspect, Solr 4.0 beta uses the 4.0 version of Lucene, while ElasticSearch 0.19.8 still uses version 3.6. Of course, that doesn’t mean much when it comes to future versions of ElasticSearch because you can be sure that ElasticSearch will start using Lucene 4.0 once it’s GA release is ready, or maybe even before that.
There are a few differences in the way Solr and ElasticSearch name certain concepts. Let’s start with the basics – many servers connected together forms a cluster for both ElasticSearch and Solr. A single instance of Solr or ElasticSearch is called a node. That’s about it for nomenclature overlap.
The main logical data structure for Solr is called the Collection. A Collection is composed of Shards that are really Lucene indices. A single Collection can have multiple Shards and Shards can live on different Nodes. Because a Collection is composed of one or more Shards, a single Collection can be spread across multiple Nodes giving you a distributed environment. In addition to that, a Collection can have Replicas – basically an exact copy of the Shard whose main purpose is to enable scaling and data duplication in case of Node failures (i.e., High Availability).
On the other hand we have ElasticSearch where the top logical data structure is called an Index. Similar to a Collection in Solr, ElasticSearch Index can have multiple Shards and Replicas. And here, too, Shards and Replicas are small Lucene indices, that can be spread across the Cluster in order to create a distributed environment. But that’s not all – in ElasticSearch you can have multiple Types of documents in a single Index. This means that you can index documents of different index structure (for example users and their documents) in a single Index. ElasticSearch is able to distinguish those Types during indexing as well as querying. In order to achieve the same with Solr you would have to simulate that inside your application or develop a custom search component.
Lets take a quick look at how Solr and ElasticSearch are configured. Let’s start with the index structure.
In Solr you need the schema.xml file in order to define how your index structure, to define fields and their types. Of course, you can have all fields defined as dynamic fields and create them on the fly, but you still need at least some degree of index configuration. In most cases though, you’ll create a schema.xml to match your data structure.
ElasticSearch is a bit different – it can be called schemaless. What exactly does this mean, you may ask. In short, it means one can launch ElasticSearch and start sending documents to it in order to have them indexed without creating any sort of index schema and ElasticSearch will try to guess field types. It is not always 100% accurate, at least when comparing to the manual creation of the index mappings, but it works quite well. Of course, you can also define the index structure (so called mappings) and then create the index with those mappings, or even create the mappings files for each type that will exist in the index and let ElasticSearch use it when a new index is created. Sounds pretty cool, right? In addition to than, when a new, previously unseen field is found in a document being indexed, ElasticSearch will try to create that field and will try to guess its type. As you may imagine, this behavior can be turned off.
Let’s talk about the actual configuration of Solr and ElasticSearch for a bit. In Solr, the configuration of all components, search handlers, index specific things such as merge factor or buffers, caches, etc. are defined in the solrconfig.xml file. After each change you need to restart Solr node or reload it. All configs in ElasticSearch are written to elasticsearch.yml file, which is just another configuration file. However, that’s not the only way to store and change ElasticSearch settings. Many settings exposed by ElasticSearch (not all though) can be changed on the live cluster – for example you can change how your shards and replicas are placed inside your cluster and ElasticSearch nodes don’t need to be restarted. Learn more about this in ElasticSearch Shard Placement Control.
Discovery and Cluster Management
Solr and ElasticSearch have a different approach to cluster node discovery and cluster management in general. The main purpose of discovery is to monitor nodes’ states, choose master nodes, and in some cases also store shared configuration files.
By default ElasticSearch uses the so called Zen Discovery, which has two methods of node discovery: multicast and unicast. With multicast a single node sends a multicast request and all nodes that receive that request respond to it. So if your nodes can see each other at the network layer with the use of multicast method your nodes will be able to form a cluster. On the other hand, unicast depends on the list of hosts that should be pinged in order to form the cluster. In addition to that, the Zen Discovery module is also responsible for detecting the master node for the cluster and for fault discovery. The fault discovery is done in two ways – the master node pings all the other nodes to see if they are healthy and the nodes ping the master in order to see if the master is still working as it should. We should note that there is an ElasticSearch plugin that makes ElasticSearch use Apache Zookeeper instead of its own Zen Discovery.
Apache Solr uses a different approach for handling search cluster. Solr uses Apache Zookeeper ensemble – which is basically one or more Zookeeper instances running together. Zookeeper is used to store the configuration files and monitoring – for keeping track of the status of all nodes and of the overall cluster state. In order for a new node to join an existing cluster Solr needs to know which Zookeeper ensemble to connect to.
There is one thing worth noting when it comes to cluster handling – the split brain situation. Imagine a situation, where you cluster is divided into half, so half of your nodes don’t see the other half, for example because of the network failure. In such cases ElasticSearch will try to elect a new master in the cluster part that doesn’t have one and this will lead to creation of two independent clusters running at the same time. This can be limited with a small degree of configuration, but it can still happen. On the other hand, Solr 4.0 is immune to split brain situations, because it uses Zookeeper, which prevents such ill situations. If half of your Solr cluster is disconnected, it wouldn’t be visible by Zookeeper and thus data and queries wouldn’t be forwarded there.
If you know Apache Solr or ElasticSearch you know that they expose an HTTP API.
Those of you familiar with Solr know that in order to get search results from it you need to query one of the defined request handlers and pass in the parameters that define your query criteria. Depending on which query parser you choose to use, these parameters will be different, but the method is still the same – an HTTP GET request is sent to Solr in order to fetch search results. The good thing is that you are not limited to a single response format – you may choose to get results in XML, in JSON in JavaBin format and several other formats that have response writers developed for them. You can thus choose the format that is the most convenient for you and your search application. Of course, Solr API is not only about querying as you can also get some statistics about different search components or control Solr behavior, such as collection creation for example.
And what about ElasticSearch? ElasticSearch exposes a REST API which can be accessed using HTTP GET, DELETE, POST and PUT methods. Its API allows one not only to query or delete documents, but also to create indices, manage them, control analysis and get all the metrics describing current state and configuration of ElasticSearch. If you need to know anything about ElasticSearch, you can get it through the REST API (we use it in our Scalable Performance Monitoring for ElasticSearch, too!). If you are used to Solr there is one thing that may be strange for you in the beginning – the only format ElasticSearch can respond in JSON – there is no XML response for example. Another big difference between ElasticSearch and Solr is querying. While with Solr all query parameters are passed in as URL parameters, in ElasticSearch queries are structured in JSON representation. Queries structured as JSON objects give one a lot of control over how ElasticSearch should understand the query and thus what results to return.
Of course, both Solr and ElasticSearch leverage Lucene near real-time capabilities. This makes it possible for queries to match documents right after they’ve been indexed. In addition to that, both Solr (since 4.0) and ElasticSearch (since 0.15) allow versioning of documents in the index. This feature allows them to support optimistic locking and thus enable prevention of overwriting updates. Let’s look at how distributed indexing is done in Solr vs. ElastiSearch.
Let’s start with ElasticSearch this time. In order to add a document to the index in a distributed environment ElasticSearch needs to choose which shard each document should be sent to. By default a document is placed in a shard that is calculated as a hash from the documents identifier. Because this default behavior is not always desired, one can control and alter this behavior by using a feature called routing. This is controlled via the routing parameter, which can take any value you would like it to have. Imagine that you have a single logical index divided into multiple shards and you index multiple users’ data in it. On the search side you know queries are narrowed mostly to a single user’s data. With the use of the routing parameter you can index all documents belonging to a single user within a single shard by using the same routing value for all his/her documents. On the search side you can then use the same routing value when querying. This would result in a single shard being queried instead of the query being spread across all shards in the index, which would be more expensive and slower. In case each index shard contains multiple users’ data we could additionally use a filter to limit matches to only one user’s documents. In cases like this, routing functionality allows one to think of some nice optimization for both indexing and querying. If you want to hear some more about distributed indexing capabilities of ElasticSearch please take a look at my Berlin Buzzwords 2012 talk – Scaling Massive ElasticSearch Clusters (video).
Details of Solr’s implementation of distributed indexing (and searching) capabilities can be found in our The New SolrCloud: Overview post. But let’s recall some of those details. In order to forward a document to a proper shard Solr uses Murmur hashing algorithm which calculates the hash for the given document on the basis of its unique identifier. This part is similar to default ElasticSearch behavior. However, Solr doesn’t yet let you specify explicitly to which shard the document should be sent - there is no document and query routing equivalent in Solr yet.
Of course, both Solr and ElasticSearch allow one to configure replicas of indices (ElasticSearch) or collections (Solr). This is crucial because replicas enable creation of highly available clusters – even if some of nodes are down, for example because of hardware failure or maintenance, the cluster and data within it can remain available. Without replicas if one nodes is lost, you lose (access to) the data that were on the missing node. If you have replicas present in your configuration both search engines will automatically copy documents to replicas, so that you don’t need to worry about data loss.
We hope that after reading this post you have the basic understanding of what you can expect from both Solr 4.0 and ElasticSearch 0.19.* and you can start to get the feeling for differences and similarities between them. Of course, both Solr and ElasticSearch have very strong and active user and development communities and are constantly evolving and improving, and are doing that rather fast. In pre-Solr 4.0 (aka SolrCloud) world the difference between Solr and ElasticSearch was quite stark. Since then, and under the pressure from ElasticSearch, the gap has narrowed and both projects are moving forward quite quickly. At Sematext our clients often ask us to recommend the search engine for their use and we recommend both of them. Which one we recommend for a particular project depends on project requirements, which we always go through at the beginning of every engagement. If you need help deciding, let us know.
Also please keep in mind this post is not meant to be the most comprehensive guide to all the similarities and differences between ElasticSearch and Solr. We wanted to start with a general overview of how these two great search engines work and cover the big picture. In subsequent parts of the “Solr vs. ElasticSearch” series we’ll describe how the most frequently used features of both search engines work, what the differences between them are, and we’ll get into details of those features showing you pros and cons of Solr vs. ElasticSearch approach (for example approaches used in faceting or caching). Just as a sneak peak into the next post in the series – you can expect information about language handling capabilities, analysis configuration, and querying.