Side by Side with Elasticsearch and Solr: Performance and Scalability

[Note: this post has been updated to include video and slides from the June 2 presentation]

Back by popular demand!  Sematext engineers Radu Gheorghe and Rafal Kuc returned to Berlin Buzzwords on Tuesday, June 2, with the second installment of their “Side by Side with Elasticsearch and Solr” talk.  (You can check out Part 1 here.)

Elasticsearch and Solr Performance and Scalability

This brand new talk — which included a live demo, a video demo and slides — dove deeper into into how Elasticsearch and Solr scale and perform. And, of course, they took into account all the goodies that came with these search platforms since last year.

Radu and Rafal showed attendees how to tune Elasticsearch and Solr for two common use-cases: logging and product search.  Then they showed what numbers they got after tuning. There was also some sharing of best practices for scaling out massive Elasticsearch and Solr clusters; for example, how to divide data into shards and indices/collections that account for growth, when to use routing, and how to make sure that coordinated nodes don’t become unresponsive.

Here is the video:


…and here are the slides:


Feedback & Questions — Bring It On

If you’ve got feedback or questions about topics like Elasticsearch vs. Solr (here’s a detailed comparison) and what’s new and exciting with both applications, just drop us a line.  We live and breathe this stuff, so we’re always happy to hear from like-minded people.

Solr Cookbook, 3rd Edition — Now Available and includes Solr 5.0

Hot off the press: a brand new Solr Cookbook!  One of Sematext’s Solr and Elasticsearch experts — and authorsRafał Kuć, has just published the third and latest edition of Solr Cookbook.  This edition covers both Solr 4.x (based on the newest 4.10.3 version of Solr) and the just-released Solr 5.0.

Similar to previous Solr Cookbooks, Rafal updated the book significantly — half of the previous content has been changed — and rewrote all of the recipes.


Chapter List

Here’s a list of the chapters:

  1. Apache Solr Configuration
  2. Indexing Your Data
  3. Analyzing Your Text Data
  4. Querying Solr
  5. Faceting
  6. Improving Solr Performance
  7. In the Cloud
  8. Using Additional Solr Functionalities
  9. Dealing with Problems
  10. Real-life Situations

For more information about Solr Cookbook, Third Edition — including info on getting a free chapter — check out the Packt Publishing web page dedicated to it.  The book is available in both electronic and paperback versions.  Even better, here is a discount code you can use for 20% off (valid until March 22, 2015; see details for applying code below*): scte20

Need Some Solr Expertise?

Rafal isn’t the only Solr expert at Sematext; we’ve got several more who have helped 100+ clients to architect, scale, tune, and successfully deploy their Solr-based products.  We also offer 24/7 production support for Solr and Elasticsearch.  Here’s more info about our professional services, which also include Elasticsearch and Logging consulting.  You can also monitor Solr performance (and many other platforms) with SPM Performance Monitoring.

Have some feedback or questions for Rafal?

He’d love to hear from you — get him @kucrafal


* Using discount code:

  1. Set up a free Packt account or log into your existing account
  2. Add the title “Solr Cookbook – Third Edition” in the cart
  3. Click on ‘View Cart’
  4. Then in the “Do you have a promo code?” field enter scte20
  5. Click on the “Apply” button for the discount to get applied


Solr vs. Elasticsearch — How to Decide?

by Otis Gospodnetić

[Otis is a Lucene, Solr, and Elasticsearch expert and co-author of “Lucene in Action” (1st and 2nd editions).  He is also the founder and CEO of Sematext. See full bio below.]

“Solr or Elasticsearch?”…well, at least that is the common question I hear from Sematext’s consulting services clients and prospects.  Which one is better, Solr or Elasticsearch?  Which one is faster?  Which one scales better?  Which one can do X, and Y, and Z?  Which one is easier to manage?  Which one should we use?  Which one do you recommend? etc., etc.

These are all great questions, though not always with clear and definite, universally applicable answers. So which one do we recommend you use? How do you choose in the end?  Well, let me share how I see Solr and Elasticsearch past, present, and future, let’s do a bit of comparing and contrasting, and hopefully help you make the right choice for your particular needs.


Early Days: Youth Vs. Experience

Apache Solr is a mature project with a large and active development and user community behind it, as well as the Apache brand.  First released to open-source in 2006, Solr has long dominated the search engine space and was the go-to engine for anyone needing search functionality.  Its maturity translates to rich functionality beyond vanilla text indexing and searching; such as faceting, grouping (aka field collapsing), powerful filtering, pluggable document processing, pluggable search chain components, language detection, etc.

Read more of this post

Solr 5: Replication Throttling

With the release of Solr 5.0, the most recent major version of this great search server, we didn’t only get improvements and changes from the Lucene library.  Of course, we did get features like:

  • segments control sum
  • segments identifiers
  • Lucene using only classes from Java NIO.2 package to access files
  • lowered heap usage because of new Lucene50Codec

…but those features came from the Lucene core itself.  Solr introduced:

  • improved usability for start-up scripts
  • scripts for Linux service installation and running
  • distributed IDF calculation
  • ability to register new handlers using the API (with jar uploads)
  • replication throttling
  • …and so on

All of these features come with the first release of branch 5 of Solr, and we can expect even more from future releases — like cross data center replication! We want to start sharing what we know about those features and, today, we start with replication throttling.

Read more of this post

Solr Redis Plugin Use Cases and Performance Tests

The Solr Redis Plugin is an extension for Solr that provides a query parser that uses data stored in Redis. It is open-sourced on Github by Sematext. This tool is basically a QParserPlugin that establishes a connection to Redis and takes data stored in SET, ZRANGE and other Redis data structures in order to build a query. Data fetched from Redis is used in RedisQParser and is responsible for building a query. Moreover, this plugin provides a highlighter extension which can be used to highlight parts of aliased Solr Redis queries (this will be described in a future).

Use Case: Social Network

Imagine you have a social network and you want to implement a search solution that can search things like: events, interests, photos, and all your friends’ events, interests, and photos. A naive, Solr-only-based implementation would search over all documents and filter by a “friends” field. This requires denormalization and indexing the full list of friends into each document that belongs to a user. Building a query like this is just searching over documents and adding something like a “friends:1234” clause to the query. It seems simple to implement, but the reality is that this is a terrible solution when you need to update a list of friends because it requires a modification of each document. So when the number of documents (e.g., photos, events, interests, friends and their items) connected with a user grows, the number of potential updates rises dramatically and each modification of connections between users becomes a nightmare. Imagine a person with 10 photos and 100 friends (all of which have their photos, events, interests, etc.).  When this person gets the 101th friend, the naive system with flattened data would have to update a lot of documents/rows.  As we all know, in a social network connections between people are constantly being created and removed, so such a naive Solr-only system could not really scale.

Social networks also have one very important attribute: the number of connections of a single user is typically not expressed in millions. That number is typically relatively small — tens, hundreds, sometimes thousands. This begs the question: why not carry information about user connections in each query sent to a search engine? That way, instead of sending queries with clause “friends:1234,” we can simply send queries with multiple user IDs connected by an “OR” operator. When a query has all the information needed to search entities that belong to a user’s friends, there is no need to store a list of friends in each user’s document. Storing user connections in each query leads to sending of rather large queries to a search engine; each of them containing multiple terms containing user ID (e.g., id:5 OR id:10 OR id:100 OR …) connected by a disjunction operator. When the number of terms grows the query requests become very big. And that’s a problem, because preparing it and sending it to a search engine over the network becomes awkward and slow.

How Does It Work?

The image below presents how Solr Redis Plugin works.

Read more of this post

Videos: Tuning Solr for Logs and Solr Anti-Patterns

If you’re an avid Solr user you’ll want to check out these Lucene / Solr Revolution videos from two of Sematext’s Solr experts: Rafal Kuc and Radu Gheorghe.

Tuning Solr for Logs

Radu talked about Solr performance tuning, which is always nice for keeping your applications snappy and your costs down. This is especially true for logs, social media and other stream-like data that can easily grow into terabyte territory.

(note: there’s no audio between 3:30 and 4:30; we hope to have this fixed soon and it doesn’t materially affect the talk)

Solr Anti-Patterns

Rafal points out common mistakes and roads that should be avoided at all costs when dealing with Solr.

Slides and Summaries

You can find slides of the Solr presentations in this blog post and summaries in this blog post.


Solr Presentations from Lucene/Solr Revolution 2014

Thanks to everyone who stopped by the Sematext booth at last week’s Lucene/Solr Revolution event in Washington, DC and attended our two talks:

The attendance, questions and interest are very much appreciated.  As a company that prides itself on its Solr expertise (and Elasticsearch expertise too, for that matter), it was nice to spend a couple days talking about search and Big Data challenges, performance monitoring and logging with fellow experts from around the world. Here are the slides for the two talks we gave (summaries of the talks can be found here):


  Videos of the talks will be posted here soon.  Hope to see everyone again next year!


Get every new post delivered to your Inbox.

Join 181 other followers