SolrCloud Rebalance API

This is a post of the work done at BloomReach on smarter index & data management in SolrCloud.  

Authors: Nitin Sharma – Search Platform Engineer & Suruchi Shah –  Engineering Intern

 

Nitin_intro

Introduction

In a multi-tenant search architecture, as the size of data grows, the manual management of collections, ranking/search configurations becomes non-trivial and cumbersome. This blog describes an innovative approach we implemented at BloomReach that helps with an effective index and a dynamic config management system for massive multi-tenant search infrastructure in SolrCloud.

Problem

The inability to have granular control over index and config management for Solr collections introduces complexities in geographically spanned, massive multi-tenant architectures. Some common scenarios, involving adding and removing nodes, growing collections and their configs, make cluster management a significant challenge. Currently, Solr doesn’t offer a scaling framework to enable any of these operations. Although there are some basic Solr APIs to do trivial core manipulation, they don’t satisfy the scaling requirements at BloomReach.

Innovative Data Management in SolrCloud

To address the scaling and index management issues, we have designed and implemented the Rebalance API, as shown in Figure 1. This API allows robust index and config manipulation in SolrCloud, while guaranteeing zero downtime using various scaling and allocation strategies. It has  two dimensions:

Nitin_strategy

The seven scaling strategies are as follows:

  1. Auto Shard allows re-sharding an entire collection to any number of destination shards. The process includes re-distributing the index and configs consistently across the new shards, while avoiding any heavy re-indexing processes.  It also offers the following flavors:
    • Flip Alias Flag controls whether or not the alias name of a collection (if it already had an alias) should automatically switch to the new collection.
    • Size-based sharding allows the user to specify the desired size of the destination shards for the collection. As a result, the system defines the final number of shards depending on the total index size.
  2. Redistribute enables distribution of cores/replicas across unused nodes. Oftentimes, the cores are concentrated within a few nodes. Redistribute allows load sharing by balancing the replicas across all nodes.
  3. Replace allows migrating all the cores from a source node to a destination node. It is useful in cases requiring replacement of an entire node.
  4. Scale Up adds new replicas for a shard. The default allocation strategy for scaling up is unused nodes. Scale up also has the ability to replicate additional custom per-merchant configs in addition to the index replication (as an extension to the existing replication handler, which only syncs the index files)
  5. Scale Down removes the given number of replicas from a shard.
  6. Remove Dead Nodes is an extension of Scale Down, which allows removal of the replicas/shards from dead nodes for a given collection. In the process, the logic unregisters the replicas from Zookeeper. This in-turn saves a lot of back-and-forth communication between Solr and Zookeeper in their constant attempt to find the replicas on dead nodes.
  7. Discovery-based Redistribution allows distribution of all collections as new nodes are introduced into a cluster. Currently, when a node is added to a cluster, no operations take place by default. With redistribution, we introduce the ability to rearrange the existing collections across all the nodes evenly.

Read more of this post

Top 10 Mistakes Made While Learning Solr

Top_10_Solr

  1. Upgrading to the new major version right after its release without waiting for the inevitable .1 release
  2. Explaining your, “I don’t need backups, I can always reindex” statement to your manager during an 8-hour reindexing session
  3. Taking down the whole Data Center with a single rows=1000000000000000 request while singing, “I want it all / I want it now”
  4. In a room full of Solr users wondering out loud why you’re not using Elasticsearch instead
  5. Splitting shards like it’s 1999
  6. Giving Solr’s JVM all the memory you’ve got and getting paged in the middle of the night
  7. Running hundreds of queries with facet.mincount=0 and facet.limit=-1 and wondering why the YouTube videos you’re trying to watch are being buffered
  8. Using shards=1 and replicationFactor=1 and wondering why only a single node in your hundred nodes cluster is being used
  9. Optimizing after commits, hard committing every 5 seconds, using openSearcher=true and still wondering why your terminal is all slow
  10. …and last but not least: not taking Sematext Solr guru @kucrafal’s upcoming Solr Training course in October in NYC!

Solr_Training

Solr Training in New York City — October 19-20

For those of you interested in some comprehensive Solr training taught by an expert from Sematext who knows it inside and out, we’re running a super hands-on training workshop in New York City from October 19-20.

This two-day workshop will be taught by Sematext engineer — and author of Solr books — Rafal Kuc.

Target audience:

Developers and Devops who want to configure, tune and manage Solr at scale.

What you’ll get out of it:

In two days of training Rafal will help:

  • bring Solr novices to the level where he/she would be comfortable with taking Solr to production
  • give experienced Solr users proven and practical advice based on years of experience designing, tuning, and operating numerous Solr clusters to help with their most advanced and pressing issues

* See the Course Outline at the bottom of this post for details

When & Where:

  • Dates:        October 19 & 20 (Monday & Tuesday)
  • Time:         9:00 a.m. — 5:00 p.m.
  • Location:     New Horizons Computer Learning Center in Midtown Manhattan (map)
  • Cost:         $1,200 “early bird rate” (valid through September 1) and $1,500 afterward.  And…we’re also offering a 50% discount for the purchase of a 2nd seat!
  • Food/Drinks: Light breakfast and lunch will be provided

Register_Now_2

Attendees will go through several sequences of short lectures followed by interactive, group, hands-on exercises. There will be a Q&A session after each such lecture-practicum block.

Got any questions or suggestions for the course? Just drop us a line or hit us @sematext!

Lastly, if you can’t make it…watch this space or follow @sematext — we’ll be adding more Solr training workshops in the US, Europe and possibly other locations in the coming months.  We are also known worldwide for our Solr Consulting Services and Solr Production Support.

Hope to see you in the Big Apple in October!

——-

Solr Training Workshop – Course Outline

  • Introduction to Solr
  1. What is Solr and use – cases
  2. Solr master – slave architecture
  3. SolrCloud architecture
  4. Why & When SolrCloud
  5. Solr master – slave vs SolrCloud
  6. Starting Solr with schema-less configuration
  7. Indexing documents
  8. Retrieving documents using URI request
  9. Deleting documents
  • Indexing data

Read more of this post

Large Scale Log Analytics with Solr – Presentation Upvoting

If topics like log analytics and Solr are your thing then we may have a treat for you at the upcoming Lucene / Solr Revolution conference in Austin in October.  Two of Sematext’s engineers and Solr, Elasticsearch and ELK stack experts — Rafal Kuc and Radu Gheorghe — have proposed a talk called “Large Scale Log Analytics with Solr” and could use some upvoting from the community to get in on this year’s agenda.

To show your support for “Large Scale Log Analytics with Solr” just click here to vote.  Takes less than a minute!  Even if you don’t attend the conference, we’ll post the slides and video here on the blog…assuming it gets on the agenda.  Voting will close at 11:59pm EDT on Thursday, June 25th.

LR_2015

Talk Summary

This talk is about searching and analyzing time-based data at scale. Documents ranging from blog posts and social media to application logs and metrics generated by smart watches and other “smart” things share a similar pattern: timestamp among their fields, rarely changeable, deletion when they become obsolete.

Very often this kind of data is so large that it causes scaling and performance challenges. We’ll address precisely these challenges, which include:

  1. Properly designing collections architecture
  2. Indexing data fast and without documents waiting in queues for processing
  3. Being able to run queries that include time-based sorting and faceting on enormous amounts of indexed data without killing Solr
  4. …and many more

We’ll start with the indexing pipeline — where you do all your ETL. We’ll show you how to maximize throughput through various ETL tools, such Flume, Kafka, Logstash and rsyslog, and make them scale and send data to Solr.

On the Solr side, we’ll show all sorts of tricks to optimize indexing and searching: from tuning merge policies to slicing collections based on timestamp. While scaling out, we’ll show how to improve the performance/cost ratio.

Thanks for your support!

Side by Side with Elasticsearch and Solr: Performance and Scalability

[Note: We’re holding 2-day, super hands-on training workshops for Elasticsearch AND Solr in New York City on October 19 & 20, 2015.  Click here for Elasticsearch training and click here for Solr training]

[Note #2: post has been updated to include video and slides from the June 2 presentation]

Back by popular demand!  Sematext engineers Radu Gheorghe and Rafal Kuc returned to Berlin Buzzwords on Tuesday, June 2, with the second installment of their “Side by Side with Elasticsearch and Solr” talk.  (You can check out Part 1 here.)

Elasticsearch and Solr Performance and Scalability

This brand new talk — which included a live demo, a video demo and slides — dove deeper into into how Elasticsearch and Solr scale and perform. And, of course, they took into account all the goodies that came with these search platforms since last year.

Radu and Rafal showed attendees how to tune Elasticsearch and Solr for two common use-cases: logging and product search.  Then they showed what numbers they got after tuning. There was also some sharing of best practices for scaling out massive Elasticsearch and Solr clusters; for example, how to divide data into shards and indices/collections that account for growth, when to use routing, and how to make sure that coordinated nodes don’t become unresponsive.

Here is the video:

 

…and here are the slides:

 

Feedback & Questions — Bring It On

If you’ve got feedback or questions about topics like Elasticsearch vs. Solr (here’s a detailed comparison) and what’s new and exciting with both applications, just drop us a line.  We live and breathe this stuff, so we’re always happy to hear from like-minded people.

Solr Cookbook, 3rd Edition — Now Available and includes Solr 5.0

[Note: We’re holding a 2-day, super hands-on Solr training workshop in New York City on October 19 & 20, 2015. Click here for details!]

——-

Hot off the press: a brand new Solr Cookbook!  One of Sematext’s Solr and Elasticsearch experts — and authorsRafał Kuć, has just published the third and latest edition of Solr Cookbook.  This edition covers both Solr 4.x (based on the newest 4.10.3 version of Solr) and the just-released Solr 5.0.

Similar to previous Solr Cookbooks, Rafal updated the book significantly — half of the previous content has been changed — and rewrote all of the recipes.

Solr_Cookbook

Chapter List

Here’s a list of the chapters:

  1. Apache Solr Configuration
  2. Indexing Your Data
  3. Analyzing Your Text Data
  4. Querying Solr
  5. Faceting
  6. Improving Solr Performance
  7. In the Cloud
  8. Using Additional Solr Functionalities
  9. Dealing with Problems
  10. Real-life Situations

For more information about Solr Cookbook, Third Edition — including info on getting a free chapter — check out the Packt Publishing web page dedicated to it.  The book is available in both electronic and paperback versions.  Even better, here is a discount code you can use for 20% off (valid until March 22, 2015; see details for applying code below*): scte20

Need Some Solr Expertise?

Rafal isn’t the only Solr expert at Sematext; we’ve got several more who have helped 100+ clients to architect, scale, tune, and successfully deploy their Solr-based products.  We also offer 24/7 production support for Solr and Elasticsearch.  Here’s more info about our professional services, which also include Elasticsearch and Logging consulting.  You can also monitor Solr performance (and many other platforms) with SPM Performance Monitoring.

Have some feedback or questions for Rafal?

He’d love to hear from you — get him @kucrafal

——-

* Using discount code:

  1. Set up a free Packt account or log into your existing account
  2. Add the title “Solr Cookbook – Third Edition” in the cart
  3. Click on ‘View Cart’
  4. Then in the “Do you have a promo code?” field enter scte20
  5. Click on the “Apply” button for the discount to get applied

 

Solr vs. Elasticsearch — How to Decide?

[Note: We’re holding 2-day, super hands-on training workshops for Elasticsearch AND Solr in New York City on October 19 & 20, 2015.  Click here for Elasticsearch training and click here for Solr training]

——-

by Otis Gospodnetić

[Otis is a Lucene, Solr, and Elasticsearch expert and co-author of “Lucene in Action” (1st and 2nd editions).  He is also the founder and CEO of Sematext. See full bio below.]

“Solr or Elasticsearch?”…well, at least that is the common question I hear from Sematext’s consulting services clients and prospects.  Which one is better, Solr or Elasticsearch?  Which one is faster?  Which one scales better?  Which one can do X, and Y, and Z?  Which one is easier to manage?  Which one should we use?  Which one do you recommend? etc., etc.

These are all great questions, though not always with clear and definite, universally applicable answers. So which one do we recommend you use? How do you choose in the end?  Well, let me share how I see Solr and Elasticsearch past, present, and future, let’s do a bit of comparing and contrasting, and hopefully help you make the right choice for your particular needs.

Solr_vs_Elasticsearch

Early Days: Youth Vs. Experience

Apache Solr is a mature project with a large and active development and user community behind it, as well as the Apache brand.  First released to open-source in 2006, Solr has long dominated the search engine space and was the go-to engine for anyone needing search functionality.  Its maturity translates to rich functionality beyond vanilla text indexing and searching; such as faceting, grouping (aka field collapsing), powerful filtering, pluggable document processing, pluggable search chain components, language detection, etc.

Read more of this post

Follow

Get every new post delivered to your Inbox.

Join 173 other followers