Solr Cookbook, 3rd Edition — Now Available and includes Solr 5.0

Hot off the press: a brand new Solr Cookbook!  One of Sematext’s Solr and Elasticsearch experts — and authorsRafał Kuć, has just published the third and latest edition of Solr Cookbook.  This edition covers both Solr 4.x (based on the newest 4.10.3 version of Solr) and the just-released Solr 5.0.

Similar to previous Solr Cookbooks, Rafal updated the book significantly — half of the previous content has been changed — and rewrote all of the recipes.

Solr_Cookbook

Chapter List

Here’s a list of the chapters:

  1. Apache Solr Configuration
  2. Indexing Your Data
  3. Analyzing Your Text Data
  4. Querying Solr
  5. Faceting
  6. Improving Solr Performance
  7. In the Cloud
  8. Using Additional Solr Functionalities
  9. Dealing with Problems
  10. Real-life Situations

For more information about Solr Cookbook, Third Edition — including info on getting a free chapter — check out the Packt Publishing web page dedicated to it.  The book is available in both electronic and paperback versions.  Even better, here is a discount code you can use for 20% off (valid until March 22, 2015; see details for applying code below*): scte20

Need Some Solr Expertise?

Rafal isn’t the only Solr expert at Sematext; we’ve got several more who have helped 100+ clients to architect, scale, tune, and successfully deploy their Solr-based products.  We also offer 24/7 production support for Solr and Elasticsearch.  Here’s more info about our professional services, which also include Elasticsearch and Logging consulting.  You can also monitor Solr performance (and many other platforms) with SPM Performance Monitoring.

Have some feedback or questions for Rafal?

He’d love to hear from you — get him @kucrafal

——-

* Using discount code:

  1. Set up a free Packt account or log into your existing account
  2. Add the title “Solr Cookbook – Third Edition” in the cart
  3. Click on ‘View Cart’
  4. Then in the “Do you have a promo code?” field enter scte20
  5. Click on the “Apply” button for the discount to get applied

 

Solr vs. Elasticsearch — How to Decide?

by Otis Gospodnetić

[Otis is a Lucene, Solr, and Elasticsearch expert and co-author of “Lucene in Action” (1st and 2nd editions).  He is also the founder and CEO of Sematext. See full bio below.]

“Solr or Elasticsearch?”…well, at least that is the common question I hear from Sematext’s consulting services clients and prospects.  Which one is better, Solr or Elasticsearch?  Which one is faster?  Which one scales better?  Which one can do X, and Y, and Z?  Which one is easier to manage?  Which one should we use?  Which one do you recommend? etc., etc.

These are all great questions, though not always with clear and definite, universally applicable answers. So which one do we recommend you use? How do you choose in the end?  Well, let me share how I see Solr and Elasticsearch past, present, and future, let’s do a bit of comparing and contrasting, and hopefully help you make the right choice for your particular needs.

Solr_vs_Elasticsearch

Early Days: Youth Vs. Experience

Apache Solr is a mature project with a large and active development and user community behind it, as well as the Apache brand.  First released to open-source in 2006, Solr has long dominated the search engine space and was the go-to engine for anyone needing search functionality.  Its maturity translates to rich functionality beyond vanilla text indexing and searching; such as faceting, grouping (aka field collapsing), powerful filtering, pluggable document processing, pluggable search chain components, language detection, etc.

Read more of this post

Solr 5: Replication Throttling

With the release of Solr 5.0, the most recent major version of this great search server, we didn’t only get improvements and changes from the Lucene library.  Of course, we did get features like:

  • segments control sum
  • segments identifiers
  • Lucene using only classes from Java NIO.2 package to access files
  • lowered heap usage because of new Lucene50Codec

…but those features came from the Lucene core itself.  Solr introduced:

  • improved usability for start-up scripts
  • scripts for Linux service installation and running
  • distributed IDF calculation
  • ability to register new handlers using the API (with jar uploads)
  • replication throttling
  • …and so on

All of these features come with the first release of branch 5 of Solr, and we can expect even more from future releases — like cross data center replication! We want to start sharing what we know about those features and, today, we start with replication throttling.

Read more of this post

Solr Redis Plugin Use Cases and Performance Tests

The Solr Redis Plugin is an extension for Solr that provides a query parser that uses data stored in Redis. It is open-sourced on Github by Sematext. This tool is basically a QParserPlugin that establishes a connection to Redis and takes data stored in SET, ZRANGE and other Redis data structures in order to build a query. Data fetched from Redis is used in RedisQParser and is responsible for building a query. Moreover, this plugin provides a highlighter extension which can be used to highlight parts of aliased Solr Redis queries (this will be described in a future).

Use Case: Social Network

Imagine you have a social network and you want to implement a search solution that can search things like: events, interests, photos, and all your friends’ events, interests, and photos. A naive, Solr-only-based implementation would search over all documents and filter by a “friends” field. This requires denormalization and indexing the full list of friends into each document that belongs to a user. Building a query like this is just searching over documents and adding something like a “friends:1234″ clause to the query. It seems simple to implement, but the reality is that this is a terrible solution when you need to update a list of friends because it requires a modification of each document. So when the number of documents (e.g., photos, events, interests, friends and their items) connected with a user grows, the number of potential updates rises dramatically and each modification of connections between users becomes a nightmare. Imagine a person with 10 photos and 100 friends (all of which have their photos, events, interests, etc.).  When this person gets the 101th friend, the naive system with flattened data would have to update a lot of documents/rows.  As we all know, in a social network connections between people are constantly being created and removed, so such a naive Solr-only system could not really scale.

Social networks also have one very important attribute: the number of connections of a single user is typically not expressed in millions. That number is typically relatively small — tens, hundreds, sometimes thousands. This begs the question: why not carry information about user connections in each query sent to a search engine? That way, instead of sending queries with clause “friends:1234,” we can simply send queries with multiple user IDs connected by an “OR” operator. When a query has all the information needed to search entities that belong to a user’s friends, there is no need to store a list of friends in each user’s document. Storing user connections in each query leads to sending of rather large queries to a search engine; each of them containing multiple terms containing user ID (e.g., id:5 OR id:10 OR id:100 OR …) connected by a disjunction operator. When the number of terms grows the query requests become very big. And that’s a problem, because preparing it and sending it to a search engine over the network becomes awkward and slow.

How Does It Work?

The image below presents how Solr Redis Plugin works.

Read more of this post

Videos: Tuning Solr for Logs and Solr Anti-Patterns

If you’re an avid Solr user you’ll want to check out these Lucene / Solr Revolution videos from two of Sematext’s Solr experts: Rafal Kuc and Radu Gheorghe.

Tuning Solr for Logs

Radu talked about Solr performance tuning, which is always nice for keeping your applications snappy and your costs down. This is especially true for logs, social media and other stream-like data that can easily grow into terabyte territory.

(note: there’s no audio between 3:30 and 4:30; we hope to have this fixed soon and it doesn’t materially affect the talk)

Solr Anti-Patterns

Rafal points out common mistakes and roads that should be avoided at all costs when dealing with Solr.

Slides and Summaries

You can find slides of the Solr presentations in this blog post and summaries in this blog post.

Enjoy!

Solr Presentations from Lucene/Solr Revolution 2014

Thanks to everyone who stopped by the Sematext booth at last week’s Lucene/Solr Revolution event in Washington, DC and attended our two talks:

The attendance, questions and interest are very much appreciated.  As a company that prides itself on its Solr expertise (and Elasticsearch expertise too, for that matter), it was nice to spend a couple days talking about search and Big Data challenges, performance monitoring and logging with fellow experts from around the world. Here are the slides for the two talks we gave (summaries of the talks can be found here):

 

  Videos of the talks will be posted here soon.  Hope to see everyone again next year!

Sematext at Lucene/Solr Revolution 2014

Going to Lucene/Solr Revolution next week — November 11-14 — in Washington, DC?  If so…Sematext will be there exhibiting AND giving two talks!  If you are going, stop by our table to say hello.  We can show you the latest versions of SPM Performance Monitoring, Logsene Log Management and Analytics, Site Search Analytics, and, of course, talk about metrics, centralized log management, Lucene, Solr, Elasticsearch, and just about any other search-related topic you might be interested in.  After all, not only have we blogged, given talks and spread the word in all sorts of ways, we’ve also written books on these subjects!

Both of the Sematext engineer talks take place on Friday, November 14.  They are:

Radu Gheorghe will talk about “Tuning Solr for Logs” at 10:15 am

Summary:  Performance tuning is always nice for keeping your applications snappy and your costs down. This is especially the case for logs, social media and other stream-like data that can easily grow into terabyte territory. While you can always use SolrCloud to scale out of performance issues, this talk is about optimizing. The following questions about Solr settings will be answered. How often should you commit and merge? How can you have one collection per day/month/year/etc? What are the performance trade-offs for these options?  There will also be a discussion around choosing the appropriate hardware.  Radu will talk about optimizing the infrastructure when pushing logs to Solr. This includes tuning Apache Flume to handle large flows of logs and overall design options that also apply to other shippers, like Logstash.

Rafal Kuc will talk about “Solr Anti-Patterns” at 10:55 am

Summary:  Working as a consultant, software engineer and helping people in various ways, Rafał has seen multiple patterns in how Solr is used and how it should be used. Consulting on best practices is common, but talking about what NOT to do is not. This talk will point out common mistakes and roads that should be avoided at all costs, covering use cases and guidelines around general configuration pitfalls, data modeling and what to avoid when making your data indexable, and mistakes made when it comes to queries and searching for indexed data. Each use case will be illustrated by a before and after analysis where changes in metrics will be shown to bring a know-how worth remembering.

20% Discount Code

If you currently use a Sematext product or have been a client in the past and want to go, drop us a line for more info.

Hope to see you in DC!

Follow

Get every new post delivered to your Inbox.

Join 152 other followers