Solr 5: Replication Throttling

[Note: We’re holding a 2-day, super hands-on Solr training workshop in New York City on October 19 & 20, 2015. Click here for details!]

——-

With the release of Solr 5.0, the most recent major version of this great search server, we didn’t only get improvements and changes from the Lucene library.  Of course, we did get features like:

  • segments control sum
  • segments identifiers
  • Lucene using only classes from Java NIO.2 package to access files
  • lowered heap usage because of new Lucene50Codec

…but those features came from the Lucene core itself.  Solr introduced:

  • improved usability for start-up scripts
  • scripts for Linux service installation and running
  • distributed IDF calculation
  • ability to register new handlers using the API (with jar uploads)
  • replication throttling
  • …and so on

All of these features come with the first release of branch 5 of Solr, and we can expect even more from future releases — like cross data center replication! We want to start sharing what we know about those features and, today, we start with replication throttling.

Read more of this post

Solr Redis Plugin Use Cases and Performance Tests

[Note: We’re holding a 2-day, super hands-on Solr training workshop in New York City on October 19 & 20, 2015. Click here for details!]

——-

The Solr Redis Plugin is an extension for Solr that provides a query parser that uses data stored in Redis. It is open-sourced on Github by Sematext. This tool is basically a QParserPlugin that establishes a connection to Redis and takes data stored in SET, ZRANGE and other Redis data structures in order to build a query. Data fetched from Redis is used in RedisQParser and is responsible for building a query. Moreover, this plugin provides a highlighter extension which can be used to highlight parts of aliased Solr Redis queries (this will be described in a future).

Use Case: Social Network

Imagine you have a social network and you want to implement a search solution that can search things like: events, interests, photos, and all your friends’ events, interests, and photos. A naive, Solr-only-based implementation would search over all documents and filter by a “friends” field. This requires denormalization and indexing the full list of friends into each document that belongs to a user. Building a query like this is just searching over documents and adding something like a “friends:1234” clause to the query. It seems simple to implement, but the reality is that this is a terrible solution when you need to update a list of friends because it requires a modification of each document. So when the number of documents (e.g., photos, events, interests, friends and their items) connected with a user grows, the number of potential updates rises dramatically and each modification of connections between users becomes a nightmare. Imagine a person with 10 photos and 100 friends (all of which have their photos, events, interests, etc.).  When this person gets the 101th friend, the naive system with flattened data would have to update a lot of documents/rows.  As we all know, in a social network connections between people are constantly being created and removed, so such a naive Solr-only system could not really scale.

Social networks also have one very important attribute: the number of connections of a single user is typically not expressed in millions. That number is typically relatively small — tens, hundreds, sometimes thousands. This begs the question: why not carry information about user connections in each query sent to a search engine? That way, instead of sending queries with clause “friends:1234,” we can simply send queries with multiple user IDs connected by an “OR” operator. When a query has all the information needed to search entities that belong to a user’s friends, there is no need to store a list of friends in each user’s document. Storing user connections in each query leads to sending of rather large queries to a search engine; each of them containing multiple terms containing user ID (e.g., id:5 OR id:10 OR id:100 OR …) connected by a disjunction operator. When the number of terms grows the query requests become very big. And that’s a problem, because preparing it and sending it to a search engine over the network becomes awkward and slow.

How Does It Work?

The image below presents how Solr Redis Plugin works.

Read more of this post

Videos: Tuning Solr for Logs and Solr Anti-Patterns

[Note: We’re holding a 2-day, super hands-on Solr training workshop in New York City on October 19 & 20, 2015. Click here for details!]

——-

If you’re an avid Solr user you’ll want to check out these Lucene / Solr Revolution videos from two of Sematext’s Solr experts: Rafal Kuc and Radu Gheorghe.

Tuning Solr for Logs

Radu talked about Solr performance tuning, which is always nice for keeping your applications snappy and your costs down. This is especially true for logs, social media and other stream-like data that can easily grow into terabyte territory.

(note: there’s no audio between 3:30 and 4:30; we hope to have this fixed soon and it doesn’t materially affect the talk)

Solr Anti-Patterns

Rafal points out common mistakes and roads that should be avoided at all costs when dealing with Solr.

Slides and Summaries

You can find slides of the Solr presentations in this blog post and summaries in this blog post.

Enjoy!

Solr Presentations from Lucene/Solr Revolution 2014

Thanks to everyone who stopped by the Sematext booth at last week’s Lucene/Solr Revolution event in Washington, DC and attended our two talks:

The attendance, questions and interest are very much appreciated.  As a company that prides itself on its Solr expertise (and Elasticsearch expertise too, for that matter), it was nice to spend a couple days talking about search and Big Data challenges, performance monitoring and logging with fellow experts from around the world. Here are the slides for the two talks we gave (summaries of the talks can be found here):

 

  Videos of the talks will be posted here soon.  Hope to see everyone again next year!

Sematext at Lucene/Solr Revolution 2014

Going to Lucene/Solr Revolution next week — November 11-14 — in Washington, DC?  If so…Sematext will be there exhibiting AND giving two talks!  If you are going, stop by our table to say hello.  We can show you the latest versions of SPM Performance Monitoring, Logsene Log Management and Analytics, Site Search Analytics, and, of course, talk about metrics, centralized log management, Lucene, Solr, Elasticsearch, and just about any other search-related topic you might be interested in.  After all, not only have we blogged, given talks and spread the word in all sorts of ways, we’ve also written books on these subjects!

Both of the Sematext engineer talks take place on Friday, November 14.  They are:

Radu Gheorghe will talk about “Tuning Solr for Logs” at 10:15 am

Summary:  Performance tuning is always nice for keeping your applications snappy and your costs down. This is especially the case for logs, social media and other stream-like data that can easily grow into terabyte territory. While you can always use SolrCloud to scale out of performance issues, this talk is about optimizing. The following questions about Solr settings will be answered. How often should you commit and merge? How can you have one collection per day/month/year/etc? What are the performance trade-offs for these options?  There will also be a discussion around choosing the appropriate hardware.  Radu will talk about optimizing the infrastructure when pushing logs to Solr. This includes tuning Apache Flume to handle large flows of logs and overall design options that also apply to other shippers, like Logstash.

Rafal Kuc will talk about “Solr Anti-Patterns” at 10:55 am

Summary:  Working as a consultant, software engineer and helping people in various ways, Rafał has seen multiple patterns in how Solr is used and how it should be used. Consulting on best practices is common, but talking about what NOT to do is not. This talk will point out common mistakes and roads that should be avoided at all costs, covering use cases and guidelines around general configuration pitfalls, data modeling and what to avoid when making your data indexable, and mistakes made when it comes to queries and searching for indexed data. Each use case will be illustrated by a before and after analysis where changes in metrics will be shown to bring a know-how worth remembering.

20% Discount Code

If you currently use a Sematext product or have been a client in the past and want to go, drop us a line for more info.

Hope to see you in DC!

Two Lucene/Solr Revolution 2014 Talks Accepted!

We recently got word from Lucene/Solr Revolution 2014 (in Washington, DC from Nov. 11-14) that talks submitted by two Sematext engineers were accepted as part of the Tutorial track!  They are:

In “Tuning Solr for Logs” Radu will discuss Solr settings, hardware options and optimizing the infrastructure pushing logs to Solr.

In “Solr Anti-Patterns” Rafal will point out common Solr mistakes and roads that should be avoided at all costs.  Each of the talk’s use cases will be illustrated with a before and after analysis — including changes in metrics.

You can see more details about both talks in this recent blog post.

The full agenda, including dates and times for the talks, will be available soon on the Lucene/Solr Revolution 2014 web site.

If you do attend one of these talks please stop by and say hello to Radu and Rafal.  Not only do they know Solr inside and out, but they are good guys as well!

Love Solr Enough to Even Want to Attend One of These Talks?

If you enjoy Solr enough to even think of attending these talks — and you’re looking for a new opportunity — then Sematext might be the place for you.  We’re hiring planet-wide and currently looking for Solr and Elasticsearch Engineers, Front end and JavaScript Developers, Developer Evangelists, Full-stack Engineers, and Mobile App Developers.

JOB: Elasticsearch / Solr Engineer

We’ve grown nicely this year.  Our team has a new UI Developer, a new Solr/Elasticsearch Engineer, a new Marketing person, a new Automation Engineer, and this summer we have the first ever Intern.

Like all healthy organizations, we keep growing, and we are now looking for good Search Engineers who know Elasticsearch and/or Solr to join our geographically distributed search consulting team.  You will work remotely, from wherever you are, with smart people spread out across the planet and with an amazing array of companies world-wide on projects that range from just a week or two to several months.

At Sematext, we’ve built several exciting products – from smaller, search-focused products that work with Solr and Elasticsearch, to larger ones like SPMSearch Analytics, and most recently Logsene.  While not building products and running services, we help organizations world-wide with their search and big data needs – from fixing issues and providing production support to building complex search systems from scratch.  Our client list is long with a number of household names on it – from Instagram (Facebook) and Tumblr (Yahoo), Etsy and Shutterstock, to The BBC, Elsevier, Lockheed Martin, Reuters, Library of Congress, etc.  We did this without raising any money.  The demand for our products and services is growing and we are looking for good engineers and good people to join our adventure!

More formally:

Sematext is looking for a responsible, professional individual to join our team of search engineers.

Sematext is a New York-based startup with people spread over multiple continents and several hundred customers from Instagram and Tumblr, Etsy and Shutterstock, to The BBC, Elsevier, Lockheed Martin, Reuters, Library of Congress, etc. We’ve built systems handling over 10,000 QPS and have worked with multi-billion document indices. Our core products are:

In addition to the above products we offer consulting services around open source search and big data.

We are looking for a person who is:

  • Enthusiastic and positive
  • Driven, independent, and professional
  • A good communicator, both written and oral
  • Good with Solr and/or Elasticsearch and is hungry to learn more
  • Enjoys helping organizations make the best out of search

As a member of our search team you will get to:

  • Interact with clients world-wide
  • Provide guidance, architecture design, implementation, and support
  • Participate in Solr, Lucene, and Elasticsearch user and development communities
  • Work on Sematext’s search and data analytics products and participate in open-source search projects

This position:

  • Offers a lot of independence, learning, and growth
  • May require a bit of travel here and there, typically in the US and Europe
  • Is open world-wide

Our search team members have written several books about search, regularly give talks at conferences, blog, and participate in open-source projects.
For more info, see 19 things you may like about Sematext.

Interested? Please send your resume to jobs@sematext.com.

For other job openings please see Jobs @ Sematext or even our previous job listings.

Follow

Get every new post delivered to your Inbox.

Join 173 other followers