Two Lucene/Solr Revolution 2014 Talks Accepted!

We recently got word from Lucene/Solr Revolution 2014 (in Washington, DC from Nov. 11-14) that talks submitted by two Sematext engineers were accepted as part of the Tutorial track!  They are:

In “Tuning Solr for Logs” Radu will discuss Solr settings, hardware options and optimizing the infrastructure pushing logs to Solr.

In “Solr Anti-Patterns” Rafal will point out common Solr mistakes and roads that should be avoided at all costs.  Each of the talk’s use cases will be illustrated with a before and after analysis — including changes in metrics.

You can see more details about both talks in this recent blog post.

The full agenda, including dates and times for the talks, will be available soon on the Lucene/Solr Revolution 2014 web site.

If you do attend one of these talks please stop by and say hello to Radu and Rafal.  Not only do they know Solr inside and out, but they are good guys as well!

Love Solr Enough to Even Want to Attend One of These Talks?

If you enjoy Solr enough to even think of attending these talks — and you’re looking for a new opportunity — then Sematext might be the place for you.  We’re hiring planet-wide and currently looking for Solr and Elasticsearch Engineers, Front end and JavaScript Developers, Developer Evangelists, Full-stack Engineers, and Mobile App Developers.

JOB: Elasticsearch / Lucene Engineer (starts in the Netherlands)

In addition to looking for an Elasticsearch / Solr Engineer to join the Sematext team, we are also looking for an Lucene / Elasticsearch Engineer in EU for a specific project.  This project calls for 6 months of on-site work with our client in Netherlands.  After 6 months the collaboration with our client would continue remotely if there is more work to be done for the client or, if the client project(s) are over, this person would join our global team of Engineers and Search Consultants and work remotely (we are all very distributed over several countries and continents). This is a position focused on search – it involves working with Elasticsearch, but also requires enough understanding of Lucene to allow one to write custom Elasticsearch/Lucene components, such as tokenizers, for example. Here are some of the skills one should have for this job:

  •  knowledge of different types of Lucene queries/filters (boolean, spans, etc.) and their capabilities
  •  experience in extending out-of-the-box Lucene functionality via developing custom queries, scorers, collectors
  •  understanding of Lucene document analysis in the process of indexing, experience in writing custom analyzers
  •  experience in mapping advanced hierarchical data structures to Lucene fields
  •  experience in scalable distributed open-source search technologies such as Elasticsearch or Solr

The above is not much information to go by, but if this piqued your interest and if you think you are a good match, please fix up your resume and send it to jobs@sematext.com quickly.

Community Voting for Sematext Talks at Lucene/Solr Revolution 2014

The biggest open source conference dedicated to Apache Lucene/Solr takes place in November in Washington, DC.  If you are planning to attend — and even if you are not — you can help improve the conference’s content by voting for your favorite talk topics.  The top vote-getters for each track will be added to Lucene/Solr Revolution 2014 agenda.

Not surprisingly for one of the leading Lucene/Solr products and services organizations, Sematext has two contenders in the Tutorial track:

We’d love your support to help us contribute our expertise to this year’s conference.  To vote, simply click on the above talk links and you’ll see a “Vote” button in the upper left corner.  That’s it!

To give you a better sense of what Radu and Rafal would like to present, here are their talk summaries:

Tuning Solr for Logs – by Radu Gheorghe

Performance tuning is always nice for keeping your applications snappy and your costs down. This is especially the case for logs, social media and other stream-like data that can easily grow into terabyte territory.

While you can always use SolrCloud to scale out of performance issues, this talk is about optimizing. First, we’ll talk about Solr settings by answering the following questions:

  • How often should you commit and merge?
  • How can you have one collection per day/month/year/etc?
  • What are the performance trade-offs for these options?

Then, we’ll turn to hardware. We know SSDs are fast, especially on cold-cache searches, but are they worth the price? We’ll give you some numbers and let you decide what’s best for your use case.

The last part is about optimizing the infrastructure pushing logs to Solr. We’ll talk about tuning Apache Flume for handling large flows of logs and about overall design options that also apply to other shippers, like Logstash. As always, there are trade-offs, and we’ll discuss the pros and cons of each option.

Solr Anti-Patternsby Rafal Kuc

Working as a consultant, software engineer and helping people in various ways we can see multiple patterns on how Solr is used and how it should be used. We all usually say what should be done, but we don’t talk and point out why we should not go some ways. That’s why I would like to point out common mistakes and roads that should be avoided at all costs.   During the talk I would like not only to show the bad patterns, but also show the difference before and after.

The talk is divided into three major sections:

  1. We will start with general configuration pitfalls that people are used to make. We will discuss different use cases showing the proper path that one should take
  2. Next we will focus on data modeling and what to avoid when making your data indexable. Again we will see real life use cases followed by the description how to handle them properly
  3. Finally we will talk about queries and all the juicy mistakes when it comes to searching for indexed data

Each shown use case will be illustrated by the before and after analysis – we will see the metrics changes, so the talk will not only bring pure facts, but hopefully know-how worth remembering.

Thank you for your support!

4 Lucene Revolution Talks from Sematext

Bingo! We’re 4 of 4 at Lucene Revolution 2013 – 4 talk proposals and all 4 accepted!  We are hiring just so next year we can attempt getting 5 talks in. ;)  We’ll also be exhibiting at the conference, so stop by.  We will be giving away Solr and Elasticsearch books.  Here’s what we’ll be talking about in Dublin on November 6th and 7th:

In Using Solr to Search and Analyze Logs Radu will be talking about … well, you guessed it – using Solr to analyze logs.  After this talk you may want to run home (or back to the hotel) and hack on LogStash or Flume, and Solr and get Solr to eat your logs…. but don’t forget we have to keep Logsene well fed.  Feed this beast your logs like we feed it ours and help us avoid getting eaten by our own creation.

Abstract:

Many of us tend to hate or simply ignore logs, and rightfully so: they’re typically hard to find, difficult to handle, and are cryptic to the human eye. But can we make logs more valuable and more usable if we index them in Solr, so we can search and run real-time statistics on them? Indeed we can, and in this session you’ll learn how to make that happen. In the first part of the session we’ll explain why centralized logging is important, what valuable information one can extract from logs, and we’ll introduce the leading tools from the logging ecosystems everyone should be aware of – from syslog and log4j to LogStash and Flume. In the second part we’ll teach you how to use these tools in tandem with Solr. We’ll show how to use Solr in a SolrCloud setup to index large volumes of logs continuously and efficiently. Then, we’ll look at how to scale the Solr cluster as your data volume grows. Finally, we’ll see how you can parse your unstructured logs and convert them to nicely structured Solr documents suitable for analytical queries.

Rafal will teach about Scaling Solr with SolrCloud in a 75-minute session.  Prepare for taking lots of notes and for scaling your brain both horizontally and vertically while at the same time avoiding split-brain.

Abstract:

Configure your Solr cluster to handle hundreds of millions of documents without even noticing, handle queries in milliseconds, use Near Real Time indexing and searching with document versioning. Scale your cluster both horizontally and vertically by using shards and replicas. In this session you’ll learn how to make your indexing process blazing fast and make your queries efficient even with large amounts of data in your collections. You’ll also see how to optimize your queries to leverage caches as much as your deployment allows and how to observe your cluster with Solr administration panel, JMX, and third party tools. Finally, learn how to make changes to already deployed collections —split their shards and alter their schema by using Solr API.

Rafal doesn’t like to sleep.  He prefers to write multiple books at the same time and give multiple talks at the same conference.  His second talk is about Administering and Monitoring SolrCloud Clusters – something we and our customers do with SPM all the time.

Abstract:

Even though Solr can run without causing any troubles for long periods of time it is very important to monitor and understand what is happening in your cluster. In this session you will learn how to use various tools to monitor how Solr is behaving at a high level, but also on Lucene, JVM, and operating system level. You’ll see how to react to what you see and how to make changes to configuration, index structure and shards layout using Solr API. We will also discuss different performance metrics to which you ought to pay extra attention. Finally, you’ll learn what to do when things go awry – we will share a few examples of troubleshooting and then dissect what was wrong and what had to be done to make things work again.

Otis has aggregation coming out of his ears and dreams about data visualization, timeseries graphs, and other romantic visuals. In Solr for Analytics: Metrics Aggregations at Sematext we’ll share our experience running SPM on top of SolrCloud (vs. HBase, which we currently use).

Abstract:

While Solr and Lucene were originally written for full-text search, they are capable and increasingly used for Analytics, as Key Value Stores, NoSQL databases, and more. In this session we’ll describe our experience with Solr for Analytics. More specifically, we will describe a couple of different approaches we have taken with SolrCloud for aggregation of massive amounts of performance metrics, we’ll share our findings, and compare SolrCloud with HBase for large-scale, write-intensive aggregations. We’ll also visit several Solr new features that are in the works that will make Solr even more suitable for Analytics workloads.

See you in Dublin!

ElasticSearch Cache Usage

We’ve been doing a ton of work with ElasticSearch. Not long ago, we had a few situations where ElasticSearch would “eat” all the JVM heap memory we give it.  It was so hungry, we could not feed it enough memory to keep it happy.  It was insatiable.  After some troubleshooting and looking at SPM for ElasticSearch (btw. we released a new version of the SPM agent earlier this week, so if you don’t have it, go grab agent v1.5.0) we figured out the cause – ElasticSearch default field cache setting was not quite right for our deployment. In this post we’ll share our experience on this topic, explain why this was happening and how to minimize the negative effect of large field caches.

ElasticSearch Cache Types

There are two types of caches in ElasticSearch whose behaviors you can control. The first cache is the filter cache. This cache is responsible for caching results of filters used in your queries. This is very handy, because after a filter is run once, ElasticSearch will subsequently use values stored in the filter cache and thus save precious disk I/O operations and by avoiding disk I/O speed up query execution. There are two main implementations of filter cache in ElasticSearch:

  1. node filter cache (default)
  2. index filter cache

The node filter cache is an LRU cache, which means that the least recently used items will be evicted when the filter cache is full. Its size can be limited to be either a percentage of the total memory allocated to the Java process or by specifying the exact amount of memory. The second type of filter cache is the index filter cache. It is not recommended for use because you can’t predict (in most cases) how much memory it will use, since that depends on which shards are located on which node. In addition to that, you can’t control the amount of memory used by index filter cache, you can only set its expiration time and maximum amount of entries in that cache.

The second type of cache in ElasticSearch is field data cache. This cache is used for sorting and faceting in most cases. It loads all values from the field you sort or facet on and then provides calculations on the basis of loaded values. You can imagine that the cost of building such a cache for a large amount of data might be very high.  And it is.  Apart from the type (which can be either resident or soft) you can control two additional parameters of field data cache – the maximum amount of entries in it and its expiration time.

The Defaults

The default implementation for the filter cache is the index filter cache, with its size set to the maximum of 20% of the memory allocated to the Java process. As you can imagine there is nothing to worry about – if the cache fills up appropriate cache entries will get evicted.  You can then consider adding more RAM to make index filter cache bigger or you must live with evictions. That’s perfectly acceptable.

On the other hand we have the default settings for ElasticSearch field data cache – it is a resident cache with unlimited size. Yes, unlimited. The cost of rebuilding this cache is very high and thus you must know how much memory it can use – you must control your queries and watch what you sort on and on which fields you do the faceting.

What Happens When You Don’t Control Your Cache Size ?

This is what can happen when you don’t control your field data cache size:

As you can see on the above chart field data cache jumped to more than 58 GB, which is enormous. Yes, we got OutOfMemory exception during that time.

What CanYou Do ?

There are actually three thing you can do to make your field data cache use less memory:

Control its Size and Expiration Time

When using the default, resident field data cache type, you can set its size and expiration time. However, please remember, that there are situations when you need the field data cache to hold values for that particular field you are sorting or faceting on. In order to change field data cache size, you specify the following property:

index.cache.field.max_size

It specifies the maximum size entries in that cache per Lucene segment. It doesn’t limit the amount of memory field data cache can use, so you have to do some testing to ensure queries you are using won’t result in OutOfMemory exception.

The other property you can set is the expiration time.  It defaults to -1 which says that the cache will not be expired by default. In order to change that, you must set the following property:

index.cache.field.expire

So if, for example, you would like to have a maximum of 50k entries of field data cache per segment and if you would like to have those entries expiredafter 10 minutes, you would set the following property values in ElasticSearch configuration file:

index.cache.field.max_size: 50000
index.cache.field.expire: 10m

Change its Type

The other thing you can do is change field data cache type from the default resident to soft. Why does that matter? ElasticSearch uses Google Guava libraries to implement its cache. The soft type wraps cache values in soft references, which means that whenever memory is needed garbage collector will clear those references even when they are used. This means that when you start hitting heap memory limit, the JVM wont throw OutOfMemory exception, but will  instead release those soft references with the use of garbage collector. More about soft references can be found at:

http://docs.oracle.com/javase/6/docs/api/java/lang/ref/SoftReference.html

So in order to change the default field data cache type to soft you should add the following property to ElasticSearch configuration file:

index.cache.field.type: soft

Change Your Data

The last thing you can do is the operation that requires much more effort than only changing ElasticSearch configuration – you may want to change your data. Look at your index structure, look at your queries and think. Maybe you can lowercase some string data and this way reduce the number of unique values in the field? Maybe you don’t need your dates be precise down to a second, maybe it can be minute or even an hour? Of course, when doing some faceting operations you can set the granularity, but the data will still be loaded into memory. So if there are parts of your data that can be changed in a way that will result in lower memory consumption, you should consider it. As a matter of fact, that is exactly what we did!

Caches After Some Changes

After we made some changes in our ElasticSearch configuration/deployment, this is what the field data cache usage looked like:

As you can see, the cache dropped from 58 GB down to 37 GB.  Impressive drop!  After these changes we stopped running into OutOfMemory exception problems.

Summary

You have to remember that the default settings for field data cache in ElasticSearch may not be appropriate for you. Of course, that may not be the case in your deployment. You may not need sorting, apart from the default based on Lucene scoring and you may not need faceting on fields with many unique terms. If that’s the case for you, don’t worry about field data cache. If you have enough memory for holding the fields data for your facets and sorting then you also don’t need to change anything regarding the cache setup from the default ElasticSearch configuration. What you need to remember is to monitor your JVM heap memory usage and cache statistics, so you know what is happening in your cluster and react before things get worse.

One More Thing

The charts you see in the post are taken from SPM (Scalable Performance Monitoring) for ElasticSearch.  SPM is currently free and, as you can see, we use it extensively in our client engagements.  If you give it a try, please let us know what you think and what else you would like to see in it.

@sematext (Like working with ElasticSearch?  We’re hiring!)

Wanted Dead or Alive: Search Engineer with Client-facing Skills

We are on a lookout for a strong Search Engineer with interest and ability to interact with clients and with potential to build and lead local and/or remote development teams.  By “client-facing” we really mean primarily email, phone, Skype.

A person in this role needs to be able to:

  • design large scale search systems
  • have solid knowledge of either Solr or ElasticSearch or both
  • efficiently troubleshoot performance, relevance, and other search-related issues
  • speak and interact with clients

Pluses – beyond pure engineering:

  • ability and desire to expand and lead a development/consulting teams
  • ability to think both business and engineering
  • ability to build products and services based on observed client needs
  • ability to present in public, at meetups, conferences, etc
  • ability to contribute to blog.sematext.com
  • active participation in online search communities
  • attention to detail
  • desire to share knowledge and teach
  • positive attitude, humor, agility

We’re small and growing.  Our HQ is in Brooklyn, but our team is spread over 4 continents.  If you follow this blog you know we have deep expertise in search and big data analytics and that our team members are conference speakers, book authors, Apache members, open-source contributors, etc.  While we are truly international, this particular opening is in New York.  Speaking of New York, some of our New York City clients that we are allowed to mention are Etsy, Gilt, Tumblr, Thomson Reuters, Simon & Schuster (more on http://sematext.com/clients/index.html).

Relevant pointers:

If you are interested, please send over some information about yourself, your CV, and let’s talk.

Sematext Presenting Open Source Search Safari at ESS 2012

We are continuing our “new tradition” of presenting at Enterprise Search Summit (ESS) conferences.  We presented at ESS in 2011 (see http://blog.sematext.com/2011/11/02/search-analytics-at-enterprise-search-summit-fall-2011-presentation/).  This year we’ll be giving a talk titled Open Source Search Safari, in which Otis (@otisg) will present a number of open-source search solutions – Lucene, Solr, ElasticSearch, and Sensei, plus maybe one or two others (suggestions?).  We’ll also be chained to booth 26 where we’ll be showcasing our search-vendor neutral Search Analytics service (service is currently still free, feel free to use it all you want), along with some of our other search-related products.

ESS East will be held May 15-16 in our own New York City.  If you are a past or prospective client or customer of ours, please get in touch if you are interested in attending ESS at a discount.

Follow

Get every new post delivered to your Inbox.

Join 1,695 other followers