Presentation: Tuning Elasticsearch Indexing Pipeline for Logs

Fresh from GeeCON in Krakow…we have another Elasticsearch and Logging manifesto from Sematext engineers — and book authors — Rafal Kuc and Radu Gheorghe.  As with many of their previous presentations, Radu and Rafal go into detail on Elasticsearch, Logstash and Rsyslog topics like:

  • How Elasticsearch, Logstash and Rsyslog work
  • Tuning Elasticsearch
  • Using, scaling, and tuning Logstash
  • Using and tuning Rsyslog
  • Rsyslog with JSON parsing
  • Hardware and data tests
  • …and lots more along these lines

[Note: Video of the talk coming soon to this post!]

If you find this stuff interesting and have similar challenges, then drop us a line to chat about our Elasticsearch and Logging consulting services and Elasticsearch (and Solr, too) production support.  Oh yeah, and we’re hiring worldwide if you are into Logging, Monitoring, Search, or Big Data Analytics as much as Radu and Rafal!

Elasticsearch Training in Berlin – Wednesday, June 3

For those of you interested in some comprehensive Elasticsearch training taught by experts from Sematext who know it inside and out, we’re running an Elasticsearch Intro workshop in Berlin on Wednesday, June 3 (the day after Berlin Buzzwords ends). This full-day, hands-on training workshop will be taught by Sematext engineer — and author of several Elasticsearch booksRafal Kuc.  The workshop is open to anyone, not just folks who attended Berlin Buzzwords.

ES_intro_2

Here are the details:

  • Date:  Wednesday, June 3
  • Time:  9:00 a.m. to 5:00 p.m.
  • Location:  idealo internet GmbH, Ritterstraße 11, 10969 Berlin, Germany (less than 2 km from Buzzwords site)
  • Cost:  EUR 400 (early bird rate, valid through May 25) – EUR 500 afterward – 50% off 2nd seat!
  • Food/Drinks:  Light breakfast, lunch and post-workshop snacks & beverages

Register_Now_2 In this training workshop attendees will go through a series of short lectures followed by exercises and Q&A sessions covering the many aspects of Elasticsearch.  There will also be plenty of opportunities to get production tips & tricks that make things smoother. We are also considering an Elasticsearch Advanced class to be taught simultaneously at the same location.  If this is of interest to you and/or your colleagues, please drop us a line and it could happen! Lastly, if you can’t make it…watch this space.  We’ll be adding more Elasticsearch training workshops in the US, Europe and possibly other locations in the coming months.  We are also known worldwide for our Elasticsearch consulting services and production support if you need help asap. Hope to see you in Berlin!

Top 10 Elasticsearch Metrics to Watch

Elasticsearch is booming.  Together with Logstash, a tool for collecting and processing logs, and Kibana, a tool for searching and visualizing data in Elasticsearch (aka the “ELK” stack), adoption of Elasticsearch continues to grow by leaps and bounds.  When it comes to actually using Elasticsearch, there are tons of metrics generated.  Instead of taking on the formidable task of tackling all-things-metrics in one blog post, we’re going to serve up something that we at Sematext have found to be extremely useful in our work as Elasticsearch consultants, production support providers, and monitoring solution builders: the top 10 Elasticsearch metrics to watch.  This should be especially helpful to those readers new to Elasticsearch, and also to experienced users who want a quick start into performance monitoring of Elasticsearch.

Side note: we’re @sematext, if you’d like to follow us.

Here are the Top 10 Elasticsearch metrics:

  1. Cluster Health – Nodes and Shards
  2. Node Performance – CPU
  3. Node Performance – Memory Usage
  4. Node Performance – Disk I/O
  5. Java – Heap Usage and Garbage Collection
  6. Java – JVM Pool Size
  7. Search Performance – Request Latency and Request Rate
  8. Search Performance – Filter Cache
  9. Search Performance – Field Data Cache
  10. Indexing Performance – Refresh Times and Merge Times

Most of the charts in this piece group metrics either by displaying multiple metrics in one chart or organizing them into dashboards. This is done to provide context for each of the metrics we’re exploring.

To start, here’s a dashboard view of the 10 Elasticsearch metrics we’re going to discuss.

Top_10_dashboard

This dashboard image, and all images in this post, are from Sematext’s SPM Performance Monitoring tool.

Now, let’s dig each of the top 10 metrics one by one and see how to interpret them.

Read more of this post

Recipe: Reindexing Elasticsearch Documents with Logstash

If you’re working with Elasticsearch, it’s very likely that you’ll need to reindex data at some point. The most popular reason is because you need a mapping change that is incompatible with your current mapping. New fields can be added by default, but many changes are not allowed, for example:

  • Want to switch to doc values because field data is taking too much heap? Reindex!
  • Want to change the analyzer of a given field? Reindex!
  • Want to break one great big index into time-based indices? Reindex!

Enter Logstash

A while ago I was using stream2es for reindexing, but if you look at the GitHub page it recommends using Logstash instead. Why? In general, Logstash can do more stuff, here are my top three reasons:

  1. On the input side, you can filter only a subset of documents to reindex
  2. You can add filters to transform documents on their way to the new index (or indices)
  3. It should perform better, as you can add more filter threads (using the -w parameter) and multiple output worker threads (using the workers configuration option)

Show Me the Configuration!

In short, you’ll use the elasticsearch input to read existing data and the elasticsearch output to write it. In between, you can use various filters to change how documents look like.

Input

To read documents, you’ll use the elasticsearch input. You’ll probably want to specify the host(s) to connect to and the index (check the documentation for more options like query):

input {
  elasticsearch {
   hosts => ["localhost"]
   index => "old-index"
  }
}

By default, this will run a match_all query that does a scan through all the documents of the index, fetch pages of 1000, and times out in a minute (i.e. after a minute it won’t know where it left off). All this is configurable, but the defaults are sensible. Scan is good for deep paging (as normally when you fetch a page from 1000000 to 1000020, Elasticsearch fetches 1000020, sorts them, and gives back the last 20) and also works with a “snapshot” of the index (updates after the scan started won’t be taken into account).

Filter

Next, you might want to change documents in their way to the new index. For example, if the data you’re reindexing wasn’t originally indexed with Logstash, you probably want to remove the @version and/or @timestamp fields that are automatically added. To do that, you’ll use the mutate filter:

filter {
 mutate {
  remove_field => [ "@version" ]
 }
}

Output

Finally, you’ll use the elasticsearch output to send data to a new index. The defaults are once again geared towards the logging use-case. If this is not your setup, you might want to disable the default Logstash template (manage_template=false) and use yours:

output {
 elasticsearch {
   host => "localhost"
   protocol => "http"
   manage_template => false
   index => "new-index"
   index_type => "new-type"
   workers => 5
 }
}

Final Remarks

If you want to use time-based indices, you can change index to something like “logstash-%{+YYYY.MM.dd}” (this is the default), and the date would be taken from the @timestamp field. This is by default populated with the time Logstash processes the document, but you can use the date filter to replace it with a timestamp from the document itself:

filter {
 date {
   "match" => [ "custom_timestamp", "MM/dd/YYYY HH:mm:ss" ]
   target => "@timestamp"
 }
}

If your Logstash configuration contains only these snippets, it will nicely shut down when it’s done reindexing.

That’s it! We are happy answer questions or receive feedback – please drop us a line or get us @sematext. And, yes, we’re hiring!

How to use Kibana 4 with Logsene Log Management

Did you know that Logsene provides a complete ELK Stack; i.e., a complete Log management, analytics, exploration, and visualization solution? Logsene currently supports Kibana 3 with complete Kibana 4 support about to be released soon.

Can’t wait to use Kibana 4 with Logsene? No problem – part of the integration is already done and we’ve prepared instructions to run your own Kibana 4 with Logsene:

  • Open Kibana 4 configuration file config/kibana.yml and add Logsene server and Kibana-Index:
    elasticsearch_url: “https://logsene-receiver.sematext.com”
    kibana_index: “LOGSENE_TOKEN_kibana
  • Start Kibana 4 (./bin/kibana) and open the web browser http://localhost:5601 – Kibana 4 asks for an index pattern. Here you need to enter the Logsene token and a daily date pattern separated by an underscore:
    [YOUR-LOGSENE-TOKEN_]YYYY-MM-DD
  • Enter Kibana Index-PatternNow you are ready to set up your visualizations and dashboards in Kibana 4:
Kibana4-Logsene-Syslog

Kibana 4 Dashboard for data stored in Logsene

Perhaps you prefer automation of tasks? We prepared it for you:

That’s all there is to it.  Like what you see here?  Sound like something that could benefit your organization?  Then try Logsene for Free by registering here.  There’s no commitment and no credit card required.  And, if you are a young startup, a small or non-profit organization, or an educational institution, ask us for a discount (see special pricing)!

We are happy answer questions or receive feedback – please drop us a line or get us @sematext.

Elasticsearch Training at GeeCON 2015

[Note: Early Bird pricing ends on Tuesday, May 5!]

For those of you interested in some comprehensive Elasticsearch training taught by experts (and authors of several Elasticsearch books!) who know it inside and out, you are in luck if you are attending — or considering — the GeeCON conference taking place in Krakow from May 13-15.

There will be two full-day training workshops held on May 12 — Elasticsearch Intro and Elasticsearch Advanced — run by Sematext engineers Radu Gheorghe and Rafał Kuć.

You can find the details for each session here, including costs and topics covered:

Elasticsearch Intro

ES_intro_2

Elasticsearch Advanced

ES_advanced_2

In both training workshops attendees will go through a series of short lectures followed by exercises and Q&A sessions covering the many aspects of Elasticsearch.  There will also be plenty of opportunities to get production tips & tricks that make things smoother.

If you can’t make it…watch this space.  We’ll be adding more Elasticsearch training workshops in the US, Europe and possibly other locations in the coming months.  We are also known worldwide for our Elasticsearch consulting services and production support if you need help asap.

Hope to see you in Krakow!

Monitoring rsyslog’s Performance with impstats and Elasticsearch

If you’re using rsyslog for processing lots of logs (and, as we’ve shown before, rsyslog is good at processing lots of logs), you’re probably interested in monitoring it. To do that, you can use impstats, which comes from input module for process stats. impstats produces information like:
input stats, like how many events went through each input
queue stats, like the maximum size of a queue
– action (output or message modification) stats, like how many events were forwarded by each action
– general stats, like CPU time or memory usage

In this post, we’ll show you how to send those stats to Elasticsearch (or Logsene — essentially hosted ELK, our log analytics service) that exposes the Elasticsearch API), where you can explore them with a nice UI, like Kibana. For example get the number of logs going through each input/output per hour:
kibana_graph
More precisely, we’ll look at:
– the useful options around impstats
– how to use those stats and what they’re about
– how to ship stats to Elasticsearch/Logsene by using rsyslog’s Elasticsearch output
– how to do this shipping in a fast and reliable way. This will apply to most rsyslog use-cases, not only impstats

Read more of this post

Follow

Get every new post delivered to your Inbox.

Join 166 other followers