Handling Stack Traces with Logstash

Here at Sematext we use Java (8!) and rely on centralized logging a lot. We like them so much that we regularly share our logging experience with everyone and help others with logging, especially, ELK stack. Centralized logging plays nice with Java (and anything else that can write pretty logs). However, there is one tricky thing that can be hard to get right: properly capturing exception stack traces. Unlike your typical single-line log events, stack traces have multiple lines and they aren’t always perfectly uniform.  As such, most log shipper don’t handle them properly out of the box and typically treat each stack trace line as a separate event – clearly the wrong thing to do (n.b., if you are sending logs to Logsene this way, you are going to hit your plan limits faster, so make sure you send stack traces properly!).  Thus, in most cases, a special configuration is needed in order to get stack traces right. Let’s see how one needs to configure Logstash for this task.  We’ll be using Logstash 1.5.

Prerequisites

We start from the following assumptions:

  • you have a working Logstash setup
  • local logs are written to a file named: /var/log/test.log
  • the conversion pattern for log4j/logback/log4j2 is: “%d %p %m%n”

Based on the above pattern, logs should look like this, for example:

2015-03-23 11:11:50,778 INFO Initializing Systems Cache
2015-03-23 11:11:58,666 ERROR Processing request failed
kafka.common.FailedToSendMessageException: Failed to send messages after 3 tries.
	at kafka.producer.async.DefaultEventHandler.handle(DefaultEventHandler.scala:90)
	at kafka.producer.Producer.send(Producer.scala:77)
	at kafka.javaapi.producer.Producer.send(Producer.scala:42)
	...

Extracting Exception Stack Traces Correctly

Stack traces are multiline messages or events. Logstash has the ability to parse a log file and merge multiple log lines into a single event. You can do this using either the multiline codec or the multiline filter, depending on the desired effect. A codec is attached to an input and a filter can process events from multiple inputs.

The most popular example that allows matching log lines that are a part of an exception uses a not so simple regexp:

input {
  file {
    path => "/var/log/test.log"
    start_position => "beginning"
    codec => multiline {
      pattern => "(^\d+\serror)|(^.+Exception: .+)|(^\s+at .+)|(^\s+... \d+ more)|(^\s*Caused by:.+)"
      what => "previous"
    }   
  } 
}

If the log line is part of an exception, it is appended to the last received log line. Once a new log line is received, the complete exception is sent to be processed. The downside of this approach is that it might not match all exceptions correctly. Sure, you could keep fine-tuning and expanding the regexp and attempt to capture all possible variants of stack trace format, but … there is a better (way.

input {
  file {
    path => "/var/log/test.log"
    start_position => "beginning"
    codec => multiline {
      pattern => "^%{TIMESTAMP_ISO8601}"
      negate => true
      what => "previous"
    }   
  } 
}

The pattern looks for log lines starting with a timestamp and, until a new match is found, all lines are considered part of the event. This is done by setting the negate parameter to true.

Structuring Logs with Grok

To structure the information before storing the event, a filter section should be used for parsing the logs.

filter {
  mutate {
    gsub => [ "message", "\r", "" ]
  }
  grok {
    match => [ "message", "(?m)%{TIMESTAMP_ISO8601:timestamp} %{LOGLEVEL:severity} %{GREEDYDATA:message}" ]
    overwrite => [ "message" ]
  }
  date {
    match => [ "timestamp" , "yyyy-MM-dd HH:mm:ss,SSS" ]
  }
}

output {
  stdout { codec => rubydebug }
}

The output section, in this case, is only used for debugging.  In the next section we show how to actually ship your logs.

First, the mutate filter removes any “\r” characters from the event. Usually, this is something you want to do, to prevent later issues when storing and visualizing the logs where “\r” could be interpreted as a “\n”.

The grok filter splits the event content into 3 parts: timestamp, severity and message (which overwrites original message). The (?m) in the beginning of the regexp is used for multiline matching and, without it, only the first line would be read. The patterns used in the regexp are provided with Logstash and should be used when possible to simplify regexps.

By default, the timestamp of the log line is considered the moment when the log line is read from the file. The date filter fixes the timestamp, by changing it to the one matched earlier with the grok filter.

Shipping the Logs

Finally, now that you’ve got all your logs extracted, even those with multiple lines like exception stack traces, it’s time to ship them to wherever you keep your logs.  Here’s how you’d ship them to Logsene:

output {
  elasticsearch {
    host => "logsene-receiver.sematext.com"
    ssl => true
    port => 443
    index => "YOUR LOGSENE APP TOKEN GOES HERE"
    protocol => http
    manage_template => false
  }
}

Final Words

If you are looking for a way to ship logs containing stack traces or other complicated multi line events, Logstash is the simplest way to do it at the moment.

It is written JRuby, which makes it possible for many people to contribute to the project. The downside of this ease of use and maintainability is that it is not the fastest tool for the job and it is also quite resource hungry (both CPU and memory). Though, depending on the log volume that needs to be shipped, this might not be a problem.

For those looking for tips on how to ship multiline logs with rsyslog or syslog-ng, subscribe to this blog or follow @sematext – tips for handling stack traces with rsyslog and syslog-ng are coming.

If your organization need help with logging (ELK stack, but also rsyslog, Fluentd, Flume…), let us know.  If you like to build tools and products around logs and logging, join us – we’re always looking for good toolmakers world-wide.

Presentation: Tuning Elasticsearch Indexing Pipeline for Logs

Fresh from GeeCON in Krakow…we have another Elasticsearch and Logging manifesto from Sematext engineers — and book authors — Rafal Kuc and Radu Gheorghe.  As with many of their previous presentations, Radu and Rafal go into detail on Elasticsearch, Logstash and Rsyslog topics like:

  • How Elasticsearch, Logstash and Rsyslog work
  • Tuning Elasticsearch
  • Using, scaling, and tuning Logstash
  • Using and tuning Rsyslog
  • Rsyslog with JSON parsing
  • Hardware and data tests
  • …and lots more along these lines

[Note: Video of the talk coming soon to this post!]

If you find this stuff interesting and have similar challenges, then drop us a line to chat about our Elasticsearch and Logging consulting services and Elasticsearch (and Solr, too) production support.  Oh yeah, and we’re hiring worldwide if you are into Logging, Monitoring, Search, or Big Data Analytics as much as Radu and Rafal!

Elasticsearch Training in Berlin – Wednesday, June 3

For those of you interested in some comprehensive Elasticsearch training taught by experts from Sematext who know it inside and out, we’re running an Elasticsearch Intro workshop in Berlin on Wednesday, June 3 (the day after Berlin Buzzwords ends). This full-day, hands-on training workshop will be taught by Sematext engineer — and author of several Elasticsearch booksRafal Kuc.  The workshop is open to anyone, not just folks who attended Berlin Buzzwords.

ES_intro_2

Here are the details:

  • Date:  Wednesday, June 3
  • Time:  9:00 a.m. to 5:00 p.m.
  • Location:  idealo internet GmbH, Ritterstraße 11, 10969 Berlin, Germany (less than 2 km from Buzzwords site)
  • Cost:  EUR 400 (early bird rate, valid through May 25) – EUR 500 afterward – 50% off 2nd seat!
  • Food/Drinks:  Light breakfast, lunch and post-workshop snacks & beverages

Register_Now_2 In this training workshop attendees will go through a series of short lectures followed by exercises and Q&A sessions covering the many aspects of Elasticsearch.  There will also be plenty of opportunities to get production tips & tricks that make things smoother. We are also considering an Elasticsearch Advanced class to be taught simultaneously at the same location.  If this is of interest to you and/or your colleagues, please drop us a line and it could happen! Lastly, if you can’t make it…watch this space.  We’ll be adding more Elasticsearch training workshops in the US, Europe and possibly other locations in the coming months.  We are also known worldwide for our Elasticsearch consulting services and production support if you need help asap. Hope to see you in Berlin!

Top 10 Elasticsearch Metrics to Watch

Elasticsearch is booming.  Together with Logstash, a tool for collecting and processing logs, and Kibana, a tool for searching and visualizing data in Elasticsearch (aka the “ELK” stack), adoption of Elasticsearch continues to grow by leaps and bounds.  When it comes to actually using Elasticsearch, there are tons of metrics generated.  Instead of taking on the formidable task of tackling all-things-metrics in one blog post, we’re going to serve up something that we at Sematext have found to be extremely useful in our work as Elasticsearch consultants, production support providers, and monitoring solution builders: the top 10 Elasticsearch metrics to watch.  This should be especially helpful to those readers new to Elasticsearch, and also to experienced users who want a quick start into performance monitoring of Elasticsearch.

Side note: we’re @sematext, if you’d like to follow us.

Here are the Top 10 Elasticsearch metrics:

  1. Cluster Health – Nodes and Shards
  2. Node Performance – CPU
  3. Node Performance – Memory Usage
  4. Node Performance – Disk I/O
  5. Java – Heap Usage and Garbage Collection
  6. Java – JVM Pool Size
  7. Search Performance – Request Latency and Request Rate
  8. Search Performance – Filter Cache
  9. Search Performance – Field Data Cache
  10. Indexing Performance – Refresh Times and Merge Times

Most of the charts in this piece group metrics either by displaying multiple metrics in one chart or organizing them into dashboards. This is done to provide context for each of the metrics we’re exploring.

To start, here’s a dashboard view of the 10 Elasticsearch metrics we’re going to discuss.

Top_10_dashboard

This dashboard image, and all images in this post, are from Sematext’s SPM Performance Monitoring tool.

Now, let’s dig each of the top 10 metrics one by one and see how to interpret them.

Read more of this post

Recipe: Reindexing Elasticsearch Documents with Logstash

If you’re working with Elasticsearch, it’s very likely that you’ll need to reindex data at some point. The most popular reason is because you need a mapping change that is incompatible with your current mapping. New fields can be added by default, but many changes are not allowed, for example:

  • Want to switch to doc values because field data is taking too much heap? Reindex!
  • Want to change the analyzer of a given field? Reindex!
  • Want to break one great big index into time-based indices? Reindex!

Enter Logstash

A while ago I was using stream2es for reindexing, but if you look at the GitHub page it recommends using Logstash instead. Why? In general, Logstash can do more stuff, here are my top three reasons:

  1. On the input side, you can filter only a subset of documents to reindex
  2. You can add filters to transform documents on their way to the new index (or indices)
  3. It should perform better, as you can add more filter threads (using the -w parameter) and multiple output worker threads (using the workers configuration option)

Show Me the Configuration!

In short, you’ll use the elasticsearch input to read existing data and the elasticsearch output to write it. In between, you can use various filters to change how documents look like.

Input

To read documents, you’ll use the elasticsearch input. You’ll probably want to specify the host(s) to connect to and the index (check the documentation for more options like query):

input {
  elasticsearch {
   hosts => ["localhost"]
   index => "old-index"
  }
}

By default, this will run a match_all query that does a scan through all the documents of the index, fetch pages of 1000, and times out in a minute (i.e. after a minute it won’t know where it left off). All this is configurable, but the defaults are sensible. Scan is good for deep paging (as normally when you fetch a page from 1000000 to 1000020, Elasticsearch fetches 1000020, sorts them, and gives back the last 20) and also works with a “snapshot” of the index (updates after the scan started won’t be taken into account).

Filter

Next, you might want to change documents in their way to the new index. For example, if the data you’re reindexing wasn’t originally indexed with Logstash, you probably want to remove the @version and/or @timestamp fields that are automatically added. To do that, you’ll use the mutate filter:

filter {
 mutate {
  remove_field => [ "@version" ]
 }
}

Output

Finally, you’ll use the elasticsearch output to send data to a new index. The defaults are once again geared towards the logging use-case. If this is not your setup, you might want to disable the default Logstash template (manage_template=false) and use yours:

output {
 elasticsearch {
   host => "localhost"
   protocol => "http"
   manage_template => false
   index => "new-index"
   index_type => "new-type"
   workers => 5
 }
}

Final Remarks

If you want to use time-based indices, you can change index to something like “logstash-%{+YYYY.MM.dd}” (this is the default), and the date would be taken from the @timestamp field. This is by default populated with the time Logstash processes the document, but you can use the date filter to replace it with a timestamp from the document itself:

filter {
 date {
   "match" => [ "custom_timestamp", "MM/dd/YYYY HH:mm:ss" ]
   target => "@timestamp"
 }
}

If your Logstash configuration contains only these snippets, it will nicely shut down when it’s done reindexing.

That’s it! We are happy answer questions or receive feedback – please drop us a line or get us @sematext. And, yes, we’re hiring!

Cassandra Use Case – including Performance Monitoring

For those of you using Cassandra or considering it, we’ve got another Sematext user’s Planet Cassandra case study for your reading pleasure.  In this Q&A/use case, Alain Rodriguez, main Data Architect for Teads.tv, talks about the business and technical needs that drove their decision to use Cassandra, how they are using Cassandra, how their Cassandra monitoring is done, and more.

Here’s an excerpt related to Cassandra performance monitoring that’s especially dear to our heart:

“Thanks to SPM we now have a very good view of what is happening in terms of Cassandra performance.  This kind of application is fully realized on outage or slowness issues.  With any outage we’ve had since we started using SPM, I always found the root cause have been able to fix or mitigate things.  Any downtime — even for just a few minutes — can lead to hundreds of thousands of dollars lost, plus a negative opinion of Teads by our customers and a negative impact on our image.  Today it is really worth it to invest in a good monitoring solution. I believe SPM belongs to be one of them.”

PlanetCassandra_Teads

 

Here’s the full post: Video Advertising Platform Teads Chose Cassandra, SPM, and OpsCenter to Monitor a Personalized Ad Experience

And here’s our previous client Cassandra use case for Recruiting.com if you’d like another informed take on selecting Cassandra and using SPM for Cassandra monitoring.

If you’d like try SPM to monitor Cassandra yourself (or any number of applications like Hadoop, HBase, Spark, Kafka, Elasticsearch, Solr, etc.), check out a Free 30-day trial by registering here.  There’s no commitment and no credit card required. Small startups, startups with no or very little outside funding, non-profit and educational institutions get special pricing – just get in touch with us.

Monitoring Stream Processing Tools: Cassandra, Kafka and Spark

One of the trends we see in our Elasticsearch and Solr consulting work is that everyone is processing one kind of data stream or another.  Including us, actually – we process endless streams of metrics, continuous log and even streams, high volume clickstreams, etc.  Kafka is clearly the de facto messaging standard.  In the data processing layer Storm used to be a big hit and, while we still encounter it, we see more and more Spark.  What comes out of Spark typically ends up in Cassandra, or Elasticsearch, or HBase, or some other scalable distributed data store capable of high write rates and analytical read loads.

Kafka-Spark-Cassandra - mish-mash tooling

The left side of this figure shows a typical data processing pipeline, and while this is the core of a lot of modern data processing applications, there are often additional pieces of technology involved – you may have Nginx or Apache exposing REST APIs, perhaps Redis for caching, and maybe MySQL for storing information about users of your system.

Once you put an application like this in production, you better keep an eye on it – there are lots of moving pieces and if any one of them isn’t working well you may start getting angry emails from unhappy users or worse – losing customers.

Imagine you had to use multiple different tools, open-source or commercial, to monitor your full stack, and perhaps an additional tool (ELK or Splunk or …) to handle your logs (you don’t just write them to local FS, compress, and rotate, do you?) Yuck!  But that is what that whole right side of the above figure is about.  Each of the tools on that right side is different – they are different open-source projects with different authors, have different versions, are possibly written in different languages, are released on different cycles, are using different deployment and configuration mechanism, etc.  There are also a number of arrows there.  Some carry metrics to Graphite (or Ganglia or …), others connect to Nagios which provides alerting, while another set of arrows represent log shipping to ELK stack.  One could then further ask – well, what/who monitors your ELK cluster then?!?  (don’t say Marvel, because then the question is who watches Marvel’s own ES cluster and we’ll get stack overflow!)  That’s another set of arrows going from the Elasticsearch object.  This is a common picture.  We’ve all seen similar setups!

Our goal with SPM and Logsene is to simplify this picture and thus the lives of DevOps, Ops, and Engineers who need to manage deployments like this stream processing pipeline.  We do that by providing a monitoring and log management solution that can handle all these technologies really well.   Those multiple tools, multiple types of UIs, multiple logins….. they are a bit of a nightmare or at least a not very efficient setup.  We don’t want that.  We want this:

Kafka-Spark-Cassandra

Once you have something like SPM & Logsene in place you can see your complete application stack, all tiers from frontend to backend, in a single pane of glass. Nirvana… almost, because what comes after monitoring?  Alerting, Anomaly Detection, notifications via email, HipChat, Slack, PageDuty, WebHooks, etc.  The reality of the DevOps life is that we can’t just watch pretty, colourful, real-time charts – we also need to set up and handle various alerts.  But in case of SPM & Logsene, at least this is all in one place – you don’t need to fiddle with Nagios or some other alerting tool that needs to be integrated with the rest of the toolchain.

So there you have it.  If you like what we described here, try SPM and/or Logsene – simply sign up here – there’s no commitment and no credit card required.  Small startups, startups with no or very little outside investment money, non-profit and educational institutions get special pricing – just get in touch with us.  If you’d like to help us make SPM and Logsene even better, we are hiring!

Follow

Get every new post delivered to your Inbox.

Join 166 other followers