Encrypting Logs on Their Way to Elasticsearch Part 2: TLS Syslog

In part 1 of the “encrypted logs” series we discussed sending logs to Elasticsearch over HTTPS. This second part is about TLS syslog.

If you wonder what this has to do with Elasticsearch, the point is that TLS syslog is a standard (RFC-5425): any decent version of rsyslog, syslog-ng or nxlog works with it. So you can forward logs over TLS to a recent, “intermediary” rsyslog. Then, you can either use omelasticsearch with HTTPS to ship your logs to Elasticsearch, or you can install rsyslog on an Elasticsearch node (and index logs over HTTP to localhost).

Such a setup will give you the following benefits:

  • it will work with most syslog daemons, because TLS syslog is so widely supported
  • the “intermediate” rsyslog can act as a buffer, taking that pressure off your application servers
  • the “intermediate” rsyslog can be used for processing, like parsing CEE-formatted JSON over syslog. Again, taking load off your applicaton servers

Our log analytics SaaS, Logsene, gives you all the benefits listed above through the syslog endpoint:

TLS syslog flow in Logsene

Client Setup

Before you start, you’ll need a Certificate Authority’s public key, which will be used to validate the encryption certificate from the syslog destination (more about the server side later).

If you’re using Logsene, you can download the CA certificates directly. If you’re on a local setup, or you just want to consolidate your logs before shipping them to Logsene, you can use your own certificates or generate self-signed ones. Here’s a guide to generating certificates that will work with TLS syslog.

With the CA certificate(s) in hand, you can start configuring your syslog daemon. For example, the rsyslog configuration can look like this:

module(load="imuxsock")  # listens for local logs on /dev/log

global (  # global settings
 defaultNetstreamDriver="gtls"  # use TLS driver when it comes to transporting over TCP
 defaultNetstreamDriverCAFile="/opt/rsyslog/ca_bundle.pem"  # CA certificate. Concatenate if you have more
)

action(  # how to send logs
  type="omfwd"                                    # Forward them
  target="logsene-receiver-syslog.sematext.com"   # to Logsene's syslog endpoint
  port="10514"                                    # on port X
  protocol="tcp"                                  # over TCP
  template="RSYSLOG_SyslogProtocol23Format"       # using the RFC-5424 syslog format
  StreamDriverMode="1"                            # via the TLS mode of the driver defined above.
  StreamDriverAuthMode="x509/name"                # Request the machine certificate of the server
  StreamDriverPermittedPeers="*.sematext.com"     # and based on it, just allow Sematext hosts
)

This is the new-style configuration format for rsyslog, that works with version 6 or above. For the pre-v6 format (BSD-style), check out the Logsene documentation. You can also find the syslog-ng equivalent there.

Server Setup

If you’re using Logsene, you might as well stop here, because it handles everything from buffering and indexing to parsing JSON-formatted syslog.

If you’re consolidating logs before sending them to Logsene, or you’re running your local setup, here’s an excellent end-to-end guide to setting up TLS with rsyslog. The basic steps for the server are:

  • use the same CA certificates as the client, so they have the same basis
  • generate the machine public-private key pair. You’ll have to provide both in the rsyslog configuration
  • set up the TLS rsyslog configuration

Explore

Once you start logging, the end result should be just like in part 1. You can use Logsene’s hosted Kibana, your own Kibana or the Logsene UI to explore your logs:

Logsene Screnshot

As always, feel free to contact us if you need any help:

Encrypting Logs on Their Way to Elasticsearch

Let’s assume you want to send your logs to Elasticsearch, so you can search or analyze them in realtime. If your Elasticsearch cluster is in a remote location (EC2?) or is our log analytics service, Logsene (which exposes the Elasticsearch API), you might need to forward your data over an encrypted channel.

There’s more than one way to forward over SSL, and this post is part 1 of a series explaining how.

update: part 2 is now available!

Today’s method is about sending data over HTTPS to Elasticsearch (or Logsene), instead of plain HTTP. You’ll need two pieces to achieve this:

  1. a tool that can send logs over HTTPS
  2. the Elasticsearch REST API exposed over HTTPS

You can build your own tool or use existing ones. In this post we’ll show you how to use rsyslog’s Elasticsearch output to do that. For the API, you can use Nginx or Apache as a reverse proxy for HTTPS in front of your Elasticseach, or you can use Logsene’s HTTPS endpoint:

Rsyslog Configuration

To get rsyslog’s omelasticsearch plugin, you need at least version 6.6. HTTPS support was just added to master, and it’s expected to land in version 8.2.0. Once that is up, you’ll be able to use the Ubuntu, Debian or RHEL/CentOS packages to install both the base rsyslog and the rsyslog-elasticsearch packages you need. Otherwise, you can always install from sources:
- clone from the rsyslog github repository
- run `autogen.sh –enable-elasticsearch && make && make install` (depending on your system, it might ask for some dependencies)

With omelasticsearch in place (the om part comes from output module, if you’re wondering about the weird name), you can try the configuration below to take all your logs from your local /dev/log and forward them to Elasticsearch/Logsene:

# load needed input and output modules
module(load="imuxsock.so") # listen to /dev/log
module(load="omelasticsearch.so") # provides Elasticsearch output capability

# template that will build a JSON out of syslog
# properties. Resulting JSON will be in Logstash format
# so it plays nicely with Logsene and Kibana
template(name="plain-syslog"
         type="list") {
           constant(value="{")
             constant(value="\"@timestamp\":\"")
                 property(name="timereported" dateFormat="rfc3339")
             constant(value="\",\"host\":\"")
                 property(name="hostname")
             constant(value="\",\"severity\":\"")
                 property(name="syslogseverity-text")
             constant(value="\",\"facility\":\"")
                 property(name="syslogfacility-text")
             constant(value="\",\"syslogtag\":\"")
                 property(name="syslogtag" format="json")
             constant(value="\",\"message\":\"")
                 property(name="msg" format="json")
             constant(value="\"}")
         }

# send resulting JSON documents to Elasticsearch
action(type="omelasticsearch"
       template="plain-syslog"
 # Elasticsearch index (or Logsene token)
       searchIndex="YOUR-LOGSENE-TOKEN-GOES-HERE"
 # bulk requests
       bulkmode="on"  
       queue.dequeuebatchsize="100"
 # buffer and retry indefinitely if Elasticsearch is unreachable
       action.resumeretrycount="-1"
 # Elasticsearch/Logsene endpoint
       server="logsene-receiver.sematext.com"
       serverport="443"
       usehttps="on"
)

Exploring Your Data

After restarting rsyslog, you should be able to see your logs flowing in the Logsene UI, where you can search and graph them:

Logsene Screnshot

If you prefer Logsene’s Kibana UI, or you run your own Elasticsearch cluster, you can run make your own Kibana connect to the HTTPS endpoint just like rsyslog or Logsene’s native UI do.

Wrapping Up

If you’re using Logsene, all you need to do is to make sure you add your Logsene application token as the Elasticsearch index name in rsyslog’s configuration.

If you’re running your own Elasticsearch cluster, there are some nice tutorials about setting up reverse HTTPS proxies with Nginx and Apache respectively. You can also try Elasticsearch plugins that support HTTPS, such as the jetty and security plugins.

Feel free to contact us if you need any help. We’d be happy to answer any Logsene questions you may have, as well as help you with your local setup through professional services and production support. If you just find this stuff exciting, you may want to join us, wherever you are.

Stay tuned for part 2, which will show you how to use RFC-5425 TLS syslog to encrypt your messages from one syslog daemon to the other.

Video and Presentation: Indexing and Searching Logs with Elasticsearch or Solr

Interested in log indexing using Elasticsearch or Solr?  Also interested in searching and analyzing logs in real time?

This topic really hits home for us since we released our log analytics tool, Logsene and we also offer consulting services for logging infrastructure.  If you are reading this and looking for a new opportunity then you might be interested to hear that we are hiring worldwide.

If you are into logging like we are, then you will want to check out this presentation delivered by Sematext’s own Radu Gheorghe to the NYC Search, Discovery and Analytics Meetup held recently at Pivotal Labs.  For the purposes of this presentation the term “logs” ranges from server logs and application events to metrics and even social media information.

The presentation has three parts:

  1. Overview of logging tools that play nicely with Elasticseach and Solr (like Logstash, Apache Flume or rsyslog)
  2. Performance tuning and scaling Elasticsearch and Solr
  3. Demo of an end-to-end solution

Here you go – enjoy!

JOB: Professional Services Lead – Solr and Elasticsearch

We have a great opportunity at Sematext for a person who wants to take the Professional Services Lead role and grow both him/herself in this role as well as grow the whole Professional Services side of the house.  The person in this role will get to learn all aspects of the business from engineering, to speaking with numerous clients and customers, to working with remote team members, even touching on sales and marketing.  This position offers a truly multifaceted view into Sematext and the space that Sematext is in, which is a rich blend of search, big data, analytics, open source, products, services, engineering, support, etc.  The ideal candidate would already be in New York, where Sematext HQ is located, but we are open to people from other locations as well.

REQUIREMENTS
• Experience working with Solr or Elasticsearch
• Plan and coordinate customer engagements from business and technical perspective
• Identify customer pain points, needs, and success criteria at the onset of each engagement
• Provide expert-level consulting and support services and strive to be a trustworthy advisor to a wide range of customers
• Resolve complex search issues involving Solr or Elasticsearch
• Identify opportunities to provide customers with additional value through our products or services
• Communicate high-value use cases and customer feedback to our Product teams
• Participate in open source community by contributing bug fixes, improvements, answering questions, etc.

EXPERIENCE
• BS or higher in Engineering or Computer Science preferred
• 2 or more years of IT Consulting and/or Professional Services experience required
• Exposure to other related open source projects (Hadoop, Nutch, Kafka, Storm, Mahout, etc.) a plus
• Experience with other commercial and open source search technologies a plus
• Enterprise Search, eCommerce, and/or Business Intelligence experience a plus
• Experience working in a startup a plus

Interested? Please send your resume to jobs@sematext.com.

For other job openings please see Jobs @ Sematext or even our previous job listings.

Meetup: Indexing and Searching Logs with Elasticsearch and Solr

If you are into logging and search like we are, and if you are in New York, like some of us are, come to Indexing and Searching Logs with Elasticsearch and Solr on Wednesday at Pivotal Labs office in Manhattan.

Rsyslog 8.1 Elasticsearch Output Performance

Recently, the first rsyslog version 8 release was announced. Major changes in its core should give outputs better performance, and the one for Elasticsearch should benefit a lot. Since we’re using rsyslog and Elasticsearch in Sematext‘s own Logsene, we had to take the new version for a spin.

The Weapon and the Target

For testing, we used a good-old i3 laptop, with 8GB of RAM. We generated 20 million logs, sent them to rsyslog via TCP and from there to Elasticsearch in the Logstash format, so they can get explored with Kibana. The objective was to stuff as many events per second into Elasticsearch as possible.

Rsyslog Architecture Overview

In order to tweak rsyslog effectively, one needs to understand its architecture, which is not that obvious (although there’s an ongoing effort to improve the documentation). The gist of it its architecture represented in the figure below.

  • you have input modules taking messages (from files, network, journal, etc.) and pushing them to a main queue
  • one or more main queue threads take those events and parse them. By default, they parse syslog formats, but you can configure rsyslog to use message modifier modules to do additional parsing (e.g. CEE-formatted JSON messages). Either way, this parsing generates structured events, made out of properties
  • after parsing, the main queue threads push events to the action queue. Or queues, if there are multiple actions and you want to fan-out
  • for each defined action, one or more action queue threads takes properties from events according to templates, and makes messages that would be sent to the destination. In Elasticsearch’s case, a template should make Elasticsearch JSON documents, and the destination would be the REST API endpoint
rsyslog message flow

rsyslog message flow

There are two more things to say about rsyslog’s architecture before we move on to the actual test:

  • you can have multiple independent flows (like the one in the figure above) in the same rsyslog process by using rulesets. Think of rulesets as swim-lanes. They’re useful for example when you want to process local logs and remote logs in a completely separate manner
  • queues can be in-memory, on disk, or a combination called disk-assisted. Here, we’ll use in-memory because they’re the fastest. For more information about how queues work, take a look here

Configuration

To generate messages, we used tcpflood, a small and light tool that’s part of rsyslog’s testbench. It generates messages and sends them over to the local syslog via TCP.

Rsyslog took received those messages with the imtcp input module, queued them and forwarded them to Elasticsearch 0.90.7, which was also installed locally. We also tried with Elasticsearch 1.0 Beta 1 and got the same results (see below).

The flow of messages in this test is represented in the following figure:

Flow of messages in this test

Flow of messages in this test

The actual rsyslog config is listed below. It can be tuned further (for example by using the multithreaded imptcp input module), but we didn’t get significant improvements.

module(load="imtcp")   # TCP input module
module(load="omelasticsearch") # Elasticsearch output module

input(type="imtcp" port="13514")  # where to listen for TCP messages

$MainMsgQueueSize 1000000
$MainMsgQueueDequeueBatchSize 1000
$MainMsgQueueWorkerThreads 2

# template to generate JSON documents for Elasticsearch
template(name="plain-syslog"
         type="list") {
           constant(value="{")
             constant(value="\"@timestamp\":\"")      property(name="timereported" dateFormat="rfc3339")
             constant(value="\",\"host\":\"")        property(name="hostname")
             constant(value="\",\"severity\":\"")    property(name="syslogseverity-text")
             constant(value="\",\"facility\":\"")    property(name="syslogfacility-text")
             constant(value="\",\"syslogtag\":\"")   property(name="syslogtag" format="json")
             constant(value="\",\"message\":\"")    property(name="msg" format="json")
             constant(value="\"}")
         }

action(type="omelasticsearch"
       template="plain-syslog"  # use the template defined earlier
       searchIndex="test-index"
       bulkmode="on"
       queue.dequeuebatchsize="5000"   # ES bulk size
       queue.size="100000"
       queue.workerthreads="5"
       action.resumeretrycount="-1"  # retry indefinitely if ES is unreachable
)

You can see from the configuration that:

  • both main and action queues have a defined size in number of messages
  • both have number of threads that deliver messages to the next step. The action needs more because it has to wait for Elasticsearch to reply
  • moving of messages from the queues happens in batches. For the Elasticsearch output, the batch of messages is sent through the Bulk API, which makes queue.dequeuebatchsize effectively the bulk size

Results

We started with default Elasticsearch settings. Then we tuned them to leave rsyslog with a more significant slice of the CPU. We measured the indexing rate with SPM. Here are the average results over 20 million indexed events:

  • with default Elasticsearch settings, we got 8,000 events per second
  • after setting Elasticsearch up more production-like (5 second refresh interval, increased index buffer size, translog thresholds, etc), and the throughput went up to average of 20,000 events per second
  • in the end, we went berserk and used in-memory indices, updated the mapping to disable any storing or indexing for any field, to have Elasticsearch do as little work as possible and make room for rsyslog. Got an average of 30,000 events per second. In this scenario, rsyslog was using between 1 and 1.5 of the 4 virtual CPU cores, with tcpflood using 0.5 and Elasticsearch using from 2 to 2.5

Conclusion

20K EPS on a low-end machine with production-like configuration means Elasticsearch is quick at indexing. This is very good for logs, where you typically have lots of messages being generated, compared to how often you search.

If you need some tool to ship your logs to Elasticsearch with minimum overhead, rsyslog with its new version 8 may well be your best bet.

Additional resources:

NOTE: We’re hiring, so if you find this interesting, please check us out. Search, logging and performance is our thing!

5-minute Logstash: Parsing and Sending a Log File

We like Logstash a lot at Sematext, because it’s a good (if not the) swiss-army knife for logs. Plus, it’s one of the easiest logging tools to get started with, which is exactly what this post is about. In less than 5 minutes, you’ll learn how to send logs from a file, parse them to extract metrics from those logs and send them to Logsene, our logging SaaS.

NOTE: Because Logsene exposes the Elasticsearch API, the same steps will work if you have a local Elasticsearch cluster.

Overview

As an example, we’ll take an Apache log, written in its combined logging format. Your Logstash configuration would be made up of three parts:

pie_chart

The Input

The first part of your configuration file would be about your inputs. Inputs are modules of Logstash responsible for ingesting data. You can use the file input to tail your files. There are a lot of options around this input, and the full documentation can be found here. For now, let’s assume you want to send the existing contents of that file, in addition to the new content. To do that, you’d set the start_position to beginning. Here’s how the whole input configuration will look like:

input {
file {
path => “/var/log/apache.log”
type => “apache-access”              # a type to identify those logs (will need this later)
start_position => “beginning”
}
}

The Filter

Filters are modules that can take your raw data and try to make sense of it. Logstash has lots of such plugins, and one of the most useful is grok. Grok makes it easy for you to parse logs with regular expressions, by assigning labels to commonly used patterns. One such label is called COMBINEDAPACHELOG, which is exactly what we need:

filter {
if [type] == “apache-access” {   # this is where we use the type from the input section
grok {
match => [ "message", "%{COMBINEDAPACHELOG}" ]
}
}
}

If you need to use more complicated grok patterns, we suggest trying the grok debugger.

The Output

To send logs to Logsene (or your own Elasticsearch cluster) via HTTP, you can use the elasticsearch_http output. You’ll need to specify the host and port of an Elasticsearch server.

For Logsene, those would be logsene-receiver.sematext.com and port 80. Another Logsene-specific requirement is to specify the access token for your Logsene app. You can find that token in your Sematext account, under Services -> Logsene.

The complete output configuration would be:

output {
elasticsearch_http {
host => “logsene-receiver.sematext.com”
port => 80
index => “your Logsene app token goes here
}
}

Wrapping Up

To start sending your logs, you’d have to put the three configuration snippets in a file (let’s say, logstash.conf), download Logstash and start the agent:

java -jar logstash-1.3.2-flatjar.jar agent -f logstash.conf

You can add -v to see more verbose output. Once your logs are in, you can start exploring your data by using Kibana or the native Logsene UI.

Video: Using Solr for Logs with Rsyslog, Flume, Fluentd and Logstash

A while ago we published the slides from our talk at Lucene Revolution about using Solr for indexing and searching logs. This topic is of special interest for us, since we’ve released Logsene and we’re also offering consulting services for logging infrastructure. If you’re also into working with search engines or logs, please note that we’re hiring worldwide.

The video for that talk is now available, and you can watch it below. The talk is made of three parts:

  • one that discusses the general concepts of what a log is, structured logging and indexing logs in general, whether it’s Solr or Elasticsearch
  • one that shows how to use existing tools to send logs to Solr: Rsyslog and Fluentd to send structured events (yes, structured syslog!); Apache Flume and Logstash to take unstructured data, make it structured via Morphlines and Grok, and then send it to Solr
  • one that shows how to optimize Solr’s performance for handling logs. From tuning the commit frequency and merge factor to using time-based collections with aliases

Presentation: Introduction to Elasticsearch

During the New York Search, Discovery and Analytics group in November, we did a presentation on Introduction to Elasticsearch. It was about how you can use Elasticsearch to store, search and analyze your data in near-realtime. It included a demo on the basic functionality (index, retrieve, update, delete, search, facet) as well as scaling, and some tips for performance tuning and monitoring your Elasticsearch cluster.

Please tweet about Presentation: Introduction to Elasticsearch

If you attended, you might want to see Gilt’s blog post about it, which includes some nice pictures. If you didn’t attend, of if you just want to refresh your memory, you can find the video (courtesy of g33ktalk) and the slides below.

Logstash Performance Monitoring (1.2.2 vs 1.2.1)

We’re using Logstash to collect some of our own logs here at Sematext, because it can easily forward events to Logsene via the elasticsearch_http output. And we recently upgraded to the latest version 1.2.2.

Since we’re obsessed with performance monitoring, the first question that came up was: did the upgrade make any difference in terms of load? So we did a test to find out.

Context

Logstash runs on a JVM, so we’re already monitoring it with SPM for Java Apps.  So we just put it through a steady, moderate logs of about 70 events per second for a while, running both 1.2.1 and 1.2.2, on the same machine.

The configuration remained the same on both machines:

  • a file input that was tailing a single file using the multiline codec
  • a couple of grok filters and a geoIP filter, that we’ll talk about in later posts
  • resulting events are fed to Logsene through the elasticsearch_http output, because Logsene exposes the Elasticsearch API

Results

The biggest difference we’ve seen was in memory usage. The new version uses about 30% less memory:

logstash_spm_memory

Next, there was a significant difference in the amount of garbage collection going on. Again, some 30% difference in favor of 1.2.2:

logstash_spm_gc

We also had slightly less CPU usage. The difference wasn’t significant in our case because we didn’t have a lot of traffic: Logstash is installed on every host and there’s no “central” Logstash to process lots of data. I’m sure that if we had a more complex configuration and/or more traffic, we would have seen much less CPU.

Conclusions

Logstash is getting lighter and lighter, which seems to address its only criticism. You definitely get less memory usage and less garbage collection, so we definitely recommend upgrading to 1.2.2. And if you want to monitor its usage, you can always use our SPM. Happy stashing!

Follow

Get every new post delivered to your Inbox.

Join 1,564 other followers