Kafka Poll: Version You Use?

With Kafka 0.8.2 and being released and with the updated SPM for Kafka monitoring over 100 Kafka metrics, we thought it would be good to see which Kafka versions are being used in the wild.  Kafka 0.7.x was a strong and stable release used by many.  The 0.8.1.x release has been out since March 2014.  Kafka 0.8.2.x has been out for just a little while, but…. are there any people who are either already using it (we are!) or are about to upgrade to it? Please tweet this poll and help us spread the word, so we can get a good, statistically significant results.  We’ll publish the results here and via @sematext (follow us!) in a week.

Please tweet this poll and help us spread the word, so we can get a good, statistically significant results.  We’ll publish the results here and via @sematext (follow us!) in a week.

Solr Cookbook, 3rd Edition — Now Available and includes Solr 5.0

Hot off the press: a brand new Solr Cookbook!  One of Sematext’s Solr and Elasticsearch experts — and authorsRafał Kuć, has just published the third and latest edition of Solr Cookbook.  This edition covers both Solr 4.x (based on the newest 4.10.3 version of Solr) and the just-released Solr 5.0.

Similar to previous Solr Cookbooks, Rafal updated the book significantly — half of the previous content has been changed — and rewrote all of the recipes.


Chapter List

Here’s a list of the chapters:

  1. Apache Solr Configuration
  2. Indexing Your Data
  3. Analyzing Your Text Data
  4. Querying Solr
  5. Faceting
  6. Improving Solr Performance
  7. In the Cloud
  8. Using Additional Solr Functionalities
  9. Dealing with Problems
  10. Real-life Situations

For more information about Solr Cookbook, Third Edition — including info on getting a free chapter — check out the Packt Publishing web page dedicated to it.  The book is available in both electronic and paperback versions.  Even better, here is a discount code you can use for 20% off (valid until March 22, 2015; see details for applying code below*): scte20

Need Some Solr Expertise?

Rafal isn’t the only Solr expert at Sematext; we’ve got several more who have helped 100+ clients to architect, scale, tune, and successfully deploy their Solr-based products.  We also offer 24/7 production support for Solr and Elasticsearch.  Here’s more info about our professional services, which also include Elasticsearch and Logging consulting.  You can also monitor Solr performance (and many other platforms) with SPM Performance Monitoring.

Have some feedback or questions for Rafal?

He’d love to hear from you — get him @kucrafal


* Using discount code:

  1. Set up a free Packt account or log into your existing account
  2. Add the title “Solr Cookbook – Third Edition” in the cart
  3. Click on ‘View Cart’
  4. Then in the “Do you have a promo code?” field enter scte20
  5. Click on the “Apply” button for the discount to get applied


Kafka 0.8.2 Monitoring Support

SPM Performance Monitoring is the first Apache Kafka monitoring tool to support Kafka 0.8.2.  Here are all the details:

Shiny, New Kafka Metrics

Kafka 0.8.2 has a pile of new metrics for all three main Kafka components: Producers, Brokers, and Consumers.  Not only does it have a lot of new metrics, the whole metrics part of Kafka has been redone — we worked closely with Kafka developers for several weeks to bring order and structure to all Kafka metrics and make them easy to collect, parse and interpret.

We could list all the Kafka metrics you can get via SPM, but in short — SPM monitors all Kafka metrics and, as with all things SPM monitors, all these metrics are nicely graphed and are filterable by server name, topic, partition, and everything else that makes sense in Kafka deployments.

103 Kafka metrics:

  • Broker: 43 metrics
  • Producer: 9 metrics
  • New Producer: 38 metrics
  • Consumer: 13 metrics

You will be hard-pressed to find another solution that can monitor that many Kafka metrics out of the box! And if you want to do something with your Kafka logs, Logsene will gladly make them searchable for you!

Needless to say, SPM shows the most sought after Kafka metric – the Consumer Lag (see the screenshot below).

Screenshot – Kafka Metrics Overview  (click to enlarge)


Screenshot – Consumer Lag  (click to enlarge)


Monitoring Kafka in Context

Running Kafka alone is pointless. On one side you process or collect data and push it into Kafka.  On the other side you consume that data (maybe processing it some more) and in the end this data typically ends up landing in some data store. Kafka is often used with data processing frameworks like Spark, Storm and Hadoop, or data stores like Cassandra and HBase, search engines like Elasticsearch and Solr, and so on.  Wouldn’t it be nice to have a single place to monitor all of these systems?  With alerts and anomaly detection?  And letting you collect and search all their logs?  Guess what?  SPM and Logsene do exactly that — they can monitor all of these technologies and make all their logs searchable!

Take a Test Drive — It’s Easy and Free to Get Started

Like what you see here?  Sound like something that could benefit your organization?  Then try SPM for Free for 30 days by registering here.  There’s no commitment and no credit card required.

HAProxy Monitoring Support

New functionality is rolling out in SPM Performance Monitoring!  Watch this space for future posts on Transaction Tracing, Global and App-specific Server Views, Kafka 0.8.2 monitoring and other cool stuff.  For this post, those of you who use HAProxy are in luck as we just added monitoring support for this popular TCP/HTTP load balancer.

See also: Apache monitoring, and Nginx & Nginx Plus monitoring.

Screenshot – HAProxy Session Rate  (click to enlarge)

haproxy-session-rate copy 2

HAProxy Metrics

SPM collects key metrics from the HAProxy load balancer of the underlying proxies/servers, as you can see in the chart below.

Metric Name Description
status 1 (UP/OPEN) 0 (DOWN)
downtime total downtime (in seconds)
rate number of sessions per second over last elapsed second
rate_max max number of new sessions per second
rate_lim limit on new sessions per second
scur current sessions
smax max sessions
slimit sessions limit
stot total sessions
lbtot total number of times a server was selected
bin bytes in
bout bytes out
dreq denied requests
dresp denied responses
ereq error requests
eresp response errors
econ connection errors
wretr retries (warning)
wredis redispatches (warning)
weight server weight (server), total weight (backend)
act server is active (server), number of active servers (backend)
bck server is backup (server), number of backup servers (backend)

You can create threshold-based or machine learning-based anomaly detection on any of these metrics, of course, and you can also rely on heartbeat alerts to detect any HAProxy daemon going down.  Any alerts can be emailed or you can use any of the SPM Alerts Integrations such as PagerDuty, HipChat, Slack, Nagios, or any other WebHook.

See for Yourself

You can check out SPM’s live demo and see some more of SPM’s monitoring, alerting and anomaly detection functionality.  In addition to native monitoring for apps like Solr, Elasticsearch, Hadoop, HBase, Spark, Cassandra, Kafka, Storm, and many more, SPM also integrates with with Logsene Log Management and Analytics to add centralized logging functionality and correlation of metrics, logs, alerts, anomalies, and events.

Take a Test Drive — It’s Easy and Free to Get Started

Like what you see here?  Sound like something that could benefit your organization?  Then try SPM (and Logsene, too) for Free for 30 days by registering here.  There’s no commitment and no credit card required.

Using Elasticsearch Mapping Types to Handle Different JSON Logs

By default, Elasticsearch does a good job of figuring the type of data in each field of your logs. But if you like your logs structured like we do, you probably want more control over how they’re indexed: is time_elapsed an integer or a float? Do you want your tags analyzed so you can search for big in big data? Or do you need it not_analyzed, so you can show top tags via the terms aggregation? Or maybe both?

In this post, we’ll look at how to use index templates to manage multiple types of logs across multiple indices. Also, we’ll explain how to use logging tools (such as Logstash and rsyslog) to handle JSON logging and specify types.

Elasticsearch Mapping and Logs

As you may already know, to control these things in Elasticsearch you’ll need to define a mapping. This works similarly in Logsene, our log analytics SaaS, because it uses Elasticsearch and exposes its API.

With logs you’ll probably use time-based indices, because they scale better (in Logsene, for instance, you get daily indices). That said, to make sure the mapping you define today applies to the index you create tomorrow, you need to define it in an index template.

Managing Multiple Types

Mappings provide a nice abstraction when you have to deal with multiple types of structured data. Let’s say you have two apps generating logs of different structures: both have a timestamp field, but one recording logins has a user field, and another one recording purchases has an amount field.

To deal with this, you can define the timestamp field in the _default_ mapping which applies to all types. Then, in each type’s own mapping we’ll define fields unique to that mapping. The following snippet is an example that works with Logsene, provided that aaaaaaaa-bbbb-cccc-dddd-eeeeeeeeeeee is your Logsene app token. If you roll your own Elasticsearch, you can use whichever name you want, and make sure the template applies to your index pattern.

curl -XPUT 'logsene-receiver.sematext.com/_template/aaaaaaaa-bbbb-cccc-dddd-eeeeeeeeeeee_MyTemplate' -d '{
 "template" : "aaaaaaaa-bbbb-cccc-dddd-eeeeeeeeeeee*",
 "order" : 21,
 "mappings" : {
  "_default_" : {
   "properties" : {
    "timestamp" : { "type" : "date" }
  "firstapp" : {
   "properties" : {
    "user" : { "type" : "string" }
  "secondapp" : {
   "properties" : {
    "amount" : { "type" : "long" }

Sending JSON Logs to Specific Types

When you send a document to Elasticsearch by using the API, you have to provide an index and a type. You can use an Elasticsearch client for your preferred language to log directly to Elasticsearch or Logsene this way. But I wouldn’t recommend this, because then you’d have to manage things like buffering if the destination is unreachable.

Instead, I’d keep my logging simple and use a specialized logging tool, such as Logstash or rsyslog to do the hard work for me. Logging to a file is usually the easiest option. It’s local, and you can have your logging tool tail the file and send contents over the network. I usually prefer sockets (like syslog) because they let me configure Logstash/rsyslog to:
– write events in a human format to a local file I can tail if I need to (usually in development)
– forward logs without hitting disk if I need to (usually in production)
Whatever you prefer, I think writing to local files or sockets is better than sending logs over the network from your application. Unless you’re willing to do a reliability trade-off and use UDP, which gets rid of most complexities.

Opinions aside, here’s a Logstash configuration for tailing a file with JSON logs separated by a newline. Here’s how you’d send those documents to Logsene via the Elasticsearch API:

input {
 file {
 path => "/var/log/test"
 codec => "json"

output {
 elasticsearch {
 host => "logsene-receiver.sematext.com"
 port => 80
 index_type => "fileapp"
 protocol => "http"
 manage_template => false

Note how the JSON codec does the parsing here, instead of the more expensive and maintenance-heavy approach with grok that we’ve shown in an earlier post on getting started with Logstash. Some applications let you configure the log format, so you can make them write JSON (Apache httpd, for example).

If you want to send JSON over syslog, there’s the JSON-over-syslog (CEE) format that we detailed in a previous post. You can use rsyslog’s JSON parser module to take your structured logs and forward them to Logsene:

module(load="imuxsock")        # can listen to local syslog socket
module(load="omelasticsearch") # can forward to Elasticsearch
module(load="mmjsonparse")     # can parse JSON

action(type="mmjsonparse")  # parse CEE-formatted messages

template(name="syslog-cee" type="list") {  # Elasticsearch documents will contain
  property(name="$!all-json")              # all JSON fields that were parsed

  template="syslog-cee"                     # use the template defined earlier
  bulkmode="on"                                # send logs in batches
  queue.dequeuebatchsize="1000"                # of up to 1000
  action.resumeretrycount="-1"    # retry indefinitely (buffer) if destination is unreachable

To send a CEE-formatted syslog, you can run logger ‘@cee: {“amount”: 50}’ for example. Rsyslog would forward this JSON to Elasticsearch or Logsene via HTTP. Note that Logsene also supports CEE-formatted JSON over syslog out of the box if you want to use a syslog protocol instead of the Elasticsearch API.

Filtering by Type

Once your logs are in, you can filter them by type (via the _type field) in Kibana:
Type Filtering with Kibana
However, if you want more refined filtering by source, we suggest using a separate field for storing the application name. This can be useful when you have different applications using the same logging format. For example, both crond and postfix use plain syslog.

If you’re looking for a place to send your logs to, check out Logsene!

Hiring: Full-stack Java Developers

Sematext is looking for a strong full-stack developers (remote work is cool!) who:

  • Find creative and elegant solutions, build tools, avoid repetition and boilerplate code
  • Take ownership and push forward; want to help build the team and the organization
  • Like working with data-intense applications, continuous data streams (e.g. metrics, logs, events), visualization and data analytics
  • Want to have fun, enjoy building new things and improving existing ones

Some info about our tech:

  • Java in the backend, with a bit of Akka
  • Java and NodeJS for various SPM agents
  • A series of Machine Learning algorithms for Anomaly Detection
  • HBase and Elasticsearch for storing massive volumes of data (many billions of “rows”… stopped counting long ago)
  • Jetty and Kafka that handle hundreds of thousands of events/metrics/logs/messages per second
  • MySQL, ZooKeeper (obviously)
  • Apache Flume, rsyslog, Logstash, and Kibana
  • Lots of JavaScript in the Boostrap-based UI layer – jQuery and various other usual suspects
  • Flot for charting (and looking to replace that with something more modern, yes)
  • Solr, but just for search-lucene.com and search-hadoop.com, which are not related to this opening
  • Everything runs on AWS – we own a total of 2 physical servers

Products / Services you’d be building:

  • SPM – monitoring, alerting, anomaly detection.  A lot has been done, and a lot more is in the queue.
  • Logsene – log collection, indexing, searching, alerting, anomaly detection.  A lot of new features are waiting to be built.
  • Search Analytics – it works, it runs, it has customers, but there is so much more value we can extract from query and click data!
  • NewAppHere – can’t talk about it, but it’s going to be big… and not only in Japan

Good to have skills:

  • Java – for various backend components
  • JavaScript (frameworks/libraries) – if you’re truly full-stack
  • We can teach you or at least help you catch up quickly with everything else mentioned above as well as custom bits not mentioned here

A bit about Sematext:

  • HQ in NYC, with people in North America, Europe, and Asia
  • Developers with strong open-source backgrounds
  • Deep expertise in Solr and Elasticsearch – we are also a leading provider of consulting and support for tons of clients
  • Some of our engineers give talks at conferences around the world and write books
  • We are totally self-funded, financially independent, and profitable
  • Chirping via @sematext

Got more questions?  Send them our way!  Better yet, send your resume!


Get every new post delivered to your Inbox.

Join 152 other followers