Solr Redis Plugin Use Cases and Performance Tests

The Solr Redis Plugin is an extension for Solr that provides a query parser that uses data stored in Redis. It is open-sourced on Github by Sematext. This tool is basically a QParserPlugin that establishes a connection to Redis and takes data stored in SET, ZRANGE and other Redis data structures in order to build a query. Data fetched from Redis is used in RedisQParser and is responsible for building a query. Moreover, this plugin provides a highlighter extension which can be used to highlight parts of aliased Solr Redis queries (this will be described in a future).

Use Case: Social Network

Imagine you have a social network and you want to implement a search solution that can search things like: events, interests, photos, and all your friends’ events, interests, and photos. A naive, Solr-only-based implementation would search over all documents and filter by a “friends” field. This requires denormalization and indexing the full list of friends into each document that belongs to a user. Building a query like this is just searching over documents and adding something like a “friends:1234″ clause to the query. It seems simple to implement, but the reality is that this is a terrible solution when you need to update a list of friends because it requires a modification of each document. So when the number of documents (e.g., photos, events, interests, friends and their items) connected with a user grows, the number of potential updates rises dramatically and each modification of connections between users becomes a nightmare. Imagine a person with 10 photos and 100 friends (all of which have their photos, events, interests, etc.).  When this person gets the 101th friend, the naive system with flattened data would have to update a lot of documents/rows.  As we all know, in a social network connections between people are constantly being created and removed, so such a naive Solr-only system could not really scale.

Social networks also have one very important attribute: the number of connections of a single user is typically not expressed in millions. That number is typically relatively small — tens, hundreds, sometimes thousands. This begs the question: why not carry information about user connections in each query sent to a search engine? That way, instead of sending queries with clause “friends:1234,” we can simply send queries with multiple user IDs connected by an “OR” operator. When a query has all the information needed to search entities that belong to a user’s friends, there is no need to store a list of friends in each user’s document. Storing user connections in each query leads to sending of rather large queries to a search engine; each of them containing multiple terms containing user ID (e.g., id:5 OR id:10 OR id:100 OR …) connected by a disjunction operator. When the number of terms grows the query requests become very big. And that’s a problem, because preparing it and sending it to a search engine over the network becomes awkward and slow.

How Does It Work?

The image below presents how Solr Redis Plugin works.

Read more of this post

Apache Spark Monitoring in SPM

Apache Spark is an open-source, large-scale data processing engine built on top of the Hadoop Distributed File System (HDFS) and enables applications in Hadoop clusters to run up to 100x faster in memory, and 10x faster even when running on disk.  So it’s not surprising the usage of Spark is booming as this Google Trends graph shows.

And while Spark usage has been going through the roof, Engineers and DevOps handling Spark have not had a good monitoring tool at their disposal.  Well, that is, until now.  By releasing the first Spark monitoring product to market Sematext has, with the addition of Spark monitoring to SPM Performance Monitoring, Alerting and Anomaly Detection, just filled a big hole in the Spark ecosystem.

Having just been added — along with other goodies — to the latest SPM release, SPM for Spark monitors all Spark metrics.  It includes alerting, anomaly detection, log correlation, custom dashboards, events graphing, custom metrics, and a ton more.  SPM can be installed On Premises or one can use the Cloud version run by Sematext, in which case the setup takes less than 5 minutes before graphs with performance metrics start appearing in real-time.

Enough with the words – Show me what Spark Monitoring looks like!

Have a look at a few screenshots to see how we graph Spark metrics in SPM.  While we don’t use Spark at Sematext at this time and thus don’t have a live demo to show you, you can check out SPM’s live demo and see some other types of apps we monitor, such as Hadoop, HBase, Cassandra, Kafka, Storm, ZooKeeper, Elasticsearch, Solr, NGINX and NGINX Plus, Apache, MySQL, Redis, Java webapps and generic Java applications, as well as custom metrics.

Screenshot – Spark Executor metrics [click to enlarge]


Screenshot – Spark Worker metrics  [click to enlarge]


And One More Thing…

SPM now works hand-in-hand with Logsene Log Management and Analytics.  This makes the integration of performance metrics, logs, events and anomalies more robust for those of you looking to combine performance monitoring and centralized log management in one place — not only knowing that SOMETHING affected performance of your Spark cluster when you look at your performance metrics graphs or get an alert, but also exactly WHAT happened with the cluster by having immediate access to all relevant Spark event logs right there!

Take a Test Drive — It’s Easy and Free to Get Started

Like what you see here?  Sound like something that could benefit your organization?  Then try SPM and/or Logsene for Free for 30 days by registering here.  There’s no commitment and no credit card required.

Announcement: Percentiles added to SPM

In the spirit of continuous improvement, we are happy to announce that percentiles have recently been added to SPM’s arsenal of measurement tools.  Percentiles provide more accurate statistics than averages, and users are able to see 50%, 95% and 99% percentiles for specific metrics and set both regular threshold-based as well as anomaly detection alerts.  We will go more into the details about how the percentiles are computed in another post, but for now we want to put the word out and show some of the related graphs — click on them to enlarge them.  Enjoy!

Elasticsearch – Request Rate and Latency


Garbage Collectors Time


Kafka – Flush Time


Kafka – Fetch/Produce Latency 1


Kafka – Fetch/Produce Latency 2


Solr Req. Rate and Latency 1


Solr – Req. Rate and Latency 2


If you enjoy performance monitoring, log analytics, or search analytics, working with projects like Elasticsearch, Solr, HBase, Hadoop, Kafka, Storm, we’re hiring planet-wide!

Rsyslog 8.1 Elasticsearch Output Performance

Recently, the first rsyslog version 8 release was announced. Major changes in its core should give outputs better performance, and the one for Elasticsearch should benefit a lot. Since we’re using rsyslog and Elasticsearch in Sematext‘s own Logsene, we had to take the new version for a spin.

The Weapon and the Target

For testing, we used a good-old i3 laptop, with 8GB of RAM. We generated 20 million logs, sent them to rsyslog via TCP and from there to Elasticsearch in the Logstash format, so they can get explored with Kibana. The objective was to stuff as many events per second into Elasticsearch as possible.

Rsyslog Architecture Overview

In order to tweak rsyslog effectively, one needs to understand its architecture, which is not that obvious (although there’s an ongoing effort to improve the documentation). The gist of it its architecture represented in the figure below.

  • you have input modules taking messages (from files, network, journal, etc.) and pushing them to a main queue
  • one or more main queue threads take those events and parse them. By default, they parse syslog formats, but you can configure rsyslog to use message modifier modules to do additional parsing (e.g. CEE-formatted JSON messages). Either way, this parsing generates structured events, made out of properties
  • after parsing, the main queue threads push events to the action queue. Or queues, if there are multiple actions and you want to fan-out
  • for each defined action, one or more action queue threads takes properties from events according to templates, and makes messages that would be sent to the destination. In Elasticsearch’s case, a template should make Elasticsearch JSON documents, and the destination would be the REST API endpoint
rsyslog message flow

rsyslog message flow

There are two more things to say about rsyslog’s architecture before we move on to the actual test:

  • you can have multiple independent flows (like the one in the figure above) in the same rsyslog process by using rulesets. Think of rulesets as swim-lanes. They’re useful for example when you want to process local logs and remote logs in a completely separate manner
  • queues can be in-memory, on disk, or a combination called disk-assisted. Here, we’ll use in-memory because they’re the fastest. For more information about how queues work, take a look here


To generate messages, we used tcpflood, a small and light tool that’s part of rsyslog’s testbench. It generates messages and sends them over to the local syslog via TCP.

Rsyslog took received those messages with the imtcp input module, queued them and forwarded them to Elasticsearch 0.90.7, which was also installed locally. We also tried with Elasticsearch 1.0 Beta 1 and got the same results (see below).

The flow of messages in this test is represented in the following figure:

Flow of messages in this test

Flow of messages in this test

The actual rsyslog config is listed below. It can be tuned further (for example by using the multithreaded imptcp input module), but we didn’t get significant improvements.

module(load="imtcp")   # TCP input module
module(load="omelasticsearch") # Elasticsearch output module

input(type="imtcp" port="13514")  # where to listen for TCP messages

    queue.size="1000000" # capacity of the main queue
    queue.dequeuebatchsize="1000" # process messages in batches of 1000
    queue.workerthreads="2" # 2 threads for the main queue

# template to generate JSON documents for Elasticsearch
         type="list") {
             constant(value="\"@timestamp\":\"")      property(name="timereported" dateFormat="rfc3339")
             constant(value="\",\"host\":\"")        property(name="hostname")
             constant(value="\",\"severity\":\"")    property(name="syslogseverity-text")
             constant(value="\",\"facility\":\"")    property(name="syslogfacility-text")
             constant(value="\",\"syslogtag\":\"")   property(name="syslogtag" format="json")
             constant(value="\",\"message\":\"")    property(name="msg" format="json")

       template="plain-syslog"  # use the template defined earlier
       queue.dequeuebatchsize="5000"   # ES bulk size
       action.resumeretrycount="-1"  # retry indefinitely if ES is unreachable

You can see from the configuration that:

  • both main and action queues have a defined size in number of messages
  • both have number of threads that deliver messages to the next step. The action needs more because it has to wait for Elasticsearch to reply
  • moving of messages from the queues happens in batches. For the Elasticsearch output, the batch of messages is sent through the Bulk API, which makes queue.dequeuebatchsize effectively the bulk size


We started with default Elasticsearch settings. Then we tuned them to leave rsyslog with a more significant slice of the CPU. We measured the indexing rate with SPM. Here are the average results over 20 million indexed events:

  • with default Elasticsearch settings, we got 8,000 events per second
  • after setting Elasticsearch up more production-like (5 second refresh interval, increased index buffer size, translog thresholds, etc), and the throughput went up to average of 20,000 events per second
  • in the end, we went berserk and used in-memory indices, updated the mapping to disable any storing or indexing for any field, to have Elasticsearch do as little work as possible and make room for rsyslog. Got an average of 30,000 events per second. In this scenario, rsyslog was using between 1 and 1.5 of the 4 virtual CPU cores, with tcpflood using 0.5 and Elasticsearch using from 2 to 2.5


20K EPS on a low-end machine with production-like configuration means Elasticsearch is quick at indexing. This is very good for logs, where you typically have lots of messages being generated, compared to how often you search.

If you need some tool to ship your logs to Elasticsearch with minimum overhead, rsyslog with its new version 8 may well be your best bet.

Additional resources:

NOTE: We’re hiring, so if you find this interesting, please check us out. Search, logging and performance is our thing!

Video: Using Solr for Logs with Rsyslog, Flume, Fluentd and Logstash

A while ago we published the slides from our talk at Lucene Revolution about using Solr for indexing and searching logs. This topic is of special interest for us, since we’ve released Logsene and we’re also offering consulting services for logging infrastructure. If you’re also into working with search engines or logs, please note that we’re hiring worldwide.

The video for that talk is now available, and you can watch it below. The talk is made of three parts:

  • one that discusses the general concepts of what a log is, structured logging and indexing logs in general, whether it’s Solr or Elasticsearch
  • one that shows how to use existing tools to send logs to Solr: Rsyslog and Fluentd to send structured events (yes, structured syslog!); Apache Flume and Logstash to take unstructured data, make it structured via Morphlines and Grok, and then send it to Solr
  • one that shows how to optimize Solr’s performance for handling logs. From tuning the commit frequency and merge factor to using time-based collections with aliases

Announcement: ZooKeeper Performance Monitoring in SPM

You don’t see him, but he is present.  He is all around us.  He keeps things running.  No, we are not talking about Him, nor about The Force.  We are talking about Apache ZooKeeper, the under-appreciated, often not talked-about, yet super-critical component of almost all distributed systems we’ve come to rely on – Hadoop, HBase, Solr, Kafka, Storm, and so on.  Our SPM, Search Analytics, and Logsene, all use ZooKeeper, and we are not alone – check our ZooKeeper poll.

We’re happy to announce that SPM can now monitor Apache ZooKeeper!  This means everyone using SPM to monitor Hadoop HBase, Solr, Kafka, Sensei, and other applications that rely on ZooKeeper can now use the same monitoring and alerting tool – SPM – to monitor their ZooKeeper instances.

Please tweet about Performance Monitoring for ZooKeeper

Here’s a glimpse into what SPM for ZooKeeper provides – click on the image to see the full view or look at the actual SPM live demo:

SPM for ZooKeeper Overview

SPM for ZooKeeper Overview

Please tell us what you think – @sematext is always listening!  Is there something SPM doesn’t monitor that you would really like to monitor?  Please vote for tech to monitor!

Want to build highly distributed big data apps with us?  We’re hiring good engineers (not just for positions listed on our jobs page), and we’re sitting on a heap of some pretty juicy big data!

Presentation: Scaling Solr with SolrCloud

Squeezing the maximal possible performance out of Solr / SolrCloud, and Elasticsearch and making them scale well is what we do on a daily basis for our clients.  We make sure their servers are optimally configured and maximally utilized.  Rafal Kuć gave a long, 75-minute talk on the topic of Scaling Solr with SolrCloud at Lucene Revolution 2013 conference in Dublin. Enjoy!

If you are interesting in working with Solr and/or Elasticsearch, we are looking for good people to join our team.


Get every new post delivered to your Inbox.

Join 146 other followers