Presentation: Solr for Analytics

Last week, a bunch of Sematextans were at Lucene Revolution conference in Dublin, where we were both sponsors and presenters.  There were a number of interesting talks and we saw great interest in SPM from people who want to use it to monitor Solr (and more) and want to send their logs to Logsene, which confirmed Sematext is going in the right direction and is creating products and services that are in demand and solve real-world problems.

Below are the slides from one of our four talks from the conference.  This talk was about our experience using Solr as an alternative data store used for SPM, in which we share our findings and observations about using Solr for large scale aggregations, analytical queries, applications with high write throughput, performance improvements in Solr 4.5, the lower memory footprint of DocValues, and more.

If you are interesting in this sort of stuff, we are looking for good people at all levels – from JavaScript Developers and Backend Engineers, to Evangelists, Sales, Marketing, etc.


Announcement: Logstash Support for SolrCloud

While using Elasticsearch for log indexing is all the rage these days (and this is one of the reasons Logsene exposes an Elasticsearch API), especially from Logstash which has had an Elasticsearch output for a long time now, there is no reason one could not index logs into Solr – SolrCloud, more specifically.

To help people get their logs into Solr(Cloud) we wrote up a simple Logstash output for Solr(Cloud) and made it available in LOGSTASH-1405, with the accompanying pull request 675.

Give it a try and ping @sematext – we’d love to know if anyone finds this useful!

And if this is of interest, consider coming to Dublin to hear Using Solr to Search and Analyze Logs, which is one of our four talks at this year’s Lucene Revolution conference in Dublin.

Job: Solr / Elasticsearch Engineer @ Sematext

Sematext is 100% engineers and there are about 10 of us now.  In addition to being on a lookout for our Head of Marketing, we are looking for solid search engineers who love Solr and/or Elasticsearch, who want to use  their search skills to work with our growing list of international clients, and who want to join our  awesome, super distributed engineering team.

Together, we’ve built several exciting products – from smaller, search-focused products that work with Solr and Elasticsearch, to larger ones like SPMSearch Analytics, and most recently Logsene.  While not building products and running services, we help organizations world-wide with their search and big data needs – from fixing issues and providing production support to building complex search systems from scratch.  Our client list is long with a number of household names on it – from Instagram (Facebook) and Tumblr (Yahoo), Etsy and Shutterstock, to The BBC, Elsevier, Lockheed Martin, Reuters, Library of Congress, etc.  We did this without raising any money.  To date, virtually all of our business came to us without us doing much real marketing – fama volat in action.  The demand for our products and services is growing and we are looking for good engineers and good people to join our adventure!

More formally:

Sematext is looking for a responsible, professional individual to join our team of search engineers.

Sematext is a New York-based startup with people spread over multiple continents and several hundred customers from Instagram and Tumblr, Etsy and Shutterstock, to The BBC, Elsevier, Lockheed Martin, Reuters, Library of Congress, etc. We’ve built systems handling over 10,000 QPS and have worked with multi-billion document indices. Our core products are:

  • SPM – performance monitoring
  • Search Analytics
  • Logsene – log and data analytics
  • Several search-focused products

In addition to the above products we offer consulting services around open source search and big data.

We are looking for a person who is:

  • Enthusiastic and positive
  • Driven, independent, and professional
  • A good communicator, both written and oral
  • Good with Solr and/or Elasticsearch and is hungry to learn more
  • Enjoys helping organizations make the best out of search

As a member of our search team you will get to:

  • Interact with clients world-wide
  • Provide guidance, architecture design, implementation, and support
  • Participate in Solr, Lucene, and Elasticsearch user and development communities
  • Work on Sematext’s search and data analytics products and participate in open-source search projects

This position:

  • Offers a lot of independence, learning, and growth
  • Does not require travel, but does offer the opportunity for travel for those who want that
  • Is open world-wide

Our search team members have written several books about search, regularly give talks at conferences, blog, and participate in open-source projects.
For more info, see 19 things you may like about Sematext.
Come join us build cool products!

4 Lucene Revolution Talks from Sematext

Bingo! We’re 4 of 4 at Lucene Revolution 2013 – 4 talk proposals and all 4 accepted!  We are hiring just so next year we can attempt getting 5 talks in. ;)  We’ll also be exhibiting at the conference, so stop by.  We will be giving away Solr and Elasticsearch books.  Here’s what we’ll be talking about in Dublin on November 6th and 7th:

In Using Solr to Search and Analyze Logs Radu will be talking about … well, you guessed it – using Solr to analyze logs.  After this talk you may want to run home (or back to the hotel) and hack on LogStash or Flume, and Solr and get Solr to eat your logs…. but don’t forget we have to keep Logsene well fed.  Feed this beast your logs like we feed it ours and help us avoid getting eaten by our own creation.


Many of us tend to hate or simply ignore logs, and rightfully so: they’re typically hard to find, difficult to handle, and are cryptic to the human eye. But can we make logs more valuable and more usable if we index them in Solr, so we can search and run real-time statistics on them? Indeed we can, and in this session you’ll learn how to make that happen. In the first part of the session we’ll explain why centralized logging is important, what valuable information one can extract from logs, and we’ll introduce the leading tools from the logging ecosystems everyone should be aware of – from syslog and log4j to LogStash and Flume. In the second part we’ll teach you how to use these tools in tandem with Solr. We’ll show how to use Solr in a SolrCloud setup to index large volumes of logs continuously and efficiently. Then, we’ll look at how to scale the Solr cluster as your data volume grows. Finally, we’ll see how you can parse your unstructured logs and convert them to nicely structured Solr documents suitable for analytical queries.

Rafal will teach about Scaling Solr with SolrCloud in a 75-minute session.  Prepare for taking lots of notes and for scaling your brain both horizontally and vertically while at the same time avoiding split-brain.


Configure your Solr cluster to handle hundreds of millions of documents without even noticing, handle queries in milliseconds, use Near Real Time indexing and searching with document versioning. Scale your cluster both horizontally and vertically by using shards and replicas. In this session you’ll learn how to make your indexing process blazing fast and make your queries efficient even with large amounts of data in your collections. You’ll also see how to optimize your queries to leverage caches as much as your deployment allows and how to observe your cluster with Solr administration panel, JMX, and third party tools. Finally, learn how to make changes to already deployed collections —split their shards and alter their schema by using Solr API.

Rafal doesn’t like to sleep.  He prefers to write multiple books at the same time and give multiple talks at the same conference.  His second talk is about Administering and Monitoring SolrCloud Clusters – something we and our customers do with SPM all the time.


Even though Solr can run without causing any troubles for long periods of time it is very important to monitor and understand what is happening in your cluster. In this session you will learn how to use various tools to monitor how Solr is behaving at a high level, but also on Lucene, JVM, and operating system level. You’ll see how to react to what you see and how to make changes to configuration, index structure and shards layout using Solr API. We will also discuss different performance metrics to which you ought to pay extra attention. Finally, you’ll learn what to do when things go awry – we will share a few examples of troubleshooting and then dissect what was wrong and what had to be done to make things work again.

Otis has aggregation coming out of his ears and dreams about data visualization, timeseries graphs, and other romantic visuals. In Solr for Analytics: Metrics Aggregations at Sematext we’ll share our experience running SPM on top of SolrCloud (vs. HBase, which we currently use).


While Solr and Lucene were originally written for full-text search, they are capable and increasingly used for Analytics, as Key Value Stores, NoSQL databases, and more. In this session we’ll describe our experience with Solr for Analytics. More specifically, we will describe a couple of different approaches we have taken with SolrCloud for aggregation of massive amounts of performance metrics, we’ll share our findings, and compare SolrCloud with HBase for large-scale, write-intensive aggregations. We’ll also visit several Solr new features that are in the works that will make Solr even more suitable for Analytics workloads.

See you in Dublin!

Berlin Buzzwords 2013 – Two Talks from Sematext

Last year at Berlin Buzzwords we were proud to give three talks. Alex talked about “Real-time Analytics with HBase” (slides, video), Otis talked about large scale monitoring in his talked titled “Large Scale ElasticSearch, Solr & HBase Performance Monitoring” (slides, video) and Rafał gave a talk about how we scale ElasticSearch clusters in his “Scaling Massive ElasticSearch Clusters” talk (slides, video). We were also very happy to be one of the sponsors of this great conference :) Because we really enjoyed the conference we decided to submit a few proposals this year and they got accepted. In this years schedule we will be giving the following talks:

Radu: JSON Logging with ElasticSearch

This talk is about aggregating loooots of logs – searching of seriously big data. We’ll go through everything we can possibly go through in 20 minutes. We’ll look at how, where, when, why, and what to log. We’ll show how to use Elasticsearch as a data store for logs and what the benefits of doing so are. We’ll discuss advantages and disadvantages of logging in JSON, which is easily processed by machines, over traditional logging, which is easily processed by humans. Finally, we’ll explore how you can get your logs – JSON or not – into Elasticsearch, run searches and statistics on them, and create pretty graphs you can’t stop staring at.

Rafał: Battle of the Giants, Round 2


Learn about how both of these great enterprise search servers are evolving and adding new features. We will be comparing the latest and greatest versions of Solr and ES, both of which are using Lucene 4.x and bringing different approaches to handling codecs, per field similarities, and more. Of course, we’ll not only look at technical aspects of both Apache Solr and ElasticSearch, but will also dig into the makeup of their contributors, compare the code and of course the user community. By the end of the talk you’ll learn the main differences when it comes to these two search servers, how they handle shard and replica distribution, automatic data replication, and different query types. In addition, you’ll learn what the admin APIs for both Solr and ElasticSearch look like and how to use them to control and alter your cluster state. Last, but not least, you’ll learn what to avoid when using ElasticSearch or Apache Solr.

We hope to see some of you in Berlin.  If these topics are of interest to you, but you won’t be coming to Berlin, feel free to get in touch, leave comments, or ping @sematext. As usual we’ll be posting slides after the talks and the organizers will probably record the talk and publish it after the conference. And if you love working with things our talks are about, we are hiring world-wide!

Poll: Using SolrCloud or Not?

We know that as of February 2013, of those Solr users who follow Sematext Blog about 75% use one some version of Solr 4.x.  But today we are trying to get to another interesting stat:

What portion of Solr 4.x users use SolrCloud?

Let’s find out!  Please tweet this to help us get more votes and better stats.

Please vote only if you are using Solr 4.x.  Please do NOT vote if you are using 1.x or 3.x version of Solr.

Poll: Which Solr version are you using?

With Solr 4.1 recently released, let’s see which version(s) of Solr people are using.  Please tweet it to help us get more votes and better stats.

Solr vs. ElasticSearch: Part 6 – User & Dev Communities

One of the questions after my talk during the recent ApacheCon EU was what I thought about the communities of the two search engines I was comparing. Not surprisingly, this is also a question we often address in our consulting engagements.  As a part of our Apache Solr vs ElasticSearch post series we decided to step away from the technical aspects of SolrCloud vs. ElasticSearch and look at the communities gathered around thesee two projects. If you haven’t read the previous posts about Apache Solr vs. ElasticSearch here are pointers to all of them:

Read more of this post

Solr vs ElasticSearch: Part 5 – Management API Capabilities

In previous posts, all listed below, we’ve discussed general architecture, full text search capabilities and facet aggregations possibilities. However, till now we have not discussed any of the administration and management options and things you can do on a live cluster without any restart. So let’s get into it and see what Apache Solr and ElasticSearch have to offer.

Read more of this post

SPM Discountorama Announcement

We are happy to announce the General Availability of SPM, our performance monitoring solution for Apache Solr, ElasticSearch, HBase, SenseiDB, and Java applications, and of course all system metrics. You can also vote for what else you want SPM to monitor.  Over the last N months that we’ve been running SPM we’ve received a lot of good feedback (thanks!), a lot of words of encouragement (thanks!), and even a few nice quotes (another thanks!). Here is one from Jerry Yang, a Software Engineer at Walmart Labs: “I have been using SPM for couple of days and it has been amazing. I learned a lot about my Solr services and was able to optimize based on the results on SPM. Great work.”

Discount Codes

Since holiday season is coming up, we thought we’d offer some discounts every week between now until the end of the year.  Each of the following discounts can be used only during “its week” specified below.  There is a limit to the number of people who can use each discount, so if you want it, don’t waste too much time.  Each discount will reduce the price of SPM SaaS for 365 days after you’ve used it, which effectively means you will get discount until the end of 2013.  Note that when you register for SPM you do not need to enter your credit card information.  You also don’t need to provide it when you create the SPM application for the system you want to monitor.  And it is when you create your SPM application that you can enter the discount code.

  • 20% for the remainder of this week until the end of this Sunday, December 9: NY201320
  • 15% for the week of December 10, 2012: NY201315
  • 10% for the week of December 17, 2012: NY201310
  • 5% for the week of December 24, 2012: NY201305

Note that each discount code expires on Sunday at 00:00 UTC.

SPM Flavours

The above discounts are good for our SPM SaaS.  However, if you’d rather run SPM on your own servers, we do offer SPM on Premises – please get in touch if you are interested in the on premises version.  You can also vote for SPM SaaS vs. On Premise and that way tell us which version you prefer or want.

SPM Plans

There are a few different subscription plans available in SPM SaaS:

  • Basic plan that is free and shows the last 30 minutes of performance data
  • Standard plan that shows the last 30 days of data and costs $0.035/server/hour
  • Pro plan that shows the last 60 days of performance data and costs $0.070/server/hour

If you have not used SPM before, here is what you can expect to see – click on the image to see a large, non-fuzzy version:

We hope you will find SPM useful and fun to use.  We are always looking for feedback – just email or ping @sematext and tell us what you like or don’t like about SPM.


Get every new post delivered to your Inbox.

Join 1,564 other followers