Poll Results: Kafka Producer/Consumer

About 10 days ago we ran a a poll about which languages/APIs people use when writing their Apache Kafka Producers and Consumers.  See Kafka Poll: Producer & Consumer Client.  We collected 130 votes so far.  The results were actually somewhat surprising!  Let’s share the numbers first!

Kafka Producer/Consumer Languages

Kafka Producer/Consumer Languages

What do you think?  Is that the breakdown you expected?  Here is what surprised us:

  • Java is the dominant language on the planet today, but less than 50% people use it with Kafka! Read: possible explanation for Java & Kafka.
  • Python is clearly popular and gaining in popularity, but at 13% it looks like it’s extra popular in Kafka context.
  • Go at 10-11% seems quite popular for a relatively young language.  One might expect Ruby to have more adoption here than Go because Ruby has been around much longer.
  • We put C/C++ in the poll because these languages are still in use, though we didn’t expect it to get 6% of votes.  However, considering C/C++ are still quite heavily used generally speaking, that’s actually a pretty low percentage.
  • JavaScript and NodeJS are surprisingly low at just 4%.  Any idea why?  Is the JavaScript Kafka API not up to date or bad or ….?
  • The “Other” category is relatively big, at a bit over 12%.  Did we forget some major languages people often use with Kafka?  Scala?  See info about the Kafka Scala API here.

Everyone and their cousin is using Kafka nowadays, or at least that’s what it looks like from where we at Sematext sit.  However, because of the relatively high percentage of people using Python and Go, we’d venture to say Kafka adoption is much stronger among younger, smaller companies, where Python and Go are used more than “enterprise languages”, like Java, C#, and C/C++.

Kafka Poll: Producer & Consumer Client

Kafka has become the de-facto standard for handling real-time streams in high-volume, data-intensive applications, and there are certainly a lot of those out there.  We thought it would be valuable to conduct a quick poll to find out which which implementation of Kafka Producers and Consumers people use – specifically, which programming languages do you use to produce and consume Kafka messages?

Please tweet this poll and help us spread the word, so we can get a good, statistically significant results.  We’ll publish the results here and via @sematext (follow us!) in a week.

NOTE #: If you choose “Other”, please leave a comment with additional info, so we can share this when we publish the results, too!

NOTE #2: The results are in! See http://blog.sematext.com/2015/01/28/kafka-poll-results-producer-consumer/

Please tweet this poll and help us spread the word, so we can get a good, statistically significant results.  We’ll publish the results hereand via @sematext (follow us!) in a week.

Solr Redis Plugin Use Cases and Performance Tests

The Solr Redis Plugin is an extension for Solr that provides a query parser that uses data stored in Redis. It is open-sourced on Github by Sematext. This tool is basically a QParserPlugin that establishes a connection to Redis and takes data stored in SET, ZRANGE and other Redis data structures in order to build a query. Data fetched from Redis is used in RedisQParser and is responsible for building a query. Moreover, this plugin provides a highlighter extension which can be used to highlight parts of aliased Solr Redis queries (this will be described in a future).

Use Case: Social Network

Imagine you have a social network and you want to implement a search solution that can search things like: events, interests, photos, and all your friends’ events, interests, and photos. A naive, Solr-only-based implementation would search over all documents and filter by a “friends” field. This requires denormalization and indexing the full list of friends into each document that belongs to a user. Building a query like this is just searching over documents and adding something like a “friends:1234″ clause to the query. It seems simple to implement, but the reality is that this is a terrible solution when you need to update a list of friends because it requires a modification of each document. So when the number of documents (e.g., photos, events, interests, friends and their items) connected with a user grows, the number of potential updates rises dramatically and each modification of connections between users becomes a nightmare. Imagine a person with 10 photos and 100 friends (all of which have their photos, events, interests, etc.).  When this person gets the 101th friend, the naive system with flattened data would have to update a lot of documents/rows.  As we all know, in a social network connections between people are constantly being created and removed, so such a naive Solr-only system could not really scale.

Social networks also have one very important attribute: the number of connections of a single user is typically not expressed in millions. That number is typically relatively small — tens, hundreds, sometimes thousands. This begs the question: why not carry information about user connections in each query sent to a search engine? That way, instead of sending queries with clause “friends:1234,” we can simply send queries with multiple user IDs connected by an “OR” operator. When a query has all the information needed to search entities that belong to a user’s friends, there is no need to store a list of friends in each user’s document. Storing user connections in each query leads to sending of rather large queries to a search engine; each of them containing multiple terms containing user ID (e.g., id:5 OR id:10 OR id:100 OR …) connected by a disjunction operator. When the number of terms grows the query requests become very big. And that’s a problem, because preparing it and sending it to a search engine over the network becomes awkward and slow.

How Does It Work?

The image below presents how Solr Redis Plugin works.

Read more of this post

Solr Presentations from Lucene/Solr Revolution 2014

Thanks to everyone who stopped by the Sematext booth at last week’s Lucene/Solr Revolution event in Washington, DC and attended our two talks:

The attendance, questions and interest are very much appreciated.  As a company that prides itself on its Solr expertise (and Elasticsearch expertise too, for that matter), it was nice to spend a couple days talking about search and Big Data challenges, performance monitoring and logging with fellow experts from around the world. Here are the slides for the two talks we gave (summaries of the talks can be found here):

 

  Videos of the talks will be posted here soon.  Hope to see everyone again next year!

Sematext at Lucene/Solr Revolution 2014

Going to Lucene/Solr Revolution next week — November 11-14 — in Washington, DC?  If so…Sematext will be there exhibiting AND giving two talks!  If you are going, stop by our table to say hello.  We can show you the latest versions of SPM Performance Monitoring, Logsene Log Management and Analytics, Site Search Analytics, and, of course, talk about metrics, centralized log management, Lucene, Solr, Elasticsearch, and just about any other search-related topic you might be interested in.  After all, not only have we blogged, given talks and spread the word in all sorts of ways, we’ve also written books on these subjects!

Both of the Sematext engineer talks take place on Friday, November 14.  They are:

Radu Gheorghe will talk about “Tuning Solr for Logs” at 10:15 am

Summary:  Performance tuning is always nice for keeping your applications snappy and your costs down. This is especially the case for logs, social media and other stream-like data that can easily grow into terabyte territory. While you can always use SolrCloud to scale out of performance issues, this talk is about optimizing. The following questions about Solr settings will be answered. How often should you commit and merge? How can you have one collection per day/month/year/etc? What are the performance trade-offs for these options?  There will also be a discussion around choosing the appropriate hardware.  Radu will talk about optimizing the infrastructure when pushing logs to Solr. This includes tuning Apache Flume to handle large flows of logs and overall design options that also apply to other shippers, like Logstash.

Rafal Kuc will talk about “Solr Anti-Patterns” at 10:55 am

Summary:  Working as a consultant, software engineer and helping people in various ways, Rafał has seen multiple patterns in how Solr is used and how it should be used. Consulting on best practices is common, but talking about what NOT to do is not. This talk will point out common mistakes and roads that should be avoided at all costs, covering use cases and guidelines around general configuration pitfalls, data modeling and what to avoid when making your data indexable, and mistakes made when it comes to queries and searching for indexed data. Each use case will be illustrated by a before and after analysis where changes in metrics will be shown to bring a know-how worth remembering.

20% Discount Code

If you currently use a Sematext product or have been a client in the past and want to go, drop us a line for more info.

Hope to see you in DC!

Sematext in GooglePlus

Quick shout-out to all G+ fans — you can find us in G+, too, and follow us there if you prefer that over the more traditional blog subscription: https://plus.google.com/+SematextGroup

Of course, @sematext is an option, too!

Two Lucene/Solr Revolution 2014 Talks Accepted!

We recently got word from Lucene/Solr Revolution 2014 (in Washington, DC from Nov. 11-14) that talks submitted by two Sematext engineers were accepted as part of the Tutorial track!  They are:

In “Tuning Solr for Logs” Radu will discuss Solr settings, hardware options and optimizing the infrastructure pushing logs to Solr.

In “Solr Anti-Patterns” Rafal will point out common Solr mistakes and roads that should be avoided at all costs.  Each of the talk’s use cases will be illustrated with a before and after analysis — including changes in metrics.

You can see more details about both talks in this recent blog post.

The full agenda, including dates and times for the talks, will be available soon on the Lucene/Solr Revolution 2014 web site.

If you do attend one of these talks please stop by and say hello to Radu and Rafal.  Not only do they know Solr inside and out, but they are good guys as well!

Love Solr Enough to Even Want to Attend One of These Talks?

If you enjoy Solr enough to even think of attending these talks — and you’re looking for a new opportunity — then Sematext might be the place for you.  We’re hiring planet-wide and currently looking for Solr and Elasticsearch Engineers, Front end and JavaScript Developers, Developer Evangelists, Full-stack Engineers, and Mobile App Developers.

Follow

Get every new post delivered to your Inbox.

Join 146 other followers