Berlin Buzzwords 2012 – Three Talks from Sematext
April 12, 2012 2 Comments
Last year was our first time at Berlin Buzzwords. We gave 1 full talk about Search Analytics (video) and 2 lightning talks (video, video). We saw a number of good talks, too. We also took part in a HBase Hackathon organized by Lars George in Groupon’s Berlin offices and even found time to go clubbing. So in hopes of paying Berlin another visit this year, a few of us at Sematext (@sematext) submitted talk proposals. Last week we all got acceptance emails, so this year there will be 3 talks from 3 Sematextans at Berlin Buzzwords! Here is what we’ll be talking about:
Rafał: Scaling Massive ElasticSearch Clusters
This talk describes how we’ve used ElasticSearch to build massive search clusters capable of indexing several thousand documents per second while at the same time serving a few hundred QPS over billions of documents in well under a second. We’ll talk about building clusters that continuously grow in terms of both indexing and search rates. You will learn about finding cluster nodes that can handle more documents, about managing shard and replica allocation and prevention of unwanted shard rebalancing, about avoiding expensive distributed queries, etc. We’ll also describe our experience doing performance testing of several ElasticSearch clusters and will share our observations about what settings affect search performance and how much. In this talk you’ll also learn how to monitor large ElasticSearch clusters, what various metrics mean, and which ones to pay extra attention to.
Alex: Real-time Analytics with HBase
HBase can store massive amounts of data and allow random access to it – great. MapReduce jobs can be used to perform data analytics on a large scale – great. MapReduce jobs are batch jobs – not so great if you are after Real-time Analytics. Meet append-only writes approach that allows going real-time where it wasn’t possible before.
In this talk we’ll explain how we implemented “update-less updates” (not a typo!) for HBase using append-only approach. This approach shines in situations where high data volume and velocity make random updates (aka Get+Put) prohibitively expensive. Apart from making Real-time Analytics possible, we’ll show how the append-only approach to updates makes it possible to perform rollbacks of data changes, and avoid data inconsistency problems caused by tasks in MapReduce jobs that fail after only partially updating data in HBase. The talk is based on Sematext’s success story of building a highly scalable, general purpose data aggregation framework which was used to build Search Analytics and Performance Monitoring services. Most of the generic code needed for append-only approach described in this talk is implemented in our HBaseHUT open-source project.
Otis: Large Scale ElasticSearch, Solr & HBase Performance Monitoring
This talk has all the buzzwords covered: big data, search, analytics, realtime, large scale, multi-tenant, SaaS, cloud, performance… and here is why:
In this talk we’ll share the “behind the scenes” details about SPM for HBase, ElasticSearch, and Solr, a large scale, multi-tenant performance monitoring SaaS built on top of Hadoop and HBase running in the cloud. We will describe all its backend components, from the agent used for performance metrics gathering, to how metrics get sent to SPM in the cloud, how they get aggregated and stored in HBase, how alerting is implemented and how it’s triggered, how we graph performance data, etc. We’ll also point out the key metrics to watch for each system type. We’ll go over various pain-points we’ve encountered while building and running SPM, how we’ve dealt with them, and we’ll discuss our plans for SPM in the future.