JOB: Devops Evangelist – Monitoring, Logging, Analytics

DEVELOPER / DEVOPS EVANGELIST

Sematext is looking for someone with a business and marketing bent and enough technical background to be able to put together bits of code and demos of SPM, Logsene, and other products we are working on.  A good fit for this role is a person who likes to teach and share, knows how to connect with people and their needs, is passionate and is considered (or wants to become) a thought leader in at least one area — Monitoring, Logging, Data Analytics and/or Business Intelligence.  Our ideal evangelist also enjoys the agility and challenge of a startup.  For a good description of the type of person we are looking for watch this video.  Sematext is a young, fast growing, highly distributed and agile team and our developer evangelist will work in many different capacities and contribute to the company’s success in a variety of ways.

 

RESPONSIBILITIES:

  • Create technical content and demos for publication on our blog and other channels to show developers, devops, and others how to implement specific solutions or use new technologies
  • Prepare and deliver presentations and webinars, speak at industry conferences, local meetups and other events
  • Build relationships with tech bloggers, open source contributors and product community leaders, journalists and analysts
  • Educate and empower developers, giving technical workshops and brown bags
  • Build partnerships with individuals, companies and organizations that serve the same communities we do (Elasticsearch, Solr, Kafka, Hadoop, Storm, Spark, etc.)
  • Gather and socialize product feedback that informs engineering, sales, and marketing decision making

 

REQUIREMENTS:

  • BS or higher in Computer Science or professional experience as a developer, sys admin, sales engineer or other technical role
  • Strong verbal and written communications skills with ability to write for engineers or high-level management
  • Entrepreneurial thinking and the ability to act effectively with only high-level direction

 

BONUS:

  • Participation in open-source community
  • Experience with other commercial and open source monitoring, logging, or analytics technologies
  • Experience working in a startup

 

You can check out our products, our services, our clients, and our team to get a better sense of what Sematext is all about.  Also worth a look are the 19 things you may like about Sematext.  Interested? Please send your resume to jobs@sematext.com.  For other job openings please see Jobs @ Sematext or even our previous job listings.

 

Announcement: Coming Up in Site Search Analytics

Have you checked out Site Search Analytics yet?  If not, and if you think that gaining insight into user search behavior and experience is valuable information, then we’ve got something for you that’s battle-tested and ready to go.

This year we are adding some killer new features that will make SSA even more useful.  So, if you want to be enjoying benefits like:

  • Viewing real-time graphs showing search and click-through rates
  • Awareness of your top queries, top zero-hit queries, most seen and clicked on hits, etc.
  • Having a mechanism to perform search relevance A/B tests and a relevance feedback mechanism
  • Not having to develop, set up, manage or scale all the infrastructure needed for query and click log analysis
  • And many others — here is a full list of features and benefits

…then you will love the new functionality we have on the way.  After all, how can you improve search quality if you don’t measure it first and keep track of it?

Site Search Analytics

Site Search Analytics

Sound interesting?  Then check out a live demo.  SSA is 100% focused on helping you to improve the search experience of your customers and prospects.  And a better search experience translates into more traffic to your web site and greater awareness of your business.

Presentation: Solr for Analytics

Last week, a bunch of Sematextans were at Lucene Revolution conference in Dublin, where we were both sponsors and presenters.  There were a number of interesting talks and we saw great interest in SPM from people who want to use it to monitor Solr (and more) and want to send their logs to Logsene, which confirmed Sematext is going in the right direction and is creating products and services that are in demand and solve real-world problems.

Below are the slides from one of our four talks from the conference.  This talk was about our experience using Solr as an alternative data store used for SPM, in which we share our findings and observations about using Solr for large scale aggregations, analytical queries, applications with high write throughput, performance improvements in Solr 4.5, the lower memory footprint of DocValues, and more.

If you are interesting in this sort of stuff, we are looking for good people at all levels – from JavaScript Developers and Backend Engineers, to Evangelists, Sales, Marketing, etc.

 

What’s New in Sematext Search Analytics

We’ve been busy with adding functionality and improving SPM, our Performance Monitoring service, but we’ve also been quietly working on our free Search Analytics service (internally known as SA).  As a matter of fact, not coincidentally, SPM and SA share a lot of backend components, as well as UI-level pieces.  This, of course, allows a good amount of software reuse and lets us improve both services without double the effort.

Here are some of the new things in Search Analytics:

Live Demo. Before you create your Sematext Apps account (it’s free, no need to take out your credit card) you can check out the live demo and see both SPM and SA in action.

Real-time. Previously, SA used MapReduce jobs to process the collected data and make them available as reports.  That is no longer the case. We’ve put SA on the same real-time OLAP engine that powers SPM.  This means you’ll see your graphs refresh and change before your eyes.

Dashboards. Just like we’ve added Dashboards to SPM, we’ve added them to SA, too.  You can now create custom Dashboards, pick which graphs you want on them, and where you want to put them on a Dashboard.  This is great if you want to display your search stats and trends on a large office monitor, as some SA users are already doing.  Moreover, you can put widgets from multiple SA and SPM Apps all on the same Dashboard, so you can see your performance metrics, SPM custom metrics (e.g. your KPIs), and your SA metrics on a single Dashboard, side by side.

Report/URL Sharing. Just like in SPM, we’ve made it possible to copy the URL from the browser, give it to anyone who has access to your SA reports.  When this URL is opened any filters or time selection will be automatically applied.  This makes it very easy for multiple people to easily share their “SA view” by sharing the URL instead of having to tell others which report they should look at, what time and what filters they should select, etc.

Graph Embedding and Sharing. Similar to URL Sharing (but different!), you can now share and embed individual SA graphs.  Each graph has a short URL that you can Tweet or share elsewhere.  You can also get a URL/HTML snippet and embed SA graphs in your blog, wiki, web site, etc.

User Sessions. We’ve added a User Sessions report. This report shows you the number of search sessions over time, the number of queries per session, as well as the number of distinct users using your search.  If anyone asked you to provide these numbers for your site, would you know them?  Most people would say no.  These metrics are good to know and with SA everyone will now be able to say yes to that question.

Distinct Queries. We’ve added the number of Distinct Queries to the Rate & Volume report.  Another nice metric.

Hourly Granularity. All graphs in SA now go down to hourly granularity.  This let’s you see how trends change over the course of each day.  This can lead to insights around differences in how your users use your search in the morning vs. during work hours vs. evening.

HTTPS/SSL. The SA JavaScript beacon can now use HTTPS.  This is important if your site uses HTTPS when displaying search results.  To send your search and clickstream data VIA HTTPS just replace http:// with https:// in SA JavaScript beacon.

We hope you like these changes.  Please leave a comment or let us know if you have suggestions for other improvements or new features you would like to see in Sematext Search Analytics.

- @sematext

Slides: Real-time Analytics with HBase

Here are slides from another talk we gave at both Berlin Buzzwords and at HBaseCon in San Francisco last month.  In this presentation  Alex describes one approach to real-time analytics with HBase, which we use at Sematext via HBaseHUT.   If you like these slides you will also like HBase Real-time Analytics Rollbacks via Append-based Updates.

Note: we are actively looking for people with strong interest and/or experience with HBase and/or Analytics, OLAP, etc.  If that’s you, please get in touch.

The short version is from Buzzwords, while the version with more slides is from HBaseCon:

HBase Real-time Analytics & Rollbacks via Append-based Updates (Part 2)

This is the second part of a 3-part post series in which we describe how we use HBase at Sematext for real-time analytics with an append-only updates approach.

In our previous post we explained the problem in detail with the help of example and touched on the suggested solution idea. In this post we will go through solution details as well as briefly introduce the open-sourced implementation of the described approach.

Suggested Solution

Suggested solution can be described as follows:

  1. replace update (Get+Put) operations at write time with simple append-only writes
  2. defer processing updates to periodic compaction jobs (not to be confused with minor/major HBase compaction)
  3. perform on the fly updates processing only if user asks for data earlier than updates compacted

Before (standard Get+Put updates approach):

The picture below shows an example of updating search query metrics that can be collected by Search Analytics system (something we do at Sematext).

  • each new piece of data (blue box) is processed individually
  • to apply update based on the new piece of data:
    • existing data (green box) is first read
    • data is changed and
    • written back

After (append-only updates approach):

1. Writing updates:

2. Periodic updates processing (compacting):

3. Performing on the fly updates processing (only if user asks for data earlier than updates compacted):

Note: the result of the processing updates on the fly can be optionally stored back right away, so that next time same data is requested no compaction is needed.

The idea is simple and not a new one, but given the specific qualities of HBase like fast range scans and high write throughput it works especially well with HBase. So, what we gain here is:

  • high update throughput
  • real-time updates visibility: despite deferring the actual updates processing, user always sees the latest data changes
  • efficient updates processing by replacing random Get+Put operations with processing sets of records at a time (during the fast scan) and eliminating redundant Get+Put attempts when writing the very first data item
  • better handling of update load peaks
  • ability to roll back any range of updates
  • avoid data inconsistency problems caused by tasks that fail after only partially updating data in HBase without doing rollback (when using with MapReduce, for example)
Let’s take a closer look at each of the above points.

High Update Throughput

Higher update throughput is achieved by not doing Get operation for every record update. Thus, Get+Put operations are replaced with Puts (which can be further optimized by using client-side buffer), which are really fast in HBase. The processing of updates (compaction) is still needed to perform the actual data merge, but it is done much more efficiently than doing Get+Put for each update operation (see below more details).

Real-time Updates Visibility

Users always see latest data changes. Even if updates have not yet been processed and still stored as a series of records, they will be merged on the fly.

By doing periodic merges of appended records we ensure data that has to be processed on the fly is small enough and does fast.

Efficient Updates Processing

  • N Get+Put operations at write time are replaced with N Puts + 1 Scan (shared) + 1 Put operations
  • Processing N changes at once is usually more effective than applying N individual changes

Let’s say we got 360 update requests for the same record: e.g. record keeps track of some sensor value for 1 hour interval and we collect data points every 10 seconds. These measurements needs to be merged in a single record that represents the whole 1-hour interval. Initially we would perform 360 Get+Put operations (while we can use some client-side buffer to perform partial aggregation and reduce the number of actual Get+Put operations, we want data to be sent immediately as it arrives instead of asking user to wait 10*N seconds). With append-only approach, we will perform 360 Put operations, 1 scan (which is actually meant to process not only these updates) that goes through 360 records (stored in sequence), it calculates resulting record, and performs 1 Put operation to store the result back. Fewer operations means using less resources and this leads to more efficient processing. Moreover, if the value which needs to be updated is a complex one (needs time to load in memory, etc.) it is much more efficient to apply all updates at once than one by one individually.

Deferring processing of updates is especially effective when large portion of operations is in essence insertion of new data, but not an update of stored data. In this case a lot of Get operations (checking if there’s something to update) are redundant.

Better Handling of Update Load Peaks

Deferring processing of updates helps handle load peaks without major performance degradation. The actual (periodic) processing of updates can be scheduled for off-peak time (e.g. nights or weekends).

Ability to Rollback Updates

Since updates don’t change the existing data (until they are processed) rolling back is easy.

Preserving rollback ability even after updates were compacted is also not hard. Updates can be grouped and compacted within time periods of given length as shown in the picture below. That means the client that reads data will still have to merge updates on-the-fly even right after compaction is finished. However using the proper configuration this isn’t going to be a problem as the number of records to be merged on the fly will be small enough.

Consider the following example where the goal is to keep all-time avg value for particular sensor. Let’s say data is collected every 10 seconds for 30 days, which gives 259200 separately written data points. While compacting on-the-fly this amount of values might be quite fast for a medium-large HBase cluster, performing periodic compaction will improve reading speed a lot. Let’s say we perform updates processing every 4 hours and use 1 hour interval as compacting base (as shown in the picture above). This gives us, at any point in time, less than 24*30 + 4*60*6 = 2,160 non-compacted records that need to be processed on-the-fly when fetching resulting record for 30 days. This is a small number of records and can be processed very fast. At the same time it is possible to perform rollback to any point in time with 1 hour granularity.

In case system should store more historical data, but we don’t care about rolling it back (if nothing wrong was found during 30 days the data is likely to be OK) then compaction can be configured to process all data older than 30 days as one group (i.e. merge into one record).

Automatic Handling of Task Failures which Write Data to HBase

Typical scenario: task updating HBase data fails in the middle of writing – some data was written, some not. Ideally we should be able to simply restart the same task (on the same input data) so that new one performs needed updates without corrupting data because some write operations were duplicate.

In the suggested solution every update is written as a new record. In order to make sure that performing the same (literally the same, not just similar) update operation multiple times does not result in multiple separate update operations, in which case data will be corrupted, every update operation should write to the record with the same row key from any task attempts. This results in overriding the same single record (if it was created by failed task) and avoiding doing the same update multiple times which in turn means that data is not corrupted.

This especially convenient in situations when we write to HBase table from MapReduce tasks, as MapReduce framework restarts failed tasks for you. With the given approach we can say that handling task failures happens automatically – no extra effort to manually roll back previous task changes and starting a new task is needed.

Cons

Below are the major drawbacks of the suggested solution. Usually there are ways to reduce their effect on your system depending on the specific case (e.g. by tuning HBase appropriately or by adjusting parameters involved in data compaction logic).

  • merging on the fly takes time. Properly configuring periodic updates processing is a key to keeping data fetching fast.
  • when performing compaction, scanning of many records that don’t need to be compacted can happen (already compacted or “alone-standing” records). Compaction can usually be performed only on data written after the previous compaction which allows to use efficient time-based filters to reduce the impact here.

Solving these issues may be implementation-specific. We’ll bring them up again when talking about our implementation in the follow up post.

Implemenation: Meet HBaseHUT

Suggested solution was implemented and open-sourced as HBaseHUT project. HBaseHUT will be covered in the follow up post shortly.

 

If you like this sort of stuff, we’re looking for Data Engineers!

HBase Real-time Analytics & Rollbacks via Append-based Updates

In this part 1 of a 3-part post series we’ll describe how we use HBase at Sematext for real-time analytics and how we can perform data rollbacks by using an append-only updates approach.

Some bits of this topic were already covered in Deferring Processing Updates to Increase HBase Write Performance and some were briefly presented at BerlinBuzzwords 2011 (video). We will also talk about some of the ideas below during HBaseCon-2012 in late May (see Real-time Analytics with HBase). The approach described in this post is used in our production systems (SPM & SA) and the implementation was open-sourced as HBaseHUT project.

Problem we are Solving

While HDFS & MapReduce are designed for massive batch processing and with the idea of data being immutable (write once, read many times), HBase includes support for additional operations such as real-time and random read/write/delete access to data records. HBase performs its basic job very well, but there are times when developers have to think at a higher level about how to utilize HBase capabilities for specific use-cases.  HBase is a great tool with good core functionality and implementation, but it does require one to do some thinking to ensure this core functionality is used properly and optimally. The use-case we’ll be working with in this post is a typical data analytics system where:

  • new data are continuously streaming in
  • data are processed and stored in HBase, usually as time-series data
  • processed data are served to users who can navigate through most recent data as well as dig deep into historical data

Although the above points frame the use-case relatively narrowly, the approach and its implementation that we’ll describe here are really more general and applicable to a number of other systems, too. The basic issues we want to solve are the following:

  • increase record update throughput. Ideally, despite high volume of incoming data changes can be applied in real-time . Usually. due to the limitations of the “normal  HBase update”, which requires Get+Put operations, updates are applied using batch-processing approach (e.g. as MapReduce jobs).  This, of course, is anything but real-time: incoming data is not immediately seen.  It is seen only after it has been processed.
  • ability to roll back changes in the served data. Human errors or any other issues should not permanently corrupt data that system serves.
  • ability to fetch data interactively (i.e. fast enough for inpatient humans).  When one  navigates through a small amount of recent data, as well as when selected time interval spans years, the retrieval should be fast.

Here is what we consider an “update”:

  • addition of a new record if no records with same key exists
  • update of an existing record with a particular key

Let’s take a look at the following example to better understand the problem we are solving.

Example Description

Briefly, here are the details of an example system:

  • System collects metrics from a large number of sensors (N) very frequently (each second) and displays them on chart(s) over time
  • User needs to be able to select small time intervals to display on a chart (e.g. several minutes) as well as very large spans (e.g. several years)
  • Ideally, data shown to user should be updated in real-time (i.e. user can see the most recent state of the sensors)

Note that even if some of the above points are not applicable to your system the ideas that follow may still be relevant and applicable.

Possible “direct” Implementation Steps

The following steps are by no means the only possible approach.

Step 1: Write every data point as a new record or new column(s) in some record in HBase. In other words, use a simple append-only approach. While this works well for displaying charts with data from short time intervals, showing a year (there are about 31,536,000 seconds in one year) worth of data may be too slow to call the experience “interactive”.

Step 2: Store extra records with aggregated data for larger time intervals (say 1 hour, so that 1 year = 8,760 data points). As new data comes in continuously and we want data to be seen in real-time, plus we cannot rely on data coming in a strict order, say because one sensor had network connectivity issues or we want to have ability to import historical data from a new data source, we have to use update operations on those records that hold data for longer intervals. This requires a lot of Get+Put operations to update aggregated records and this means degradation in performance — writing to HBase in this fashion will be significantly slower compared to using the append-only approach described in Step 1. This may slow writes so much that a system like this may not actually be able to keep up with the volume of the incoming data.  Not good.

Step 3: Compromise real-time data analysis and process data in small batches (near real-time). This will decrease the load on HBase as we can process (aggregate) data more efficiently in batches and can reduce the number of update (Get+Put) operations. But do we really want to compromise real-time analytics? No, of course not.  While it may seem OK in this specific example to show data for bigger intervals with some delay (near real-time), in real-world systems this usually affects other charts/reports, such as reports that need to show total, up to date figures. So no, we really don’t want to compromise real-time analytics if we don’t have to. In addition, imagine what happens if something goes wrong (e.g. wrong data was fed as input, or application aggregates data incorrectly due to a bug or human error).  If that happens we will not be able to easily roll back recently written data. Utilizing native HBase column versions may help in some cases, but in general, when we want greater control over rollback operation a better solution is needed.

Use Versions for Rolling Back?

Recent additions in managing cell versions make cell versioning even more powerful than before. Things like HBASE-4071 make it easy to store historical data without big data overhead by cleaning old data efficiently. While it seems obvious to use versions (native HBase feature) for allowing rolling back data changes, we cannot (and do not want to) rely heavily on cell versions here. The main reason for that is that it is just not very effective when dealing with lots of versions for a given cell. When update history for a record/cell becomes very long this requires many versions for a given cell. Versions are managed and navigated as a simple list in HBase (as opposed to using a Map-like structure that is used for records and columns) so managing long lists of versions is less efficient than having a bunch of separate records/columns. Besides, using versions will not help us with Get+Put situation and we are aiming to kill these two birds with one rock with the solution we are about to describe. One could try to use append-only updates approach described below and use cells versions as update log, but this would again bring us to managing long lists in a non-efficient way.

Suggested Solution

Given the example above, our suggested solution can be described as follows:

  • replace update (Get+Put) operations at write time with simple append-only writes and defer processing of updates to periodic jobs or perform aggregations on the fly if user asks for data earlier than individual additions are processed.

The idea is simple and not necessarily novel, but given the specific qualities of HBase, namely fast range scans and high write throughput, this approach works very well.  So well, in fact, that we’ve implemented it in HBaseHUT and have been using it with success in our production systems (SPM & SA).

So, what we gain here is:

  • high update throughput
  • real-time updates visibility: despite deferring the actual updates processing, user always sees the latest data changes
  • efficient updates processing by replacing random Get+Put operations with processing whole sets of records at a time (during the fast scan) and eliminating redundant Get+Put attempts when writing first data item
  • ability to roll back any range of updates
  • avoid data inconsistency problems caused by tasks that fail after only partially updating data in HBase without doing rollback (when using with MapReduce, for example)

In part 2 post we’ll dig into the details around each of the above points and we’ll talk more about HBaseHUT, which makes all of the above possible. If you like this sort of stuff, we’re looking for Data Engineers!

Data Engineer Position at Sematext International

If you’ve always wanted to work with Hadoop, HBase, Flume, and friends and build massively scalable, high-throughput distributed systems (like our Search Analytics and SPM), we have a Data Engineer position that is all about that!  If you are interested, please send your resume to jobs@sematext.com.

Responsibilities:

  • Versatile architect and developer – design and build large, high performance,scalable data processing systems using Hadoop, HBase, and other big data technologies
  • DevOps fan –  run and tune large data processing production clusters
  • Tool maker – develop ops and management tools 
  • Open source participant – keep up with development in areas of cloud and distributed computing, NoSQL, Big Data, Analytics, etc.

Pluses:

  • solid Math, Statistics, Machine Learning, or Data Mining is not required but is a big plus
  • experience with Analytics, OLAP, Data Warehouse or related technologies is a big plus
  • ability and desire to expand and lead a data engineering team
  • ability to think both business and engineering
  • ability to build products and services based on observed client needs
  • ability to present in public, at meetups, conferences, etc.
  • ability to contribute to blog.sematext.com
  • active participation in open-source communities
  • desire to share knowledge and teach
  • positive attitude, humor, agility, high integrity, and low ego, attention to detail

Location:

  • New York

We’re small and growing.  Our HQ is in Brooklyn, but our team is spread over 4 continents.  If you follow this blog you know we have deep expertise in search and big data analytics and that our team members are conference speakers, book authors, Apache members, open-source contributors, etc.

Relevant pointers:

Berlin Buzzwords 2012 – Three Talks from Sematext

Last year was our first time at Berlin Buzzwords.  We gave 1 full talk about Search Analytics (video) and 2 lightning talks (video, video).  We saw a number of good talks, too.  We also took part in a HBase Hackathon organized by Lars George in Groupon’s Berlin offices and even found time to go clubbing.  So in hopes of paying Berlin another visit this year, a few of us at Sematext (@sematext) submitted talk proposals.  Last week we all got acceptance emails, so this year there will be 3 talks from 3 Sematextans at Berlin Buzzwords!  Here is what we’ll be talking about:

RafałScaling Massive ElasticSearch Clusters

This talk describes how we’ve used ElasticSearch to build massive search clusters capable of indexing several thousand documents per second while at the same time serving a few hundred QPS over billions of documents in well under a second.  We’ll talk about building clusters that continuously grow in terms of both indexing and search rates. You will learn about finding cluster nodes that can handle more documents, about managing shard and replica allocation and prevention of unwanted shard rebalancing, about avoiding expensive distributed queries, etc.  We’ll also describe our experience doing performance testing of several ElasticSearch clusters and will share our observations about what settings affect search performance and how much.  In this talk you’ll also learn how to monitor large ElasticSearch clusters, what various metrics mean, and which ones to pay extra attention to.

AlexReal-time Analytics with HBase

HBase can store massive amounts of data and allow random access to it – great. MapReduce jobs can be used to perform data analytics on a large scale – great. MapReduce jobs are batch jobs – not so great if you are after Real-time Analytics. Meet append-only writes approach that allows going real-time where it wasn’t possible before.

In this talk we’ll explain how we implemented “update-less updates” (not a typo!) for HBase using append-only approach. This approach shines in situations where high data volume and velocity make random updates (aka Get+Put) prohibitively expensive.  Apart from making Real-time Analytics possible, we’ll show how the append-only approach to updates makes it possible to perform rollbacks of data changes, and avoid data inconsistency problems caused by tasks in MapReduce jobs that fail after only partially updating data in HBase.  The talk is based on Sematext’s success story of building a highly scalable, general purpose data aggregation framework which was used to build Search Analytics and Performance Monitoring services. Most of the generic code needed for append-only approach described in this talk is implemented in our HBaseHUT open-source project.

OtisLarge Scale ElasticSearch, Solr & HBase Performance Monitoring 

This talk has all the buzzwords covered: big data, search, analytics, realtime, large scale, multi-tenant, SaaS, cloud, performance… and here is why:

In this talk we’ll share the “behind the scenes” details about SPM for HBase, ElasticSearch, and Solr, a large scale, multi-tenant performance monitoring SaaS built on top of Hadoop and HBase running in the cloud.  We will describe all its backend components, from the agent used for performance metrics gathering, to how metrics get sent to SPM in the cloud, how they get aggregated and stored in HBase, how alerting is implemented and how it’s triggered, how we graph performance data, etc.  We’ll also point out the key metrics to watch for each system type.  We’ll go over various pain-points we’ve encountered while building and running SPM, how we’ve dealt with them, and we’ll discuss our plans for SPM in the future.

We hope to see some of you in Berlin.  If these topics are of interest to you, but you won’t be coming to Berlin, feel free to get in touch, leave comments, or ping @sematext.  And if you love working with things our talks are about, we are hiring world-wide!

Relevance Tuning and Competitive Advantage via Search Analytics

Here are two cool things about Search Analytics that I’d like to point out.  The slides are stolen from our Search Analytics presentation at Enterprise Search Summit 2011 in Washington DC.

Search Analytics for A/B testing, relevance tuning and improvements

This slide shows how Search Analytics can be used to help with A/B testing.  Concretely, in this slide we see two Solr Dismax handlers selected on the right side.  If you are not familiar with Solr, think of a Dismax handler as an API that search applications call to execute searches.  In this example, each Dismax handler is configured differently and thus each of them ranks search hits slightly differently.  On the graph we see the MRR (see Wikipedia page for Mean Reciprocal Rank details) for both Dismax handlers and we can see that the one corresponding to the blue line is performing much better.  That is, users are clicking on search hits closer to the top of the search results page, which is one of several signals of this Dismax handler providing better relevance ranking than the other one.  Once you have a system like this in place you can add more Dismax handlers and compare 2 or more of them at a time.  As the result, with the help of Search Analytics you get actual, real feedback about any changes you make to your search engine.  Without a tool like this, you cannot really tune your search engine’s relevance well and will  be doing it blindly.


A/B Testing with Search Analytics

A/B Testing with Search Analytics

Note: while in this slide we see two Solr Dismax handlers, Sematext Search Analytics is search vendor agnostic – the same thing can be done with search powered by FAST/Microsoft Search, Attivio, Endeca/Oracle, Autonomy/HP, Vivisimo, Dieselpoint, Coveo, ElasticSearch, vanilla Lucene, Xapian, or Sphinx, or …

Gaining Competitive Advantage with Search Analytics

As you can see, the only way to fix or improve things, and in this case we are talking about various aspects of search experience, is by having something with which you can measure this search experience and your changes.  You need something to tell you when it’s time to improve things, and you need something that gives you feedback about your changes: Did the key metrics improve after you changes?  If so, how much?  Did any metrics degrade? etc.

Search Analytics Key Takeways

Search Analytics Key Takeways

I can’t emphasize enough how important Search Analytics is and how few organizations use it or use it well and consistently.  While this may be a bit mind boggling for those of us who live and breath search, from your perspective this is a great thing – it means that if you are smart about using Search Analytics to improve your search engine and your users’ search experience, you will gain competitive advantage and be ahead of your competitors who still don’t have or don’t use Search Analytics!

If you have any questions of feedback about Search Analytics in general, please leave a comment and we’ll follow up as soon as possible!

@sematext

Follow

Get every new post delivered to your Inbox.

Join 1,637 other followers