SPM Discountorama Announcement

We are happy to announce the General Availability of SPM, our performance monitoring solution for Apache Solr, ElasticSearch, HBase, SenseiDB, and Java applications, and of course all system metrics. You can also vote for what else you want SPM to monitor.  Over the last N months that we’ve been running SPM we’ve received a lot of good feedback (thanks!), a lot of words of encouragement (thanks!), and even a few nice quotes (another thanks!). Here is one from Jerry Yang, a Software Engineer at Walmart Labs: “I have been using SPM for couple of days and it has been amazing. I learned a lot about my Solr services and was able to optimize based on the results on SPM. Great work.”

Discount Codes

Since holiday season is coming up, we thought we’d offer some discounts every week between now until the end of the year.  Each of the following discounts can be used only during “its week” specified below.  There is a limit to the number of people who can use each discount, so if you want it, don’t waste too much time.  Each discount will reduce the price of SPM SaaS for 365 days after you’ve used it, which effectively means you will get discount until the end of 2013.  Note that when you register for SPM you do not need to enter your credit card information.  You also don’t need to provide it when you create the SPM application for the system you want to monitor.  And it is when you create your SPM application that you can enter the discount code.

  • 20% for the remainder of this week until the end of this Sunday, December 9: NY201320
  • 15% for the week of December 10, 2012: NY201315
  • 10% for the week of December 17, 2012: NY201310
  • 5% for the week of December 24, 2012: NY201305

Note that each discount code expires on Sunday at 00:00 UTC.

SPM Flavours

The above discounts are good for our SPM SaaS.  However, if you’d rather run SPM on your own servers, we do offer SPM on Premises – please get in touch if you are interested in the on premises version.  You can also vote for SPM SaaS vs. On Premise and that way tell us which version you prefer or want.

SPM Plans

There are a few different subscription plans available in SPM SaaS:

  • Basic plan that is free and shows the last 30 minutes of performance data
  • Standard plan that shows the last 30 days of data and costs $0.035/server/hour
  • Pro plan that shows the last 60 days of performance data and costs $0.070/server/hour

If you have not used SPM before, here is what you can expect to see – click on the image to see a large, non-fuzzy version:

We hope you will find SPM useful and fun to use.  We are always looking for feedback – just email spm-support@sematext.com or ping @sematext and tell us what you like or don’t like about SPM.

Slides: Large Scale Performance Monitoring for ElasticSearch, HBase, Solr, SenseiDB…

In this presentation from Berlin Buzzwords 2012 we show how the SPM, our Performance Monitoring service is built.  How metrics are collected, how they are processed, and how they are presented.  We share a few findings along the way, too.

Note: we are actively looking for people with strong Java engineers. If that’s you, please get in touch. Separately, if you have interest and/or experience with HBase and/or Analytics, OLAP, and related areas, or  if you are looking to work with ElasticSearch, Solr, and search in general please get in touch, too.

 

See also:

Sematext Presenting Open Source Search Safari at ESS 2012

We are continuing our “new tradition” of presenting at Enterprise Search Summit (ESS) conferences.  We presented at ESS in 2011 (see http://blog.sematext.com/2011/11/02/search-analytics-at-enterprise-search-summit-fall-2011-presentation/).  This year we’ll be giving a talk titled Open Source Search Safari, in which Otis (@otisg) will present a number of open-source search solutions – Lucene, Solr, ElasticSearch, and Sensei, plus maybe one or two others (suggestions?).  We’ll also be chained to booth 26 where we’ll be showcasing our search-vendor neutral Search Analytics service (service is currently still free, feel free to use it all you want), along with some of our other search-related products.

ESS East will be held May 15-16 in our own New York City.  If you are a past or prospective client or customer of ours, please get in touch if you are interested in attending ESS at a discount.

Sensei: distributed, realtime, semi-structured database

Once upon a time there was no decent open-source search engine.  Then, at the very beginning of this millennium Doug Cutting gave us Lucene.  Several years later Yonik Seeley wrote Solr.  In 2010 Shay Banon released ElasticSearch.  And just a few days ago John Wang and his team at LinkedIn announed Sensei 1.0.0 (also known as SenseiDB).  Here at Sematext we’ve been aware of Sensei for a while now (2 years?) and are happy to have one more great piece of search software available for our own needs and those of our customers.  As a matter of fact, we are so excited about Sensei that we’ve already started hacking on adding support for Sensei to SPM, our Scalable Performance Monitoring tool and service!  Since Sensei is brand new, we asked John to tell us a little more about it.

Please tell us a bit about yourself.

My name is John Wang, and I am the search architect at LinkedIn.com. I am the creator and the current lead for the Sensei project.

Could you describe Sensei for us?

Sensei is an open-source, elastic, realtime, distributed database with native support for searching and navigating both unstructured text and structured data. Sensei is designed to handle complex semi-structured queries on very large, and rapidly changing datasets.

It was written by the content search team at LinkedIn to support LinkedIn Homepage and Signal.

The core engine is also used for LinkedIn search properties, e.g. people search, recruiter system, job and company search pages.

Why did you write Sensei instead of using Lucene or Solr?

Sensei leverages Lucene.

We weren’t able to leverage Solr because of the following requirements:

  • High update requirement, 10’s of thousands updates per second in to the system
  • Real distributed solution, current Solr’s distributed story has a SPOF at the master, and Solr Cloud is not yet completed.
  • Complex faceting support. Not just your standard terms based faceting. We needed to facet on social graph, dynamic time ranges and many other interesting faceting scenarios. Faceting behavior also needs to be highly customizable, which is not available via Solr.

What does Sensei do that existing open-source search solutions don’t provide?

Consider Sensei if your application has the following characteristics:

  • High update rates
  • Non-trivial semi-structured query support

Who should stick with Solr or ElasticSearch instead of using Sensei?

The feature set, as well as limitations of all these system don’t overlap fully. Depending on your application, if you are building on certain features in one system and it is working out, then I would suggest you stick with it. But for Sensei, our philosophy is to consider performance ahead of features.

Are there currently any Sensei users other than LinkedIn that you are aware of?

We have seen some activities on the mailing list indicating deployments outside of LinkedIn, but I don’t know the specifics. This is a new project and we are keeping track of its usage on http://senseidb.github.com/sensei/usage.html, so let us know if you are using Sensei and want to be listed there.

What are Sensei’s biggest weaknesses and how and when do you plan on addressing them?

Let me address this question by providing a few limitations of Sensei:

  • Each document inserted into Sensei must have a unique identifier (UID) of type long. Though this can be inconvenient, this is a decision made for performance reasons. We have no immediate plans for addressing this, but we are thinking about it.
  • For columns defined in numeric format, e.g. int, float, long…, we don’t yet support negative numbers. We will have support for negative numbers very soon.
  • Static schema. Dynamic schema is something we find useful, and we will support it in the near future.

What’s next for Sensei as a project?

We will continue iterating on Sensei on both the performance and feature front. See below for a list of things we are looking at.

What are some of the key features you plan on adding to Sensei in the coming months?

This may not be a comprehensive list, but gives you an idea areas we are working on:

  • Relevance toolkit
  • Built-in time and geo type columns
  • Parent-node type documents
  • Attribute type faceting (name-value pairs)
  • Online rebalancing
  • Online reindexing
  • Parameter secondary store (e.g. activities on a document, social gestures, etc.)
  • Dynamic schemata
  • Support for aggregation functions, e.g. AVG, MIN, MAX, etc.

The Relevance toolkit sounds interesting.  Could you tell us a bit about what this will do and if, by any chance, this might in any way be close to the idea behind Apache Open Relevance?

This is a big feature for 1.1.0. I am not familiar with Open Relevance to comment. The idea behind relevance toolkit is to allow you to specify a model with the query. One important usage for us is to be able to perform relevance tuning against fast-flowing production data.  Waiting for things to be redeployed to production after relevance model changes does not work if the production data is changing in real-time, like tweets.

Maybe some specific tech questions – feel free to skip the ones that you think are not important or you just don’t feel like answering.

What is the role of Norbert in Sensei?

Sensei currently uses Norbert , whose maintainer is one of our main developers ,as a cluster manager and RPC between a Broker and Sensei nodes. A Broker is servlet embedded in each Sensei node. Norbert is used as a message transport to Sensei nodes.. Norbert is an elegant wrapper around Zookeeper for cluster management. We do have plans to create abstraction around this component to allow pluggability for other cluster managers.

When I first saw SQL-like query on Sensei’s home page I thought it was purely for illustration purposes, but now I realize it is actually very real!

BQL – Browse Query Language, is a SQL-variant to query Senesi. It is very real, we plan for BQL to be a standard way to query Sensei.

Can you share with us any Sensei performance numbers?

We have published some performance numbers at http://senseidb.com/performance.html

We have created a separate Github repository containing all our performance evaluation code at:

https://github.com/kwei/search-perf

Does Sensei have a SPOF (Single Point Of Failure)?

No – assuming a Sensei cluster contains more than 1 replica of each document. This is one important design goal of Sensei: every Sensei node in the cluster acts independently in both consuming data as well as handling queries. See the following answers for details.

What has to happen for data loss to occur?

Data loss occurs only if you have data store corruption on all replicas.  If only 1 replica is corrupted, you can always recover from other replicas.

Sensei by design assumes a data source that is ordered and versioned, e.g., a persistent queue. Each Sensei node persists the version for each commit. Thus, to recover data events can be replayed from that version.

In production at Linkedin, this is very handy to ensure data consistency when bouncing nodes.

You mention recovery from other replicas and recovery by replaying data events from a specific version.  Does that mean once a copy of a document makes it into Sensei in order to recover lost replicas for that document Sensei does not need to reach out to the originator of the data and is self-sufficient, so to speak?  Or does replaying mean getting the lost data from an external data store?

The data stream is external. So to catch-up from an older version, Sensei would just re-play the data events from the stream using this version. But if an entire data replica is lost, a manual copy from other replicas is required (for now).

What happens if a node in a cluster fails?

When a node fails, Zookeeper notifies other cluster event listeners in the cluster, which means the Broker. Broker keeps a state of the current cluster node topology, and subsequent queries will be routed to the live replicas, thus avoiding sending requests to the failed node. If all nodes for one replica are down, then partial results are returned.

What happens when the cluster reaches its capacity?  Can one simply add more nodes to the cluster and Sensei auto-magically takes care of the rest or does one have to manually rebalance shards, or…?

Depending on how data is sharded:

If over-sharding technique is used, then adding nodes to the cluster is trivial. New nodes would just specify which shards they want to handle – every node in sensei.properties indicates partitions it should handle, e.g., sensei.node.partitions=1,3,8

If using sharding strategy where data migration is not needed as new data is flowing into the system, e.g., sharding by time or consecutive UID, then expanding the cluster is also trivial.

If such sharding strategy requires data migration, e.g. mod based sharding. Then cluster rebalancing is required. This is coming in a future release, where we already have designs for online data rebalancing.  For now, one has to reindex all data in order to reshard and rebalance.

Since Sensei is an eventually consistent system, how does one insure the search client gets consistent results (e.g. when paging through results or filtering results with facets)?

On the Sensei request object, there is a routing parameter. Typically this routing parameter is set to the value of the search session. By default, Sensei applies consistent hashing on the routing parameter to make sure the same replica is used for queries with the same routing parameter or search session.

How does one upgrade Sensei? Can one perform an online upgrade of a Sensei cluster without shutting down the whole cluster? Are new versions of Sensei backwards compatible?

Yes, subsets of the cluster can be shut down, and dynamic routing via Zookeeper would take place. This is very useful when we are pushing out new builds in canary mode to ensure stability and compatibility.

Does Sensei require a schema or is it schemaless?

Sensei requires a schema. But we do have plans to make Sensei schema dynamic like ElasticSearch.

Does Sensei have support for things like Spatial Search, Function Queries, Parent-Child data, or JOIN?

We have in the works a relevance toolkit which should cover features of Solr’s Function Queries.

We also have plans to support Spatial Search and Parent-Child data.

We don’t have immediate plans to support Joins.

How does one talk to Sensei?  Are there existing client libraries?

The Sensei cluster exposes 2 rest end-points: REST/JSON and BQL.
The packaging also include Java and Python clients, (also Ruby if resourcing works out), along with JavaScript helpers for using the REST/JSON API in a web application.

Does Sensei have an administrative/management UI?

Sensei comes with a web application for helping with building queries against the cluster. We use it to tweak relevance models as well as instrumenting an online cluster.

JMX is also exposed to administer the cluster.

In the configuration users can turn on other types of data reporting to other clusters, e.g. RRD, log etc.

Big thank you to John and his team for releasing and open-sourcing Sensei and for taking the time to answer all our questions.

@sematext

Follow

Get every new post delivered to your Inbox.

Join 1,640 other followers