HBaseWD and HBaseHUT: Handy HBase Libraries Available in Public Maven Repo

HBaseWD is aimed to help distribute writes of records with sequential row keys in HBase (and avoid RegionServer hotspotting). Good introduction can be found here.

We recently published 0.1.0 version of the library to Sonatype public maven repository. Thus, integration in your project became much easier:

  <repositories>
    <repository>
      <id>sonatype release</id>
      <url>https://oss.sonatype.org/content/repositories/releases/</url>
    </repository>
  </repositories>
  <dependency>
    <groupId>com.sematext.hbasewd</groupId>
    <artifactId>hbasewd</artifactId>
    <version>0.1.0</version>
  </dependency>

HBaseHUT is aimed to help in situations when you need to update a lot of records in HBase in read-modify-write style. Good introduction can be found here.

We recently published 0.1.0 version of this library to Sonatype public maven repository too. Integration info:

  <repositories>
    <repository>
      <id>sonatype release</id>
      <url>https://oss.sonatype.org/content/repositories/releases/</url>
    </repository>
  </repositories>
  <dependency>
    <groupId>com.sematext.hbasehut</groupId>
    <artifactId>hbasehut</artifactId>
    <version>0.1.0</version>
  </dependency>

For running (MR jobs) on hadoop-2.0+ (which is a part of CDH4.1+) use 0.1.0-hadoop-2.0 version:

  <dependency>
    <groupId>com.sematext.hbasehut</groupId>
    <artifactId>hbasehut</artifactId>
    <version>0.1.0-hadoop-2.0</version>
  </dependency>

Thank you to all contributors and users of the libraries!

ElasticSearch Shard Placement Control

In this post you will learn how to control ElasticSearch shard placement.  If you are coming to our ElasticSearch talk at BerlinBuzzwords, we’ll be talking about this and more.

If you’ve ever used ElasticSearch you probably know that you can set it up to have multiple shards and replicas of each index it serves. This can be very handy in many situations. With the ability to have multiple shards of a single index we can deal with indices that are too large for efficient serving by a single machine. With the ability to have multiple replicas of each shard we can handle higher query load by spreading replicas over multiple servers. Of course, replicas are useful for more than just spreading the query load, but that’s another topic.  In order to shard and replicate ElasticSearch has to figure out where in the cluster it should place shards and replicas. It needs to figure out which server/nodes each shard or replica should be placed on.

Read more of this post

HBaseWD: Avoid RegionServer Hotspotting Despite Sequential Keys

In HBase world, RegionServer hotspotting is a common problem.  We can describe this problem with a single sentence: while writing records with sequential row keys allows the most efficient reading of data range given the start and stop keys, it causes undesirable RegionServer hotspotting at write time. In this 2-part post series we’ll discuss the problem and show you how to avoid this infamous problem.

Problem Description

Records in HBase are sorted lexicographically by the row key. This allows fast access to an individual record by its key and fast fetching of a range of data given start and stop keys. There are common cases where you would think row keys forming a natural sequence at write time would be a good choice because of  types of queries that will fetch the data later. For example, we may want to associate each record with a timestamp so that later we can fetch records from a particular time range.  Examples of such keys are:

  • time-based format: Long.MAX_VALUE – new Date().getTime()
  • increasing/decreasing sequence: ”001”, ”002”, ”003”,… or ”499”, ”498”, ”497”, …

But writing records with such naive keys will cause hotspotting because of how HBase writes data to its Regions.

RegionServer Hotspotting

When records with sequential keys are being written to HBase all writes hit one Region.  This would not be a problem if a Region was served by multiple RegionServers, but that is not the case – each Region lives on just one RegionServer.  Each Region has a pre-defined maximal size, so after a Region reaches that size it is split in two smaller Regions.  Following that, one of these new Regions takes all new records and then this Region and the RegionServer that serves it becomes the new hotspot victim.  Obviously, this uneven write load distribution is highly undesirable because it limits the write throughput to the capacity of a single server instead of making use of multiple/all nodes in the HBase cluster. The uneven load distribution can be seen in Figure 1. (chart courtesy of SPM for HBase):

HBase RegionServer Hotspotting

Figure 1. HBase RegionServer hotspotting

We can see that while one server was sweating trying to keep up with writes, others were “resting”. You can find some more information about this problem in HBase Reference Guide.

Solution Approach

So how do we solve this problem?  The cases discussed here assume that we don’t have all data we want to write to HBase all at once, but rather that the data are arriving continuously. In case of bulk import of data into HBase the best solutions, including those that avoid hotspotting, are described in bulk load section of HBase documentation.  However, if you are like us at Sematext, and many organizations nowadays are, the data keeps streaming in and needs processing and storing. The simplest way to avoid single RegionServer hotspotting in case of continuously arriving data would be to simply distribute writes over multiple Regions by using random row keys. Unfortunately, this would compromise ability to do fast range scans using start and stop keys. But that is not the only solution.  The following simple approach solves the hotspotting issue while at the same time preserving the ability to fetch data by start and stop key.  This solution, mentioned multiple times on HBase mailing lists and elsewhere is to salt row keys with a prefix.  For example, consider constructing the row key using this:

new_row_key = (++index % BUCKETS_NUMBER) + original_key

For the visual types among us, that may result in keys looking as shown in Figure 2.

HBase Row Key Prefix Salting

Figure 2. HBase row key prefix salting

Here we have:

  • index is the numeric (or any sequential) part of the specific record/row ID that we later want to use for record fetching (e.g. 1, 2, 3 ….)
  • BUCKETS_NUMBER is the number of “buckets” we want our new row keys to be spread across. As records are written, each bucket preserves the sequential notion of original records’ IDs
  • original_key is the original key of the record we want to write
  • new_row_key is the actual key that will be used when writing a new record (i.e. “distributed key” or “prefixed key”). Later in the post the “distributed records” term is used for records which were written with this “distributed key”.

Thus, new records will be split into multiple buckets, each (hopefully) ending up in a different Region in the HBase cluster. New row keys of bucketed records will no longer be in one sequence, but records in each bucket will preserve their original sequence. Of course, if you start writing into an empty HTable, you’ll have to wait some time (depending on the volume and velocity of incoming data, compression, and maximal Region size) before you have several Regions for a table. Hint: use pre-splitting feature for the new table to avoid the wait time.  Once writes using the above approach kick in and start writing to multiple Regions your “slaves load” chart should look better.

HBase RegionServer evenly distributed write load

Figure 3. HBase RegionServer evenly distributed write load

Scan

Since data is placed in multiple buckets during writes, we have to read from all of those buckets when doing scans based on “original” start and stop keys and merge data so that it preserves the “sorted” attribute. That means BUCKETS_NUMBER more Scans and this can affect performance. Luckily, these scans can be run in parallel and performance should not degrade or might even improve — compare the situation when you read 100K sequential records from one Region (and thus one RegionServer) with reading 10K records from 10 Regions and 10 RegionServers in parallel!

Get/Delete

To get or delete a single record by original key may need to perform 1 or up to BUCKETS_NUMBER Get operations depending on the logic we used for prefix generation. E.g. when using “static” hash as prefix, given the original key we may precisely identify the prefixed key. In case we used random prefix we will have to perform Get for each of the possible buckets. The same goes for Delete operations.

MapReduce Input

Since we still want to benefit from data locality, the implementation of feeding “distributed” data to a MapReduce job will likely break the order in which data comes to mappers. This is at least true for the current HBaseWD implementation (see below). Each map task will process data for a particular bucket. Of course, records will be in same order based on original keys within a bucket. However, since two records which were meant to be “near each other” based on their original key may have fallen into different buckets, the will be fed to different map tasks. Thus, if the mapper assumes records come in the strict/original sequence, we will be hurt, since the order will be preserved only within each bucket, but not globally.

Increased Number of Map Tasks

When using data (written using the suggested approach) as a MapReduce input (with start and/or stop key provided) the splits number will likely to be increased (depends on the implementation). For current HBaseWD implementation you’ll get BUCKETS_NUMBER times more splits compared to “regular” MapReduce with same parameters.  This is due to the same logic for data selection as with simple Scan operation, as described above. As the result, MapReduce jobs will have BUCKETS_NUMBER times more map tasks. This should not decrease performance if BUCKETS_NUMBER is reasonably not too high (when MR job initialization & cleanup work takes more time than processing itself). Moreover, in many use-cases having many more mappers helps improve performance. Many users reported more mappers having a positive impact given that standard HTable input based MapReduce job usually has too few map tasks (one per Region) which cannot be changed without extra coding.

Another strong signal the suggested approach and its implementation could help is if in your application, in addition to writing records with sequential keys, the application also continuously processes newly written data delta using MapReduce . In such use-cases when data is written sequentially (not using any artificial distribution) and is being processed relatively frequently, the delta to be processed resides in just a few Regions (or perhaps in even just one Region if the write load is not high, if maximal Region size is high, and processing batches are very frequent).

Solution Implementation: HBaseWD

We implemented the solution described above and open-sourced it as a small HBaseWD project. We say small because HBaseWD is really self-contained and really simple to integrate into an existing code due to support for native HBase client API (see examples below). HBaseWD project was first presented at BerlinBuzzwords 2011(video) and is currently used in a number of production systems.

Configuring Distribution

Simple Even Distribution

Distributing records with sequential keys to be distributed in up to Byte.MAX_VALUE buckets (single byte is added in front of a key):

byte bucketsCount = (byte) 32; // distributing into 32 buckets
RowKeyDistributor keyDistributor =  new RowKeyDistributorByOneBytePrefix(bucketsCount);
Put put = new Put(keyDistributor.getDistributedKey(originalKey));
... // add values
hTable.put(put);

Hash-Based Distribution

Another useful RowKeyDistributor is RowKeyDistributorByHashPrefix. Please see example below. It creates the “distributed key” based on original key value so that later when you have original key and want to update the record you can calculate distributed key without having to call HBase (too see what bucket it is in). Or, you can perform a single Get operation when original key is known (instead of reading from all buckets).

AbstractRowKeyDistributor keyDistributor =
     new RowKeyDistributorByHashPrefix(
            new RowKeyDistributorByHashPrefix.OneByteSimpleHash(15));

You can use your own hashing logic here by implementing this simple interface:

public static interface Hasher extends Parametrizable {
  byte[] getHashPrefix(byte[] originalKey);
  byte[][] getAllPossiblePrefixes();
}

Custom Distribution Logic

HBaseWD is designed to be flexible especially when it comes to supporting custom row key distribution approaches. In addition to the above mentioned ability to implement custom hashing logic to be used with RowKeyDistributorByHashPrefix, one can define custom row key distribution logic by extending AbstractRowKeyDistributor abstract class whose interface is super simple:

public abstract class AbstractRowKeyDistributor implements Parametrizable {
  public abstract byte[] getDistributedKey(byte[] originalKey);
  public abstract byte[] getOriginalKey(byte[] adjustedKey);
  public abstract byte[][] getAllDistributedKeys(byte[] originalKey);
  ... // some utility methods
}

Common Operations

Scan

Performing a range scan over data:

Scan scan = new Scan(startKey, stopKey);
ResultScanner rs = DistributedScanner.create(hTable, scan, keyDistributor);
for (Result current : rs) {
  ...
}

Configuring MapReduce Job

Performing MapReduce job over the data chunk specified by Scan:

Configuration conf = HBaseConfiguration.create();
Job job = new Job(conf, "testMapreduceJob");
Scan scan = new Scan(startKey, stopKey);
TableMapReduceUtil.initTableMapperJob("table", scan,
RowCounterMapper.class, ImmutableBytesWritable.class, Result.class, job);
// Substituting standard TableInputFormat which was set in
// TableMapReduceUtil.initTableMapperJob(...)
job.setInputFormatClass(WdTableInputFormat.class);
keyDistributor.addInfo(job.getConfiguration());

What’s Next?

In the next post we’ll cover:

  • Integration into already running production systems
  • Changing distribution logic in running systems
  • Other “advanced topics”
If you’ve read this far you must be interested in HBase.  And since we (@sematext) are interested in people with interest in HBase, here are some open positions at Sematext, some nice advantages we offer and “problems” we are into that you may want to check out.
Oh, and all HBase performance charts you see in this post are from our SPM service which uses HBase and, you guessed it, HBaseWD too! ;)

The New SolrCloud: Overview

Just the other day we wrote about Sensei, the new distributed, real-time full-text search database built on top of Lucene and here we are again writing about another “new” distributed, real-time, full-text search server also built on top of Lucene: SolrCloud.

In this post we’ll share some interesting SolrCloud bits and pieces that matter mostly to those working with large data and query volumes, but that all search lovers should find really interesting, too.  If you have any questions about what we wrote (or did not write!) in this post, please leave a comment – we’re good at following up to comments!  Or just ask @sematext!

Please note that functionality described in this post is now part of trunk in Lucene and Solr SVN repository.  This means that it will be available when Lucene and Solr 4.0 are released, but you can also use trunk version just like we did, if you don’t mind living on the bleeding edge.

Recently, we were given the opportunity to once again use big data (massive may actually be more descriptive of this data volume) stored in a HBase cluster and search. We needed to design a scalable search cluster capable of elastically handling future data volume growth.  Because of the huge data volume and high search rates our search system required the index to be sharded.  We also wanted the indexing to be as simple as possible and we also wanted a stable, reliable, and very fast solution. The one thing we did not want to do is reinvent the wheel.  At this point you may ask why we didn’t choose ElasticSearch, especially since we use ElasticSearch a lot at Sematext.  The answer is that when we started the engagement with this particular client a whiiiiile back when ElasticSearch wasn’t where it is today.  And while ElasticSearch does have a number of advantages over the old master-slave Solr, with SolrCloud being in the trunk now, Solr is again a valid choice for very large search clusters.

And so we took the opportunity to use SolrCloud and some of its features not present in previous versions of Solr.  In particular, we wanted to make use of Distributed Indexing and Distributed Searching, both of which SolrCloud makes possible. In the process we looked at a few JIRA issues, such as SOLR-2358 and SOLR-2355, and we got familiar with relevant portions of SolrCloud source code.  This confirmed SolrCloud would indeed satisfy our needs for the project and here we are sharing what we’ve learned.

Our Search Cluster Architecture

Basically, we wanted the search cluster to look like this:

SolrCloud App Architecture

Simple? Yes, we like simple.  Who doesn’t!  But let’s peek inside that “Solr cluster” box now.

SolrCloud Features and Architecture

Some of the nice things about SolrCloud are:

  • centralized cluster configuration
  • automatic node fail-over
  • near real time search
  • leader election
  • durable writes

Furthermore, SolrCloud can be configured to:

  • have multiple index shards
  • have one or more replicas of each shards

Shards and Replicas are arranged into Collections. Multiple Collections can be deployed in a single SolrCloud cluster.  A single search request can search multiple Collections at once, as long as they are compatible. The diagram below shows a high-level picture of how SolrCloud indexing works.

SolrCloud Shards, Replicas, Replication

As the above diagram shows, documents can be sent to any SolrCloud node/instance in the SolrCloud cluster.  Documents are automatically forwarded to the appropriate Shard Leader (labeled as Shard 1 and Shard 2 in the diagram). This is done automatically and documents are sent in batches between Shards. If a Shard has one or more replicas (labeled Shard 1 replica and Shard 2 replica in the diagram) a document will get replicated to one or more replicas.  Unlike in traditional master-slave Solr setups where index/shard replication is performed periodically in batches, replication in SolrCloud is done in real-time.  This is how Distributed Indexing works at the high level.  We simplified things a bit, of course – for example, there is no ZooKeeper or overseer shown in our diagram.

Setup Details

All configuration files are stored in ZooKeeper.  If you are not familiar with ZooKeeper you can think of it as a distributed file system where SolrCloud configuration files are stored. When the first Solr instance in a SolrCloud cluster is started configuration files need to be sent to ZooKeeper and one needs to specify how many shards there should be in the cluster. Then, this Solr instance/node is running one can start additional Solr instances/nodes and point them to the ZooKeeper  instance (ZooKeeper is actually typically deployed as a quorum or 3, 5, or more instances in production environments).  And voilà – the SolrCloud cluster is up!  I must say, it’s quite simple and straightforward.

Shard Replicas in SolrCloud serve multiple purposes.  They provide fault tolerance in the sense that when (not if!) a single Solr instance/node containing a portion of the index goes down, you still have one or more replicas of data that was served by that instance elsewhere in the cluster and thus you still have the whole data set and no data loss.  They also allow you to spread query load over more servers, this making the cluster capable of handling higher query rates.

Indexing

As you saw above, the new SolrCloud really simplifies Distributed Indexing.  Document distribution between Shards and Replicas is automatic and real-time.  There is no master server one needs to send all documents to. A document can be sent to any SolrCloud instance and SolrCloud takes care of the rest. Because of this, there is no longer a SPOF (Single Point of Failure) in Solr.  Previously, Solr master was a SPOF in all but the most elaborate setups.

Querying

One can query SolrCloud a few different ways:

  • One can query a single Shard, which is just like Solr querying a search a single Solr instance.
  • The second option is to query a single Collection (i.e., search all shards holding pieces of a given Collection’s index).
  • The third option is to only query some of the Shards by specifying their addresses or names.
  • Finally, one can query multiple Collections assuming they are compatible and Solr can merge results they return.

As you can see, lots of choices!

Administration with Core Admin

In addition to the standard core admin parameters there are some new ones available in SolrCloud. These new parameters let one:

  • create new Shards for an existing Collection
  • create a new Collection
  • add more nodes

The Future

If you look at the New SolrCloud Design wiki page (http://wiki.apache.org/solr/NewSolrCloudDesign) you will notice, that not all planned features have been implemented yet. There are still things like cluster re-balancing or monitoring (if you are using SolrCloud already and want to monitor its performance, let us know if you want early access to SPM for SolrCloud) to be done.  Now that SolrCloud is in the Solr trunk, it should see more user and more developer attention.  We look forward to using SolrCloud in more projects in the future!

@sematext

Sensei: distributed, realtime, semi-structured database

Once upon a time there was no decent open-source search engine.  Then, at the very beginning of this millennium Doug Cutting gave us Lucene.  Several years later Yonik Seeley wrote Solr.  In 2010 Shay Banon released ElasticSearch.  And just a few days ago John Wang and his team at LinkedIn announed Sensei 1.0.0 (also known as SenseiDB).  Here at Sematext we’ve been aware of Sensei for a while now (2 years?) and are happy to have one more great piece of search software available for our own needs and those of our customers.  As a matter of fact, we are so excited about Sensei that we’ve already started hacking on adding support for Sensei to SPM, our Scalable Performance Monitoring tool and service!  Since Sensei is brand new, we asked John to tell us a little more about it.

Please tell us a bit about yourself.

My name is John Wang, and I am the search architect at LinkedIn.com. I am the creator and the current lead for the Sensei project.

Could you describe Sensei for us?

Sensei is an open-source, elastic, realtime, distributed database with native support for searching and navigating both unstructured text and structured data. Sensei is designed to handle complex semi-structured queries on very large, and rapidly changing datasets.

It was written by the content search team at LinkedIn to support LinkedIn Homepage and Signal.

The core engine is also used for LinkedIn search properties, e.g. people search, recruiter system, job and company search pages.

Why did you write Sensei instead of using Lucene or Solr?

Sensei leverages Lucene.

We weren’t able to leverage Solr because of the following requirements:

  • High update requirement, 10’s of thousands updates per second in to the system
  • Real distributed solution, current Solr’s distributed story has a SPOF at the master, and Solr Cloud is not yet completed.
  • Complex faceting support. Not just your standard terms based faceting. We needed to facet on social graph, dynamic time ranges and many other interesting faceting scenarios. Faceting behavior also needs to be highly customizable, which is not available via Solr.

What does Sensei do that existing open-source search solutions don’t provide?

Consider Sensei if your application has the following characteristics:

  • High update rates
  • Non-trivial semi-structured query support

Who should stick with Solr or ElasticSearch instead of using Sensei?

The feature set, as well as limitations of all these system don’t overlap fully. Depending on your application, if you are building on certain features in one system and it is working out, then I would suggest you stick with it. But for Sensei, our philosophy is to consider performance ahead of features.

Are there currently any Sensei users other than LinkedIn that you are aware of?

We have seen some activities on the mailing list indicating deployments outside of LinkedIn, but I don’t know the specifics. This is a new project and we are keeping track of its usage on http://senseidb.github.com/sensei/usage.html, so let us know if you are using Sensei and want to be listed there.

What are Sensei’s biggest weaknesses and how and when do you plan on addressing them?

Let me address this question by providing a few limitations of Sensei:

  • Each document inserted into Sensei must have a unique identifier (UID) of type long. Though this can be inconvenient, this is a decision made for performance reasons. We have no immediate plans for addressing this, but we are thinking about it.
  • For columns defined in numeric format, e.g. int, float, long…, we don’t yet support negative numbers. We will have support for negative numbers very soon.
  • Static schema. Dynamic schema is something we find useful, and we will support it in the near future.

What’s next for Sensei as a project?

We will continue iterating on Sensei on both the performance and feature front. See below for a list of things we are looking at.

What are some of the key features you plan on adding to Sensei in the coming months?

This may not be a comprehensive list, but gives you an idea areas we are working on:

  • Relevance toolkit
  • Built-in time and geo type columns
  • Parent-node type documents
  • Attribute type faceting (name-value pairs)
  • Online rebalancing
  • Online reindexing
  • Parameter secondary store (e.g. activities on a document, social gestures, etc.)
  • Dynamic schemata
  • Support for aggregation functions, e.g. AVG, MIN, MAX, etc.

The Relevance toolkit sounds interesting.  Could you tell us a bit about what this will do and if, by any chance, this might in any way be close to the idea behind Apache Open Relevance?

This is a big feature for 1.1.0. I am not familiar with Open Relevance to comment. The idea behind relevance toolkit is to allow you to specify a model with the query. One important usage for us is to be able to perform relevance tuning against fast-flowing production data.  Waiting for things to be redeployed to production after relevance model changes does not work if the production data is changing in real-time, like tweets.

Maybe some specific tech questions – feel free to skip the ones that you think are not important or you just don’t feel like answering.

What is the role of Norbert in Sensei?

Sensei currently uses Norbert , whose maintainer is one of our main developers ,as a cluster manager and RPC between a Broker and Sensei nodes. A Broker is servlet embedded in each Sensei node. Norbert is used as a message transport to Sensei nodes.. Norbert is an elegant wrapper around Zookeeper for cluster management. We do have plans to create abstraction around this component to allow pluggability for other cluster managers.

When I first saw SQL-like query on Sensei’s home page I thought it was purely for illustration purposes, but now I realize it is actually very real!

BQL – Browse Query Language, is a SQL-variant to query Senesi. It is very real, we plan for BQL to be a standard way to query Sensei.

Can you share with us any Sensei performance numbers?

We have published some performance numbers at http://senseidb.com/performance.html

We have created a separate Github repository containing all our performance evaluation code at:

https://github.com/kwei/search-perf

Does Sensei have a SPOF (Single Point Of Failure)?

No – assuming a Sensei cluster contains more than 1 replica of each document. This is one important design goal of Sensei: every Sensei node in the cluster acts independently in both consuming data as well as handling queries. See the following answers for details.

What has to happen for data loss to occur?

Data loss occurs only if you have data store corruption on all replicas.  If only 1 replica is corrupted, you can always recover from other replicas.

Sensei by design assumes a data source that is ordered and versioned, e.g., a persistent queue. Each Sensei node persists the version for each commit. Thus, to recover data events can be replayed from that version.

In production at Linkedin, this is very handy to ensure data consistency when bouncing nodes.

You mention recovery from other replicas and recovery by replaying data events from a specific version.  Does that mean once a copy of a document makes it into Sensei in order to recover lost replicas for that document Sensei does not need to reach out to the originator of the data and is self-sufficient, so to speak?  Or does replaying mean getting the lost data from an external data store?

The data stream is external. So to catch-up from an older version, Sensei would just re-play the data events from the stream using this version. But if an entire data replica is lost, a manual copy from other replicas is required (for now).

What happens if a node in a cluster fails?

When a node fails, Zookeeper notifies other cluster event listeners in the cluster, which means the Broker. Broker keeps a state of the current cluster node topology, and subsequent queries will be routed to the live replicas, thus avoiding sending requests to the failed node. If all nodes for one replica are down, then partial results are returned.

What happens when the cluster reaches its capacity?  Can one simply add more nodes to the cluster and Sensei auto-magically takes care of the rest or does one have to manually rebalance shards, or…?

Depending on how data is sharded:

If over-sharding technique is used, then adding nodes to the cluster is trivial. New nodes would just specify which shards they want to handle – every node in sensei.properties indicates partitions it should handle, e.g., sensei.node.partitions=1,3,8

If using sharding strategy where data migration is not needed as new data is flowing into the system, e.g., sharding by time or consecutive UID, then expanding the cluster is also trivial.

If such sharding strategy requires data migration, e.g. mod based sharding. Then cluster rebalancing is required. This is coming in a future release, where we already have designs for online data rebalancing.  For now, one has to reindex all data in order to reshard and rebalance.

Since Sensei is an eventually consistent system, how does one insure the search client gets consistent results (e.g. when paging through results or filtering results with facets)?

On the Sensei request object, there is a routing parameter. Typically this routing parameter is set to the value of the search session. By default, Sensei applies consistent hashing on the routing parameter to make sure the same replica is used for queries with the same routing parameter or search session.

How does one upgrade Sensei? Can one perform an online upgrade of a Sensei cluster without shutting down the whole cluster? Are new versions of Sensei backwards compatible?

Yes, subsets of the cluster can be shut down, and dynamic routing via Zookeeper would take place. This is very useful when we are pushing out new builds in canary mode to ensure stability and compatibility.

Does Sensei require a schema or is it schemaless?

Sensei requires a schema. But we do have plans to make Sensei schema dynamic like ElasticSearch.

Does Sensei have support for things like Spatial Search, Function Queries, Parent-Child data, or JOIN?

We have in the works a relevance toolkit which should cover features of Solr’s Function Queries.

We also have plans to support Spatial Search and Parent-Child data.

We don’t have immediate plans to support Joins.

How does one talk to Sensei?  Are there existing client libraries?

The Sensei cluster exposes 2 rest end-points: REST/JSON and BQL.
The packaging also include Java and Python clients, (also Ruby if resourcing works out), along with JavaScript helpers for using the REST/JSON API in a web application.

Does Sensei have an administrative/management UI?

Sensei comes with a web application for helping with building queries against the cluster. We use it to tweak relevance models as well as instrumenting an online cluster.

JMX is also exposed to administer the cluster.

In the configuration users can turn on other types of data reporting to other clusters, e.g. RRD, log etc.

Big thank you to John and his team for releasing and open-sourcing Sensei and for taking the time to answer all our questions.

@sematext

Follow

Get every new post delivered to your Inbox.

Join 1,633 other followers