Lucandra / Solandra: A Cassandra-based Lucene backend

In this guest post Jake Luciani (@tjake) introduces Lucandra (Update: now known as Solandra – see our State of Solandra post), a Cassandra-based backend for Lucene (Update: now integrated with Solr instead).
Update: Jake will be giving a talk about Lucandra in the April 2010 NY Search & Discovery Meetup.  Sign up!
Update 2: Slides from the Lucandra meetup talk are on line: http://www.jroller.com/otis/entry/lucandra_presentation
For most users, the trickiest part of deploying a Lucene based solution is managing and scaling storage, reads, writes and index optimization. Solr and Katta (among others) offer ways to address these, but still require quite a lot of administration and maintenance. This problem is not specific to Lucene. In fact most data management applications require a significant amount of administration.
In response to this problem of managing and scaling large amounts of data the “nosql” movement has started to become more popular. One of the most popular and widely used “nosql” systems is Apache Software Foundation project, originally developed at Facebook called Cassandra.

What is Cassandra?

Cassandra is a scalable and easy to administer column-oriented data store, modeled after Google’s BigTable, but built by the designers of Amazon’s S3. One of the big differentiators of Cassandra is it does not rely on a global file system as Hbase and BigTable do. Rather, Cassandra uses decentralized peer to peer “Gossip” which means two things:
  1. It has no single point of failure, and
  2. Adding nodes to the cluster is as simple as pointing it to any one live node

Cassandra also has built-in multi-master writes, replication, rack awareness, and can handle downed nodes gracefully. Cassandra has a thriving community and is currently being used at companies like Facebook, Digg and Twitter to name a few.

Enter Lucandra

Lucandra is a Cassandra backend for Lucene. Since Cassandra’s original use within Facebook was for search, integrating Lucene with Cassandra seemed like a “no brainer”. Lucene’s core design makes it fairly simple to strip away and plug in custom Analyzer, Writer, Reader, etc. implementations. Rather than trying to build a Lucene Directory interface on top of Lucene as some backends do (DbDirectory for example), our approach was to implement a an IndexReader and IndexWriter directly on top of Cassandra.

Here’s how Terms and Documents are stored in Cassandra. A Term is a composite key made up from the index, field and term with the document id as the column name and position vector as the column value.
      Term Key                    ColumnName   Value
      "indexName/field/term" => { documentId , positionVector }
      Document Key
      "indexName/documentId" => { fieldName , value }
Cassandra allows us to pull ranges of keys and groups of columns so we can really tune the performance of reads as well as minimize network IO for each query. Also, since writes are indexed and replicated by Cassandra we don’t need to worry about optimizing the indexes or reopening the index to see new writes. This means we get a soft real-time distributed search engine.
There is a impact on Lucandra searches when compared to native Lucene searches. In our testing we see Lucandra’s IndexReader is ~10% slower, than the default IndexReader. However, this is still quite acceptable to us given what you get in return.
For writes Lucadra is comparatively slow to regular Lucene, since every term is effectively written under its own key. Luckily, this will be fixed in the next version of Cassandra, which will allow batched writes for keys.
One other major caveat is, there is no term scoring in the current code. This simply hasn’t been needed yet. Adding is relatively trivial – via another column.
To see Lucandra in action you can try out the Twitter search app http://sparse.ly that is built on Lucandra. This service uses the Lucandra store exclusively and does not use any sort of relational or other type of database.

Lucandra in Action

Using Lucandra is extremely simple and switching a regular Lucene search application to use Lucandra is a matter of just several lines of code. Let’s have a look.

First we need to create the connection to Cassandra


import lucandra.CassandraUtils;
import lucandra.IndexReader;
import lucandra.IndexWriter;
...
Cassandra.Client client = CassandraUtils.createConnection();

Next, we create Lucandra’s IndexWriter and IndexReader, and Lucene’s own IndexSearcher.


IndexWriter indexWriter = new IndexWriter("bookmarks", client);
IndexReader indexReader = new IndexReader("bookmarks", client);
IndexSearcher indexSearcher = new IndexSearcher(indexReader);

From here on, you work with IndexWriter and IndexSearcher just like you in vanilla Lucene. Look at the BookmarksDemo for the complete class.

What’s next? Solandra!

Now that we have a Lucandra we can use it with anything built on Lucene. For example, we can integrate Lucandra with Solr and simplify our Solr administration. If fact this has already been attempted and we plan to support this in our code soon.
For most users, the trickiest part of deploying a Lucene based solution is managing and scaling storage, reads, writes and index optimization. Solr and Katta (among others) offer ways to address these, but still require quite a lot of administration and maintenance. This problem is not specific to Lucene. In fact most data management apps require a significant amount of administration.
In response to this problem of managing and scaling large amounts of data the “nosql” movement has started to become more popular. One of the most popular and widely used “nosql” systems is Apache project, originally developed at facebook called Cassandra.
What is Cassandra?
==============
Cassandra is a scalable and easy to administer column oriented data store, modeled after Google’s Bigtable but built by the designers of Amazon’s S3. One of the big differentiators of Cassandra is it does not rely on a global file system as Hbase and BigTable do. Rather, Cassandra uses decentralized peer to peer “Gossip” which means two things. It has no single point of failure and adding nodes into the cluster is as simple as pointing it to any one live node. Cassandra also has built in multi-master writes, replication, rack awareness, and can handle downed nodes gracefully.
Cassandra has a thriving community and is currently being used at companies like Facebook, Digg and Twitter to name a few.
Enter Lucandra:
===========
Lucandra is a Cassandra backend for Lucene. Since Cassandra’s original use within facebook was for search, integrating Lucene with Cassandra seemed like a nobrainer. Lucene’s core design makes it fairly simple to strip away and plug in custom Analyzer, Writer, Reader, etc. implementations. Rather than try and build a Lucene directory interface ontop of Lucene as some backends do (DbDirectory for example), our approach was to implement a index reader and writer directly ontop of Cassandra.
Here’s how Terms and Documents are stored in Cassandra. A Term is a composite key made up from the index, field and term with the document id as the column name and position vector as the column value.
      Term Key                    ColumnName   Value
      "indexName/field/term" => { documentId , positionVector }
      Document Key
      "indexName/documentId" => { fieldName , value }
Cassandra allows us to pull ranges of keys and groups of columns so we can really tune the performance of reads as well as minimize network IO for each query. Also, since writes are indexed and replicated by Cassandra we don’t need to worry about optimizing the indexes or reopening the index to see new writes. This means we get a soft real-time distributed search engine.
There is a impact on Lucandra searches when compared to native Lucene searches. In our testing we see Lucandra’s IndexReader is ~10% slower, than the default IndexReader however this is still quite acceptable to us given what you get in return.
For writes Lucadra is comparatively slow to regular Lucene, since every term is effectively written under it’s own key, however this will be fixed in next version of the Cassandra, which will allow batched writes for keys.
One other major caveat is, we do any term scoring in the current code.
To see Lucandra in action you can try out our Twitter search app http://sparse.ly that is built on Lucandra.
What’s next? Solandra:
================
Now that we have a Lucandra we can use it with anything built on Lucene. For example we can integrate Lucandra with Solr and simplify our Solr administration. If fact this has already been attempted and we plan to support this in our code soon.
About these ads

29 Responses to Lucandra / Solandra: A Cassandra-based Lucene backend

  1. Here is a question or two for Jake.

    “Cassandra allows us to pull ranges of keys and groups of columns so we can really tune the performance of reads as well as minimize network IO for each query.”

    The idea here being similar to Lucene’s FieldSelector, right?

    “Also, since writes are indexed and replicated by Cassandra we don’t need to worry about optimizing the indexes or reopening the index to see new writes. This means we get a soft real-time distributed search engine.”

    Could you elaborate more on these 2 points?

    Regarding no need for optimization – is that simply because the Directory is different, so with Cassandra as the Lucene Directory, there is no such thing as index files, segments, and such?
    What about deletes? I skimmed a recent post about distributed deletes and saw a mention of tombstones, which I think is similar to what Lucene does with deleted docs: they are first just marked as deleted, then letter truly removed during segment merging (“organic” or trigger by index optimization). Who controls this expunging in Lucandra?

    And what do you mean by seeing writes immediately and not needing to reopen the searcher? What makes that work automatically? Again simply the fact that the storage is Cassandra? Lucandra’s IndexReader doesn’t do anything under the hood to re-read any data or some such? And what about something like FieldCache, which is used for sorting and which needs to be populated once, typically when IndexReader is first opened – how does that work in Lucandra?

    Oh, and one more question: what about Cassandra’s eventual Consistency – how does that work with Lucandra seeing writes and documents immediately and consistently?

    Thanks.

    • Jake Luciani says:

      Regarding key ranges: The performance gain is we can pick and choose just the keys we want in the case of a simple search with a single term. Or we can fetch a range in 1 step with only the columns we need example (field1:(+book*) field2:-another). Once we add term scoring we can fetch term positions and/or scores depending on the call.

      Regarding optimization/deletes: Yes, since there are no segments, index files, etc no need to optimize. Cassandra takes care of this for us and has pre-indexed the data based on the cassandra config included with Lucandra.
      For deletes, Cassandra deletes the tombstones every N seconds, based on the config setting.

      Regarding seeing writes immediately: Yes, IndexReader does cache but just for the length of any one query. If you can IndexReader.reopen() it will flush its internal query cache (for sparse.ly we do this on every call). This means next query we see any new info in cassandra on this index. If you use a lucene FieldCache, I imagine you will see issues. I’ll need to test this case, but Lucene may drop any new items, since they wont match any docs in the field cache.

      Last Point: We currently don’t worry about eventual consistency, so different readers may see different terms, though as long as each reader is consistently connected to the same cassandra instance they’ll see a consistent view. There are ways to force consistency across the cluster, but it obviously affects performance.

  2. Here’s another Q for Jake:

    Why Cassandra?

    Why did you not choose, say, HBase, as the underlying storage for your Lucene index? Is there something about Cassandra that makes it more suitable than other options?

    • Jake Luciani says:

      I’m sure something similar could be built on hbase. All thats really needed is support for key ranges and columns.

      But I think cassandra is a simpler piece of software and has relatively few dependencies (for now).

      HBase requires hdfs, zk and hbase to be setup and maintained.

      The other benefit of cassandra is really the ability to scale down as well as up and rebalance the load across the cluster pretty painlessly. HBase may do this too, but I’m not sure.

  3. Matt Grogan says:

    I’m investigating the possibility of using Cassandra for an online game/social network for kids and Lucandra seems like it might be useful for search within the application.

    Presumably Lucandra requires the Cassandra cluster to define an order-preserving partitioner so that key range scans are possible.

    However the application requires the cluster to use the RandomPartitioner, so the implication is that Lucandra would need to be run on a different cluster.

    Can you let me know if this indeed the case?

    Thanks

    • sematext says:

      I think where you keep your game/SN data can be independent of where your search indices live. You can store your game/SN data in Cassandra (or any other database), but it doesn’t mean you also have to use Cassandra to store your (Lucene) search indices.

      • Matt Grogan says:

        We would use Cassandra for the game data.

        The question is then can we use Lucandra for searching because it seems it would be easier to manage than maintaining a lucene index manually or using Katta as you argue in your post.

        However it seems we would need separate clusters for game data and Lucandra which is undesirable because of the increased management overhead.

    • You can turn OPP into RP yourself for parts of your data by just prepending the key w/ its md5 hash in hex.

  4. Pingback: Cassandra: RandomPartitioner vs OrderPreservingPartitioner « Bits and Bytes.

  5. Pingback: HBase vs Cassandra: why we moved « Bits and Bytes.

  6. Pingback: Lucene Digest, February 2010 « Sematext Blog

  7. Pingback: Cassandra ライブ情報がテンコ盛り – Jonathan Ellis @ Rackspace [ #cassandra #nosql ] « Agile Cat — Azure & Hadoop — Talking Book

  8. Thomas Koch says:

    There’s another port to HBase from another guy from NYC:
    http://github.com/akkumar/hbasene

    And a mailing list. It’s surely also interesting for lucandra users:
    http://groups.google.com/group/hbasene-user

    Question: What do you thing about having one column family for every field instead of putting the field in front of the term? For HBase this means of course, that you can’t easily add additional fields.

  9. philip andrew says:

    Does this index the content in Cassandra and store the index in Cassandra?

    OR

    Is it just a way to store Lucene index in Cassandra for data stored elsewhere?

    • sematext says:

      It’s the latter, though it doesn’t mean that the source of data can’t be Cassandra itself, too.

  10. Pingback: Lucandra: A Cassandra-based Lucene backend « Loses weight the hall according to Iraq

  11. Pingback: links for 2010-04-26 « Daniel Harrison's Personal Blog

  12. Wang Xiaojun says:

    Thanks, man ! I’m trying to do the same thing in hbase, and your work gives me a great illumination!

    • sematext says:

      @Wang I suggest you look at HBasene before you attempt your own variant.

      We’ll have a post on HBasene on Sematext Blog soon.

  13. Pingback: Lucandra,when Lucene meet Cassandra | 旁门左道

  14. Pingback: HBase Digest, May 2010 « Sematext Blog

  15. Pingback: Lucene Digest, May 2010 « Sematext Blog

  16. Pingback: Cassandra勉強会 第8回に参加しました! - Aoyama Media Laboratory

  17. Rauf says:

    Can I store data to be indexed and an index itself in the same Cassandra cluster?

    • sematext says:

      Yes, I believe so, but let’s see what @tjake says.

    • Jake says:

      you mean index data already in cassandra? no. if you want data in cassandra to access via cassandra apis you need to submit it to cassandra and also solandra. solandra keyspace isn’t mean’t to be accessed directly

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out / Change )

Twitter picture

You are commenting using your Twitter account. Log Out / Change )

Facebook photo

You are commenting using your Facebook account. Log Out / Change )

Google+ photo

You are commenting using your Google+ account. Log Out / Change )

Connecting to %s

Follow

Get every new post delivered to your Inbox.

Join 1,624 other followers

%d bloggers like this: