The New SolrCloud: Overview

Just the other day we wrote about Sensei, the new distributed, real-time full-text search database built on top of Lucene and here we are again writing about another “new” distributed, real-time, full-text search server also built on top of Lucene: SolrCloud.

In this post we’ll share some interesting SolrCloud bits and pieces that matter mostly to those working with large data and query volumes, but that all search lovers should find really interesting, too.  If you have any questions about what we wrote (or did not write!) in this post, please leave a comment – we’re good at following up to comments!  Or just ask @sematext!

Please note that functionality described in this post is now part of trunk in Lucene and Solr SVN repository.  This means that it will be available when Lucene and Solr 4.0 are released, but you can also use trunk version just like we did, if you don’t mind living on the bleeding edge.

Recently, we were given the opportunity to once again use big data (massive may actually be more descriptive of this data volume) stored in a HBase cluster and search. We needed to design a scalable search cluster capable of elastically handling future data volume growth.  Because of the huge data volume and high search rates our search system required the index to be sharded.  We also wanted the indexing to be as simple as possible and we also wanted a stable, reliable, and very fast solution. The one thing we did not want to do is reinvent the wheel.  At this point you may ask why we didn’t choose ElasticSearch, especially since we use ElasticSearch a lot at Sematext.  The answer is that when we started the engagement with this particular client a whiiiiile back when ElasticSearch wasn’t where it is today.  And while ElasticSearch does have a number of advantages over the old master-slave Solr, with SolrCloud being in the trunk now, Solr is again a valid choice for very large search clusters.

And so we took the opportunity to use SolrCloud and some of its features not present in previous versions of Solr.  In particular, we wanted to make use of Distributed Indexing and Distributed Searching, both of which SolrCloud makes possible. In the process we looked at a few JIRA issues, such as SOLR-2358 and SOLR-2355, and we got familiar with relevant portions of SolrCloud source code.  This confirmed SolrCloud would indeed satisfy our needs for the project and here we are sharing what we’ve learned.

Our Search Cluster Architecture

Basically, we wanted the search cluster to look like this:

SolrCloud App Architecture

Simple? Yes, we like simple.  Who doesn’t!  But let’s peek inside that “Solr cluster” box now.

SolrCloud Features and Architecture

Some of the nice things about SolrCloud are:

  • centralized cluster configuration
  • automatic node fail-over
  • near real time search
  • leader election
  • durable writes

Furthermore, SolrCloud can be configured to:

  • have multiple index shards
  • have one or more replicas of each shards

Shards and Replicas are arranged into Collections. Multiple Collections can be deployed in a single SolrCloud cluster.  A single search request can search multiple Collections at once, as long as they are compatible. The diagram below shows a high-level picture of how SolrCloud indexing works.

SolrCloud Shards, Replicas, Replication

As the above diagram shows, documents can be sent to any SolrCloud node/instance in the SolrCloud cluster.  Documents are automatically forwarded to the appropriate Shard Leader (labeled as Shard 1 and Shard 2 in the diagram). This is done automatically and documents are sent in batches between Shards. If a Shard has one or more replicas (labeled Shard 1 replica and Shard 2 replica in the diagram) a document will get replicated to one or more replicas.  Unlike in traditional master-slave Solr setups where index/shard replication is performed periodically in batches, replication in SolrCloud is done in real-time.  This is how Distributed Indexing works at the high level.  We simplified things a bit, of course – for example, there is no ZooKeeper or overseer shown in our diagram.

Setup Details

All configuration files are stored in ZooKeeper.  If you are not familiar with ZooKeeper you can think of it as a distributed file system where SolrCloud configuration files are stored. When the first Solr instance in a SolrCloud cluster is started configuration files need to be sent to ZooKeeper and one needs to specify how many shards there should be in the cluster. Then, this Solr instance/node is running one can start additional Solr instances/nodes and point them to the ZooKeeper  instance (ZooKeeper is actually typically deployed as a quorum or 3, 5, or more instances in production environments).  And voilà – the SolrCloud cluster is up!  I must say, it’s quite simple and straightforward.

Shard Replicas in SolrCloud serve multiple purposes.  They provide fault tolerance in the sense that when (not if!) a single Solr instance/node containing a portion of the index goes down, you still have one or more replicas of data that was served by that instance elsewhere in the cluster and thus you still have the whole data set and no data loss.  They also allow you to spread query load over more servers, this making the cluster capable of handling higher query rates.

Indexing

As you saw above, the new SolrCloud really simplifies Distributed Indexing.  Document distribution between Shards and Replicas is automatic and real-time.  There is no master server one needs to send all documents to. A document can be sent to any SolrCloud instance and SolrCloud takes care of the rest. Because of this, there is no longer a SPOF (Single Point of Failure) in Solr.  Previously, Solr master was a SPOF in all but the most elaborate setups.

Querying

One can query SolrCloud a few different ways:

  • One can query a single Shard, which is just like Solr querying a search a single Solr instance.
  • The second option is to query a single Collection (i.e., search all shards holding pieces of a given Collection’s index).
  • The third option is to only query some of the Shards by specifying their addresses or names.
  • Finally, one can query multiple Collections assuming they are compatible and Solr can merge results they return.

As you can see, lots of choices!

Administration with Core Admin

In addition to the standard core admin parameters there are some new ones available in SolrCloud. These new parameters let one:

  • create new Shards for an existing Collection
  • create a new Collection
  • add more nodes

The Future

If you look at the New SolrCloud Design wiki page (http://wiki.apache.org/solr/NewSolrCloudDesign) you will notice, that not all planned features have been implemented yet. There are still things like cluster re-balancing or monitoring (if you are using SolrCloud already and want to monitor its performance, let us know if you want early access to SPM for SolrCloud) to be done.  Now that SolrCloud is in the Solr trunk, it should see more user and more developer attention.  We look forward to using SolrCloud in more projects in the future!

@sematext

Lucene & Solr Year 2011 in Review

The year 2011 is coming to an end and it’s time to reflect on the past 12 months.  Without further fluff, let’s look back and summarize all significant events that happened in Lucene and Solr world over the course of last dozen months. In the next few paragraphs we’ll go over major changes in Lucene and Solr, new blood, relevant conferences and books.

We should start by pointing out that this year Apache Lucene celebrated its 10 year anniversary as an Apache Software Foundation project.  Lucene itself is actually over 10 years old.  Otis is one of the very few people from the early years who is still around.  While we didn’t celebrations any Solr anniversaries this year, we should note that Solr, too, has been around for quite a while and is in fact approaching its 6th year at ASF!

This year saw numerous changes and additions both in Lucene and Solr.  As a matter of fact, we’d venture to say we saw more changes in Lucene & Solr this year than in any one year before.  In that sense, both projects are very much like wine – getting better with time. Lets take a look at a few of the most significant changes in 2011.

The much anticipated Near Real-Time search (NRT) functionality has arrived.  What this means is that documents that were just added to a Lucene/Solr index can immediately be made visible in search results.  This is big!  Of course, work on NRT is still in progress, but NRT is ready and you, like a number of our clients, should start using it.

Field Collapsing was one of the most watched and voted for JIRA issues for many month.  This functionality was implemented this year and now Lucene and Solr users can group result on the basis of a field or a query. In addition, you can control the groups and even do faceting calculation on the groups, not single documents. A rather powerful feature.

From Lucene users’ perspective it is also worth noting that Lucene finally got a faceting module.  Until now, faceting was available only in Solr.  If you are a pure Lucene users, you now don’t need Solr to calculate facets.

In the past modeling parent-child relationships in Lucene and Solr indices was not really possible – one had to flatten everything.  No longer – if you need to model a parent-child relationship in your index you can use the Join contrib module.  This Join functionality lets you join parent and child documents at query-time, while relaying on some assumptions about how documents were indexed.

Good and broad language support is hugely important for any search solution and this year was good for Lucene and Solr in that department: KStemFilter English stemmer was added, full Unicode 4 support was added, a new Japanese and Chinese support was added, a new stemmer-protection mechanism was added, work on synonym filter RAM consumption reduction was done, etc.  Another big addition was integration with Hunspell, which enables language-specific processing for all languages supported by Open Office.  That’s a lot of new languages we can now handle with Lucene and Solr! There is more.

Lucene 3.5.0 introduced significantly reduced the  term dictionary memory footprint. Big!  Right now, Lucene uses 3 to 5 times less memory for when dealing with terms dictionary, so it’s even less RAM consuming.

If you use Lucene and need to page through a lot of results you may run into problems. That’s why in Lucene 3.5.0 the searchAfter method was introduced which solves the deep paging problem once and for all!

There is also a new, fast and reliable Term Vector-based highlighter that both Lucene and Solr can use.

Dismax is great, but Extended Dismax query parser added to Solr is even better – it extends Dismax query parser functionality and can further improve the quality of search results.

You can now also sort by function (imagine sorting the results by distance from a point) and a new spatial search with filtering.

Solr also got the new suggest/autocomplete functionality based on FST automaton which significantly reduced the memory needed for such functionality.  If you need this for your search application, have a look at Sematext’s AutoComplete – it has additional functionality that lots of our customers like.

While not yet officially released, the new transaction log support provides Solr with a real-time get operation – as soon as you add a document you can retrieve it by ID.  This will also be used for recovering nodes in SolrCloud.

And talking about SolrCloud…  We’ve covered SolrCloud on this blog before in Solr Digest, Spring-Summer 2011, Part 2: Solr Cloud and Near Real Time Search, and we’ll be covering it again soon.  In short, SolrCloud will make it easier for people to operate larger Solr clusters by making use of more modern design principles and software components such as ZooKeeper, that make creation of distributed, cluster-based software/services easier.  Some of the core functionality is that there will be no single point of failure, any node will be able to handle any operation, there will be no traditional master-slave setup, there will be centralized cluster management and configuration, failovers will be automatic and in general things will be much more dynamic.  SolrCloud has not been released yet, but Solr developers are working on it and the codebase is seeing good progress.  We’ve used SolrCloud in a few recent engagements with our customers and were pleased by what we saw.

After merging developments of those two projects back in the 2010, we saw a speed up in development and releases. Lucene and Solr committers introduced five(!) new versions of both projects! In March, Lucene and Solr 3.1 was released with the Unicode 4 support, ReusableTokenStream, Spatial search, Vector-based Highlighter, Extended Dismax parser, and many more features and bug fixes. Then, after less than 3 months(!) on June 4th, version 3.2 was released. This release introduced a new and much desired results grouping module, NRTCachingDirectory, and highlighting performance improvements. Just one month later, on July 1st, Lucene and Solr 3.3 were introduced. That release included KStem stemmer, new implementations of Spellchecker, Field Collapsing in Solr and RAM usage reduction for autocomplete mechanism. By the end of summer there was another release, this time it was version 3.4 released on the 14th of September. Pure Lucene users got what Solr could do for a very long time – the long awaited faceting module contributed by IBM. Version 3.4 also included the new Join functionality, ability to turn off query and filter caches and faceting calculation for Field Collapsing. The last release of Lucene and Solr saw the light of day in late November. The 3.5.0 version consisted of huge memory reduction when dealing with term dictionaries, deep paging support, SearcherManager and SearcherLifetimeManager classes along with language identification provided by Tika, as well as sortMissingFirst and sortMissingLast support for TrieFields.

During the last 12 months we attended three major conferences focused on search and big data themes. Lucene Revolution took place in San Francisco in May. Otis gave a talk titled “Search Analytics: What? Why? How?” (slides) during the first day. There were a number of other good talks there and the complete conference agenda is available on  http://lucenerevolution.com/2011/agenda. Some videos are available as well. Next came the Berlin Buzzwords conference, a more grass-roots conference which took place between 4th and 10th of June. Otis gave the updated version of his “Search Analytics: What? Why? How?”. If you want to know more, check conference official site – http://berlinbuzzwords.de. The last conference focused exclusively on Lucene and Solr was Lucene Eurocon 2011 in sunny and tourist-filled Barcelona between 17th and 20th of October. And guess what – we were there again (surprise!), this time in slightly larger numbers. Otis gave a talk about “Search Analytics: Business Value & BigData NoSQL Backend” (video, slides) and Rafał gave a talk on a pretty popular topic - “Explaining & Visualizing Solr ‘explain’ information” (video, slides). No open source project can endure without regular injections of new blood. This year, Lucene and Solr development team was joined by a number of new people whose names may look familiar to you:

These 7 men are now Lucene and Solr committers and we look forward to our next year’s Year in Review post, where we hope to go over the good things these people will have brought to Lucene and Solr in 2012.

You know an open source project is successful when a whole book is dedicated to it.  You know a project is very successful when more than one book and more than one publisher cover it.  There were no new editions of Lucene in Action (amazon, manning) this year, but our own Rafał Kuć published his Solr 3.1 Cookbook (amazon) in July.  Rafał’s cookbook includes a number of recipes that can make your life easier when it comes to solving common problems with Apache Solr. Another book, Apache Solr 3 Enterprise Search Server (amazon) by David Smiley and Eric Pugh was published in November. This is a major update to the first edition of the book and it covers a wide range of functionalities of Apache Solr.

@sematext

Solr Performance Monitoring with SPM

Originally delivered as Lightning Talk at Lucene Eurocon 2011 in Barcelona, this quick presentation shows how to use Sematext’s SPM service, currently free to use for unlimited time, to monitor Solr, OS, JVM, and more.

We built SPM because we wanted to have a good and easy to use tool to help us with Solr performance tuning during engagements with our numerous Solr customers.  We hope you find our Scalable Performance Monitoring service useful!  Please let us know if you have any sort of feedback, from SPM functionality and usability to its speed.  Enjoy!

AutoComplete with Suggestion Groups

While Otis is talking about our new Search Analytics (it’s open and free now!) and Scalable Performance Monitoring (it’s also open and free now!) services at Lucene Eurocon in Barcelona Pascal, one of the new members of our multinational team at Sematext, is working on making improvements to our search-lucene.com and search-hadoop.com sites.  One of the recent improvements is on the AutoComplete functionality we have there.  If you’ve used these sites before, you may have noticed that the AutoComplete there now groups suggestions.  In the screen capture below you can see several suggestion groups divided with pink lines.  Suggestions can be grouped by any criterion, and here we have them grouped by the source of suggestions.  The very first suggestion is from “mail # general”, which is our name for the “general” mailing list that some of the projects we index have.  The next two suggestions are from “mail # user”, followed by two suggestions from “mail # dev”, and so on.  On the left side of each suggestion you can see icons that signify the type of suggestion and help people more quickly focus on types of suggestions they are after.

Sematext AutoComplete Segments

Sematext AutoComplete Suggestion Grouping

The other interesting thing you can see in this AutoComplete implementation is a custom footer, which allows one to show any sort of messaging or even advertising to the end user.  We should also point out that one can use Sematext AutoComplete to embed custom suggestions, such as ads, and have them show up in the list of suggestions only when their (meta) data matches the input, yet displayed differently from the rest of suggestion in order to make it clear they are ads and not ordinary suggestions.

We hope you find this functionality useful.  Please let us know what you think and if you have other suggestions for how to improve either AutoComplete or search-lucene.com and search-hadoop.com, please let us know – we do listen!

Solr Digest, Spring-Summer 2011, Part 2: Solr Cloud and Near Real Time Search

As promised in Part 1 of Solr Digest, Spring-Summer 2011, in this Part 2 post we’ll summarize what’s new with Solr’s Near-Real-Time Search support and Solr Cloud (if you love clouds and search with some big data on the side, get in touch). Let’s first examine what is being worked on for Solr Cloud and what else is in the queue for the near future. A good overview of what is currently functional can be found in the old Solr Cloud wiki page. Also, there is now another wiki page covering New Solr Cloud Design, which we find quite useful.  The individual pieces of Solr Cloud functionality that are being worked on are as follows:

  • Work is still in progress on Distributed Indexing and Shard distribution policy. Patches exist, although they are now over 6 months old, so you can expect to see them updated soon.
  • As part of the Distributed Indexing effort, shard leader functionality deals with leader election and with publishing the information about which node is a leader of which shard and in Zookeeper in order to notify all interested parties.  The development is pretty active here and initial patches already exist.
  • At some point in the future, Replication Handler may become cloud aware, which means it should be possible to switch the roles of masters and slaves, master URLs will be able to change based on cluster state, etc. The work hasn’t started on this issue.
  • Another feature Solr Cloud will have is automatic Spliting and migrating of Indices. The idea is that when some shard’s index becomes too large or the shard itself starts having bad query response times, we should be able to split parts of that index and migrate it (or merge) with indices on other (less loaded) nodes. Again, the work on this hasn’t started yet.  Once this is implemented one will be able to split and move/merge indices using a Solr Core Admin as described in SOLR-2593.
  • To achieve more efficiency in search and gain control over where exactly each document gets indexed to, you will be able to define a custom shard lookup mechanism. This way, you’ll be able to limit execution of search requests to only some shards that are known to hold target documents, thus making the query more efficient and faster.  This, along with the above mentioned shard distribution policy, is akin to routing functionality in ElasticSearch.

On to NRT:

  • There is a now a new wiki page dedicated to Solr NRT Search. In short, NRT Search will be available in Solr 4.0 and the work currently in progress is already available on the trunk. The first new functionality that enables NRT Search in Solr is called “soft-commit”.  A soft commit is a light version of a regular commit, which means that it avoids the costly parts of a regular commit, namely the flushing of documents from memory to disk, while still allowing searches to see new documents. It appears that a common way of using this will be having a soft-commit every second or so, to make Solr behave as NRT as possible, while also having a “hard-commit” automatically every 1-10 minutes. “Hard-commit” will still be needed so the latest index changes are persisted to the storage. Otherwise, in case of crash, changes since last “hard-commit” would be lost.
  • Initial steps in supporting NRT Search in Solr were done in Re-architect Update Handler. Some old issues Solr had were dealt with, like waiting for background merges to finish before opening a new IndexReader, blocking of new updates while commit is in progress and a problem where it was possible that multiple IndexWriters were open on the same index. The work was done on solr2193 branch and that is the place where the spinoffs of this issue will continue to move Solr even closer to NRT.
  • One of the spinoffs of the Update Handler rearchitecture is SOLR-2565, which provides further improvements on the above mentioned issue.  New issues to deal with other related functionality will be opened along the way, while SOLR-2566 looks to serve as an umbrella issue for NRT Search in Solr.
  • Partially related to NRT Search is the new Transaction Log implemented in Solr under SOLR-2700. The goal is to provide durability of updates, while also supporting features like the already committed Realtime get.  Transaction logs are implemented in various other search solutions such as ElasticSearch and Zoie, so Simon Willnauer started a good thread about the possibility of generalizing this new Transaction Log functionality so that it is not limited to Solr, but exposed to other users and applications, such as Lucene, too.

We hope you found this post useful.  If you have any questions or suggestions, please leave a comment, and if you want to follow us, we are @sematext on Twitter.

The State of Solandra – Summer 2011

A little over 18 months ago we talked to Jake Luciani about Lucandra – a Cassandra-based Lucene backend.  Since then Jake has moved away from raw Lucene and married Cassandra with Solr, which is why Lucandra now goes by Solandra.  Let’s see what Jake and Solandra are up to these days.

 

What is the current status of Solandra in terms of features and stability?

Solandra has gone through a few iterations. First as Lucandra which partitioned data by terms and used thrift to communicate with Cassandra.  This worked for a few big use cases, mainly how to manage a index per user, and garnered a number of adopters.  But it performed poorly when you had very large indexes with many dense terms, due to the number and size of remote calls needed to fulfill a query.Last summer I started off on a new approach based on Solr that would address Lucandra’s shortcomings: Solandra.  The core idea of Solandra is to use Cassandra as a foundation for scaling Solr.  It achieves this by embedding Solr in the Cassandra runtime and uses the Cassandra routing layer to auto shard a index across the ring (by document).  This means good random distribution of data for writes (using Cassandra’s RandomParitioner) and good search performance since individual shards can be searched in parallel across nodes (using SolrDistributedSearch).  Cassandra is responsible for sharding, replication, failover and compaction.  The end user now gets a single scalable component for search without changing API’s which will scale in the background for them.  Since search functionality is performed by Solr so it will support anything Solr does.

I gave a talk recently on Solandra and how it works: http://blip.tv/datastax/scaling-solr-with-cassandra-5491642

Are you still the sole developer of Solandra?  How much time do you spend on Solandra?
Have there been any external contributions to Solandra?

I still am responsible for the majority of the code, however the Solandra community is quite large with over 500 github followers and 60 forks.  I receive many useful bug reports and patches through the community.  Late last year I started working at DataStax (formerly Riptano) to focus on Apache Cassandra.   DataStax is building a suite of products and services to help customers use Cassandra in production and incorporate Cassandra into existing enterprise infrastructure.  Solandra is a great example of this. We currently have a number of customers using Solandra and we encourage people interested in using Solandra to reach out to us for support.

What are the most notable differences with Solandra and Solr?

The primary difference is the ability to grow your Solr infrastructure seamlessly using Cassandra. I purposely want to avoid altering the Solr functionality since the primary goal here is to make it easy for users to migrate to and from Solandra and Solr.   That being said Solandra does offer some unique features regarding managing millions of indexes.  One is different Solr schemas can be injected at runtime using a RESTful interface and Solandra supports the concept of virtual Solr Cores which share the same core but are treated as different indexes. For example, if you have a core called “inbox” you can create an index per user like “inbox.otis” or “inbox.jake” just by changing the endpoint URL.

Finally, Solandra has a bulk loading interface that makes it easy to index large amounts of data at a time (one known cluster indexes at ~4-5MB of text per second).

What are the most notable differences with Solandra and Elastic Search?

ElasticSearch is more mature and offers a similar architecture for scaling search though not based on Cassandra or Solr.  I think ElasticSearch’s main weakness is it requires users to scrap their existing code and tools to use it.  On the other hand, Solandra provides a scalable platform built on Solr and lets you grow with it.

Solandra doesn’t use the Lucene index file format so it will grow to support millions of indexes. Systems like Solr and ElasticSearch create a directory per index which makes managing millions of indexes very hard. The flipside is there are a lot of performance tweaks lost by not using the native file format most of the current work on Solandra relates to improving single node performance.

Solandra is a single component that gives you search AND NoSQL database, and is therefore much easier to manage from the operations perspective IMO.

What do you plan on adding to Solandra that will make it clearly stand out from Solr or Elastic Search?

Solandra will continue to grow with Solr (4.0 will be out in the future), as well as with Cassandra. Right now Solandra’s real-time search is limited by the performance of Solr’s field cache implementation. By incorporating Cassandra triggers I think we can remove this bottleneck and get really impressive real-time performance at scale, due to how Solandra pre-allocates shards.

Also, since the Solr index is stored in the Cassandra datamodel, you can now apply some really interesting features of Cassandra to Solr, such as expiring indexes and triggered searches.

When should one use Solandra?

If you say yes to any of the following you should use Solandra:

  • I have more data than can fit on a single box
  • I have potentially millions of indexes
  • I need improved indexing with multi-master writes
  • I need multi-datacenter search
  • I am already using Cassandra and Solr
  • I am having trouble managing my Solr cluster

When should one not use Solandra?

If you are happy with your Solr system today and you have enough capacity to scale the size and number of indexes comfortably then there is no need to use Solandra.  Also, Solandra is under active development so you should be prepared to help diagnose unknown issues.  Also note that if you require search features that are currently not supported by Solr distributed search, you should not use Solandra.

Are there known problems with Solandra that users should be aware of?

Yes, currently the index sizes can be much larger in Solandra than Solr (in some cases 10x) this is due to how Solandra indexes data as well as Cassandra’s file format. Cassandra 1.0 includes compression so that will help quite a bit.Also, since consistency in Solandra is tunable it requires your application to consider the implications of writing data at lower consistencies.Finally, one thing that keeps coming up quite often is users assuming Solandra auto indexes the data you already have in Cassandra, since Solandra builds on Cassandra.  This is not the case.  Data must be indexed and searched through the traditional Solr APIs.

Is anyone using Solandra in production? What is the biggest production deployment in terms of # docs, data size on filesystem, query rate?

Solandra is now in production with a handful of users I know of.  Some others are in the testing/pre-production stage. But it’s still a small number AFAIK.The largest Solandra cluster I know of is in the order of ~5 nodes, ~10TB of data with ~100k indexes and ~2B documents.

If you had to do it all over, what would you do differently?

I’m really excited with the way Lucandra/Solandra has evolved over the past year. It’s been a great learning experience and has allowed me to work with technologies and people I’m really, really excited about. I don’t think I’d change a thing, great software takes time.

When is Solandra 1.0 coming out and what is the functionality/issues that remain to be implemented before 1.0?

I don’t really use the 1.0 moniker as people tend to assume too much when they read that. I think once Solandra is fully documented, supports things like Cassandra based triggers for indexing and search, and has an improved on disk format, I’d be comfortable calling Solandra 0.9 :)

Thank you Jake.  We are looking forward to Solandra 0.9 then.

Solr Digest, Spring-Summer 2011, Part 1

No, Solr Digests are not dead, we’ve just been crazily busy at Sematext (yes, we are hiring!). Since our last Solr Digest not one, but 2 new Solr releases have been made: 3.2 in June, 3.3 in July and version 3.4 is imminent – voting is already in progress, so you can expect a new release pretty soon. Also, there were a number of interesting developments on the trunk (future 3.x and 4.0 versions). Therefore, we will be publishing two Solr Digests this time. This first Digest covers general developments in Solr world, while the sequel will be more focused on two features drawing a lot of attention: Solr Cloud and Near Real Time search.

Let’s get started with a short overview of announced news in 3.2 and 3.3. First, 3.2 brought us:

  • Ability to specify overwrite and commitWithin as request parameters when using the JSON update format
  • TermQParserPlugin, useful when generating filter queries from terms returned by field faceting or terms component
  • DebugComponent now supports using a NamedList to model Explanation objects in its responses instead of Explanation.toString
  • Improvements to the UIMA and Carrot2 integrations
  • Highlighting performance improvements
  • A test-framework jar for easy testing of Solr extensions
  • Bugfixes and improvements from Apache Lucene 3.2

With 3.3 we got:

  • Grouping / Field Collapsing
  • A new, automaton-based suggest/autocomplete implementation offering an order of magnitude smaller RAM consumption
  • KStemFilterFactory, an optimized implementation of a less aggressive stemmer for English
  • Solr defaults to a new, more efficient merge policy (TieredMergePolicy). See Mike’s cool Lucene segment merging video
  • Important bugfixes, including extremely high RAM usage in spellchecking
  • Bugfixes and improvements from Apache Lucene 3.3

Let’s now look at other interesting stuff. We’ll start with DataImportHandler and its bug fixes. As you’ll notice, there are quite a few of them (and we didn’t even list them all!) so we advise using all available patches.

Already committed features

  • A bug-fix for DataImportHandler – “replication reserves commit-point forever if using replicateAfter=startup”. SOLR-2469 brought a fix to version 3.2 and future 4.0 (trunk). This problem caused unnecessary (and huge) buildup in the number of index files on the slaves.
  • Another bug-fix for DataImportHandler – DIH does not commit if only Deletes are processed. When using special commands $deleteDocById and/or $deleteDocByQuery, when there were no updates of documents, commit wasn’t called by the DIH. Fix is available in 3.4 and 4.0.
  • Also – DataImportHandler multi-threaded option throws exception. The problem would happen when threads attribute was used. The fix for this is available in 3.4 and 4.0. Related to this is another fixed issue – DIH multi threaded mode does not resolves attributes correctly also available in 3.4 and 4.0.
  • Join feature got committed to the trunk (future 4.0 version). It can also perform cross-core joins now, which can be very useful. However, this feature also initiated some heated discussions which can be seen in SOLR-2272. The root cause was the fact that this feature was committed only to the Solr while Lucene got none of it. Of course, it might get refactored and included in Lucene too in the future, but this discussion shows the divisons which still existed between Solr and Lucene communities back then.
  • While we’re talking about Join feature, it might be worth mentioning a patch in SOLR-2604 which back-ports it to 3.x version. Be careful though, it was created for version 3.2 more than two months ago, so a few more adjustments after applying this patch might be needed.
  • Function Queries got new if(), exists(), and(), or(), not(), xor() and def() functions. The fix is committed to trunk so you’ll be able to use it in 4.0.
  • As can be seen from the Solr 3.3 announcement, one of the longest living Solr issue is finally closed for good :) . SOLR-236 – Field Collapsing – along with SOLR-2524 finally bring field collapsing to 3_x and future 4.0 versions.
  • Since grouping/field collapsing was added to Solr, we should be able to use faceting in combination with it. Issue SOLR-2665 – Solr Post Group Faceting – brought exactly that to 3.4 and 4.0.
  • Ever wanted to have more control over what gets stored in the cache? SOLR-2429 will bring exactly that starting with the next Solr release – 3.4. It is simple to use, just add cache=false to your queries like this: fq={!frange l=10 u=100 cache=false}mul(popularity,price).  Note that with this new functionality you can prevent either a filter or a query to be cached, while document caching still remains out of request-time control.
  • If you’re using JMX to observe the state of your Solr installation, you might have encountered a problem when reloading Solr cores – it appears that JMX beans didn’t survive those reloads in the past versions. The fix is created and is available in future 3_x and trunk releases.

Interesting features in development

  • To achieve case-insensitive search with wildcard queries you could use a patch suplied under issue SOLR-2438. It has to be said that this isn’t committed to svn and it is hard to say whether it ever will be since there is a similar issue SOLR-219 on which work started 4 years ago.
  • Multithreaded faceting might bring some performance improvements. At the moment, initial patch exists, but more work will be needed here and it still isn’t clear how big improvement we could expect in real-world conditions, but it is worth keeping an eye on this issue.
  • We all know that Solr’s Spatial support has its limitations. One of them is specifying bounding box which isn’t based on point distance, effectively making it limited to a circular shape. Under SOLR-2609 we might get support for exactly this.
  • For anyone interested in which direction Spatial support might evolve, we suggest checking Lucene Spatial Playground. It continues the great work done in SOLR-2155 which provided extension to initial GeoSpatial support in Solr by adding multivalued spatial fields. At some point, SOLR-2155 might get the goodness from LSP. Also, another thing to check would be a thread on Lucene Spatial Future.

Interesting new features

  • Support for Lucene’s Surround Parser is added to Solr in issue SOLR-2703. The patch is already committed to the trunk.
  • Solr will get the ability to use configuration like analyzer type=”phrase”. Lucene’s Query Parsers recently got a simpler way to use different analyzer based on the query string. One example is usage of double quotes where one can decide that instead of current meaning in Lucene/Solr world – specifying a phrase to be searched for – it should have a meaning like in Google’s search engine – find this exact wording. Patch for this exists and can be applied on the trunk (it depends on Lucene trunk).
  • SOLR-2593 aims to provide a new Solr core admin action – ‘split’ – for splitting index. It would be used in case some core got too big or in any other case you might find it necessary.  Lucene already has a similar function.

Miscellaneous

  • Oracle released Java 7 about a month ago, but we advize against using it yet. JVM crashes and index corruption are issues likely to be encoutered with it. For more information, visit this URL
  • As anticipated for some time, Java 5 support got axed from Lucene 4.0 (trunk). You can expect similar stuff for Solr too.
  • Solr’s build system has been reworked now. Among other things, this implies changes in directory structure in Solr project. For example, solr/src/ doesn’t exist any more and its old subdirs /java and /test are now in solr/core/. The changes are already applied to the trunk and 3_x which holds the next 3.4 version. For more details, see SOLR-2452.
  • A handy Solr architecture diagram can be found in ML thread
  • Solr’s Admin UI is being refreshed with the work in JIRA issue SOLR-2399 (we already wrote about it) and its spin-off SOLR-2667. Some of this stuff is already committed (on the trunk), so you may want to inspect the changes. More details can be found in the wiki where you can also get the sneak-peak of the upcoming changes.

And that would be all for part one of the Solr Spring-Summer 2011 Digest edition from @sematext. Part two of the Spring-Summer Digest is coming in a few days – stay tuned!

Opening: HBase and Lucene / Solr / Elastic Search Developer

We are once again looking for smart people.  This time we are looking to hire a person who likes working with HBase and Lucene (or Solr or ElasticSearch).  This particular combination is important to us because the very first target for this person might be the integration of HBase and Lucene / Solr / ElasticSearch.  More specifically, we have our eyes on HBASE-3529, which we’ve closely examined during a recent HBase Hackathon that took place after BerlinBuzzwords.  Of course, we are also open to alternative approaches if the one takes in HBASE-3529 turns out to be problematic.  The work around the marriage of HBase and full-text search is to be done “in the open”, meaning in collaboration with HBase as well as Lucene, Solr, or Elastic Search developers, which makes this project that much more exciting.

Beyond HBase and search integration, we do other interesting stuff with HBase (and Flume and MapReduce and …), so this person would get to work on our Search Analytics and Scalable Performance Monitoring services.

Interested?  Please get in touch and see what else we like on our jobs page.

Search Analytics – Video Interview with Otis Gospodnetić

“I’m shocked companies aren’t using these tools.”

This video interview about Search Analytics is from Techilicious by Josette Rigsby: http://techielicous.com/2011/06/04/search-and-analytics/

“…we had a chance to speak with Otis Gospodnetić, co-author of Lucene in Action and Founder of Sematext regarding search analytics and searching big data.”

Enterprise search is growing in importance along with data sizes; there is simply to much content to locate without the aid of a search tool; but, are users really  finding what they need? Unfortunately, many companies can not answer that question. Gospodnetić advised that organizations should be collecting at least a minimum set of metrics about  search performance and user behavior. However, the majority are not.

Unlike click stream analysis, search analytics provides insight into how users are actually using search – the actual terms they specify – instead of just what they clicked. Key metrics organizations should collect on an on-going basis include:

  • Search failure (zero results)
  • Low click-through rate
  • Most popular searches (words and phrases)

Once the metrics are collected, organizations should analyze the data to improve the search experience. For example, if a significant percentage of queries are failing organizations can use the data from search analytics to find out why. Is it due to misspellings? Are there synonyms? Gospodnetić said,

“I’m shocked companies aren’t using these tools.”

For more information on this topic read about Sematext Search Analytics service.

Training: Solr Performance Tuning and Monitoring

Quick announcement!

In addition to presenting at Open Source Search Conference in June, we’ll also be doing a super-cheap half-day training on Solr Performance Tuning & Monitoring.  You can sign up here.

In this tutorial you will learn how to squeeze the most performance out of your Solr cluster. We’ll cover performance at both indexing and query time; dealing with large volumes of data versus high query rates, the combination of the two; and various index sharding architectures possible to gain on search performance, in multi-data center setups, etc. We’ll cover an array of best practices, tips and tricks we regularly use in our engagements with clients, from various configuration settings to querying efficiently, all of which one should employ to get the most out of Solr. You will also learn how to monitor your Solr cluster’s performance with command-line tools and a visual monitoring solution specifically designed for Solr performance monitoring.

Prerequisites:

Basic knowledge of Solr, its configuration and setup.

Details:

  • Cost: $100
  • When: June 14, 2011, 9:00 a.m.-1:00 p.m.
  • Bonus: Lunch will be provided.
  • Register here

If you are interested in Solr Performance Monitoring, please read about Sematext Scalable Performance Monitoring service.

Follow

Get every new post delivered to your Inbox.

Join 599 other followers