Lucene & Solr Year 2011 in Review

The year 2011 is coming to an end and it’s time to reflect on the past 12 months.  Without further fluff, let’s look back and summarize all significant events that happened in Lucene and Solr world over the course of last dozen months. In the next few paragraphs we’ll go over major changes in Lucene and Solr, new blood, relevant conferences and books.

We should start by pointing out that this year Apache Lucene celebrated its 10 year anniversary as an Apache Software Foundation project.  Lucene itself is actually over 10 years old.  Otis is one of the very few people from the early years who is still around.  While we didn’t celebrations any Solr anniversaries this year, we should note that Solr, too, has been around for quite a while and is in fact approaching its 6th year at ASF!

This year saw numerous changes and additions both in Lucene and Solr.  As a matter of fact, we’d venture to say we saw more changes in Lucene & Solr this year than in any one year before.  In that sense, both projects are very much like wine – getting better with time. Lets take a look at a few of the most significant changes in 2011.

The much anticipated Near Real-Time search (NRT) functionality has arrived.  What this means is that documents that were just added to a Lucene/Solr index can immediately be made visible in search results.  This is big!  Of course, work on NRT is still in progress, but NRT is ready and you, like a number of our clients, should start using it.

Field Collapsing was one of the most watched and voted for JIRA issues for many month.  This functionality was implemented this year and now Lucene and Solr users can group result on the basis of a field or a query. In addition, you can control the groups and even do faceting calculation on the groups, not single documents. A rather powerful feature.

From Lucene users’ perspective it is also worth noting that Lucene finally got a faceting module.  Until now, faceting was available only in Solr.  If you are a pure Lucene users, you now don’t need Solr to calculate facets.

In the past modeling parent-child relationships in Lucene and Solr indices was not really possible – one had to flatten everything.  No longer – if you need to model a parent-child relationship in your index you can use the Join contrib module.  This Join functionality lets you join parent and child documents at query-time, while relaying on some assumptions about how documents were indexed.

Good and broad language support is hugely important for any search solution and this year was good for Lucene and Solr in that department: KStemFilter English stemmer was added, full Unicode 4 support was added, a new Japanese and Chinese support was added, a new stemmer-protection mechanism was added, work on synonym filter RAM consumption reduction was done, etc.  Another big addition was integration with Hunspell, which enables language-specific processing for all languages supported by Open Office.  That’s a lot of new languages we can now handle with Lucene and Solr! There is more.

Lucene 3.5.0 introduced significantly reduced the  term dictionary memory footprint. Big!  Right now, Lucene uses 3 to 5 times less memory for when dealing with terms dictionary, so it’s even less RAM consuming.

If you use Lucene and need to page through a lot of results you may run into problems. That’s why in Lucene 3.5.0 the searchAfter method was introduced which solves the deep paging problem once and for all!

There is also a new, fast and reliable Term Vector-based highlighter that both Lucene and Solr can use.

Dismax is great, but Extended Dismax query parser added to Solr is even better – it extends Dismax query parser functionality and can further improve the quality of search results.

You can now also sort by function (imagine sorting the results by distance from a point) and a new spatial search with filtering.

Solr also got the new suggest/autocomplete functionality based on FST automaton which significantly reduced the memory needed for such functionality.  If you need this for your search application, have a look at Sematext’s AutoComplete – it has additional functionality that lots of our customers like.

While not yet officially released, the new transaction log support provides Solr with a real-time get operation – as soon as you add a document you can retrieve it by ID.  This will also be used for recovering nodes in SolrCloud.

And talking about SolrCloud…  We’ve covered SolrCloud on this blog before in Solr Digest, Spring-Summer 2011, Part 2: Solr Cloud and Near Real Time Search, and we’ll be covering it again soon.  In short, SolrCloud will make it easier for people to operate larger Solr clusters by making use of more modern design principles and software components such as ZooKeeper, that make creation of distributed, cluster-based software/services easier.  Some of the core functionality is that there will be no single point of failure, any node will be able to handle any operation, there will be no traditional master-slave setup, there will be centralized cluster management and configuration, failovers will be automatic and in general things will be much more dynamic.  SolrCloud has not been released yet, but Solr developers are working on it and the codebase is seeing good progress.  We’ve used SolrCloud in a few recent engagements with our customers and were pleased by what we saw.

After merging developments of those two projects back in the 2010, we saw a speed up in development and releases. Lucene and Solr committers introduced five(!) new versions of both projects! In March, Lucene and Solr 3.1 was released with the Unicode 4 support, ReusableTokenStream, Spatial search, Vector-based Highlighter, Extended Dismax parser, and many more features and bug fixes. Then, after less than 3 months(!) on June 4th, version 3.2 was released. This release introduced a new and much desired results grouping module, NRTCachingDirectory, and highlighting performance improvements. Just one month later, on July 1st, Lucene and Solr 3.3 were introduced. That release included KStem stemmer, new implementations of Spellchecker, Field Collapsing in Solr and RAM usage reduction for autocomplete mechanism. By the end of summer there was another release, this time it was version 3.4 released on the 14th of September. Pure Lucene users got what Solr could do for a very long time – the long awaited faceting module contributed by IBM. Version 3.4 also included the new Join functionality, ability to turn off query and filter caches and faceting calculation for Field Collapsing. The last release of Lucene and Solr saw the light of day in late November. The 3.5.0 version consisted of huge memory reduction when dealing with term dictionaries, deep paging support, SearcherManager and SearcherLifetimeManager classes along with language identification provided by Tika, as well as sortMissingFirst and sortMissingLast support for TrieFields.

During the last 12 months we attended three major conferences focused on search and big data themes. Lucene Revolution took place in San Francisco in May. Otis gave a talk titled “Search Analytics: What? Why? How?” (slides) during the first day. There were a number of other good talks there and the complete conference agenda is available on  http://lucenerevolution.com/2011/agenda. Some videos are available as well. Next came the Berlin Buzzwords conference, a more grass-roots conference which took place between 4th and 10th of June. Otis gave the updated version of his “Search Analytics: What? Why? How?”. If you want to know more, check conference official site – http://berlinbuzzwords.de. The last conference focused exclusively on Lucene and Solr was Lucene Eurocon 2011 in sunny and tourist-filled Barcelona between 17th and 20th of October. And guess what – we were there again (surprise!), this time in slightly larger numbers. Otis gave a talk about “Search Analytics: Business Value & BigData NoSQL Backend” (video, slides) and Rafał gave a talk on a pretty popular topic - “Explaining & Visualizing Solr ‘explain’ information” (video, slides). No open source project can endure without regular injections of new blood. This year, Lucene and Solr development team was joined by a number of new people whose names may look familiar to you:

These 7 men are now Lucene and Solr committers and we look forward to our next year’s Year in Review post, where we hope to go over the good things these people will have brought to Lucene and Solr in 2012.

You know an open source project is successful when a whole book is dedicated to it.  You know a project is very successful when more than one book and more than one publisher cover it.  There were no new editions of Lucene in Action (amazon, manning) this year, but our own Rafał Kuć published his Solr 3.1 Cookbook (amazon) in July.  Rafał’s cookbook includes a number of recipes that can make your life easier when it comes to solving common problems with Apache Solr. Another book, Apache Solr 3 Enterprise Search Server (amazon) by David Smiley and Eric Pugh was published in November. This is a major update to the first edition of the book and it covers a wide range of functionalities of Apache Solr.

@sematext

Solr Performance Monitoring with SPM

Originally delivered as Lightning Talk at Lucene Eurocon 2011 in Barcelona, this quick presentation shows how to use Sematext’s SPM service, currently free to use for unlimited time, to monitor Solr, OS, JVM, and more.

We built SPM because we wanted to have a good and easy to use tool to help us with Solr performance tuning during engagements with our numerous Solr customers.  We hope you find our Scalable Performance Monitoring service useful!  Please let us know if you have any sort of feedback, from SPM functionality and usability to its speed.  Enjoy!

AutoComplete with Suggestion Groups

While Otis is talking about our new Search Analytics (it’s open and free now!) and Scalable Performance Monitoring (it’s also open and free now!) services at Lucene Eurocon in Barcelona Pascal, one of the new members of our multinational team at Sematext, is working on making improvements to our search-lucene.com and search-hadoop.com sites.  One of the recent improvements is on the AutoComplete functionality we have there.  If you’ve used these sites before, you may have noticed that the AutoComplete there now groups suggestions.  In the screen capture below you can see several suggestion groups divided with pink lines.  Suggestions can be grouped by any criterion, and here we have them grouped by the source of suggestions.  The very first suggestion is from “mail # general”, which is our name for the “general” mailing list that some of the projects we index have.  The next two suggestions are from “mail # user”, followed by two suggestions from “mail # dev”, and so on.  On the left side of each suggestion you can see icons that signify the type of suggestion and help people more quickly focus on types of suggestions they are after.

Sematext AutoComplete Segments

Sematext AutoComplete Suggestion Grouping

The other interesting thing you can see in this AutoComplete implementation is a custom footer, which allows one to show any sort of messaging or even advertising to the end user.  We should also point out that one can use Sematext AutoComplete to embed custom suggestions, such as ads, and have them show up in the list of suggestions only when their (meta) data matches the input, yet displayed differently from the rest of suggestion in order to make it clear they are ads and not ordinary suggestions.

We hope you find this functionality useful.  Please let us know what you think and if you have other suggestions for how to improve either AutoComplete or search-lucene.com and search-hadoop.com, please let us know – we do listen!

The State of Solandra – Summer 2011

A little over 18 months ago we talked to Jake Luciani about Lucandra – a Cassandra-based Lucene backend.  Since then Jake has moved away from raw Lucene and married Cassandra with Solr, which is why Lucandra now goes by Solandra.  Let’s see what Jake and Solandra are up to these days.

 

What is the current status of Solandra in terms of features and stability?

Solandra has gone through a few iterations. First as Lucandra which partitioned data by terms and used thrift to communicate with Cassandra.  This worked for a few big use cases, mainly how to manage a index per user, and garnered a number of adopters.  But it performed poorly when you had very large indexes with many dense terms, due to the number and size of remote calls needed to fulfill a query.Last summer I started off on a new approach based on Solr that would address Lucandra’s shortcomings: Solandra.  The core idea of Solandra is to use Cassandra as a foundation for scaling Solr.  It achieves this by embedding Solr in the Cassandra runtime and uses the Cassandra routing layer to auto shard a index across the ring (by document).  This means good random distribution of data for writes (using Cassandra’s RandomParitioner) and good search performance since individual shards can be searched in parallel across nodes (using SolrDistributedSearch).  Cassandra is responsible for sharding, replication, failover and compaction.  The end user now gets a single scalable component for search without changing API’s which will scale in the background for them.  Since search functionality is performed by Solr so it will support anything Solr does.

I gave a talk recently on Solandra and how it works: http://blip.tv/datastax/scaling-solr-with-cassandra-5491642

Are you still the sole developer of Solandra?  How much time do you spend on Solandra?
Have there been any external contributions to Solandra?

I still am responsible for the majority of the code, however the Solandra community is quite large with over 500 github followers and 60 forks.  I receive many useful bug reports and patches through the community.  Late last year I started working at DataStax (formerly Riptano) to focus on Apache Cassandra.   DataStax is building a suite of products and services to help customers use Cassandra in production and incorporate Cassandra into existing enterprise infrastructure.  Solandra is a great example of this. We currently have a number of customers using Solandra and we encourage people interested in using Solandra to reach out to us for support.

What are the most notable differences with Solandra and Solr?

The primary difference is the ability to grow your Solr infrastructure seamlessly using Cassandra. I purposely want to avoid altering the Solr functionality since the primary goal here is to make it easy for users to migrate to and from Solandra and Solr.   That being said Solandra does offer some unique features regarding managing millions of indexes.  One is different Solr schemas can be injected at runtime using a RESTful interface and Solandra supports the concept of virtual Solr Cores which share the same core but are treated as different indexes. For example, if you have a core called “inbox” you can create an index per user like “inbox.otis” or “inbox.jake” just by changing the endpoint URL.

Finally, Solandra has a bulk loading interface that makes it easy to index large amounts of data at a time (one known cluster indexes at ~4-5MB of text per second).

What are the most notable differences with Solandra and Elastic Search?

ElasticSearch is more mature and offers a similar architecture for scaling search though not based on Cassandra or Solr.  I think ElasticSearch’s main weakness is it requires users to scrap their existing code and tools to use it.  On the other hand, Solandra provides a scalable platform built on Solr and lets you grow with it.

Solandra doesn’t use the Lucene index file format so it will grow to support millions of indexes. Systems like Solr and ElasticSearch create a directory per index which makes managing millions of indexes very hard. The flipside is there are a lot of performance tweaks lost by not using the native file format most of the current work on Solandra relates to improving single node performance.

Solandra is a single component that gives you search AND NoSQL database, and is therefore much easier to manage from the operations perspective IMO.

What do you plan on adding to Solandra that will make it clearly stand out from Solr or Elastic Search?

Solandra will continue to grow with Solr (4.0 will be out in the future), as well as with Cassandra. Right now Solandra’s real-time search is limited by the performance of Solr’s field cache implementation. By incorporating Cassandra triggers I think we can remove this bottleneck and get really impressive real-time performance at scale, due to how Solandra pre-allocates shards.

Also, since the Solr index is stored in the Cassandra datamodel, you can now apply some really interesting features of Cassandra to Solr, such as expiring indexes and triggered searches.

When should one use Solandra?

If you say yes to any of the following you should use Solandra:

  • I have more data than can fit on a single box
  • I have potentially millions of indexes
  • I need improved indexing with multi-master writes
  • I need multi-datacenter search
  • I am already using Cassandra and Solr
  • I am having trouble managing my Solr cluster

When should one not use Solandra?

If you are happy with your Solr system today and you have enough capacity to scale the size and number of indexes comfortably then there is no need to use Solandra.  Also, Solandra is under active development so you should be prepared to help diagnose unknown issues.  Also note that if you require search features that are currently not supported by Solr distributed search, you should not use Solandra.

Are there known problems with Solandra that users should be aware of?

Yes, currently the index sizes can be much larger in Solandra than Solr (in some cases 10x) this is due to how Solandra indexes data as well as Cassandra’s file format. Cassandra 1.0 includes compression so that will help quite a bit.Also, since consistency in Solandra is tunable it requires your application to consider the implications of writing data at lower consistencies.Finally, one thing that keeps coming up quite often is users assuming Solandra auto indexes the data you already have in Cassandra, since Solandra builds on Cassandra.  This is not the case.  Data must be indexed and searched through the traditional Solr APIs.

Is anyone using Solandra in production? What is the biggest production deployment in terms of # docs, data size on filesystem, query rate?

Solandra is now in production with a handful of users I know of.  Some others are in the testing/pre-production stage. But it’s still a small number AFAIK.The largest Solandra cluster I know of is in the order of ~5 nodes, ~10TB of data with ~100k indexes and ~2B documents.

If you had to do it all over, what would you do differently?

I’m really excited with the way Lucandra/Solandra has evolved over the past year. It’s been a great learning experience and has allowed me to work with technologies and people I’m really, really excited about. I don’t think I’d change a thing, great software takes time.

When is Solandra 1.0 coming out and what is the functionality/issues that remain to be implemented before 1.0?

I don’t really use the 1.0 moniker as people tend to assume too much when they read that. I think once Solandra is fully documented, supports things like Cassandra based triggers for indexing and search, and has an improved on disk format, I’d be comfortable calling Solandra 0.9 :)

Thank you Jake.  We are looking forward to Solandra 0.9 then.

Opening: HBase and Lucene / Solr / Elastic Search Developer

We are once again looking for smart people.  This time we are looking to hire a person who likes working with HBase and Lucene (or Solr or ElasticSearch).  This particular combination is important to us because the very first target for this person might be the integration of HBase and Lucene / Solr / ElasticSearch.  More specifically, we have our eyes on HBASE-3529, which we’ve closely examined during a recent HBase Hackathon that took place after BerlinBuzzwords.  Of course, we are also open to alternative approaches if the one takes in HBASE-3529 turns out to be problematic.  The work around the marriage of HBase and full-text search is to be done “in the open”, meaning in collaboration with HBase as well as Lucene, Solr, or Elastic Search developers, which makes this project that much more exciting.

Beyond HBase and search integration, we do other interesting stuff with HBase (and Flume and MapReduce and …), so this person would get to work on our Search Analytics and Scalable Performance Monitoring services.

Interested?  Please get in touch and see what else we like on our jobs page.

Search Analytics – Video Interview with Otis Gospodnetić

“I’m shocked companies aren’t using these tools.”

This video interview about Search Analytics is from Techilicious by Josette Rigsby: http://techielicous.com/2011/06/04/search-and-analytics/

“…we had a chance to speak with Otis Gospodnetić, co-author of Lucene in Action and Founder of Sematext regarding search analytics and searching big data.”

Enterprise search is growing in importance along with data sizes; there is simply to much content to locate without the aid of a search tool; but, are users really  finding what they need? Unfortunately, many companies can not answer that question. Gospodnetić advised that organizations should be collecting at least a minimum set of metrics about  search performance and user behavior. However, the majority are not.

Unlike click stream analysis, search analytics provides insight into how users are actually using search – the actual terms they specify – instead of just what they clicked. Key metrics organizations should collect on an on-going basis include:

  • Search failure (zero results)
  • Low click-through rate
  • Most popular searches (words and phrases)

Once the metrics are collected, organizations should analyze the data to improve the search experience. For example, if a significant percentage of queries are failing organizations can use the data from search analytics to find out why. Is it due to misspellings? Are there synonyms? Gospodnetić said,

“I’m shocked companies aren’t using these tools.”

For more information on this topic read about Sematext Search Analytics service.

Sematext at Lucene Revolution 2011

Last year at Lucene Revolution in October in Boston, we shared how we built search-lucene.com and search-hadoop.com.  In May of this year, we’ll again be talking at Lucene Revolution about another topic very dear to us at Sematext – Search Analytics (abstract).  The full conference agenda is available.  Start picking sessions to attend.

This year’s Lucene Revolution is extra interesting because Sematext is also sponsoring the conference.  In addition to that, it’s great to see a couple of our customers be presenting this year!

If you are coming to the conference don’t be afraid to say hello.  And if San Francisco is too far this year and you are on the east coast of the US in mid-June, you can also catch us at the Open Source Search Conference.  And if you are in Europe, you’ll see us there in June of this year, too.  Until then, so long from @sematext.

For more information on this topic read about Sematext Search Analytics service.

Wanted: Devops to run Search-Hadoop.com and Search-Lucene.com

If you are dreaming about working on search, big data, analytics, data mining, and machine learning, and are a positive, proactive, independent devops creature, inquire within!

We are a small and highly distributed team who likes to eat a little bit of everything: search for breakfast, mapreduce for lunch, and bigtable for dinner.  We are looking for a part-time-to-grow-into-full-time devops to work on the popular search-hadoop.com and search-lucene.com sites and take them to the next level. As such, you’ll need to be on top of Lucene, Solr, and Elastic Search.  Similarly, you must be completely at $HOME on the UNIX command line.  Working knowledge of Mahout or statistics/machine learning/data mining background would be a major plus, but is not required.  Experience with productive web frameworks and slick modern front-end frameworks is another plus, as is familiarity with EC2 and EBS.

More about the ideal you:

  • You are well organized, disciplined, and efficient
  • You don’t wait to be told what to do and don’t need hand-holding
  • You are reliable, friendly, have a positive attitude, and don’t act like a prima donna
  • You have an eye for detail – no sloppy code, no poor spelelling and typous
  • You are able to communicate complex ideas in a clear fashion in English (or pretty diagrams)
  • You have experience with (large scale) search or data analysis
  • You like to write about technologies relevant to what we do
  • You are an open-source software contributor

Not all of the above are required, of course – the closer the match, the higher the relevance score, that’s all.

Interested?  Please get in touch.

Sematext at Open Source Search Conference 2011

We are happy to report that we have been selected to present at the upcoming Open Source Search Conference 2011 in McLean, Virginia, this coming June.  We’ll be talking about Search Analytics.

Search Analytics: What? Why? How?

Search is increasingly the primary information access mechanism, so knowing how your search is doing often has direct business impact. You’ve indexed your data and people are searching it. But how do you know if they are happy with the results? How do you know if they are finding what they need?  Regardless of whether you are using Solr, Lucene, or some other search solution, you should be paying attention to what your users are searching for and clicking on – through those actions they are telling you what you can do to improve your search.

In this talk we’ll talk about search analytics and how it can be used to answer questions like:

  • Are too many users getting the dreaded “no matches” results?
  • How deep into search results do people dig?
  • Which hits are they clicking on, or what percentage of them don’t click on any hits?
  • How much do they use the “Did You Mean” or “Auto-Complete” suggestions?

We’ll explore what specific search analytics reports tell us and what specific actions you should take based on those reports.

You can browse through other presentations, too.  If you will be attending the conference, please do not hesitate the tap Otis‘ shoulder.  We’d also be happy to talk business if you think your organization may benefit from one of Sematext’s products or services or if you simply want to chat.  To keep up with us, follow @sematext on Twitter.

For more information on this topic read about Sematext Search Analytics service.

Google Summer of Code and Intern Sponsoring

Are you a student and looking to do some fun and rewarding coding this summer? Then join us for the 2011 Google Summer of Code!

The application deadline is in less than a month! Lucene has identified initial potential projects, but this doesn’t mean you can also pick your own.  If you need additional ideas, look at our Lucene / Solr for Academia: PhD Thesis Ideas (or just the spreadsheet if you don’t want to read the what and the why),  just be sure to discuss with the community first (send an email to dev@lucene.apache.org).

We should also add that, separately from GSoC, Sematext would be happy to sponsor good students and interns interested in work on projects involving search (Lucene, Solr), machine learning & analytics (Mahout), big data (Hadoop, HBase, Hive, Pig, Cassandra), and related areas. We are a virtual and geographically distributed organization whose members are spread over several countries and continents and we welcome students from all across the globe.  For more information please inquire within.

Follow

Get every new post delivered to your Inbox.

Join 599 other followers