Solr Digest, Spring-Summer 2011, Part 1

No, Solr Digests are not dead, we’ve just been crazily busy at Sematext (yes, we are hiring!). Since our last Solr Digest not one, but 2 new Solr releases have been made: 3.2 in June, 3.3 in July and version 3.4 is imminent – voting is already in progress, so you can expect a new release pretty soon. Also, there were a number of interesting developments on the trunk (future 3.x and 4.0 versions). Therefore, we will be publishing two Solr Digests this time. This first Digest covers general developments in Solr world, while the sequel will be more focused on two features drawing a lot of attention: Solr Cloud and Near Real Time search.

Let’s get started with a short overview of announced news in 3.2 and 3.3. First, 3.2 brought us:

  • Ability to specify overwrite and commitWithin as request parameters when using the JSON update format
  • TermQParserPlugin, useful when generating filter queries from terms returned by field faceting or terms component
  • DebugComponent now supports using a NamedList to model Explanation objects in its responses instead of Explanation.toString
  • Improvements to the UIMA and Carrot2 integrations
  • Highlighting performance improvements
  • A test-framework jar for easy testing of Solr extensions
  • Bugfixes and improvements from Apache Lucene 3.2

With 3.3 we got:

  • Grouping / Field Collapsing
  • A new, automaton-based suggest/autocomplete implementation offering an order of magnitude smaller RAM consumption
  • KStemFilterFactory, an optimized implementation of a less aggressive stemmer for English
  • Solr defaults to a new, more efficient merge policy (TieredMergePolicy). See Mike’s cool Lucene segment merging video
  • Important bugfixes, including extremely high RAM usage in spellchecking
  • Bugfixes and improvements from Apache Lucene 3.3

Let’s now look at other interesting stuff. We’ll start with DataImportHandler and its bug fixes. As you’ll notice, there are quite a few of them (and we didn’t even list them all!) so we advise using all available patches.

Already committed features

  • A bug-fix for DataImportHandler – “replication reserves commit-point forever if using replicateAfter=startup”. SOLR-2469 brought a fix to version 3.2 and future 4.0 (trunk). This problem caused unnecessary (and huge) buildup in the number of index files on the slaves.
  • Another bug-fix for DataImportHandler – DIH does not commit if only Deletes are processed. When using special commands $deleteDocById and/or $deleteDocByQuery, when there were no updates of documents, commit wasn’t called by the DIH. Fix is available in 3.4 and 4.0.
  • Also – DataImportHandler multi-threaded option throws exception. The problem would happen when threads attribute was used. The fix for this is available in 3.4 and 4.0. Related to this is another fixed issue – DIH multi threaded mode does not resolves attributes correctly also available in 3.4 and 4.0.
  • Join feature got committed to the trunk (future 4.0 version). It can also perform cross-core joins now, which can be very useful. However, this feature also initiated some heated discussions which can be seen in SOLR-2272. The root cause was the fact that this feature was committed only to the Solr while Lucene got none of it. Of course, it might get refactored and included in Lucene too in the future, but this discussion shows the divisons which still existed between Solr and Lucene communities back then.
  • While we’re talking about Join feature, it might be worth mentioning a patch in SOLR-2604 which back-ports it to 3.x version. Be careful though, it was created for version 3.2 more than two months ago, so a few more adjustments after applying this patch might be needed.
  • Function Queries got new if(), exists(), and(), or(), not(), xor() and def() functions. The fix is committed to trunk so you’ll be able to use it in 4.0.
  • As can be seen from the Solr 3.3 announcement, one of the longest living Solr issue is finally closed for good :) . SOLR-236 – Field Collapsing – along with SOLR-2524 finally bring field collapsing to 3_x and future 4.0 versions.
  • Since grouping/field collapsing was added to Solr, we should be able to use faceting in combination with it. Issue SOLR-2665 – Solr Post Group Faceting – brought exactly that to 3.4 and 4.0.
  • Ever wanted to have more control over what gets stored in the cache? SOLR-2429 will bring exactly that starting with the next Solr release – 3.4. It is simple to use, just add cache=false to your queries like this: fq={!frange l=10 u=100 cache=false}mul(popularity,price).  Note that with this new functionality you can prevent either a filter or a query to be cached, while document caching still remains out of request-time control.
  • If you’re using JMX to observe the state of your Solr installation, you might have encountered a problem when reloading Solr cores – it appears that JMX beans didn’t survive those reloads in the past versions. The fix is created and is available in future 3_x and trunk releases.

Interesting features in development

  • To achieve case-insensitive search with wildcard queries you could use a patch suplied under issue SOLR-2438. It has to be said that this isn’t committed to svn and it is hard to say whether it ever will be since there is a similar issue SOLR-219 on which work started 4 years ago.
  • Multithreaded faceting might bring some performance improvements. At the moment, initial patch exists, but more work will be needed here and it still isn’t clear how big improvement we could expect in real-world conditions, but it is worth keeping an eye on this issue.
  • We all know that Solr’s Spatial support has its limitations. One of them is specifying bounding box which isn’t based on point distance, effectively making it limited to a circular shape. Under SOLR-2609 we might get support for exactly this.
  • For anyone interested in which direction Spatial support might evolve, we suggest checking Lucene Spatial Playground. It continues the great work done in SOLR-2155 which provided extension to initial GeoSpatial support in Solr by adding multivalued spatial fields. At some point, SOLR-2155 might get the goodness from LSP. Also, another thing to check would be a thread on Lucene Spatial Future.

Interesting new features

  • Support for Lucene’s Surround Parser is added to Solr in issue SOLR-2703. The patch is already committed to the trunk.
  • Solr will get the ability to use configuration like analyzer type=”phrase”. Lucene’s Query Parsers recently got a simpler way to use different analyzer based on the query string. One example is usage of double quotes where one can decide that instead of current meaning in Lucene/Solr world – specifying a phrase to be searched for – it should have a meaning like in Google’s search engine – find this exact wording. Patch for this exists and can be applied on the trunk (it depends on Lucene trunk).
  • SOLR-2593 aims to provide a new Solr core admin action – ‘split’ – for splitting index. It would be used in case some core got too big or in any other case you might find it necessary.  Lucene already has a similar function.

Miscellaneous

  • Oracle released Java 7 about a month ago, but we advize against using it yet. JVM crashes and index corruption are issues likely to be encoutered with it. For more information, visit this URL
  • As anticipated for some time, Java 5 support got axed from Lucene 4.0 (trunk). You can expect similar stuff for Solr too.
  • Solr’s build system has been reworked now. Among other things, this implies changes in directory structure in Solr project. For example, solr/src/ doesn’t exist any more and its old subdirs /java and /test are now in solr/core/. The changes are already applied to the trunk and 3_x which holds the next 3.4 version. For more details, see SOLR-2452.
  • A handy Solr architecture diagram can be found in ML thread
  • Solr’s Admin UI is being refreshed with the work in JIRA issue SOLR-2399 (we already wrote about it) and its spin-off SOLR-2667. Some of this stuff is already committed (on the trunk), so you may want to inspect the changes. More details can be found in the wiki where you can also get the sneak-peak of the upcoming changes.

And that would be all for part one of the Solr Spring-Summer 2011 Digest edition from @sematext. Part two of the Spring-Summer Digest is coming in a few days – stay tuned!

Opening: HBase and Lucene / Solr / Elastic Search Developer

We are once again looking for smart people.  This time we are looking to hire a person who likes working with HBase and Lucene (or Solr or ElasticSearch).  This particular combination is important to us because the very first target for this person might be the integration of HBase and Lucene / Solr / ElasticSearch.  More specifically, we have our eyes on HBASE-3529, which we’ve closely examined during a recent HBase Hackathon that took place after BerlinBuzzwords.  Of course, we are also open to alternative approaches if the one takes in HBASE-3529 turns out to be problematic.  The work around the marriage of HBase and full-text search is to be done “in the open”, meaning in collaboration with HBase as well as Lucene, Solr, or Elastic Search developers, which makes this project that much more exciting.

Beyond HBase and search integration, we do other interesting stuff with HBase (and Flume and MapReduce and …), so this person would get to work on our Search Analytics and Scalable Performance Monitoring services.

Interested?  Please get in touch and see what else we like on our jobs page.

Search Analytics – Video Interview with Otis Gospodnetić

“I’m shocked companies aren’t using these tools.”

This video interview about Search Analytics is from Techilicious by Josette Rigsby: http://techielicous.com/2011/06/04/search-and-analytics/

“…we had a chance to speak with Otis Gospodnetić, co-author of Lucene in Action and Founder of Sematext regarding search analytics and searching big data.”

Enterprise search is growing in importance along with data sizes; there is simply to much content to locate without the aid of a search tool; but, are users really  finding what they need? Unfortunately, many companies can not answer that question. Gospodnetić advised that organizations should be collecting at least a minimum set of metrics about  search performance and user behavior. However, the majority are not.

Unlike click stream analysis, search analytics provides insight into how users are actually using search – the actual terms they specify – instead of just what they clicked. Key metrics organizations should collect on an on-going basis include:

  • Search failure (zero results)
  • Low click-through rate
  • Most popular searches (words and phrases)

Once the metrics are collected, organizations should analyze the data to improve the search experience. For example, if a significant percentage of queries are failing organizations can use the data from search analytics to find out why. Is it due to misspellings? Are there synonyms? Gospodnetić said,

“I’m shocked companies aren’t using these tools.”

For more information on this topic read about Sematext Search Analytics service.

Training: Solr Performance Tuning and Monitoring

Quick announcement!

In addition to presenting at Open Source Search Conference in June, we’ll also be doing a super-cheap half-day training on Solr Performance Tuning & Monitoring.  You can sign up here.

In this tutorial you will learn how to squeeze the most performance out of your Solr cluster. We’ll cover performance at both indexing and query time; dealing with large volumes of data versus high query rates, the combination of the two; and various index sharding architectures possible to gain on search performance, in multi-data center setups, etc. We’ll cover an array of best practices, tips and tricks we regularly use in our engagements with clients, from various configuration settings to querying efficiently, all of which one should employ to get the most out of Solr. You will also learn how to monitor your Solr cluster’s performance with command-line tools and a visual monitoring solution specifically designed for Solr performance monitoring.

Prerequisites:

Basic knowledge of Solr, its configuration and setup.

Details:

  • Cost: $100
  • When: June 14, 2011, 9:00 a.m.-1:00 p.m.
  • Bonus: Lunch will be provided.
  • Register here

If you are interested in Solr Performance Monitoring, please read about Sematext Scalable Performance Monitoring service.

Solr Digest, February-March 2011

We Sematexters have been very busy over the past few months, so we missed Solr’s February Digest. This one will therefore be a bit longer than usual.  Let’s get started…

First, some major news : Solr 3.1 is officially released! The details of the announcement can be found here. We covered most of the new features in our digests already, so we’ll keep it short:

You can start your download :) .

Already committed features

  • post.jar got improved – JIRA issue improve post.jar to handle non UTF-8 files removed some of its very old limitations
  • jetty server included in Solr distribution didn’t support UTF-8. Now this is solved, fresh 3.1 version already contains this fix

Interesting features in development

  • as part of SolrCloud, distributed indexing is being implemented in JIRA issue SOLR-2358. You can already see the work in progress in the initial patch, but you can also check SOLR-2341 which deals with shard distribution policies which will be available in Solr 4.0
  • If you ever wanted to add custom fields (not existing in the index) to Solr responses, you couldn’t have done that from Solr components. There were other ways to achieve such functionality (for instance, customizing response writer class), but it looks like we’ll get such ability inside of components, too. No need to say how much more natural that would be. Anyway, issue Allow components to add fields to outgoing documents provides the umbrella for this new functionality. Although it is already closed, there are few sub-issues in which actual pieces of logic will be implemented.
  • if you have problem with case sensitive searches in wildcard queries, you might take a look at a patch provided in JIRA issue Case Insensitive Search for Wildcard Queries
  • although Solr got its first solid spatial implementation in version 3.1, many people found its limitations. One of them is surely a case where documents have multivalued spatial fields. We already wrote about SOLR-2155 in our December digest, but work under that issue hasn’t stopped and keeps evolving. It is likely that it will become a part of the standard Solr distribution and Lucene could get it incorporated, too. If you need spatial search you may want to watch this issue.

Interesting new features

  • one common problem when using Solr’s default spellchecker or auto-suggest is filtering of suggestions based on what some user can see (for instance, depending on the region in which your user resides). JIRA issue Doc Filters for Auto-suggest, spell checking, terms component, etc. proposes a feature which would help here. Currently, no work was done there, though we believe we’ll get to see some progress in the future. While we are at it, in case you need such feature in Auto-suggest now, you might take a look at our in-house Search Auto-Complete solution, which you can see in action on search-lucene.com and search-hadoop.com.
  • just like there are default components for SearchHandlers (which are used by default for every new search handler, unless overriden), update processors will get a similar feature. JIRA issue Let some UpdateProcessors be default without explicitly configuring them will take care that some important update processors are available by default to your UpdateRequestProcessorChain.
  • one great new feature could be added to Solr – ability not to cache a filter. JIRA issue SOLR-2429 will deal with this. Many Solr users will be happy to optimize their cache performance when this feature is available some day.

Miscellaneous

  • some interesting thoughts on spellchecker can be found in ML thread My spellchecker experiment and much more on that topic in the related blog
  • should you use ASCIIFoldingFilter or MappingCharFilter when dealing with accents? Interesting discussion in thread Should ASCIIFoldingFilter be deprecated? could help you decide which one is right for you
  • interesting idea for Solr’s admin UI can be found in this ML thread. Community’s reception was very good so we also got Solr Admin Interface, reworked issue as the home for this new work.
  • anyone using Solr’s UIMA (Unstructured Information Management Architecture) contrib might be interested to know that its wiki page got a major improvement – more docs to read!
  • we might be a bit late on this, but there is still some time left – Google’s Summer of Code applications can be submitted until 8th April. Check this ML thread for some detail.  And don’t forge that Sematext is sponsoring interns, too!
  • new Solr/Lucene users should take a look at the Refcard provided by Erik Hatcher in ML thread [infomercial] Lucene Refcard at DZone
  • some deep thoughts on Solr/Lucene’s release process by some of the key people can be found here Brainstorming on Improving the Release Process. Related to that is a JIRA issue Define Test Plan for 4.0 which will… eh, contain some info about Test plan for 4.0 release, obviously. Also, check the TestPlans wiki page that’s in the making.

Although there were some other interesting topics, we have to stop somewhere. Until next month, you’ll find us on Twitter.

Sematext at Lucene Revolution 2011

Last year at Lucene Revolution in October in Boston, we shared how we built search-lucene.com and search-hadoop.com.  In May of this year, we’ll again be talking at Lucene Revolution about another topic very dear to us at Sematext – Search Analytics (abstract).  The full conference agenda is available.  Start picking sessions to attend.

This year’s Lucene Revolution is extra interesting because Sematext is also sponsoring the conference.  In addition to that, it’s great to see a couple of our customers be presenting this year!

If you are coming to the conference don’t be afraid to say hello.  And if San Francisco is too far this year and you are on the east coast of the US in mid-June, you can also catch us at the Open Source Search Conference.  And if you are in Europe, you’ll see us there in June of this year, too.  Until then, so long from @sematext.

For more information on this topic read about Sematext Search Analytics service.

Wanted: Devops to run Search-Hadoop.com and Search-Lucene.com

If you are dreaming about working on search, big data, analytics, data mining, and machine learning, and are a positive, proactive, independent devops creature, inquire within!

We are a small and highly distributed team who likes to eat a little bit of everything: search for breakfast, mapreduce for lunch, and bigtable for dinner.  We are looking for a part-time-to-grow-into-full-time devops to work on the popular search-hadoop.com and search-lucene.com sites and take them to the next level. As such, you’ll need to be on top of Lucene, Solr, and Elastic Search.  Similarly, you must be completely at $HOME on the UNIX command line.  Working knowledge of Mahout or statistics/machine learning/data mining background would be a major plus, but is not required.  Experience with productive web frameworks and slick modern front-end frameworks is another plus, as is familiarity with EC2 and EBS.

More about the ideal you:

  • You are well organized, disciplined, and efficient
  • You don’t wait to be told what to do and don’t need hand-holding
  • You are reliable, friendly, have a positive attitude, and don’t act like a prima donna
  • You have an eye for detail – no sloppy code, no poor spelelling and typous
  • You are able to communicate complex ideas in a clear fashion in English (or pretty diagrams)
  • You have experience with (large scale) search or data analysis
  • You like to write about technologies relevant to what we do
  • You are an open-source software contributor

Not all of the above are required, of course – the closer the match, the higher the relevance score, that’s all.

Interested?  Please get in touch.

Sematext at Open Source Search Conference 2011

We are happy to report that we have been selected to present at the upcoming Open Source Search Conference 2011 in McLean, Virginia, this coming June.  We’ll be talking about Search Analytics.

Search Analytics: What? Why? How?

Search is increasingly the primary information access mechanism, so knowing how your search is doing often has direct business impact. You’ve indexed your data and people are searching it. But how do you know if they are happy with the results? How do you know if they are finding what they need?  Regardless of whether you are using Solr, Lucene, or some other search solution, you should be paying attention to what your users are searching for and clicking on – through those actions they are telling you what you can do to improve your search.

In this talk we’ll talk about search analytics and how it can be used to answer questions like:

  • Are too many users getting the dreaded “no matches” results?
  • How deep into search results do people dig?
  • Which hits are they clicking on, or what percentage of them don’t click on any hits?
  • How much do they use the “Did You Mean” or “Auto-Complete” suggestions?

We’ll explore what specific search analytics reports tell us and what specific actions you should take based on those reports.

You can browse through other presentations, too.  If you will be attending the conference, please do not hesitate the tap Otis‘ shoulder.  We’d also be happy to talk business if you think your organization may benefit from one of Sematext’s products or services or if you simply want to chat.  To keep up with us, follow @sematext on Twitter.

For more information on this topic read about Sematext Search Analytics service.

Google Summer of Code and Intern Sponsoring

Are you a student and looking to do some fun and rewarding coding this summer? Then join us for the 2011 Google Summer of Code!

The application deadline is in less than a month! Lucene has identified initial potential projects, but this doesn’t mean you can also pick your own.  If you need additional ideas, look at our Lucene / Solr for Academia: PhD Thesis Ideas (or just the spreadsheet if you don’t want to read the what and the why),  just be sure to discuss with the community first (send an email to dev@lucene.apache.org).

We should also add that, separately from GSoC, Sematext would be happy to sponsor good students and interns interested in work on projects involving search (Lucene, Solr), machine learning & analytics (Mahout), big data (Hadoop, HBase, Hive, Pig, Cassandra), and related areas. We are a virtual and geographically distributed organization whose members are spread over several countries and continents and we welcome students from all across the globe.  For more information please inquire within.

Solr Digest, January 2011

Welcome to the second season of Sematext’s monthly Solr Digests. Once again, we compiled a list of most interesting topics in Solr world for the previous month:

Already committed features

  • A bug related to using PHPSerialized response writer in sharded environment was fixed and committed in SOLR-2307. It affected all recent Solr versions (trunk, 3_x, 1.4.1,…) and the fix is committed to 3_x branch and trunk. In case you’re stuck with older version of Solr, you can manually try applying the patch, it should be doable.
  • One old JIRA issue Enable sorting by Function Query is finally closed and committed to 3_x and trunk.
  • A problem with race condition in StreamingUpdateSolrServer got its fixes before, however it appears that issue wasn’t fixed completely. Now another fix is committed to 3_x and trunk, so if you use this feature, we advise picking up the fix.

Interesting features in development

  • Support for complex syntax (e.g. wildcards) in phrase queries is being brought to Lucene. In case you’re interested, you can take a look at LUCENE-1823 or LUCENE-1486 which was another try at similar functionality. These issues have been in development for a long time and still aren’t finished, although patches exist. Similar feature for Solr is developed under SOLR-1604, where you can also find some patches. However, we think it is a bit unclear if any of these issues will ever be committed to Lucene/Solr, so if you’re interested, check the progress on them occasionally and don’t hold your breath.

Interesting new features

Miscellaneous

And that’s all for January.

Follow

Get every new post delivered to your Inbox.

Join 706 other followers