Solr vs ElasticSearch: Part 4 – Faceting

Solr 4 (aka SolrCloud) has just been released, so it’s the perfect time to continue our ElasticSearch vs. Solr series. In the last three parts of the ElasticSearch vs. Solr series we gave a general overview of the two search engines, about data handling, and about their full text search capabilities. In this part we  look at how these two engines handle faceting.

Faceting

When it comes to faceting, both Solr and ElasticSearch have some faceting methods that other search engine does not.  Both search engines allow you to calculate facets for a given field, numerical range, or date range. The key differences are in the details, of course – in the control of how exactly the facets are calculated, in the memory footprint, and whether we can change the calculation method. In most cases ElasticSearch allows more control over faceting, however Solr has some serious advantages, too.  Lets get into details of each of the methods.

Term Faceting

This method of faceting allows one to get information about the number of term occurrences in a certain field.

Solr

Solr let’s you control how many facets are returned, how they are sorted, the minimum quantity required, and so on. In addition to that, in Solr field faceting, you can choose between a couple of different methods for computing facets.  One of these method should be used for fields with a high number of distinct terms, while the second method is best used in the opposite scenario – when you expect relatively few distinct terms in a field being faceted on.

ElasticSearch

On the other side we have ElasticSearch which allows us to do all that Solr can do (in terms of faceting calculation, not the calculation methods), but in addition it also let’s us exclude specific terms we are not interested in and use regular expressions to define which terms will be included in faceting results. In addition to that we can combine term faceting results from different field automatically or just use scripts to modify the fields values before the calculation process steps in

Query Faceting

Both Solr and ElasticSearch allow calculating faceting for arbitrary query results. In both cases queries can be expressed in the query API of the search engine which we use. For example, in ElasticSearch you can use the whole query DSL to calculate faceting results on them.

Range Faceting

Range faceting lets you get the number of documents that match the given range in a field. Both engines allow for range faceting although in different fashion.

Solr

Apache Solr lets you define the start value, end value, and the gap (with some adjustments like inclusion of values at the end of the ranges) and calculate all the ranges defined by that.

ElasticSearch

ElasticSearch takes a different approach – it lets you specify set of ranges and returns document counts as well as aggregated data. In addition to that, ElasticSearch let’s you specify a different field to check if a document falls into a given range and a different field for the aggregated data calculation. Furthermore, you can modify the field and aggregated data with a script. And that’s not all – in addition to the above method of range faceting ElasticSearch also supports the so called histogram calculation.  This is similar to the Apache Solr approach – for a given field you can get a histogram of values. However, ElasticSearch doesn’t let you control the start and end like Solr does, but only the gap.

Date Faceting

Again, both search engines support faceting on date based fields.

Solr

Date faceting in Apache Solr is quite similar to the range faceting, although it is calculated on fields of  solr.DateField type. You have the same control and use similar parameters as withing the range faceting so I’ll omit describing it.

ElasticSearch

On the other hand, we have ElasticSearch with its date faceting which is an enhancement over the standard histogram faceting. It supports interval specification with date specific parameters like for example: yearmonth, or day. In addition to that, ElasticSearch lets you specify the time zone to be used in computation and of course manipulate the calculation with the use of a script.

Decision Tree Faceting – Solr Only

One of the things that ElasticSearch lacks and that is present in Solr is the pivot faceting aka decision tree faceting. It basically lets you calculate facets inside a parents facet. For example, this is what pivot faceting results look like in Solr (n.b. this example is trimmed for this post) :

<?xml version="1.0" encoding="UTF-8"?>
<response>
<lst name="responseHeader">
  <int name="status">0</int>
  <int name="QTime">1</int>
  <lst name="params">
    <str name="facet">true</str>
    <str name="indent">true</str>
    <str name="facet.pivot">inStock,cat</str>
    <str name="q">*:*</str>
    <str name="rows">0</str>
  </lst>
</lst>
<result name="response" numFound="32" start="0">
</result>
<lst name="facet_counts">
  <lst name="facet_queries"/>
  <lst name="facet_fields"/>
  <lst name="facet_dates"/>
  <lst name="facet_ranges"/>
  <lst name="facet_pivot">
    <arr name="inStock,cat">
      <lst>
        <str name="field">inStock</str>
        <bool name="value">true</bool>
        <int name="count">17</int>
        <arr name="pivot">
          <lst>
            <str name="field">cat</str>
            <str name="value">electronics</str>
            <int name="count">10</int>
          </lst>
          <lst>
            <str name="field">cat</str>
            <str name="value">currency</str>
            <int name="count">4</int>
          </lst>
          .
          .
          .
        </arr>
      </lst>
      <lst>
        <str name="field">inStock</str>
        <bool name="value">false</bool>
        <int name="count">4</int>
        <arr name="pivot">
          <lst>
            <str name="field">cat</str>
            <str name="value">electronics</str>
            <int name="count">4</int>
          </lst>
          <lst>
            <str name="field">cat</str>
            <str name="value">connector</str>
            <int name="count">2</int>
          </lst>
          .
          .
          .
        </arr>
      </lst>
    </arr>
  </lst>
</lst>
</response>

Statistical Faceting

Both ElasticSearch and Apache Solr can compute statistical data on numeric fields – values like count, total, minimal value, maximum value, average, etc. can be computed.

Solr

In Apache Solr the functionality that enables you to calculate statistics for a numeric field is called Stats Component. It returns the above mentioned values as a part of the query result, in a separate list, just as faceting results.

ElasticSearch

In ElasticSearch this functionality is called Statistical FacetYou should keep in mind thought that, as usual, ElasticSearch allows us to calculate this information for values returned by a script or combined for multiple fields, which is very nice if you need combined information for two or more fields or you want to do additional processing before getting the data returned by ElasticSearch.

Geodistance Faceting

(Geo)Spatial search is quite popular nowadays where we try to provide the best search results we can and we considering multiple pieces of information and conditions. Of course both Apache Solr and ElasticSearch provide spatial search capabilities, but we are not talking about searching – we are talking about faceting. Sometimes there is a need to return a distance from a given point, just to show that in our application – and we can do that both in ElasticSearch and Solr.

Solr

In Solr to be able to facet by distance from a given point we would have to use facet.query parameter and use frange or geofilt, for example like this:

q=*:*&sfield=location&pt=10.10,11.11&facet=true&facet.query={!geofilt d=10 key=d10}

This would return the number of document within 10 kilometers from the defined point.

ElasticSearch

ElasticSearch exposes dedicated geo-distance faceting type that lets us pass the point and the array of ranges we want the distance to be calculated for. An example query might look like this:

{
 "query" : {
  "match_all" : {}
 },
 "facets" : {
  "d10" : {
   "geo_distance" : {
    "doc.location" : {
     "lat" : 10.10,
     "lon" : 11.11
    },
    "ranges" : [
     { "to" : 10 }
    ]
   }
  }
 }
}

In addition to that, we can specify the units to be used in distance calculations (kilometers and miles) and the distance calculation type – arc for better precision and plane for faster execution.

Solr, LocalParams and Faceting

One of the good things about faceting in Solr is that it allows the use of local params. For example, you can remove some filters from the faceting results. Imagine you have a query that gets all results for a term ‘flower’ and you only get results that fall into ‘cloth’ category and ‘shirt’ subcategory, but you would like to have faceting for tags field not narrowed to any filter. With the help of local params this query may look like this:

q=flower&fq={!tag=facet_cat}category:cloth&fq={!tag=facet_sub}subcategory:shirt&facet=true&facet.field={!ex=facet_cat,facet_sub}tags

ElasticSearch Faceting Scope and Filters

By default ElasticSearch facets are restricted to the scope of a given query, which is understandable. However, ElasticSearch also lets us change the scope of faceting to global and thus calculate the faceting for the whole data set, and not just for a given result set. In addition to that we can calculate facets for different nested objects by defining the scope matching the name of the nested object. This can come in handy in many situations, for example when optimizing memory usage on faceting on multivalued fields with many unique terms. In addition to that with ElasticSearch we can narrow down the subset of the documents on which faceting will be applied by using filters. We can define filters inside faceting (just please remember that filters that narrow down query results are not restricting faceting) and choose which documents should be taken into consideration when calculating facets. Of course, as you may expect, filters for faceting may be defined in the same way as filters for queries.

Summary

In this part of the Apache Solr vs ElasticSearch posts series we talked about the ability to calculate facet information and only about this. Of course, this is only a look at the surface of faceting, because both Apache Solr and ElasticSearch provide some additional parameters and features that we couldn’t cover without turning this post into a tl;dr monster. However, we hope this post gives you some general ideas about what you can expect from each of these search engines. In the next part of the series we will focus on other search features, such as geospatial search and the administration API. If you are going to the upcoming ApacheCon EU and are interested in hearing more about how ElasticSearch and Apache Solr compare, please come to my talk titled “Battle of the giants: Apache Solr 4.0 vs ElasticSearch“. See you there!

@kucrafal, @sematext

About Rafał Kuć
Sematext engineer, books author, speaker.

4 Responses to Solr vs ElasticSearch: Part 4 – Faceting

  1. Wilco Boumans says:

    Great analysis! One question, though. In the intro you mention differences in memory footprint. However, it’s not detailed further in the rest… are there significant memory footprints between the two, and if so, are they general or specific to certain functionalities?

  2. Excellent. Thanks for the nice write up.

    How the performance this feature stacks against each other?

    Do solr and elasticsearch support faceting dynamically (@ search time) only. Since lucene 4.x has the option to specify the facets at index time why this has not been integrated.

    • sematext says:

      Good questions. We have not compared performance of Solr vs. ES head to head. Benchmarking fairly is very tough. And dangerous. People at LinkedIn have, maybe a year ago now, compares ES vs. Solr vs. Sensei. They have published the code to do this on Github. Yes, both Solr and ES compute facets dynamically. Solr has multiple methods for handling facets. ES is currently going through a pretty complete rework of facets and is starting to call them aggregations. Lucene has its own, completely separate and different faceting implementation that relies on a “sidecar” index. On the Lucene/Solr mailing lists there have been several threads on the topic of sharing/reusing around faceting, but the changes of that happening are currently low. See http://search-lucene.com/ for more.

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out / Change )

Twitter picture

You are commenting using your Twitter account. Log Out / Change )

Facebook photo

You are commenting using your Facebook account. Log Out / Change )

Google+ photo

You are commenting using your Google+ account. Log Out / Change )

Connecting to %s

Follow

Get every new post delivered to your Inbox.

Join 1,640 other followers

%d bloggers like this: