Solr vs. ElasticSearch: Part 2 – Data Handling

In the previous part of Solr vs. ElasticSearch series we talked about general architecture of these two great search engines based on Apache Lucene. Today, we will look at their ability to handle your data and perform indexing and language analysis.

  1. Solr vs. ElasticSearch: Part 1 - Overview
  2. Solr vs. ElasticSearch: Part 2 - Indexing and Language Handling
  3. Solr vs. ElasticSearch: Part 3 - Searching
  4. Solr vs. ElasticSearch: Part 4 - Faceting
  5. Solr vs. ElasticSearch: Part 5 - Management API Capabilities
  6. Solr vs. ElasticSearch: Part 6 – User & Dev Communities Compared

Read more of this post

Solr vs. ElasticSearch: Part 1 – Overview

A good Solr vs. ElasticSearch coverage is long overdue.  We make good use of our own Search Analytics and pay attention to what people search for.  Not surprisingly, lots of people are wondering when to choose Solr and when ElasticSearch, and this SolrCloud vs. ElasticSearch question is something we regularly address in our search consulting engagements.

As the Apache Lucene 4.0 release approaches and with it Solr 4.0 release as well, we thought it would be beneficial to take a deeper look and compare the two leading open source search engines built on top of Lucene – Apache Solr and ElasticSearch. Because the topic is very wide and can go deep, we are publishing our research as a series of blog posts starting with this post, which provides the general overview of the functionality provided by both search engines.

  1. Solr vs. ElasticSearch: Part 1 – Overview
  2. Solr vs. ElasticSearch: Part 2 – Indexing and Language Handling
  3. Solr vs. ElasticSearch: Part 3 – Searching
  4. Solr vs. ElasticSearch: Part 4 – Faceting
  5. Solr vs. ElasticSearch: Part 5 - Management API Capabilities
  6. Solr vs. ElasticSearch: Part 6 – User & Dev Communities Compared

Read more of this post

ActionGenerator – Part Two

In the previous part of the two – parts series about ActionGenerator we showed how to develop and run your own ActionGenerator. Today, we want to show you what action generators are there for you to use out-of-the-box and we want to share some insights about the future of this project.

ActionGenerator for ElasticSearch

The ag-player-es project contain all the code specific to action generators for ElasticSearch. You can find two sinks implemented – one for sending simple queries to ElasticSearch and the other for indexing data. They both use  ElasticSearch REST API, so no dependencies are needed for those to work. The other piece available in ag-player-es are three players configured and ready to use:

  • SimpleEsPlayerMain – ActionGenerator for running random queries to ElasticSearch.
  • DictionaryEsPlayerMain – ActionGenerator for running queries to ElasticSearch. Queries are generated using a provided dictionary.
  • DictionaryDataEsPlayerMain – ActionGenerator for indexing data to ElasticSearch. Fields content is generated using provided dictionary.

ActionGenerator for Solr

Similar to ag-player-es one can expect that ag-player-solr will contain all the code specific to action generators for Apache Solr. In this project you can find two sinks implementation – one for sending queries to ElasticSearch and one for indexing data. The first one uses Solr HTTP API and the other one uses XML to index data to Solr. Similar to ag-player-es, no dependencies are needed, so you should be able to use action generator for Solr with all recent Solr versions. Apart from that, there are three players configured and ready for use:

  • DictionaryDataSolrPlayerMain - ActionGenerator for indexing data to Apache Solr. Fields content is generated using provided dictionary.
  • DictionarySolrPlayerMain -  ActionGenerator for running queries to Apache Solr. Queries are generated using provided dictionary.
  • RandomQueriesSolrPlayerMain - ActionGenerator for running random queries to Apache Solr.

Using ActionGenerator for ElasticSearch

Lets concentrate on players available for ElasticSearch in ActionGenerator. As we wrote above, you have three main players for ElasticSearch that can be used out of the box.

SimpleEsPlayerMain

The simplest of the three generators available. It lets you generate random queries to a given index and with the use of the given field name. In order to use that player, you need to provide the following parameters:

  • ElasticSearch base URL
  • ElasticSearch index name
  • Name of the field queries should be run against
  • Number of events that should be generated
For example, you could run the following command and have 1000 queries sent to name field of the documents index on your local ElasticSearch instance:
java -cp ag-player-es-0.1.0-withdeps.jar \
com.sematext.ag.es.SimpleEsPlayerMain http://localhost:9200/ documents text 1000

DictionaryEsPlayerMain

The second generator that enables you to run queries uses a dictionary to generate text of your queries. It is similar to the SimpleEsPlayerMain, except for dictionary usage. In order to use that player, you need to provide the same parameters as to the SimpleEsPlayerMain and one additional parameters:

  • Dictionary path
For example, you could run the following command and have 1000 queries sent to name field of the documents index on your local ElasticSearch instance. Queries would be generated using dict.txt dictionary (each line containing different query string):
java -cp ag-player-es-0.1.0-withdeps.jar com.sematext.ag.es.DictionaryEsPlayerMain http://localhost:9200/ documents text 1000 dict.txt

DictionaryDataEsPlayerMain

The one and only player that enables you to index data to your ElasticSearch instance. Just as the player discussed above, DictionaryDataEsPlayerMain also works with the help of a dictionary. You need to provide the following parameters in order to use this player:

  • ElasticSearch base URL
  • ElasticSearch index name
  • ElasticSearch type name
  • Number of events that should be generated
  • Dictionary path
  • One or more fields and their types
So, for example if you would like to index 100.000 documents to documents index under document type to your local ElasticSearch instance you could run the following
java -cp ag-player-es-0.1.0-withdeps.jar com.sematext.ag.es.DictionaryDataEsPlayerMain http://localhost:9200/ documents document 100000 dict.txt id:numeric title:text likes:numeric

Right now, the following field types are available:

  • numeric
  • text
We plan to add more types in the future.  Pull requests with patches are welcome!

Using ActionGenerator for Apache Solr

Players available for Apache Solr are similar to the ones available for ElasticSearch, but lets quickly look at them, too.

RandomQueriesSolrPlayerMain

This player is similar to the SimpleEsPlayerMain – it also generates random queries, but to Apache Solr. In order to use it, you need to provide the following parameters:

  • Apache Solr core search handler URL
  • The field queries should be run against
  • Number of queries to be generated
For example, you could run the following command and have 1000 queries sent to name field of the documents core of your local Apache Solr instance:
java -cp ag-player-solr-0.1.0-withdeps.jar com.sematext.ag.solr.RandomQueriesSolrPlayerMain http://localhost:8983/solr/documents name 1000

DictionarySolrPlayerMain

DictonarySolrPlayerMain is similar to RandomQueriesSolrPlayerMain except it uses dictionary to generate queries. In order to use this player you need to provide one additional parameter compared to the ones you provided to DictionarySolrPlayerMain:

  • Dictionary path
For example, you could run the following command and have 1000 queries sent to name field of the documents core of your local Apache Solr instance. Queries would be generated using dict.txt dictionary (each line containing different query string):
java -cp ag-player-solr-0.1.0-withdeps.jar com.sematext.ag.solr.DictionarySolrPlayerMain http://localhost:8983/solr/documents name 1000 dict.txt

DictionaryDataSolrPlayerMain

The last player I’d like to tell you about is the one that enables data indexation to your Apache Solr instance. DictionaryDataSolrPlayerMain is similar to its ElasticSearch counterpart. You need to provide the following parameters in order to use this player:

  • Apache Solr core update handler URL
  • Number of events that should be generated
  • Dictionary path
  • One or more fields and their types
For example, if you would like to index 100000 documents to documents core of your local Apache Solr instance you could run the following
java -cp ag-player-solr-0.1.0-withdeps.jar com.sematext.ag.solr.DictionaryDataSolrPlayerMain http://localhost:8983/solr/documents/update/ 100000 dict.txt id:numeric title:text likes:numeric

Lets omit the field types description as it was provided during DictionaryDataEsPlayerMain description above.

Calculating Metrics

Currently, the AbstractHttpSink class has the ability to gather metrics about how the system to which you are sending events is behaving. You can choose between two methods of metrics output – to the standard output or to a file.  To enable metrics tracking you need to pass the -DenableMetrics=true parameter when running ActionGenerator. This parameter enables metrics gathering and outputs those to standard output. In order to change that behavior and output the metrics to a file you need to pass the -DmetricsType=file parameter. In addition, you need to specify which directory the metrics should be written to – you do that by passing -DmetricsDir=/path/to/output/dir/ parameter with the value of the directory. Please remember that the directory needs to be created before running your action generator.

Maven Artifacts

The libraries for all the projects creating ActionGenerator can be found in the Sonatype maven repository (
http://oss.sonatype.org/content/repositories/releases/
) under the following dependencies:

Action Generator

<dependency>
   <groupId>com.sematext.ag</groupId>
   <artifactId>ag-player</artifactId>
   <version>0.1.0</version>
</dependency>

Action Generator for ElasticSearch

<dependency>
   <groupId>com.sematext.ag</groupId>
   <artifactId>ag-player-es</artifactId>
   <version>0.1.0</version>
</dependency>

Action Generator for Solr

<dependency>
   <groupId>com.sematext.ag</groupId>
   <artifactId>ag-player-solr</artifactId>
   <version>0.1.0</version>
</dependency>

Plans for the Future

We plan to release action generators for SenseiDB, as well as expand the number of sinks and players ready to be used out of the box. Of course, as always, patches are welcome and if you find any problems with ActionGenerator or if you identify missing features, please open an issue.

ActionGenerator, Part One

In this post we’ll introduce you to ActionGenerator, one of several open source projects we are working on. ActionGenerator lets you generate actions (you can also think of actions as events) from an action sources and play those actions with ActionGenerator’s action player to one of the sinks. The rest is done by ActionGenerator. ActionGenerator comes with several action sources and sinks, but one can easily implement custom action sources and sinks and play them with ActionGenerator. Let’s dig into the details.

Read more of this post

Slides: Large Scale Performance Monitoring for ElasticSearch, HBase, Solr, SenseiDB…

In this presentation from Berlin Buzzwords 2012 we show how the SPM, our Performance Monitoring service is built.  How metrics are collected, how they are processed, and how they are presented.  We share a few findings along the way, too.

Note: we are actively looking for people with strong Java engineers. If that’s you, please get in touch. Separately, if you have interest and/or experience with HBase and/or Analytics, OLAP, and related areas, or  if you are looking to work with ElasticSearch, Solr, and search in general please get in touch, too.

 

See also:

Poll: What do you use for Solr performance monitoring?

The results of this poll will be included in the ”Large Scale ElasticSearch, Solr & HBase Performance Monitoring” presentation at Berlin Buzzwords next week.  Please vote and share this post to help us make this poll statistically significant!

Wanted Dead or Alive: Search Engineer with Client-facing Skills

We are on a lookout for a strong Search Engineer with interest and ability to interact with clients and with potential to build and lead local and/or remote development teams.  By “client-facing” we really mean primarily email, phone, Skype.

A person in this role needs to be able to:

  • design large scale search systems
  • have solid knowledge of either Solr or ElasticSearch or both
  • efficiently troubleshoot performance, relevance, and other search-related issues
  • speak and interact with clients

Pluses – beyond pure engineering:

  • ability and desire to expand and lead a development/consulting teams
  • ability to think both business and engineering
  • ability to build products and services based on observed client needs
  • ability to present in public, at meetups, conferences, etc
  • ability to contribute to blog.sematext.com
  • active participation in online search communities
  • attention to detail
  • desire to share knowledge and teach
  • positive attitude, humor, agility

We’re small and growing.  Our HQ is in Brooklyn, but our team is spread over 4 continents.  If you follow this blog you know we have deep expertise in search and big data analytics and that our team members are conference speakers, book authors, Apache members, open-source contributors, etc.  While we are truly international, this particular opening is in New York.  Speaking of New York, some of our New York City clients that we are allowed to mention are Etsy, Gilt, Tumblr, Thomson Reuters, Simon & Schuster (more on 
http://sematext.com/clients/index.html
).

Relevant pointers:

If you are interested, please send over some information about yourself, your CV, and let’s talk.

Berlin Buzzwords 2012 – Three Talks from Sematext

Last year was our first time at Berlin Buzzwords.  We gave 1 full talk about Search Analytics (video) and 2 lightning talks (video, video).  We saw a number of good talks, too.  We also took part in a HBase Hackathon organized by Lars George in Groupon’s Berlin offices and even found time to go clubbing.  So in hopes of paying Berlin another visit this year, a few of us at Sematext (@sematext) submitted talk proposals.  Last week we all got acceptance emails, so this year there will be 3 talks from 3 Sematextans at Berlin Buzzwords!  Here is what we’ll be talking about:

RafałScaling Massive ElasticSearch Clusters

This talk describes how we’ve used ElasticSearch to build massive search clusters capable of indexing several thousand documents per second while at the same time serving a few hundred QPS over billions of documents in well under a second.  We’ll talk about building clusters that continuously grow in terms of both indexing and search rates. You will learn about finding cluster nodes that can handle more documents, about managing shard and replica allocation and prevention of unwanted shard rebalancing, about avoiding expensive distributed queries, etc.  We’ll also describe our experience doing performance testing of several ElasticSearch clusters and will share our observations about what settings affect search performance and how much.  In this talk you’ll also learn how to monitor large ElasticSearch clusters, what various metrics mean, and which ones to pay extra attention to.

AlexReal-time Analytics with HBase

HBase can store massive amounts of data and allow random access to it – great. MapReduce jobs can be used to perform data analytics on a large scale – great. MapReduce jobs are batch jobs – not so great if you are after Real-time Analytics. Meet append-only writes approach that allows going real-time where it wasn’t possible before.

In this talk we’ll explain how we implemented “update-less updates” (not a typo!) for HBase using append-only approach. This approach shines in situations where high data volume and velocity make random updates (aka Get+Put) prohibitively expensive.  Apart from making Real-time Analytics possible, we’ll show how the append-only approach to updates makes it possible to perform rollbacks of data changes, and avoid data inconsistency problems caused by tasks in MapReduce jobs that fail after only partially updating data in HBase.  The talk is based on Sematext’s success story of building a highly scalable, general purpose data aggregation framework which was used to build Search Analytics and Performance Monitoring services. Most of the generic code needed for append-only approach described in this talk is implemented in our HBaseHUT open-source project.

OtisLarge Scale ElasticSearch, Solr & HBase Performance Monitoring 

This talk has all the buzzwords covered: big data, search, analytics, realtime, large scale, multi-tenant, SaaS, cloud, performance… and here is why:

In this talk we’ll share the “behind the scenes” details about SPM for HBase, ElasticSearch, and Solr, a large scale, multi-tenant performance monitoring SaaS built on top of Hadoop and HBase running in the cloud.  We will describe all its backend components, from the agent used for performance metrics gathering, to how metrics get sent to SPM in the cloud, how they get aggregated and stored in HBase, how alerting is implemented and how it’s triggered, how we graph performance data, etc.  We’ll also point out the key metrics to watch for each system type.  We’ll go over various pain-points we’ve encountered while building and running SPM, how we’ve dealt with them, and we’ll discuss our plans for SPM in the future.

We hope to see some of you in Berlin.  If these topics are of interest to you, but you won’t be coming to Berlin, feel free to get in touch, leave comments, or ping @sematext.  And if you love working with things our talks are about, we are hiring world-wide!

Sematext Presenting Open Source Search Safari at ESS 2012

We are continuing our “new tradition” of presenting at Enterprise Search Summit (ESS) conferences.  We presented at ESS in 2011 (see
http://blog.sematext.com/2011/11/02/search-analytics-at-enterprise-search-summit-fall-2011-presentation/
).  This year we’ll be giving a talk titled Open Source Search Safari, in which Otis (@otisg) will present a number of open-source search solutions – Lucene, Solr, ElasticSearch, and Sensei, plus maybe one or two others (suggestions?).  We’ll also be chained to booth 26 where we’ll be showcasing our search-vendor neutral Search Analytics service (service is currently still free, feel free to use it all you want), along with some of our other search-related products.

ESS East will be held May 15-16 in our own New York City.  If you are a past or prospective client or customer of ours, please get in touch if you are interested in attending ESS at a discount.

Poll: Solr Index Size Monitoring

As you may know, Sematext runs a service we internally call SPM – Scalable Performance Monitoring, a currently-still-free SaaS for monitoring performance of Solr, HBase, and soon a few other technologies we often help our clients with.  One of the things we monitor for Solr and other search technologies is the size of the index.  We monitor it by periodically checking its size, number of documents in it, number of deleted documents, number of index segments, files, etc.

Recently, we had an internal discussion about how to best report the index size when the index changes over time and decided we’d ask people who run Solr (or ElasticSearch or Sensei or…) – you – what you would like to see in this report.

For example, imagine that in some 5-minute time period (say 10:00 AM to 10:05 AM) we check the index 5 times (in reality we do it much for frequently) and each time we do that we find the index has a different number of documents in it: 10, 15, 20, 25, and finally 30 documents.  Now imagine this data as a graph showing the number of indexed document over time, but with the smallest  time period shown being a 5 minutes interval.

At this point the question we have for you is: How many documents should this graph report for our example 10:00 – 10:05 AM period above? Should it show the minimum – 10?  Average – 20?  Mean – 20?  Maximum -30?  Something else?  Minimum, average, and maximum – 10, 20, 30?

Any feedback and suggestions you give us regarding this will be greatly appreciated – thanks!

Follow

Get every new post delivered to your Inbox.

Join 1,218 other followers