Solr Digest, October 2010

Another busy month is behind us.  There were plenty of interesting topics, so let’s get started:

Already committed functionality

Interesting functionality in development

  • Faceting is heavily used functionality, but occasionally people find they’re missing some form of faceting. Hierarchical faceting is one such thing. It has been in development for a very long time but, despite a few posted patches, is still not a part of Solr distribution, plus it hasn’t seen much activity lately. There is another similar issue – Pivot (aka Decision Tree) Faceting Component which should come to life as a separate component. However, there is renewed effort to make it usable, so eventually we’ll see expanded faceting support in Solr.

Interesting new functionality

  • Extending SchemaField with custom attributes is being dealt with in the issue Custom SchemaField object.
  • Improving search relevance is always a big issue (and represents a good part of what Sematext does in client engagements), no matter how good out of the box Solr and Lucene relevance is.  One very useful addition to our search relevance arsenal could come from the Anti-phrasing feature. The idea is that some word sequences in a query are irrelevant to the query meaning (like “Where can I find” or “Where is.”) and could/should be ignored while searching the index. This JIRA issue is still very fresh, so don’t hold your breath waiting for the implementation to become available next week, although we are bound to see this feature in one of the future Solr releases.
  • If you often working with financial data, you might find patches from issue Money FieldType useful. The new field type will support point and range queries, sorting, and exchange rates.
  • Lucene’s ICUTokenizer is useful for multilingual tokenizing but until recently there was no support for it in Solr. The issue Provide Solr FilterFactory for Lucene ICUTokenizer will provide a filter factory which will enable us to use it from Solr. Bingo! The patch already exists, so it can be tried already. Additional new functionality will be added over time.  If you need multilingual support in Solr, have a look at Sematext‘s popular Multilingual Indexer.

Miscellaneous

  • One of the favorite topics, which we also cover frequently, is related to the ongoing confusion about Solr versions. October didn’t disappoint, this topic was discussed on mailing lists again. So, here is one such thread – Which version of Solr to use?. Let us summarize the key parts. Solr 1.5 will probably never be released.  The branch_3x is a stable version from which the next Solr 3.1 version will likely be released.  The trunk contains relatively stable, but still development version of what will become Solr 4.0 one day.
  • If you provide faceting functionality in your application, here is a small (but interesting) discussion that might give you a few ideas about how to optimize it – Faceting and first letter of fields.
  • It appears that Solr has problems running on Tomcat 7. These problems are not related to a particular version of Solr, but to all versions. To learn more, start with Problems running on tomcat and SOLR-2022 .
  • The replication between Solr master and slave when they’re running different versions of Solr is broken, as you can see in issue Cross-version replication broken by new javabin format. The cause is the new javabin format, so in cases like the one described in this issue (master 1.4.1, slave 3x), you’ll encounter problems. Keep that in mind if you plan cross-version replication for some reason.

These were the most interesting highlights for the month of October. Thank you for reading Sematext Blog and following @sematext on Twitter.

Solr Digest, September 2010

Mahout Digest, October 2010

We’ve been very busy here at Sematext, so we haven’t covered Mahout during the last few months.  We are pleased with what’s been keeping us busy, but are not happy about our irregular Mahout Digests.  We had covered the last (0.3) release with all of its features and we are not going to miss covering very important milestone for Mahout: release 0.4 is out! In this digest we’ll summarize the most important changes in Mahout from the last digest and add some perspective.

Before we dive into Mahout, please note that we are looking for people with Machine Learning skills and Mahout experience (as well as good Lucene/Solr search people).  See our Hiring Search and Data Analytics Engineers post.

This Mahout release brings overall changes regarding model refactoring and command line interface to Mahout aimed at improving integration and consistency (easier access to Mahout operations via the command line). The command line interface is pretty much standardized for working with all the various options now, which makes it easier to run and use. Interfaces are better and more consistent across algorithms and there have been many small fixes, improvements, refactorings, and clean-ups. Details on what’s included can be found in the release notes and download is available from the Apache Mirrors.

Now let’s add some context to various changes and new features.

GSoC projects

Mahout completed its Google Summer of Code  projects and two completed successfully:

  • EigenCuts spectral clustering implementation on Map-Reduce for Apache Mahout (addresses issue MAHOUT-328), proposal and implementation details can be found in MAHOUT-363
  • Hidden Markov Models based sequence classification (proposal for a summer-term university project), proposal and implementation details in  MAHOUT-396

Two projects did not complete due to lack of student participation and one remains in progress.

Clustering

The biggest addition in clustering department are EigenCuts clustering algorithm (project from GSoC) and MinHash based clustering which we covered as one of possible GSoC suggestions in one of previous digests . MinHash clustering was implemented, but not as a GSoC project. In the first digest from the Mahout series we covered problems related to evaluation of clustering results (unsupervised learning issue), so big addition to Mahout’s clustering are Cluster Evaluation Tools featuring new ClusterEvaluator (uses Mahout In Action code for inter-cluster density and similar code for intra-cluster density over a set of representative points, not the entire clustered data set) and CDbwEvaluator which offers new ways to evaluate clustering effectiveness.

Logistic Regression

Online learning capabilities such as Stochastic Gradient Descent (SGD) algorithm implementation are now part of Mahout. Logistic regression is a model used for prediction of the probability of occurrence of an event. It makes use of several predictor variables that may be either numerical or categories. For example, the probability that a person has a heart attack within a specified time period might be predicted from knowledge of the person’s age, sex and body mass index. Logistic regression is used extensively in the medical and social sciences as well as marketing applications such as prediction of a customer’s propensity to purchase a product or cease a subscription. The Mahout implementation uses Stochastic Gradient Descent (SGD), check more on initial request and development in MAHOUT-228. New sequential logistic regression training framework supports feature vector encoding framework for high speed vectorization without a pre-built dictionary. You can find more details on Mahout’s logistic regression wiki page.

Math

There has been a lot of cleanup done in the math module (you can check details in Cleanup Math discussion on ML), lot’s of it related to an untested Colt framework integration (and deprecated code in Colt framework). The discussion resulted in several pieces of Colt framework getting promoted to a tested status (QRdecomposition, in particular)

Classification

In addition to speedups and bug fixes, main new features in classification are new classifiers (new classification algorithms) and more open/uniformed input data formats (vectors). Most important changes are:

  • New SGD classifier
  • Experimental new type of Naive bayes classifier (using vectors) and feature reduction options for existing Naive bayes classifier (variable length coding of vectors)
  • New VectorModelClassifier allows any set of clusters to be used for classification (clustering as input for classification)
  • Now random forest can be saved and used to classify new data. Read more on how to build a random forest and how to use it to classify new cases on this dedicated wiki page.

Recommendation Engine

The most important changes in this area are related to distributed similarity computations which can be used in Collaborative Filtering (or other areas like clustering, for example). Implementation of Map-Reduce job, based on algorithm suggested in Elsayed et al: Pairwise Document Similarity in Large Collections with MapReduce, which computes item-item similarities for item-based Collaborative Filtering can be found in MAHOUT-362. Generalization of algorithm based on the mailing list discussion led to an implementation of  Map-Reduce job which computes pairwise similarities of the rows of a matrix using a customizable similarity measure (with implementations already provided for Cooccurrence, Euclidean Distance, Loglikelihood, Pearson Correlation, Tanimoto coefficient, Cosine). More on distributed version of any item similarity function (which was available in a non-distributed implementation before) can be found in MAHOUT-393. With pairwise similarity computation defined, RecommenderJob has been evolved to a fully distributed item-based recommender (implementation depends on how the pairwise similarities are computed). You can read more on distributed item-based recommender in MAHOUT-420.

Implementation of distributed operations on very large matrices are very important for a scalable machine learning library which supports large data sets. For example, when term vector is built from textual document/content, terms vectors tend to have high dimension. Now,  if we consider a term-document matrix where each row represents terms from document(s), while a column represents a document we obviously end up with high dimensional matrix. Same/similar thing occurs in Collaborative Filtering: it uses a user-item matrix containing ratings for matrix values, row corresponds to a user and each column represents an item. Again we have large dimension matrix that is sparse.

Now, in both cases (term-document matrix and user-item matrix) we are dealing with high matrix dimensionality which needs to be reduced, but most of information needs to be preserved (in best way possible). Obviously we need to have some sort of matrix operation which will provide lower dimension matrix with important information preserved. For example, large dimensional matrix may be approximated to lower dimensions using Singular Value Decomposition (SVD).

It’s obvious that we need some (java) matrix framework capable of fundamental matrix decompositions. JAMA is a great example of widely used linear algebra package for matrix operations, capable of SVD and other fundamental matrix decompositions (WEKA for example uses JAMA for matrix operations). Operations on highly dimensional matrices always require heavy computation and this requirements produces high HW requirements on any ML production system. This is where Mahout, which features distributed operations on large matrices, should be the production choice for Machine Learning algorithms over frameworks like JAMA, which although great, can not distribute its operations.

In typical recommendation setup users often ‘have’ (used/interacted with) only a few items from the whole item set (item set can be very large) which leads to user-item matrices being sparse matrices. Mahout’s (0.4) distributed Lanczos SVD implementation is particularly useful for finding decompositions of very large sparse matrices.

News and Roadmap

All of the new distributed similarity/recommender implementations we analyzed in previous paragraph were contributed by Sebastian Schelter and as a recognition for this important work he was elected as a new Mahout committer.

The book “Mahout in Action”, published by Manning, has reached 15/16 chapters complete and will soon enter final review.

This is all from us for now.  Any comments/questions/suggestions are more than welcome and until next Mahout digest keep an eye on Mahout’s road map for 0.5 or discussion about what is Mahout missing to become production stabile (1.0) framework.  We’ll see you next month – @sematext.

Solr Digest, September 2010

It is a busy time of year here at Sematext – we have 3 different presentations to prepare for 3 different conferences to prepare (2 down, 1 more to go!), so we’re a bit late with our digests. Nevertheless, we managed to compile a list of interesting topics in Solr world:

Already committed functionality

  • Solr was upgraded to use Tika 0.7SOLR-1819 – the fix was applied to 1.4.2, 3.1 and 4.0 versions.  Of course, Tika 0.8 is going to happen in not very distant future.
  • If you’re still using old rsync based replication and have a need to throttle transfer rate, have a look at a patch contributed in JIRA issue SOLR-2099. Unfortunatelly, if you’re using 1.4 Java based replication, there is currently no way to throttle replication.
  • If you are using new spatial capabilities in Solr, you might have noticed some incorrect calculations. One of them is fixed – Spatial filter is not accurate –  on 3.1 and 4.0 branches
  • Another minor but useful addition – function queries can now be defined in terms of parameters from other request parameters. Check JIRA issue “full parameter dereferencing for function queries”. It is already implemented in 3.1 and 4.0 and is ready to be used. Here is a short example from JIRA (check how add function is defined and note v1 and v2 request parameters):

http://localhost:8983/solr/select?defType=func&fl=id,score&q=add($v1,$v2)&v1=mul(2,3)&v2=10

Can we say, Solr Calculator, eh?

Interesting functionalities in development for some time

  • Ever wanted to add some custom fields to a response, although they were not stored in your Solr index? You could always create a custom response writer which would add those fields (although it would probably be a “dirty” copy of some already existing Solr’s response writer). However, we all know that it doesn’t sound as the right way to code.  One JIRA issue might deliver a correct way some day - Allow components to add fields to outgoing documents. We say “some day“, since this functionality has been in development for quite some time now and, although it has some patches (currently unfunctional, it seems), is probably is not very near being completed.  But it will be handy to have once it’s done.

Interesting new functionalities

  • Highlighter could get one frequently requested improvement – Highlighter fragement/formatter for returning just the matching terms – we believe this will be a useful addition, although we don’t expect it very soon.
  • One potentially useful feature for all of you who use HDFS – DIH should be able read data directly from HDFS for indexing.  This issue already contains some working code, although it is a question if the fix will become a part of standard Solr distribution.  Still, if you’re using Solr 1.4.1 and you have data in HDFS that you want to index with Solr, have a look at this contribution.
  • Another improvement related to replication is in SOLR-2117 – Allow slaves to replicate at different times. This should be useful to anyone who has long (and therefore heavy) warmup periods on their slaves after replication. This way, you can have your slaves replicate at different time and at the time of replication just take replicating slave offline (to avoid degradation of response times). Be careful though, there is a downside : for some time (limited, but still), your slaves will serve different data. Patch is available for 4.0 version.

Miscellaneous

So, we had a little bit of everything from Solr this month. Until late October (or start of November) when new Solr Digest arrives, stay tuned to @sematext, where we tweet other interesting stuff on a wider set of topics from time to time.

Hadoop Digest, August 2010

The biggest announcement of the year: Apache Hadoop 0.21.0 released and is available for download here. Over 1300 issues have been addressed since 0.20.2; you can find details for Common, HDFS and MapReduce. Note from Tom White who did an excellent job as a release manager: “Please note that this release has not undergone testing at scale and should not be considered stable or suitable for production. It is being classified as a minor release, which means that it should be API compatible with 0.20.2.”. Please find a detailed description of what’s new in 0.21.0 release here.

Community trends & news:

  • New branch hadoop-0.20-security is being created. Apart from the security features, which are in high demand, it will include improvements and fixes from over 12 months of work by Yahoo!. The new security features are going to be a very valuable and welcome contribution (also discussed before).
  • A thorough discussion about approaches of backing up HDFS data in this thread.
  • Hive voted to become Top Level Apache Project (TLP) (also here).  Note that we’ll keep Hive under Search-Hadoop.com even after Hive goes TLP.
  • Pig voted to become TLP too (also here).  Note that we’ll keep Pig under Search-Hadoop.com even after Pig goes TLP.
  • Tip: if you define a Hadoop object (e.g. Partitioner, as implementing Configurable, then its setConf() method will be called once, right after it gets instantiated)
  • For those new to ZooKeeper and pressed for time, here you can find the shortest ZooKeeper description — only 4 sentences short!
  • Good read “Avoiding Common Hadoop Administration Issues” article.

Notable efforts:

  • Howl: Common metadata layer for Hadoop’s Map Reduce, Pig, and Hive (yet another contribution from Yahoo!)
  • PHP library for Avro, includes schema parsing, Avro data file and
    string IO.
  • avro-scala-compiler-plugin: aimed to auto-generate Avro serializable classes based on some simple case class definitions

FAQ:

  • How to programatically determine the names of the files in a particular Hadoop/HDFS directory?
    Use FileSystem & FileStatus API. Detailed examples are in this thread.
  • How to restrict HDFS space usage?
    Please, refer to HDFS Quotas Guide.
  • How to pass parameters determined at run-time (i.e. not hard-coded) to Hadoop objects (like Partitioner, Writable, etc.)?
    One option is to define a Hadoop object as implementing Configurable. In this case its setConf() method will be called once, right after it gets instantiated and you can use “native” Hadoop configuration for passing parameters you need.

HBase Digest, August 2010

The second “developer release”, hbase-0.89.201007d, is now available for download. To remind everyone, there are currently two active branches of HBase:

  • 0.20 – the current stable release series, being maintained with patches for bug fixes only.
  • 0.89 – a development release series with active feature and stability development, not currently recommended for production use.

First one doesn’t support HDFS durability (edits may be lost in the case of node failure) whereas the second one does. You can find more information at this wiki page.  HBase 0.90 release may happen in October!  See info from developers.

Community trends & news:

  • New HBase AMIs are available for dev release and 0.20.6.
  • Looking for some GUI that could be used for browsing through tables in HBase? Check out Toad for Cloud, watch for HBase-Explorer and HBase-GUI-Admin.
  • How many regions a RegionServer can support and what are the consequences of having lots of regions in a RegionServer? Check info in this thread.
  • Some more complaints to be aware of regarding HBase performing on EC2 in this thread. For those who missed it, more on Hadoop & HBase reliability with regard to EC2 in our March digest post.
  • Need guidance in sizing your first Hadoop/HBase cluster? This article will be helpful.

FAQ:

  • Where can I find information about data model design with regard to HBase?
    Take a look at http://wiki.apache.org/hadoop/HBase/HBasePresentations.
  • How can I perform SQL-like query “SELECT … FROM …” on HBase?
    First, consider that HBase is a key-value store which should be treated accordingly. But if you are still up for writing ad-hoc queries in your particular situation take a look at Hive & HBase integration.
  • How can I access Hadoop & HBase metrics?
    Refer to HBase Metrics documentation.
  • How to connect to HBase from java app running on remote (to cluster) machine?
    Check out client package documentation. Alternatively, one can use the REST interface: Stargate.

Solr Digest, August 2010

August brought a lot of activity into Solr world. There were many important developments, so we again compiled the most interesting ones for you, grouped into 4 categories:

Some new (and already committed) features

  • We already wrote about new work done on CollapsingComponent in June’s digest under SOLR-1682. A lot of work was done on this component and it appears that it is very close to being committed. Patches attached to the issue are functional, so you can give it a try.
  • SpellCheckComponent got improvement related to recent Lucene changes –  Add support for specifying Spelling SuggestWord Comparator to Lucene spell checkers for SpellCheckComponent. Issue SOLR-2053 is already fixed, patch is attached if you need it, but it is also committed to trunk and 3_x branch.
  • Another minor feature is improvement of WordDelimiterFilter in SOLR-2059Allow customizing how WordDelimiterFilter tokenizes text. Patch is already there and committed to trunk and 3_x.
  • Performance boost for faceting can be found in SOLR-2089Faceting: order term ords before converting to values. Behind this intimidating title hides a very decent speedup in cases when facet.limit is high. Patch is available, trunk and branch 3_x also got this magic committed.

Some new features being discussed and implemented

  • One very important (and probably much wanted) feature just got its Jira issue – SOLR-2080Create a Related Search Component. The issue was created by Grant Ingersoll, so we can expect some quality work do be done here. There are no patches (or even discussions) yet as the issue is in its infancy, but you can watch its progress in Jira. In the meantime, if you’re interested in such functionality, you can check Sematext’s RelatedSearches product.
  • Jira issue SOLR-2026Need infrastructure support in Solr for requests that perform multiple sequential queries – might add some interesting capabilities to search components, especially if you’re writing some of them on your own. We at Sematext have plenty of experience with writing of custom Solr components (check, for instance, our DYM ReSearcher or its Relaxer sibling), so we know that sometimes it is not a very pleasant task. If Solr gets better support for execution of multiple queries during a single request, writing custom components will become easier. One patch is already posted to this issue, so you can check it out, however, it is still unclear in which way this feature will evolve. We’re hoping for a flexible and comprehensive solution which would be easily extensible to many other features.
  • Defining QueryComponent’s default query parser can be made configurable with the patch attached to the issue SOLR-2031. You probably didn’t encounter many cases where you needed this functionality, but if you needed it, you had a problem before, and now that problem will become history.
  • It appears that QueryElevationComponent might get an improvement : Distinguish Editorial Results from “normal” results in the QueryElevationComponent. Jira issue SOLR-2037 will be the place to watch the progress.

Some newly found bugs

  • DataImportHandler has a bug – Multivalued fields with dynamic names does not work properly with DIH – the fix isn’t available, but if you have such problems, you check the status here.
  • Another bug in DataImportHandler points to a connection-leak issues – DIH doesn’t release JDBC connections in conjunction with DB2. There is no fix at the moment but, as usual, you can check the status in Jira.

Other interesting news

  • One potentially useful tool we recommend checking is SolrMeter. It is a standalone tool for stress testing of you Solr. From their site: The main goal of this open source project is to bring to the solr user community a “generic tool to interact specifically with solr”, firing queries and adding documents to make sure that your Solr implementation will support the real use. With SolrMeter you can simulate your work load over solr index and retrieve statistics graphically.
  • In which IDEs do you work with Solr/Lucene? Here at Sematext, we use both Eclipse and IntelliJ IDEA. If you use the latter and you want to set up Lucene or Solr in it, you can check a very useful description and patch in LUCENE-2611 IntelliJ IDEA setup.

We hope you enjoyed another Solr Digest from @sematext.  Come back and read us next month!

Solr Digest, July 2010

As usual, July is one of the slower months in Solr world, however, we managed to find a few interesting topics for our readers.

  • Interesting feature might be added with SOLR-1979Create LanguageIdentifierUpdateProcessor. It would provide ability to differently handle the text in different languages (think about stemming in analysis, for instance) and to do it automatically. This issue was just created, so the work on it and any usable patches are coming some time in the future. However, if you need something working now, Sematext has a few products for similar multilingual functionality, for instance, Multilingual Indexer or its cousin Language Identifier.
  • Another interesting feature might come with SOLR-1980Implement boundary match support. This will enable one to specify that query should match only at the start or at the end of the field (or be exact match), not somewhere in the middle, which could provide more relevant search results in some specific cases. This issue is also in its infancy and has no patches yet, so we’ll have to wait and see how it progresses.
  • Ever wanted Solr to store as the value of some field something other than the raw input value (remember, when you search Solr, you search on analyzed and indexed values; when you fetch the content of some field, you get the raw input value added to that field, not its analyzed version)? Patch for that already exists in one rather fresh JIRA issue – SOLR-1997Store internal value instead of input one.
  • Getting ready to start using Solr, but are unsure about which version you should use? Don’t worry, confusion about Solr’s version started this spring (see Solr May 2010 Digest), but things stabilized lately. The latest release is the fairly recent 1.4.1, which is basically 1.4 version with many bugfixes. The next release version is 3.1 which can be found on branch_3x branch. You can find its nightly build versions here. The trunk is still used for “unstable” development and the future 4.0 version. To get more information, check these recent threads on the Solr mailing list: here and here.
  • Many will probably agree that Solr’s SpellCheckComponent isn’t very useful in real-life applications. One of the main problems is that it poorly handles multi-word queries, where it creates its suggestion as a collated version of best suggestion for each word of the query, so you often get suggestions which have 0 hits. Also, it doesn’t return important information about suggested query, like how many hits such query would generate and what results it would give. Some of these issues could be fixed some day with SOLR-2010Improvements to SpellCheckComponent Collate functionality. The first version of the patch is already provided. However, if you’d like to use such functionality in your Solr production today, you might consider one much more sophisticated and production-ready component developed by Sematext – DYM ReSearcher – you can see DYM ReSearcher in action on Search-Lucene.com, for example.
  • One minor functionality is added to QueryElevationComponent – Add option to return only the specified results. It was added with JIRA issue SOLR-1966 and is already committed to 3.x and trunk.

We hope that this was enough to satisfy your Solr appetite.  Hopefully, we’ll dig more interesting topics for you in August.  Until then you can keep up with us via @sematext on Twitter.

Add option to return only the specified results

Follow

Get every new post delivered to your Inbox.

Join 1,695 other followers