Search Analytics: What? Why? How? Presentation

Back in May and June of this year, 2011, we went to two conferences: Lucene Revolution 2011 in San Francisco and Berlin Buzzwords.  We went there to talk about what Search Analytics is, why it’s important to have a good search analytics tool, what the benefits of search analytics are, etc.  We never posted that presentation deck online, so we are doing it now.  Note that we will be talking about Search Analytics at Lucene Eurocon 2011 in Barcelona, too, but that talk will be different than the one whose slides are below.

Enjoy!

Search Analytics – Video Interview with Otis Gospodnetić

“I’m shocked companies aren’t using these tools.”

This video interview about Search Analytics is from Techilicious by Josette Rigsby: http://techielicous.com/2011/06/04/search-and-analytics/

“…we had a chance to speak with Otis Gospodnetić, co-author of Lucene in Action and Founder of Sematext regarding search analytics and searching big data.”

Enterprise search is growing in importance along with data sizes; there is simply to much content to locate without the aid of a search tool; but, are users really  finding what they need? Unfortunately, many companies can not answer that question. Gospodnetić advised that organizations should be collecting at least a minimum set of metrics about  search performance and user behavior. However, the majority are not.

Unlike click stream analysis, search analytics provides insight into how users are actually using search – the actual terms they specify – instead of just what they clicked. Key metrics organizations should collect on an on-going basis include:

  • Search failure (zero results)
  • Low click-through rate
  • Most popular searches (words and phrases)

Once the metrics are collected, organizations should analyze the data to improve the search experience. For example, if a significant percentage of queries are failing organizations can use the data from search analytics to find out why. Is it due to misspellings? Are there synonyms? Gospodnetić said,

“I’m shocked companies aren’t using these tools.”

For more information on this topic read about Sematext Search Analytics service.

Sematext at Berlin Buzzwords 2011

As part of Sematext’s Summer 2011 Conference Tour we are going to be visiting the good old Europe and giving a talk at Berlin Buzzwords.  This is the second year for Berlin Buzzwords, “a conference for developers and users of open source software projects, focusing on the issues of scalable search, data-analysis in the cloud and NoSQL-databases. Berlin Buzzwords presents more than 30 talks and presentations of international speakers specific to the three tags “search”, “store” and “scale”“.  Last year, one of us from Sematext went there as an attendee.  This year, three of us are going and one of us is giving a talk – @OtisG will be speaking about Search Analytics on June 6th.  That’s the first day of the conference and we are first in line to talk at 11:00 AM, right after the morning coffee.  Doug Cutting and Ted Dunning will be giving Keynotes.  Some of us may also be there for some of the Hackathons/Workshops before and/or after the conference.  If you are going to be there and would like to meet up, please let us know@sematext.

For more information on this topic read about Sematext Search Analytics service.

Lucene / Solr for Academia: PhD Thesis Ideas

If you are a Lucene or Solr user or developer, please read on, we’d like to hear from you.  If you use a different search tool, please also keep reading.  And if you have 5 minutes of free time, we’d like to hear from you, too! ;)

Short version:

We are looking for your suggestions for advanced features that tools like Lucene, Sol, etc. could or should have, but unfortunately don’t have today, and that could be good topics for one’s Master’s or PhD thesis.  Some of us here at Sematext are PhD candidates and are looking for suggestions that could result in working code ready to be contributed to open-source.  Plus, we are trying to go beyond that and involve the academic community, as described below.  Please add your suggestions to the Lucene / Solr Wishlist public spreadsheet, but please keep in mind that we are looking for advanced functionality, not simple features that would be too simple or small as research/thesis topics.  Feel free to pass the link to friends and colleagues you think would be interested in this or may want to make suggestions.

Longer version:

We are in early stages of collaborating with academia in areas such as IR/IE/ML/NLP.  What we’d like to do is involve the academic community, but with an explicit intention of producing research whose day one goal is to result in an implementation that will get integrated (in)to a specific, non-academic system.  Thus, we’d like to come up with very real, very practical problems or deficiencies in existing IR/IE/ML/NLP systems, but that are not simple and that require academic sort of work that then requires real hacking in order to produce at least a working prototype/proof of concept. Our hope is that such a PoC could then be truly integrated, and maybe even improved upon, by industry people.

This may be too abstract and vague, so how about an example.

  • Say the target is Lucene and IR.
  • Say we identify that ability to do X is missing from Lucene.
  • Say that X is non-trivial, that it’s nobody’s immediate itch, and thus won’t be implemented by anyone in Lucene community in the next N months.
  • Say that X involves advanced functionality that could benefit from relatively advanced and/or new research coming out of academia, and is thus something that could be a part of someone’s PhD thesis.
  • Say we find a PhD candidate with adequate background knowledge and interest in X.
  • N months later we could have a working PofC of X.

We are hoping that by doing this we can help everyone:

  • The future PhD will have a non-made-up, real-world problem to solve and existing code (Lucene) to hack on.
  • Lucene community will get X.
  • Lucene community may get a good contributor or committer down the road.

As facilitators of this, we will try hard to work with the academia and teach them “open-source ways”, which includes teaching how to effectively work with the specific open-source community (to the extent this is permissible by one’s academic institution), in order for the research and the real-world needs to be aligned.

So….. at this point we are looking for suggestions of various interesting and practical advanced topics that have both the academic and industry facet to it.  And, with this debut blog post, we are specifically turning to the IR/Lucene/Solr community at large to make suggestions.  Please add your suggestions to the Lucene / Solr Wishlist public spreadsheet, but please keep in mind that we are looking for advanced functionality, not simple features that would be too simple or small as research/thesis topics. Feel free to pass the link to friends and colleagues you think would be interested in this or may want to make suggestions.

Thank you!

Key Phrases for Better Search: Smart Content Presentation

We are 3 for 3 this month – 3 talks at 3 different conferences – Lucene Revolution (see our presentation), Hadoop World (see our presentation), and Smart Content (see full agenda).  The last conference was a small one-day conference here in New York, organized by Seth Grimes.  It turns out there are tons of vendors in the text analytics / “semantic” analysis space who all do more or less the same thing – Named Entity Recognition, Classification, Clustering, Key Phrase Extraction, etc.  Sematext is not in that space, though we do have a classifier, a Language Identifier, and a Key Phrase Extractor.  If is this last tool, the Key Phrase Extractor that I made use of in the presentation.  But enough talk, here is our presentation:

 

Hiring Search and Data Analytics Engineers

We are growing and looking for smart people to join us either in an “elastic”, on-demand, per-project, or more permanent role:

Lucene/Solr expert who…

  • Has built non-trivial applications with Lucene or Solr or Elastic Search, knows how to tune them, and can design systems for large volume of data and queries
  • Is familiar with (some of the) internals of Lucene or Solr or Elastic Search, at least on the high level (yeah, a bit of an oxymoron)
  • Has a systems/ops bent or knows how to use performance-related UNIX and JVM tools for analyzing disk IO, CPU, GC, etc.

Data Analytics expert who…

  • Has used or built tools to process and analyze large volumes of data
  • Has experience using HDFS and MapReduce, and have ideally also worked with HBase, or Pig, or Hive, or Cassandra, or Voldemort, or Cascading or…
  • Has experience using Mahout or other similar tools
  • Has interest or background in Statistics, or Machine Learning, or Data Mining, or Text Analytics or…
  • Has interest in growing into a Lead role for the Data Analytics team

We like to dream that we can find a person who gets both Search and Data Analytics, and ideally wants or knows how to marry them.

Ideal candidates also have the ability to:

  • Write articles on interesting technical topics (that may or may not relate to Lucene/Solr) on Sematext Blog or elsewhere
  • Create and give technical talks/presentations (at conferences, local user groups, etc.)

Additional personal and professional traits we really like:

  • Proactive and analytical: takes initiative, doesn’t wait to be asked or told what to do and how to do it
  • Self-improving and motivated: acquires new knowledge and skills, reads books, follows relevant projects, keeps up with changes in the industry…
  • Self-managing and organized: knows how to parcel work into digestible tasks, organizes them into Sprints, updates and closes them, keeps team members in the loop…
  • Realistic: good estimator of time and effort (i.e. knows how to multiply by 2)
  • Active in OSS projects: participates in open source community (e.g. mailing list participation, patch contribution…) or at least keeps up with relevant projects via mailing list or some other means
  • Follows good development practices: from code style to code design to architecture
  • Productive, gets stuff done: minimal philosophizing and over-designing

Here are some of the Search things we do (i.e. that you will do if you join us):

  • Work with external clients on their Lucene/Solr projects.  This may involve anything from performance troubleshooting to development of custom components, to designing highly scalable, high performance, fault-tolerant architectures.  See our services page for common requests.
  • Provide Lucene/Solr technical support to our tech support customers
  • Work on search-related products and services

A few words about us:

We work with search and big data (Lucene, Solr, Nutch, Hadoop, MapReduce, HBase, etc.) on a daily basis.  Our projects with external clients range from 1 week to several months.  Some clients are small startups, some are large international organizations.  Some are top secret.  New customers knock on our door regularly and this keeps us busy at pretty much all times.  When we are not busy with clients we work on our products.  We run search-lucene.com and search-hadoop.com.  We participate in open-source projects and publish monthly Digest posts that cover Lucene, Solr, Nutch, Mahout, Hadoop, and HBase.  We don’t write huge spec docs, we work in sprints, we multitask, and try our best to be agile. We send people to conferences, trainings (Hadoop, HBase, Cassandra), and certifications (2 of our team members are Cloudera Certified Hadoop Developers).

We are a small and mostly office-free, highly distributed team that communicates via email, Skype voice/IM, BaseCamp.  Some of our developers are in Eastern Europe, so we are especially open to new team members being in that area, but we are also interested in good people world-wide, from South America to Far East.

Interested? Please send your resume to jobs @ sematext.com.

Search Analytics: Hadoop World Presentation

After our Lucene Revolution talk in Boston, we got ready for last week’s Hadoop World conference in New York.  Like at the Lucene Revolution, we presented to a packed room of 200+ people. The topic of our talk was the Search Analytics tool we’ve built with the help of Flume, HBase, MapReduce, and other open-source tools, and which are now starting to use for search-hadoop.com and search-lucene.com.  If you couldn’t make it to Hadoop World, have a look at our presentation below.  And if you’d like to work on Search, Analytics, and related areas, we’re looking for good people world-wide – see our jobs page.  Enjoy!

ProjectHub: Lucene Revolution Presentation

Over the past few weeks we’ve been to two conferences: Lucene Revolution in Boston and Hadoop World in New York.  We presented at both.  At Lucene Revolution we talked about how we built search-hadoop.com and search-lucene.com.  The fact that our presentation room was packed despite a couple of other interesting talks being given at the same time tells us this stuff is interesting to people (or at least the title and the brief description in the conference schedule were attractive). For those of you who were unable to make it to Boston, we are sharing our presentation below.  And for those of you who like to work on Search, Analytics, Machine Learning, and related areas, we’re looking for good people world-wide – see our jobs page.  Enjoy!

Sematext at Hadoop World: Search Analytics with Flume and HBase

Besides working with search in general and Lucene/Solr/Nutch in particular, we also work with Hadoop, HBase, and other related technologies.  This October we’ll be presenting at Hadoop World (see the schedule).  The title of our talk is Search Analytics with Flume and HBase. Here’s the abstract:

In this talk we will show how we use Flume to transport search and clickstream data to HBase with the ultimate goal of producing Search Analytics reports using that data.  We’ll show how data flow through the system from the moment a query or click event is captured in the search application UI, until it lands in HBase via Flume’s HBase sink.  We’ll also share information about what this system looked like in the pre-Flume days.  Finally we’ll demonstrate various reports the system ultimately produces and insight we derive from them.

So, if you are interested in search and analytics, and especially the mix of the two, come see us this October in New York.  If you can’t wait until then or can’t make it to New York, and need help with search and/or analytics, let us know!

Run FAST to Open-Source Search

Ever since Microsoft announced they are halting all development of FAST for Linux starting now, every single organization involved in search had something to say on this topic.  Here are our thoughts on that.

Apparently 80% of FAST users are Linux users.  We won’t speculate what is really behind this seemingly crazy decision to turn off the FAST@Linux dollar faucet.  Over at Kellblog, the Mark Logic CEO already itemized whose door the current FAST@Linux customers might want to knock on next depending on what they are using FAST for.  Unfortunately, he forgot one important angle there, one key option for FAST@Linux users that in today’s day and age should absolutely not be ignored.  After every storm comes sunshine.  The same is happening here.  While being forced to start thinking about changing one’s search solution probably isn’t pleasant, it does open another important door, a big opportunity, one should not ignore. The question we would pose FAST customers is the following: Is there something you’ve always wanted to do with FAST, but never could?  Here is your chance to change that! That door that just opened… walk through it, take look around, you just may like it.  Read on.

While FAST@Linux users may be experiencing mental turbulence now, they know there are open-source tools and solutions out there that are more scalable and faster than FAST and don’t involve any crazy per-document, per-query, per-xyz licensing fees.  (The part about superior performance is referring to Lucene and solutions built on top of it, like Solr.  Several years ago the former C*O of FAST called me up with a search business proposal.  One of the bits he mentioned is that his/FAST engineers were testing Lucene and found it faster than their own search components.) More importantly, these tools are completely open and give FAST@Linux customers the opportunity to finally get whatever they could never get from FAST.  Sure, there will be some trade-offs (e.g. some of these tools may not have some of the nice GUI pieces FAST has, but that is changing…eh, fast).  But the key bit is that these tools are not only free to get and use, they are nearly infinitely flexible.  They have their source opened to all the eyeballs of the world.  They have open-minded communities (or at least the ones we are involved in at Apache Lucene, Solr, Nutch, etc. are).  Development plans are all an open book.  Any organization’s engineers are welcome to jump in and either contribute (nearly finished) new components, or collaborate to develop new functionality, or simply explain user-cases and request functionality to support them, or even fork the whole project.  One would be crazy to choose the last option, of course – it would be sub-optimal use of the benefits of choosing an open-source solution.  Flexibility is one of the key benefits of using open-source.  You don’t like relevance is computed?  Plug your own secret sauce!  You don’t like how data is stored on disk?  Change it to your liking!

If you have to get off FAST@Linux in the coming years, think very, very, VERY hard before you go to another closed enterprise search vendor.  By going to another closed enterprise search vendor, what have you achieved?  You may be able to do what you never could do with FAST before, but there might be some other functionality you’ll have to give up because the new vendor does not have it, does not plan on having, and that you cannot add yourself because most of the solution is a black box with tiny holes poked on some of its sides, so you can kind of take a pretend peek inside.  This would not be an improvement.  This would be a missed opportunity!

Here are some of the key benefits of going with open-source search tools and solutions:

  • No up-front fees
  • No increasing fees due to growth
  • Flexibility to change anything and everything
  • A large and growing user and development communities
  • Security of commercial-grade support

Microsoft’s decision to drop FAST development for Linux is a blessing in disguise. It may not be pleasant right now, but it is actually a good opportunity for FAST customers.  Choose wisely now and in the coming years, and avoid being put on the spot again.

Update: after posting this we came across a blog post that refers to a search service switching from FAST to Solr and cutting costs 400%.  That is, the cost of their new Solr-powered search is 25% of the cost of their old FAST-powered search.

Follow

Get every new post delivered to your Inbox.

Join 706 other followers