Lucene / Solr for Academia: PhD Thesis Ideas

If you are a Lucene or Solr user or developer, please read on, we’d like to hear from you.  If you use a different search tool, please also keep reading.  And if you have 5 minutes of free time, we’d like to hear from you, too! ;)

Short version:

We are looking for your suggestions for advanced features that tools like Lucene, Sol, etc. could or should have, but unfortunately don’t have today, and that could be good topics for one’s Master’s or PhD thesis.  Some of us here at Sematext are PhD candidates and are looking for suggestions that could result in working code ready to be contributed to open-source.  Plus, we are trying to go beyond that and involve the academic community, as described below.  Please add your suggestions to the Lucene / Solr Wishlist public spreadsheet, but please keep in mind that we are looking for advanced functionality, not simple features that would be too simple or small as research/thesis topics.  Feel free to pass the link to friends and colleagues you think would be interested in this or may want to make suggestions.

Longer version:

We are in early stages of collaborating with academia in areas such as IR/IE/ML/NLP.  What we’d like to do is involve the academic community, but with an explicit intention of producing research whose day one goal is to result in an implementation that will get integrated (in)to a specific, non-academic system.  Thus, we’d like to come up with very real, very practical problems or deficiencies in existing IR/IE/ML/NLP systems, but that are not simple and that require academic sort of work that then requires real hacking in order to produce at least a working prototype/proof of concept. Our hope is that such a PoC could then be truly integrated, and maybe even improved upon, by industry people.

This may be too abstract and vague, so how about an example.

  • Say the target is Lucene and IR.
  • Say we identify that ability to do X is missing from Lucene.
  • Say that X is non-trivial, that it’s nobody’s immediate itch, and thus won’t be implemented by anyone in Lucene community in the next N months.
  • Say that X involves advanced functionality that could benefit from relatively advanced and/or new research coming out of academia, and is thus something that could be a part of someone’s PhD thesis.
  • Say we find a PhD candidate with adequate background knowledge and interest in X.
  • N months later we could have a working PofC of X.

We are hoping that by doing this we can help everyone:

  • The future PhD will have a non-made-up, real-world problem to solve and existing code (Lucene) to hack on.
  • Lucene community will get X.
  • Lucene community may get a good contributor or committer down the road.

As facilitators of this, we will try hard to work with the academia and teach them “open-source ways”, which includes teaching how to effectively work with the specific open-source community (to the extent this is permissible by one’s academic institution), in order for the research and the real-world needs to be aligned.

So….. at this point we are looking for suggestions of various interesting and practical advanced topics that have both the academic and industry facet to it.  And, with this debut blog post, we are specifically turning to the IR/Lucene/Solr community at large to make suggestions.  Please add your suggestions to the Lucene / Solr Wishlist public spreadsheet, but please keep in mind that we are looking for advanced functionality, not simple features that would be too simple or small as research/thesis topics. Feel free to pass the link to friends and colleagues you think would be interested in this or may want to make suggestions.

Thank you!

5 Responses to Lucene / Solr for Academia: PhD Thesis Ideas

  1. Pingback: Call for PhD ideas for Lucen/Solr implementation « LingPipe Blog

  2. tommy says:

    This is a great idea!

    I recently completed my MS at UC Irvine where I wrote my thesis on a part of query tagger for search queries. I implemented query segmentation and classification as a solr component and used on a search engine for federal research grants http://researchwatch.net. A search query like “energy stanford university” would get transformed into the filter query: Grant Abstract: “energy” and Organization: “stanford university”

    I’m a little busy with some projects but I plan to open source the library.

    • sematext says:

      Great. I think this is a good example of something that could be done as research/thesis (how to do query tagging well) with a concrete implementation (a Solr SearchComponent, I’m guessing, in this case).

  3. Govind says:

    Idea : Luke is great tool.To complement it and monitor what queries are getting executed, what are their usage of resources, what are the top 10 queries. A tool like SQL Server DMV/Oracle V$Session will be good addition.

  4. Pingback: Google Summer of Code Student Sponsoring « Sematext Blog

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out / Change )

Twitter picture

You are commenting using your Twitter account. Log Out / Change )

Facebook photo

You are commenting using your Facebook account. Log Out / Change )

Google+ photo

You are commenting using your Google+ account. Log Out / Change )

Connecting to %s

Follow

Get every new post delivered to your Inbox.

Join 1,716 other followers

%d bloggers like this: