As you may know, Sematext runs a service we internally call SPM – Scalable Performance Monitoring, a currently-still-free SaaS for monitoring performance of Solr, HBase, and soon a few other technologies we often help our clients with. One of the things we monitor for Solr and other search technologies is the size of the index. We monitor it by periodically checking its size, number of documents in it, number of deleted documents, number of index segments, files, etc.
Recently, we had an internal discussion about how to best report the index size when the index changes over time and decided we’d ask people who run Solr (or ElasticSearch or Sensei or…) – you – what you would like to see in this report.
For example, imagine that in some 5-minute time period (say 10:00 AM to 10:05 AM) we check the index 5 times (in reality we do it much for frequently) and each time we do that we find the index has a different number of documents in it: 10, 15, 20, 25, and finally 30 documents. Now imagine this data as a graph showing the number of indexed document over time, but with the smallest time period shown being a 5 minutes interval.
At this point the question we have for you is: How many documents should this graph report for our example 10:00 – 10:05 AM period above? Should it show the minimum – 10? Average – 20? Mean – 20? Maximum -30? Something else? Minimum, average, and maximum – 10, 20, 30?
Any feedback and suggestions you give us regarding this will be greatly appreciated – thanks!
We use ElasticSearch more and more here at Sematext (we have a number of ElasticSearch projects right at this moment, some of them quite massive in terms of data and/or query volume). In our work we typically have only 1 ElasticSearch instance/only 1 JVM running ElasticSearch on each server in the cluster.
How about you? Do you run multiple ElasticSearch instances/JVMs per box in production?
Lucene and Solr projects merged recently, as we mentioned in Solr Digest and Lucene Digest for March 2010. Today, their -dev mailing lists finally finally merged. Since Sematext runs the search-lucene.com service that makes these lists (and more) searchable, we need to decided how to handle this relatively drastic change.
We’ve identified 2 options, and we need your input to help us decide what the right option is:
We can add a new lucene-dev list and start indexing it. This would contain only the new lucene-dev content (for both Lucene and Solr development from today on). This downside is that if you wanted to include old lucene-dev messages or old solr-dev messages in your search, you would have to explicitly select those lists. We could rename them to lucene-dev-old and solr-dev-old for example, so the UI would show lucene-dev, lucene-dev-old, and solr-dev-old. You’d have total control over what you want searched, but it would require you to make your choices explicitly, which also means people would have to understand what those -old lists are about and why there is no solr-dev.
We could merge the old solr-dev and old lucene-dev, and have a single lucene-dev that has both of those lists’ old messages (up to today), as well as all the new messages from the merged lucene-dev list from here on. In effect, it would look as it Lucene and Solr always had a single lucene-dev list, since all of the old lucene-dev and solr-dev content would be in this new lucene-dev. If we go this route, there would be no lucene-dev-old or solr-dev-old in the UI, just one lucene-dev choice. But there also wouldn’t be solr-dev choice in the UI, since it doesn’t exist any more, which may be confusing. Thus, when you choose to search Solr, you wouldn’t see solr-dev facet in the UI, but the lucene-dev list’s content would be searched, so you wouldn’t actually miss any matches.
If there is a 3rd or 4th option that we missed, please let us know via comments!