JOB: Elasticsearch / Lucene Engineer (starts in the Netherlands)

In addition to looking for an Elasticsearch / Solr Engineer to join the Sematext team, we are also looking for an Lucene / Elasticsearch Engineer in EU for a specific project.  This project calls for 6 months of on-site work with our client in Netherlands.  After 6 months the collaboration with our client would continue remotely if there is more work to be done for the client or, if the client project(s) are over, this person would join our global team of Engineers and Search Consultants and work remotely (we are all very distributed over several countries and continents). This is a position focused on search – it involves working with Elasticsearch, but also requires enough understanding of Lucene to allow one to write custom Elasticsearch/Lucene components, such as tokenizers, for example. Here are some of the skills one should have for this job:

  •  knowledge of different types of Lucene queries/filters (boolean, spans, etc.) and their capabilities
  •  experience in extending out-of-the-box Lucene functionality via developing custom queries, scorers, collectors
  •  understanding of Lucene document analysis in the process of indexing, experience in writing custom analyzers
  •  experience in mapping advanced hierarchical data structures to Lucene fields
  •  experience in scalable distributed open-source search technologies such as Elasticsearch or Solr

The above is not much information to go by, but if this piqued your interest and if you think you are a good match, please fix up your resume and send it to jobs@sematext.com quickly.

JOB: Elasticsearch / Solr Engineer

We’ve grown nicely this year.  Our team has a new UI Developer, a new Solr/Elasticsearch Engineer, a new Marketing person, a new Automation Engineer, and this summer we have the first ever Intern.

Like all healthy organizations, we keep growing, and we are now looking for good Search Engineers who know Elasticsearch and/or Solr to join our geographically distributed search consulting team.  You will work remotely, from wherever you are, with smart people spread out across the planet and with an amazing array of companies world-wide on projects that range from just a week or two to several months.

At Sematext, we’ve built several exciting products – from smaller, search-focused products that work with Solr and Elasticsearch, to larger ones like SPMSearch Analytics, and most recently Logsene.  While not building products and running services, we help organizations world-wide with their search and big data needs – from fixing issues and providing production support to building complex search systems from scratch.  Our client list is long with a number of household names on it – from Instagram (Facebook) and Tumblr (Yahoo), Etsy and Shutterstock, to The BBC, Elsevier, Lockheed Martin, Reuters, Library of Congress, etc.  We did this without raising any money.  The demand for our products and services is growing and we are looking for good engineers and good people to join our adventure!

More formally:

Sematext is looking for a responsible, professional individual to join our team of search engineers.

Sematext is a New York-based startup with people spread over multiple continents and several hundred customers from Instagram and Tumblr, Etsy and Shutterstock, to The BBC, Elsevier, Lockheed Martin, Reuters, Library of Congress, etc. We’ve built systems handling over 10,000 QPS and have worked with multi-billion document indices. Our core products are:

In addition to the above products we offer consulting services around open source search and big data.

We are looking for a person who is:

  • Enthusiastic and positive
  • Driven, independent, and professional
  • A good communicator, both written and oral
  • Good with Solr and/or Elasticsearch and is hungry to learn more
  • Enjoys helping organizations make the best out of search

As a member of our search team you will get to:

  • Interact with clients world-wide
  • Provide guidance, architecture design, implementation, and support
  • Participate in Solr, Lucene, and Elasticsearch user and development communities
  • Work on Sematext’s search and data analytics products and participate in open-source search projects

This position:

  • Offers a lot of independence, learning, and growth
  • May require a bit of travel here and there, typically in the US and Europe
  • Is open world-wide

Our search team members have written several books about search, regularly give talks at conferences, blog, and participate in open-source projects.
For more info, see 19 things you may like about Sematext.

Interested? Please send your resume to jobs@sematext.com.

For other job openings please see Jobs @ Sematext or even our previous job listings.

Presentation and Video: Side by Side with Solr and Elasticsearch

Fresh from Berlin Buzzwords where Sematext‘s own Radu Gheorghe and Rafal Kuc presented “Side by Side with Solr and Elasticsearch” on the same stage, at the same time…but in different colors.  The talk included live demos, graphing, stats, and hints at juicy things to come.  Needless to say — if you deal with Solr and Elasticsearch then there are great insights to be found here!

Here is the presentation:

 

And here is the video:

 

Want to Be on Stage Somewhere Like Radu and Rafal Talking About Solr and Elasticsearch?

Or maybe you don’t want the spotlight — that’s cool too.  But…if you do enjoy performance monitoring, log analytics, or search analytics, working with projects like Elasticsearch, Solr, HBase, Hadoop, Kafka, and Storm, then drop us a line.  We’re hiring planet-wide!  Front end and JavaScript Developers, Developer Evangelists, Full-stack Engineers, Mobile App Developers…get in touch!

Enjoy!

Podcast: Tools to Monitor Solr, Manage Logs & Analyze Search Trends

Sematext Founder & President Otis Gospodnetic recently spoke with LucidWorks Chief of Product, Will Hayes as part of their SolrCluster podcast series.  Otis and Will discussed tools that Sematext has built to help monitor Solr and other stacks, manage and analyze logs, and analyze search trends.  They also discuss Solr/SolrCloud and Elasticsearch, their APIs, developer friendliness, as well as the general direction that search and big data industry leaders are moving toward around data acquisition and discovery as data increasingly grows.

Go here to listen to the podcast.  It runs about 36 minutes.  Enjoy!

Berlin Buzzwords 2014 – Side by Side with Elasticsearch and Solr

Last year at Berlin Buzzwords two Sematext Engineers had the opportunity to give two talks. Radu talked about “JSON Logging with Elasticsearch” (video, slides) and Rafał did the second round of Solr vs Elasticsearch in his talk “Battle of the Giants, round 2” (video, slides). We were also happy to be sponsoring Berlin Buzzwords 2013. This year, we decided to go for a talk where two of us can talk on the same stage, at the same time. On Tuesday, 27th of May, at 11:30, in the Frannz Club Radu and Rafał will be giving a talk called “Side by side with Solr and Elasticsearch“.

side by side

Solr – established, mature and well known open-source search server, commonly used. Elasticsearch – still young, but quickly gaining popularity, with over 200k downloads per month. Both search servers are based on Lucene – the open-source full text searching Java library, but each with their own extensions, their pros and cons.

We all know that Solr and Elasticsearch are different, but what those differences are and which solution is the best fit for a particular use case is a frequent question. We will try to make those differences clear, not by showing slides and comparing them, but by showing on online demo of both Elasticsearch and Solr:

  • Set up and start both search servers. See what you need to prepare and launch Solr and Elasticsearch.
  • Index data right after the server was started using the “schemaless” mode
  • Create index structure and modify it using the provided API
  • Explore different query use cases
  • Scale by adding and removing nodes from the cluster, creating indices and managing shards. See how that affects data indexing and querying.
  • Monitor and administer clusters.  See what metrics can be seen out of the box, how to get them and what tools can provide you with the graphical view of all the goodies that each search server can provide.

If you want to come, hear about both Solr and Elasticsearch from @sematext and how to achieve similar things, what how they behave and don’t see too many slides, come join us :)

Elasticsearch Server by Rafal Kuc & Marek Rogozinski – now updated!

Use Elasticsearch now?  Thinking about using Elasticsearch?  Wish there was a comprehensive resource that pulled everything you ever wanted to know about Elasticsearch together in one place?  Fret not — you are in luck!

All Elasticsearch, all the time

Sematext engineer Rafał Kuć has co-authored (with Marek Rogozinski) not one, but two(!) different Elasticsearch books: Elasticsearch Server and Mastering Elasticsearch.  Considering that Elasticsearch has only been around a few years — not to mention how much is going on under the hood — it’s a pretty impressive accomplishment.  Even more impressive?  Rafal and Marek have just published a second edition of Elasticsearch Server that encompasses all the changes between Elasticsearch 0.20 and 1.0.  So if you wish you knew more about Elasticsearch, look no further.

Here’s a brief Q&A with Rafal to add some insight:

Q:  What has changed since the first edition of Elasticsearch Server?

A:  After releasing the first edition of the book, which happened to be the first book about Elasticsearch, we got a nice amount of comments and suggestions which we took into consideration when writing the second edition.  The first edition was based on Elasticsearch 0.20, so we already had a lot of material to work with when we were asked to write the second edition and take readers up to version 1.0.  Some of the features we decided to write about were aggregations, new function queries allowing extensive score control, snapshotting, and others.  Some features that are still used by Elasticsearch users, like faceting, did not need much updating.  But others, like percolator, had to be completely rewritten.

Q:  How much work was it?

A:  We tried to make the book as good as we could so the readers could enjoy it and learn from it.  And believe me, we both learned a lot during the writing of the first edition of the book and while writing Mastering Elasticsearch. We had a lot of comments both from the readers and from people working on the book’s Japanese translation.  Thanks Jun!

We incorporated all the comments and suggestion, but it took time, of course. We also wanted to fully restructure the book so that it flowed better.  Hopefully we achieved that. Of course, in addition to all that we had to rewrite major parts of the book to bring it up to date, review all the parts that we decided to leave in the book and make updates as needed, and then write the new sections.

Q:  Where can someone buy it?

A:  You can buy it from Amazon or direct from Packt Publishing.

Parameterizing Queries in Solr and Elasticsearch

We all know how good it is to have abstraction layers in software we create. We tend to abstract implementation from the method contracts using interfaces, we use n-tier architectures so that we can abstract and divide different system layers from each other. This is very good – when we change one piece, we don’t need to touch the other parts that only knew about method contracts, API’s, etc. Why not do the same with search queries? Can we even do that in Elasticsearch and Solr? We can and I’ll show you how to do that.

Read more of this post

Encrypting Logs on Their Way to Elasticsearch Part 2: TLS Syslog

In part 1 of the “encrypted logs” series we discussed sending logs to Elasticsearch over HTTPS. This second part is about TLS syslog.

If you wonder what this has to do with Elasticsearch, the point is that TLS syslog is a standard (RFC-5425): any decent version of rsyslog, syslog-ng or nxlog works with it. So you can forward logs over TLS to a recent, “intermediary” rsyslog. Then, you can either use omelasticsearch with HTTPS to ship your logs to Elasticsearch, or you can install rsyslog on an Elasticsearch node (and index logs over HTTP to localhost).

Such a setup will give you the following benefits:

  • it will work with most syslog daemons, because TLS syslog is so widely supported
  • the “intermediate” rsyslog can act as a buffer, taking that pressure off your application servers
  • the “intermediate” rsyslog can be used for processing, like parsing CEE-formatted JSON over syslog. Again, taking load off your applicaton servers

Our log analytics SaaS, Logsene, gives you all the benefits listed above through the syslog endpoint:

TLS syslog flow in Logsene

Client Setup

Before you start, you’ll need a Certificate Authority’s public key, which will be used to validate the encryption certificate from the syslog destination (more about the server side later).

If you’re using Logsene, you can download the CA certificates directly. If you’re on a local setup, or you just want to consolidate your logs before shipping them to Logsene, you can use your own certificates or generate self-signed ones. Here’s a guide to generating certificates that will work with TLS syslog.

With the CA certificate(s) in hand, you can start configuring your syslog daemon. For example, the rsyslog configuration can look like this:

module(load="imuxsock")  # listens for local logs on /dev/log

global (  # global settings
 defaultNetstreamDriver="gtls"  # use TLS driver when it comes to transporting over TCP
 defaultNetstreamDriverCAFile="/opt/rsyslog/ca_bundle.pem"  # CA certificate. Concatenate if you have more
)

action(  # how to send logs
  type="omfwd"                                    # Forward them
  target="logsene-receiver-syslog.sematext.com"   # to Logsene's syslog endpoint
  port="10514"                                    # on port X
  protocol="tcp"                                  # over TCP
  template="RSYSLOG_SyslogProtocol23Format"       # using the RFC-5424 syslog format
  StreamDriverMode="1"                            # via the TLS mode of the driver defined above.
  StreamDriverAuthMode="x509/name"                # Request the machine certificate of the server
  StreamDriverPermittedPeers="*.sematext.com"     # and based on it, just allow Sematext hosts
)

This is the new-style configuration format for rsyslog, that works with version 6 or above. For the pre-v6 format (BSD-style), check out the Logsene documentation. You can also find the syslog-ng equivalent there.

Server Setup

If you’re using Logsene, you might as well stop here, because it handles everything from buffering and indexing to parsing JSON-formatted syslog.

If you’re consolidating logs before sending them to Logsene, or you’re running your local setup, here’s an excellent end-to-end guide to setting up TLS with rsyslog. The basic steps for the server are:

  • use the same CA certificates as the client, so they have the same basis
  • generate the machine public-private key pair. You’ll have to provide both in the rsyslog configuration
  • set up the TLS rsyslog configuration

Explore

Once you start logging, the end result should be just like in part 1. You can use Logsene’s hosted Kibana, your own Kibana or the Logsene UI to explore your logs:

Logsene Screnshot

As always, feel free to contact us if you need any help:

Encrypting Logs on Their Way to Elasticsearch

Let’s assume you want to send your logs to Elasticsearch, so you can search or analyze them in realtime. If your Elasticsearch cluster is in a remote location (EC2?) or is our log analytics service, Logsene (which exposes the Elasticsearch API), you might need to forward your data over an encrypted channel.

There’s more than one way to forward over SSL, and this post is part 1 of a series explaining how.

update: part 2 is now available!

Today’s method is about sending data over HTTPS to Elasticsearch (or Logsene), instead of plain HTTP. You’ll need two pieces to achieve this:

  1. a tool that can send logs over HTTPS
  2. the Elasticsearch REST API exposed over HTTPS

You can build your own tool or use existing ones. In this post we’ll show you how to use rsyslog’s Elasticsearch output to do that. For the API, you can use Nginx or Apache as a reverse proxy for HTTPS in front of your Elasticseach, or you can use Logsene’s HTTPS endpoint:

Rsyslog Configuration

To get rsyslog’s omelasticsearch plugin, you need at least version 6.6. HTTPS support was just added to master, and it’s expected to land in version 8.2.0. Once that is up, you’ll be able to use the Ubuntu, Debian or RHEL/CentOS packages to install both the base rsyslog and the rsyslog-elasticsearch packages you need. Otherwise, you can always install from sources:
– clone from the rsyslog github repository
– run `autogen.sh –enable-elasticsearch && make && make install` (depending on your system, it might ask for some dependencies)

With omelasticsearch in place (the om part comes from output module, if you’re wondering about the weird name), you can try the configuration below to take all your logs from your local /dev/log and forward them to Elasticsearch/Logsene:

# load needed input and output modules
module(load="imuxsock.so") # listen to /dev/log
module(load="omelasticsearch.so") # provides Elasticsearch output capability

# template that will build a JSON out of syslog
# properties. Resulting JSON will be in Logstash format
# so it plays nicely with Logsene and Kibana
template(name="plain-syslog"
         type="list") {
           constant(value="{")
             constant(value="\"@timestamp\":\"")
                 property(name="timereported" dateFormat="rfc3339")
             constant(value="\",\"host\":\"")
                 property(name="hostname")
             constant(value="\",\"severity\":\"")
                 property(name="syslogseverity-text")
             constant(value="\",\"facility\":\"")
                 property(name="syslogfacility-text")
             constant(value="\",\"syslogtag\":\"")
                 property(name="syslogtag" format="json")
             constant(value="\",\"message\":\"")
                 property(name="msg" format="json")
             constant(value="\"}")
         }

# send resulting JSON documents to Elasticsearch
action(type="omelasticsearch"
       template="plain-syslog"
 # Elasticsearch index (or Logsene token)
       searchIndex="YOUR-LOGSENE-TOKEN-GOES-HERE"
 # bulk requests
       bulkmode="on"  
       queue.dequeuebatchsize="100"
 # buffer and retry indefinitely if Elasticsearch is unreachable
       action.resumeretrycount="-1"
 # Elasticsearch/Logsene endpoint
       server="logsene-receiver.sematext.com"
       serverport="443"
       usehttps="on"
)

Exploring Your Data

After restarting rsyslog, you should be able to see your logs flowing in the Logsene UI, where you can search and graph them:

Logsene Screnshot

If you prefer Logsene’s Kibana UI, or you run your own Elasticsearch cluster, you can run make your own Kibana connect to the HTTPS endpoint just like rsyslog or Logsene’s native UI do.

Wrapping Up

If you’re using Logsene, all you need to do is to make sure you add your Logsene application token as the Elasticsearch index name in rsyslog’s configuration.

If you’re running your own Elasticsearch cluster, there are some nice tutorials about setting up reverse HTTPS proxies with Nginx and Apache respectively. You can also try Elasticsearch plugins that support HTTPS, such as the jetty and security plugins.

Feel free to contact us if you need any help. We’d be happy to answer any Logsene questions you may have, as well as help you with your local setup through professional services and production support. If you just find this stuff exciting, you may want to join us, wherever you are.

Stay tuned for part 2, which will show you how to use RFC-5425 TLS syslog to encrypt your messages from one syslog daemon to the other.

Video and Presentation: Indexing and Searching Logs with Elasticsearch or Solr

Interested in log indexing using Elasticsearch or Solr?  Also interested in searching and analyzing logs in real time?

This topic really hits home for us since we released our log analytics tool, Logsene and we also offer consulting services for logging infrastructure.  If you are reading this and looking for a new opportunity then you might be interested to hear that we are hiring worldwide.

If you are into logging like we are, then you will want to check out this presentation delivered by Sematext’s own Radu Gheorghe to the NYC Search, Discovery and Analytics Meetup held recently at Pivotal Labs.  For the purposes of this presentation the term “logs” ranges from server logs and application events to metrics and even social media information.

The presentation has three parts:

  1. Overview of logging tools that play nicely with Elasticseach and Solr (like Logstash, Apache Flume or rsyslog)
  2. Performance tuning and scaling Elasticsearch and Solr
  3. Demo of an end-to-end solution

Here you go – enjoy!

Follow

Get every new post delivered to your Inbox.

Join 1,637 other followers