Introducing NetMaps

New Year, New Feature in SPM!  We are happy to announce the immediate availability of NetMaps in SPM!  Check out why they are useful or watch the short video below.

Ever wondered how different components of distributed apps are actually connected over the network? When it comes to troubleshooting of distributed application stacks like Apache Kafka, Spark, Hadoop, Cassandra,  Solr, or Elasticsearch — not to mention Microservice architectures or Docker Containers — information about the deployed infrastructure becomes critical. That architecture diagram you drew N months ago?  It’s probably out of date.  Apps we run today are often very dynamic. Instances, nodes, and containers come and go, whether because of elastic up/down scaling or other reasons.

Discovering This Dynamic Infrastructure

Watching the actual network traffic on all nodes could quickly answer many questions for DevOps engineers doing troubleshooting or planning setup changes. For example:

  • Which nodes are online and active?
  • How nodes are connected to other nodes?
  • What are the dependencies between network services?
  • What is the consumed bandwidth between nodes?
  • Which applications run on a specific network node?

Visualize Network Connections

Designed to visualize network connections and answer the above questions instantly, NetMaps also include:

  • Automatic Discovery of network nodes and applications
  • Filtering by application and host name
  • Automatic Visualizations as Network Map and Chord Diagrams
  • Interactive Explorer for following network links for each application node
  • Bandwidth consumption for all incoming and outgoing network connections
  • Navigation from the NetMap to all nodes and related performance metrics of the monitored App

The best practice is to activate network monitoring on all application server nodes, which communicate with databases, message brokers, search engines etc. in that way it is easy to see how client applications communicate with backend servers.

NetMap “Map” View

NetMap_map

NetMap “Chord” View

NetMap_chord

It is very easy to activate Network Monitoring in SPM Client, a collector for Host and Application Metrics. Intelligent network filters ensure that the resource usage for the network monitoring stays low while capturing all relevant packets to explore your infrastructure using the “NetMap” Tab in SPM. If you find network maps interesting, you might also be interested in SPM’s AppMap feature for JVM applications to discover relationships between monitored JVM applications such as Elasticsearch, Solr, Cassandra, Spark or Kafka, …

We hope you like this new addition to SPM.  Got ideas how we could make it more useful for you?  Let us know via comments, email or @sematext.

Not using SPM yet? Check out the free 30-day SPM trial by registering here.  There’s no commitment and no credit card required.

SPM vs. New Relic APM – Performance Monitoring Solution Comparison

If you’ve found your way to this post then chances are high that you’re having second thoughts about diving into a New Relic APM subscription.  You’re not alone.  In fact, we hear from many fellow DevOps engineers looking at performance monitoring solutions who check out New Relic APM — or who are already using it — because it is so widely known, but wonder if there is a better tool, specifically with traits like:

  • Better pricing
  • On Premises deployment (not just SaaS)
  • Integration of metrics, logs and events in a single UI
  • Across-the-board anomaly detection

In all of the above cases — and others — SPM meets or exceeds New Relic APM.  This SPM vs. New Relic APM comparison document has all the details.

SPM_NR_header

So why not give SPM a try?  You can check out a free 30-day trial by registering here.  There’s no commitment and no credit card required.  Even better — combine SPM with Logsene to make the integration of performance metrics, logs, events and anomalies more robust for those looking for a single pane of glass.

MongoDB Monitoring Support

For many of us in the DevOps field, MongoDB is a critical part of our IT stack.  With today’s acquisition of WiredTiger, MongoDB is further establishing itself as the NoSQL DB built to support massive data processing and storage.  It would be an understatement to say that MongoDB does a lot, with many organizations using it as their backend storage framework, analytics backend, and so on.

So your MongoDB cluster really, really needs to be in tip-top shape.  All the time.  And if it’s not then you need to know asap — or better yet — prevent problems before they kick in and make your life difficult.  That’s where SPM comes in — with MongoDB monitoring, alerting and anomaly detection.  MongoDB exposes a boatload of metrics, but instead of just throwing all of them on endless charts, we’ve taken the time to cherry pick what we think are the top 50 most valuable MongoDB metrics to monitor. We have furthermore made it possible to filter the MongoDB metrics by server, as well as a database and table where possible.

The key metric groups we track are:

  • Database Operations
  • Database Memory
  • Database Storage
  • Documents
  • Locks
  • Network
  • Database Journal
  • Background Flushes

The Overview chart below provides 9 charts with MongoDB key metrics:

  • Row 1 displays CPU, Memory and Disk Metrics
  • Row 2 displays Database Operations, Database Memory and Database Storage Metrics.
  • Row 3 adds Collection/Document Metrics, Locks, and wait times; followed by Network Metrics for MongoDB

MongoDB_Overview

SPM for MongoDB Overview

In case you monitor a MongoDB cluster, the Server Tab provides a quick overview for the Health of each node:

MongoDB_Server_view

SPM Server View

The Reports on the left side of the screen below provide detailed information for each group of metrics. Let’s have a quick look at them.

MongoDB_CPU_details

OS Metrics: CPU Metrics, Memory Usage, Disk Space and I/O

Below is an example of some of the key MongoDB Metrics found in SPM:

  • Database Operations: Counters for Queries, Insert, Update, Delete and other commands for the main database plus replica operations
  • Database Memory: Resident-, Virtual-, Mapped-, and Journal Memory
  • Database Storage: Size of Data Files, Namespace Files, DB Files etc., plus Size of Objects, Number of Collections and Objects

MongoDB_Storage

MongoDB Storage & Collections

The screenshot below shows:

  • Documents: Counters for Documents inserted, updated or returned by queries
  • Locks: Lock counters and lock acquisition wait times for Global, Database, Collection and Journal level. Since MongoDB 3.x Locks are not always global. SPM shows a breakdown for all lock types. These metrics are good candidates for alerting, when anomalies are detected.  Simply add an alert from the menu in the top-left corner in each chart.

MongoDB_Locks

Metrics for all MongoDB Locks

Other key MongoDB metrics that SPM displays are:

  • Network: Number of client connections, Received and transmitted data, Request rate
  • Database Journal: Commits, Early Commits,  Commit times and lock times

MongoDB_Journal_Commits

MongoDB Journal Metrics

In case you like to see MongoDB metrics together with the Top Node.js Metrics, you might like the idea of putting MongoDB and Node.js metrics from SPM for Node.js in a custom dashboard:

MongoDB_Locks-Node.js_Loop

SPM Custom Dashboard with MongoDB Locks and Node.js Event Loop Latency

We hope you like this new addition to SPM.  Got ideas how we could make it more useful for you?  Let us know via comments, email or @sematext.

Not using SPM yet? Check out the free 30-day trial by registering here.  There’s no commitment and no credit card required.  Even better — combine SPM with Logsene to make the integration of performance metrics, logs, events and anomalies more robust for those looking for a single pane of glass.

Docker + Solr How-to: Monitoring the Official Solr Docker Image

The official Solr Image on Docker Hub was released just a few weeks ago and already has 16K pulls. Why not more? Well, there are more than 200 different Solr images on Docker Hub — probably because no official Image was available!

A rapidly growing number of organizations are using Solr and Docker in production and they are probably happy about the new official Image. Needless to say, monitoring Solr is essential in production. Docker is disruptive in many ways, and there are many things that are slightly different and worth mentioning.  These include:

  1. Changed deployment for Solr and its monitoring tools using Dockerfile, Docker Compose or various Orchestration Tools
  2. There is a new Layer to monitor: Container Metrics and Events, see: Docker Events and Metrics monitoring and SPM for Docker
  3. Logging has changed: containers log to the console and logs need to be retrieved from Docker-Daemon instead getting them from the Solr log file.  Check out our post on the subject: Innovative Docker Log Management
  4. Official Images may not provide options for monitoring (such as JMX).  However, the official Image for Solr provides an option to pass parameters to the Java Runtime Environment.  We we will use this option for Solr monitoring in this post.

Next, I’m going to demonstrate the setup of a Solr node with SPM. The final setup will provide the full Solr & Docker Monitoring and Logging package:

  • Detailed Application Metrics for Solr, deployed on Docker
  • Detailed Container Metrics and Docker Events
  • Centralized Logs for all Containers by SPM for Docker

Let’s first decide on one of the following options to monitor Solr on Docker:

  1. Build your own Solr container with a mix of open-source monitoring/alerting tools. I’m not going to go into detail about this option today because dealing with a mix of open-source DevOps tools and a non-official Solr image doesn’t sound clean; plus, we can do better.
  2. Use a standalone monitoring agent, which queries metrics from the Solr container. This requires a setup for JMX and Docker networking configurations for the monitor and Solr. The metrics gathered by remote agents are limited and, in the Docker context, running an external monitoring process plus Solr processes consumes more resources.  And the next option …
  3. Inject an SPM in-process monitoring agent into Solr. This option has the lowest resource usage and has support for advanced monitoring functions like Transaction Tracing and AppMap.

We’ll go with Option #3 in this blog post, as it provides the best insights into Solr.  Sematext provides the SPM Client (this includes the monitoring agent and metrics sender) pre-installed in a Docker Image.  We refer to this dockerized SPM Client as “SPM Client Image/Container” in the following instructions.  The main trick here is to mount a volume from SPM Client Container into Solr Containers in order to load the monitoring library that’s part of the SPM Client Container.

Let’s have a look at the desired setup and how to get there:

SPM-Solr-Docker-Schema
Monitoring Setup

We’ll use the latest Docker-Compose Version (> v1.5) because we can than use environment variables substitution in Docker-Compose.

1) Configure and start SPM-Client Container

The SPM Token is a unique identifier for monitored applications – if you haven’t created an SPM App for Solr, then create one here first. Should take about 37 seconds.

# Set the SPM Token as Environment Variable
export SPM_TOKEN=4feb144c-4da8-4081-83b5-b0b8e06e743a
# Set the JVM Name, which appears in SPM JVM Metrics Report
# In addition we will use it as Hostname for the Solr container
export JVM_NAME=SOLR1

2) Create SPM Client and Solr service in docker-compose.yml Note: you may copy this file to make changes for additional Solr options; all parameters are set as Environment Variables.

spm-client-solr:
 image: sematext/spm-client
 container_name: spm-client-solr
 hostname: spm-client-solr
 environment:
 - SPM_CONFIG=${SPM_TOKEN} solr javaagent ${JVM_NAME}

SOLR1:
 image: solr
 hostname: solr1
 ports:
 - "8983:8983"
 volumes_from:
 - spm-client-solr
 environment:
 - SOLR_OPTS=-Dcom.sun.management.jmxremote -javaagent:/opt/spm/spm-monitor/lib/spm-monitor-solr.jar=${SPM_TOKEN}::${JVM_NAME}
 command: bin/solr -f

In the Environment variable “SOLR_OPTS” in the Docker-Compose file above we see options for the SPM in-process monitor to inject a .jar file from the SPM Client Volume.  The SOLR_OPTS string is taken from SPM install instructions.  It includes the SPM Token (the ${SPM_TOKEN} part) and provides the JVM name so we can distinguish between multiple Solr instances if we run N of them on the same host (the ::${JVM_NAME} part).

3) Run Solr and SPM Monitor  

We are now ready to fire up Solr:

    docker-compose up -d

Solr_image_code

All done! After about a minute, metrics for the Docker Host, JVMs and Solr nodes will appear in SPM.  Because we chose a consistent naming for Container hostname, and JVM name we can immediately see, in every chart, the relevant filters named “SOLR1”.  This is much better than some random Container IDs.

Solr_image_screen_4

Solr Metrics Overview

But what about my Solr Logs and the Container Metrics?

Simply run SPM for Docker – it collect logs as well as container and host metrics.  It can also parse Solr logs and store them in Logsene (see Logsene 1-Click ELK Stack), which is awesome because it means you can have both Solr/OS/JVM metrics AND Solr logs all in one place!  Or do you maybe like to ssh to your servers and grep log files?

Docker Logs & Metrics Steps:

First we create the SPM App with the type “Docker” for Docker-specific metrics and then we create a Logsene App for our logs. Then we use the generated App Tokens to run Sematext Agent for Docker.

docker run -d -name sematext-agent -e SPM_TOKEN=SPM_DOCKER_APP_TOKEN -e LOGSENE_TOKEN=LOGSENE_APP_TOKEN sematext/sematext-agent-docker

After a few minutes, you will get Host and Container Metrics together with Events and Logs in SPM, as shown here:

Solr_image_screen_2

Please note that logs from the containers are automatically shipped and parsed! No setup for log shippers? That is correct — there is NO complicated setup of syslog, Logstash, Docker log drivers, etc.  All this work is done by SPM for Docker. For example, each log line has a “node_name” field for the Solr node. It takes the timestamp, severity, class, thread and source from the Solr log and each log is automatically tagged with the container ID and image name. Moving from SPM Metrics to detailed Solr Logs including Exceptions and parsed Stack Traces is just another mouse click away! Look:

Solr_image_screen_3
Multi-Line Exception, captured and parsed from Solr container

 

solr-logsene

The filters next to field stats on the right side of the screen make it easy to identify containers with the most logs by choosing “container_name”.  That’s just a little detail in the Logsene UI – feel free to explore it by creating Alerts or Kibana 4 Dashboards for your container logs.

Like what you saw here? To monitor Docker and Solr with SPM just get a free account here!  And drop us an email or hit us on Twitter with suggestions, questions or comments.  Solr and Docker are topics we enjoy chatting about with the community!

Top Node.js Metrics to Watch

Monitoring Node.js Applications has special challenges. The dynamic nature of the language provides many “opportunities” for developers to produce memory leaks, and a single function blocking the event queue can have a huge impact on the overall application performance. Parallel execution of jobs is done using multiple worker processes using the “cluster” functionality to take full advantage of multi-core CPUs – but the master and worker processes belong to a single application, which means that they should be monitored together. Let’s have a deep look at the Top Metrics in Node.js Applications to get a better understanding of why they are so important to monitor.

Note: All images in this post are from Sematext’s SPM Performance Monitoring solution and its Node.js integration.

Garbage Collection & Process Memory – Node.js is based on Google’s Chrome V8 Javascript engine. Garbage Collection reclaims memory used by objects that are not longer required. The V8 garbage collection stops the program execution. Incremental GC cycles (scavenging) process only a part of the Heap and are very fast. Full GC cycles deal with objects that survived multiple Incremental GC runs. Full GC runs are executed less frequently to minimize pauses in the program execution.

With regard to garbage collection metrics, we should first measure all the time spent for garbage collection. In addition, it is useful to see how often a full GC cycle — or incremental GC cycle — is executed. The size of heap memory can be compared with the size of the last GC run to see if there is a growing trend. That’s why the following metrics should be monitored:

  • Time consumed for garbage collection
  • Counters for full GC cycles
  • Counters for incremental GC cycles

Nodejs_garbage_collection

Garbage Collection Metrics

Aside from how often GC happens and how long it takes, we can measure the effect on memory by providing the following metrics:

  • Released memory between GC cycles (see above)
  • Process Heap Size and Heap Usage

Nodejs_process_memory

Process Memory Information

Event Loop – The secret of Node.js’s performance is its ability to be CPU bound and use async operations; in that way CPU can be highly utilized and doesn’t waste cycles waiting for I/O operations. This means a server can take many connections and will not be blocked for async operations. As soon as the operation is finished, callback functions are used to continue processing.   The implementation is based on a single event loop, which processes the async function calls in a separate thread. Using synchronous operations drags down performance because other operations need to wait to be executed.  That’s why the golden rule for Node.js performance is “don’t block the event loop”.

The metric to watch is the Latency to process the next event:

  • Slowest Event Handling (Max Latency)
  • Average Event Loop Latency
  • Fastest Event Handling (Min Latency)

Nodejs_slow_avg_fast

Slowest, average and fastest event processing

A high latency in the event loop might indicate the use of blocking (sync) or time-consuming functions in event handlers, which could impact the performance of the whole Node.js application.

Cluster Mode and number of processes – To scale Node.js beyond the capacity of a single process the use of master and worker processes is required – the so called “cluster” mode. Master processes share sockets with the forked worker process and can exchange messages with it. A typical use case for web servers is forking N worker processes, which operate on the shared server socket and handle the requests in round robin (since Node v0.12). In many cases programs choose N with the number of CPUs the server provides – that’s why a constant number of worker processes should be the regular case.  If this number changes it means worker processes have been terminated for some reason.  In the case of processing queues, workers might be started on demand.  In this scenario it would be normal that the number of workers changes all the time, but it might be interesting to see how long a higher number of workers was active. Using a tool like SPM for Node.js lets one track the number of workers. When picking a monitoring solution or developing your own monitoring for Node.js, make sure it is capable of filtering by hostname and worker ID.  Keep in mind Node.js workers can have a very short lifetime that traditional monitoring tools may not be able to handle well.

Nodejs_worker_count

Worker Count

For example, to compare event loop latency in different Node.js sub-processes, we need to be able to select workers we want to compare:

Nodejs_event_loop

Event Loop Latency for each Worker

Web Frameworks – There is a steadily growing number of frameworks to build web services using Node.js.  The most popular are: Express, Hapi.js, Restify, Mean.io, Meteor, and many more. When doing HTTP monitoring, here are some of the key metrics to pay attention to:

  • Response time (http/https)
  • Request rate
  • Error rates (total, error categories)
  • Request/Response content size

Nodejs_HTTP:HTTPS

HTTP/HTTPS Metrics Overview

Of course, Node.js apps don’t run in a vacuum.  They connect to other services, other types of applications, caches, data stores, etc.  As such, while knowing what key Node.js metrics are, monitoring Node.js alone or monitoring it separately from other parts of the infrastructure is not the best practice.  If there is one piece of advice I can give to anyone looking into (Node.js) monitoring it is this: when you buy a monitoring solution — or if you are building it for your own use — make sure you end up with a solution that is capable of showing you the big picture.  For example, Node.js is often used with Elasticsearch (see Top 10 Elasticsearch Metrics to Watch post), Redis, etc.  Seeing metrics for all the systems that surround Node.js apps is precious.  Here is just a small example of a dashboard showing a few Node.js and Elasticsearch metrics together.

Nodejs_combined_dashboard

Combined Dashboard of Node.js and Elasticsearch Metrics

So, those are our top Node.js metrics — what are YOUR top 10 metrics? We’d love to know so we can compare and contrast them with ours in a future post.  Please leave a comment, or send them to us via email or hit us on Twitter: @sematext.

And…if you’d like try SPM to monitor Node.js yourself, check out a Free 30-day trial by registering here.  There’s no commitment and no credit card required. Small startups, startups with no or very little outside funding, non-profit and educational institutions get special pricing – just get in touch with us.

[Note: this post originally appeared on Radar.com]

Death to APM and Logging Silos

Sematext has combined the power of SPM and Logsene in a single pane of glass – a unified view into all the key bits of operational intelligence every DevOps engineer needs: server and application performance metrics, logs, events, anomalies, alerts, ChatOps integrations, etc.  In other words, the whole is greater than the sum of its parts.

Metrics + Logs Correlation using SPM and Logsene Together in One UI

This video demonstrates how the SPM + Logsene combination solves the problems of having too much data to manage yourself and the disconnect when metrics and logs are siloed.  We address two of the most common problems — and their solutions — below the video.

Problem 1 – Big Data, Big Burden: Servers, Containers, Apps, and Devices spew out more and more data: more metrics, more logs, more events. Collecting and storing all this data is a challenge and is often not cheap both in terms of time invested in building large-scale data collection, storage, and retrieval systems, maintaining them, as well as providing the adequate infrastructure to run them.

Solution: Focus on your organization’s core business, your core strength, and outsource needlessly painful or expensive parts to those who specialize in them.  We already outsource all the time, except we don’t call it “outsourcing”: we buy food, we don’t grow or raise it.  We buy cars and don’t build them.  Most of us don’t buy physical servers any more.  Why?  Because others do that better, faster, cheaper.

Problem 2 – Metrics vs. Log Silos: Collecting and visualizing performance metrics and getting alerts when things go awry is great, but performance charts can tell us only so much.  Code instrumentation, like SPM’s Transaction Tracing, goes deeper and provides more insight, but still doesn’t tell us the whole story.  Similarly, collecting logs and being able to search them is very valuable.  Unfortunately, oftentimes APM and log management solutions live in separate silos that don’t really talk to each other.

Solution: Don’t waste your time jumping between multiple disconnected solutions, be they open-source or commercial.  Time is the most precious thing each of us has, and our time as DevOps engineers is very expensive.  Use a tool or service that gives you access to as many bits of operational information that you need as possible.  Not only is this more efficient, and thus cheaper, it’s also much more pleasant than jumping between solutions for browser and terminal, top, vmstat, dstat, less, grep, etc. which are needlessly manual and get boring.

Troubleshooting Doesn’t Need To Take Over Your Life

Troubleshooting production performance issues, dealing with APM alerts and even looking at logs (don’t even think about grepping!) isn’t that hard or time consuming.  Well, as long as you have the right tools, that is.

Correlate_1

Cloud & On Premises Deployments

Unlike most monitoring and logging solutions, Sematext offers both Cloud and On Premises deployments for SPM and Logsene.  We’re happy to discuss package pricing if you’d like to combine both products.

Got ideas how we could make metrics and logs correlation more useful for you?  Let us know via comments, email or @sematext.

Not using SPM and/or Logsene yet? Check out the free 30-day trial by registering here (ping us if you’re a startup, a non-profit, or educational institution – we’ve got special pricing for you!).  There’s no commitment and no credit card required.

Introducing Top Database Operations

If you run Elasticsearch, Solr, or any backend you communicate with using SQL (via JDBC), like SparkSQL, Apache Cassandra (CQL), Apache Impala, Apache Drill, MySQL, PostgreSQL, etc., you’ll like what we’ve just added to SPM.  We call it Database Operations and in SPM you can find it in the new Database report:

If you didn’t watch the video, here’s what Database Operations gives you:

  • Top 5 operation types across all your data stores or filtered to a specific data store type
  • Top 5 operation types by speed, throughput, or simply their volume
  • Time-series reports for volume, throughput, and latency broken down by operation type
  • Ability to view all collected operations, not just the slowest ones, filter by database type or by operation type, sorted by average or total duration, or throughput
  • Sparklines that show last 5 minute values and trends
  • Top 10 slowest individual operations and drill-in details

Integration with Transaction Tracing, so you can correlate slow data store operations with the actual transaction/request that triggered slow operations

Important:

  • To get this information add SPM agent to the application that is talking to a data store (e.g. Solr or Elasticsearch or MySQL or …). This is because the SPM agent captures operations at that client layer, not in the server itself.
  • To start capturing this information enable Transaction Tracing in your SPM agents

This, including Distributed Transaction Tracing, works for all Java applications

Database_ops_1

——-

Database_ops_graphic

Don’t forget – when you enable Database Operations you will also automatically get Transaction Tracing, as well as the cool AppMaps – enjoy! :)

Got ideas how we could make Database Operations better and more useful to you?  Let us know via comments, email or @sematext.

Grab a free 30-day SPM trial by registering here (ping us if you’re a startup, a non-profit, or educational institution – we’ve got special pricing for you!).  There’s no commitment and no credit card required.