Configuring HBase Memstore: What You Should Know

In this post we discuss what HBase users should know about one of the internal parts of HBase: the Memstore. Understanding underlying processes related to Memstore will help to configure HBase cluster towards better performance.

HBase Memstore

Let’s take a look at the write and read paths in HBase to understand what Memstore is, where how and why it is used.

Memstore Usage in HBase Read/Write Paths

Memstore Usage in HBase Read/Write Paths

(picture was taken from Intro to HBase Internals and Schema Design presentation)

When RegionServer (RS) receives write request, it directs the request to specific Region. Each Region stores set of rows. Rows data can be separated in multiple column families (CFs). Data of particular CF is stored in HStore which consists of Memstore and a set of HFiles. Memstore is kept in RS main memory, while HFiles are written to HDFS. When write request is processed, data is first written into the Memstore. Then, when certain thresholds are met (obviously, main memory is well-limited) Memstore data gets flushed into HFile.

The main reason for using Memstore is the need to store data on DFS ordered by row key. As HDFS is designed for sequential reads/writes, with no file modifications allowed, HBase cannot efficiently write data to disk as it is being received: the written data will not be sorted (when the input is not sorted) which means not optimized for future retrieval. To solve this problem HBase buffers last received data in memory (in Memstore), “sorts” it before flushing, and then writes to HDFS using fast sequential writes. Note that in reality HFile is not just a simple list of sorted rows, it is much more than that.

Apart from solving the “non-ordered” problem, Memstore also has other benefits, e.g.:

  • It acts as a in-memory cache which keeps recently added data. This is useful in numerous cases when last written data is accessed more frequently than older data
  • There are certain optimizations that can be done to rows/cells when they are stored in memory before writing to persistent store. E.g. when it is configured to store one version of a cell for certain CF and Memstore contains multiple updates for that cell, only most recent one can be kept and older ones can be omitted (and never written to HFile).

Important thing to note is that every Memstore flush creates one HFile per CF.

On the reading end things are simple: HBase first checks if requested data is in Memstore, then goes to HFiles and returns merged result to the user.

What to Care about

There are number of reasons HBase users and/or administrators should be aware of what Memstore is and how it is used:

  • There are number of configuration options for Memstore one can use to achieve better performance and avoid issues. HBase will not adjust settings for you based on usage pattern.
  • Frequent Memstore flushes can affect reading performance and can bring additional load to the system
  • The way Memstore flushes work may affect your schema design

Let’s take a closer look at these points.

Configuring Memstore Flushes

Basically, there are two groups of configuraion properties (leaving out region pre-close flushes):

  • First determines when flush should be triggered
  • Second determines when flush should be triggered and updates should be blocked during flushing

First  group is about triggering “regular” flushes which happen in parallel with serving write requests. The properties for configuring flush thresholds are:

  • hbase.hregion.memstore.flush.size
 Memstore will be flushed to disk if size of the memstore
 exceeds this number of bytes. Value is checked by a thread that runs
 every hbase.server.thread.wakefrequency.
 <description>Maximum size of all memstores in a region server before
 flushes are forced. Defaults to 35% of heap.
 This value equal to causes
 the minimum possible flushing to occur when updates are blocked due to
 memstore limiting.

Note that the first setting is the size per Memstore. I.e. when you define it you should take into account the number of regions served by each RS. When number of RS grows (and you configured the setting when there were few of them) Memstore flushes are likely to be triggered by the second threshold earlier.

Second group of settings is for safety reasons: sometimes write load is so high that flushing cannot keep up with it and since we  don’t want memstore to grow without a limit, in this situation writes are blocked unless memstore has “manageable” size. These thresholds are configured with:

 <description>Maximum size of all memstores in a region server before new
 updates are blocked and flushes are forced. Defaults to 40% of heap.
 Updates are blocked and flushes are forced until size of all memstores
 in a region server hits
  • hbase.hregion.memstore.block.multiplier
 Block updates if memstore has hbase.hregion.block.memstore
 time hbase.hregion.flush.size bytes. Useful preventing
 runaway memstore during spikes in update traffic. Without an
 upper-bound, memstore fills such that when it flushes the
 resultant flush files take a long time to compact or split, or
 worse, we OOME.

Blocking writes on particular RS on its own may be a big issue, but there’s more to that. Since in HBase by design one Region is served by single RS when write load is evenly distributed over the cluster (over Regions) having one such “slow” RS will make the whole cluster work slower (basically, at its speed).

Hint: watch for Memstore Size and Memstore Flush Queue size. Memstore Size ideally should not reach upper Memstore limit and Memstore Flush Queue size should not constantly grow.

Frequent Memstore Flushes

Since we want to avoid blocking writes it may seem a good approach to flush earlier when we are far from “writes-blocking” thresholds. However, this will cause too frequent flushes which can affect read performance and bring additional load to the cluster.

Every time Memstore flush happens one HFile created for each CF. Frequent flushes may create tons of HFiles. Since during reading HBase will have to look at many HFiles, the read speed can suffer.

To prevent opening too many HFiles and avoid read performance deterioration there’s HFiles compaction process. HBase will periodically (when certain configurable thresholds are met) compact multiple smaller HFiles into a big one. Obviously, the more files created by Memstore flushes, the more work (extra load) for the system. More to that: while compaction process is usually performed in parallel with serving other requests, when HBase cannot keep up with compacting HFiles (yes, there are configured thresholds for that too;)) it will block writes on RS again. As mentioned above, this is highly undesirable.

Hint: watch for Compaction Queue size on RSs. In case it is constantly growing you should take actions before it will cause problems.

More on HFiles creation & Compaction can be found here.

So, ideally Memstore should use as much memory as it can (as configured, not all RS heap: there are also in-memory caches), but not cross the upper limit. This picture (screenshot was taken from our SPM monitoring service) shows somewhat good situation:

Memstore Size: Good Situation

Memstore Size: Good Situation

“Somewhat”, because we could configure lower limit to be closer to upper, since we barely ever go over it.

Multiple Column Families & Memstore Flush

Memstores of all column families are flushed together (this might change). This means creating N HFiles per flush, one for each CF. Thus, uneven data amount in CF will cause too many HFiles to be created: when Memstore of one CF reaches threshold all Memstores of other CFs are flushed too. As stated above too frequent flush operations and too many HFiles may affect cluster performance.

Hint: in many cases having one CF is the best schema design.

HLog (WAL) Size & Memstore Flush

On RegionServer write/read paths picture above you may also noticed a Write-ahead Log (WAL) where data is getting written by default. It contains all the edits of RegionServer which were written to Memstore but were not flushed into HFiles. As data in Memstore is not persistent we need WAL to recover from RegionServer failures. When RS crushes and data which was stored in Memstore and wasn’t flushed is lost, WAL is used to replay these recent edits.

When WAL (in HBase it is called HLog) grows very big, it may take a lot of time to replay it. For that reason there are certain limits for WAL size, which when reached cause Memstore to flush. Flushing Memstores decreases WAL as we don’t need to keep in WAL edits which were written to HFiles (persistent store). This is configured by two properties: hbase.regionserver.hlog.blocksize and hbase.regionserver.maxlogs. As you probably figured out, maximum WAL size is determined by hbase.regionserver.maxlogs * hbase.regionserver.hlog.blocksize (2GB by default). When this size is reached, Memstore flushes are triggered. So, when you increase Memstore size and adjust other Memstore settings you need to adjust HLog ones as well. Otherwise WAL size limit may be hit first and you will never utilize all the resources dedicated to Memstore. Apart from that, triggering of Memstore flushes by reaching WAL limit is not the best way to trigger flushing, as it may create “storm” of flushes by trying to flush many Regions at once when written data is well distributed across Regions.

Hint: keep hbase.regionserver.hlog.blocksize * hbase.regionserver.maxlogs just a bit above * HBASE_HEAPSIZE.

Compression & Memstore Flush

With HBase it is advised to compress the data stored on HDFS (i.e. HFiles). In addition to saving on space occupied by data this reduces the disk & network IO significantly. Data is compressed when it is written to HDFS, i.e. when Memstore flushes. Compression should not slow down flushing process a lot, otherwise we may hit many of the problems above, like blocking writes caused by Memstore being too big (hit upper limit) and such.

Hint: when choosing compression type favor compression speed over compression ratio. SNAPPY showed to be a good choice here.


[UPDATE 2012/07/23: added notes about HLog size and Compression types]


Plug: if this sort of stuff interests you, we are hiring people who know about Hadoop, HBase, MapReduce…


Presentation: Intro to HBase Internals and Schema Design

Below are the slides (and audio) from the Intro to HBase Internals and Schema Design presentation  Alex gave at had our inaugural HBase NYC meetup.   See also: Introduction to HBase

We’re hiring people who want to work WITH and ON HBase and other Big Data technologies.  See jobs @ sematext.

Presentation: Intro to HBase

Last week we had our inaugural HBase NYC meetup.  About 30 people turned up – not bad for the first meetup.  Etsy, Sematext old customer and Brooklyn neighbours, provided the space, AV equipment and help, as well as their fridge with beer – thanks!  Alex gave two talks, first the Introduction to HBase whose slides (audio) are below and then Intro to HBase Internals and Schema Design.

We’re hiring people who want to work WITH and ON HBase and other Big Data technologies.  See jobs @ sematext.

Slides: Large Scale Performance Monitoring for ElasticSearch, HBase, Solr, SenseiDB…

In this presentation from Berlin Buzzwords 2012 we show how the SPM, our Performance Monitoring service is built.  How metrics are collected, how they are processed, and how they are presented.  We share a few findings along the way, too.

Note: we are actively looking for people with strong Java engineers. If that’s you, please get in touch. Separately, if you have interest and/or experience with HBase and/or Analytics, OLAP, and related areas, or  if you are looking to work with ElasticSearch, Solr, and search in general please get in touch, too.


See also:

Slides: Real-time Analytics with HBase

Here are slides from another talk we gave at both Berlin Buzzwords and at HBaseCon in San Francisco last month.  In this presentation  Alex describes one approach to real-time analytics with HBase, which we use at Sematext via HBaseHUT.   If you like these slides you will also like HBase Real-time Analytics Rollbacks via Append-based Updates.

Note: we are actively looking for people with strong interest and/or experience with HBase and/or Analytics, OLAP, etc.  If that’s you, please get in touch.

The short version is from Buzzwords, while the version with more slides is from HBaseCon:

Announcement: HBase NYC Meetup

We make so much use of HBase at Sematext, we felt the need to start a New York City HBase Meetup – HBase NYC.  We just created the meetup, but there are already over 40 members.  We do not have any meetups scheduled just yet, but have a few initial ideas.  If you are in New York and have interest in HBase, this is the meetup to join.  If you would like to host the meetup or if you would like to present please let us know.  Presentation suggestions are very welcome, too!



HBase Real-time Analytics & Rollbacks via Append-based Updates (Part 2)

This is the second part of a 3-part post series in which we describe how we use HBase at Sematext for real-time analytics with an append-only updates approach.

In our previous post we explained the problem in detail with the help of example and touched on the suggested solution idea. In this post we will go through solution details as well as briefly introduce the open-sourced implementation of the described approach.

Suggested Solution

Suggested solution can be described as follows:

  1. replace update (Get+Put) operations at write time with simple append-only writes
  2. defer processing updates to periodic compaction jobs (not to be confused with minor/major HBase compaction)
  3. perform on the fly updates processing only if user asks for data earlier than updates compacted

Before (standard Get+Put updates approach):

The picture below shows an example of updating search query metrics that can be collected by Search Analytics system (something we do at Sematext).

  • each new piece of data (blue box) is processed individually
  • to apply update based on the new piece of data:
    • existing data (green box) is first read
    • data is changed and
    • written back

After (append-only updates approach):

1. Writing updates:

2. Periodic updates processing (compacting):

3. Performing on the fly updates processing (only if user asks for data earlier than updates compacted):

Note: the result of the processing updates on the fly can be optionally stored back right away, so that next time same data is requested no compaction is needed.

The idea is simple and not a new one, but given the specific qualities of HBase like fast range scans and high write throughput it works especially well with HBase. So, what we gain here is:

  • high update throughput
  • real-time updates visibility: despite deferring the actual updates processing, user always sees the latest data changes
  • efficient updates processing by replacing random Get+Put operations with processing sets of records at a time (during the fast scan) and eliminating redundant Get+Put attempts when writing the very first data item
  • better handling of update load peaks
  • ability to roll back any range of updates
  • avoid data inconsistency problems caused by tasks that fail after only partially updating data in HBase without doing rollback (when using with MapReduce, for example)
Let’s take a closer look at each of the above points.

High Update Throughput

Higher update throughput is achieved by not doing Get operation for every record update. Thus, Get+Put operations are replaced with Puts (which can be further optimized by using client-side buffer), which are really fast in HBase. The processing of updates (compaction) is still needed to perform the actual data merge, but it is done much more efficiently than doing Get+Put for each update operation (see below more details).

Real-time Updates Visibility

Users always see latest data changes. Even if updates have not yet been processed and still stored as a series of records, they will be merged on the fly.

By doing periodic merges of appended records we ensure data that has to be processed on the fly is small enough and does fast.

Efficient Updates Processing

  • N Get+Put operations at write time are replaced with N Puts + 1 Scan (shared) + 1 Put operations
  • Processing N changes at once is usually more effective than applying N individual changes

Let’s say we got 360 update requests for the same record: e.g. record keeps track of some sensor value for 1 hour interval and we collect data points every 10 seconds. These measurements needs to be merged in a single record that represents the whole 1-hour interval. Initially we would perform 360 Get+Put operations (while we can use some client-side buffer to perform partial aggregation and reduce the number of actual Get+Put operations, we want data to be sent immediately as it arrives instead of asking user to wait 10*N seconds). With append-only approach, we will perform 360 Put operations, 1 scan (which is actually meant to process not only these updates) that goes through 360 records (stored in sequence), it calculates resulting record, and performs 1 Put operation to store the result back. Fewer operations means using less resources and this leads to more efficient processing. Moreover, if the value which needs to be updated is a complex one (needs time to load in memory, etc.) it is much more efficient to apply all updates at once than one by one individually.

Deferring processing of updates is especially effective when large portion of operations is in essence insertion of new data, but not an update of stored data. In this case a lot of Get operations (checking if there’s something to update) are redundant.

Better Handling of Update Load Peaks

Deferring processing of updates helps handle load peaks without major performance degradation. The actual (periodic) processing of updates can be scheduled for off-peak time (e.g. nights or weekends).

Ability to Rollback Updates

Since updates don’t change the existing data (until they are processed) rolling back is easy.

Preserving rollback ability even after updates were compacted is also not hard. Updates can be grouped and compacted within time periods of given length as shown in the picture below. That means the client that reads data will still have to merge updates on-the-fly even right after compaction is finished. However using the proper configuration this isn’t going to be a problem as the number of records to be merged on the fly will be small enough.

Consider the following example where the goal is to keep all-time avg value for particular sensor. Let’s say data is collected every 10 seconds for 30 days, which gives 259200 separately written data points. While compacting on-the-fly this amount of values might be quite fast for a medium-large HBase cluster, performing periodic compaction will improve reading speed a lot. Let’s say we perform updates processing every 4 hours and use 1 hour interval as compacting base (as shown in the picture above). This gives us, at any point in time, less than 24*30 + 4*60*6 = 2,160 non-compacted records that need to be processed on-the-fly when fetching resulting record for 30 days. This is a small number of records and can be processed very fast. At the same time it is possible to perform rollback to any point in time with 1 hour granularity.

In case system should store more historical data, but we don’t care about rolling it back (if nothing wrong was found during 30 days the data is likely to be OK) then compaction can be configured to process all data older than 30 days as one group (i.e. merge into one record).

Automatic Handling of Task Failures which Write Data to HBase

Typical scenario: task updating HBase data fails in the middle of writing – some data was written, some not. Ideally we should be able to simply restart the same task (on the same input data) so that new one performs needed updates without corrupting data because some write operations were duplicate.

In the suggested solution every update is written as a new record. In order to make sure that performing the same (literally the same, not just similar) update operation multiple times does not result in multiple separate update operations, in which case data will be corrupted, every update operation should write to the record with the same row key from any task attempts. This results in overriding the same single record (if it was created by failed task) and avoiding doing the same update multiple times which in turn means that data is not corrupted.

This especially convenient in situations when we write to HBase table from MapReduce tasks, as MapReduce framework restarts failed tasks for you. With the given approach we can say that handling task failures happens automatically – no extra effort to manually roll back previous task changes and starting a new task is needed.


Below are the major drawbacks of the suggested solution. Usually there are ways to reduce their effect on your system depending on the specific case (e.g. by tuning HBase appropriately or by adjusting parameters involved in data compaction logic).

  • merging on the fly takes time. Properly configuring periodic updates processing is a key to keeping data fetching fast.
  • when performing compaction, scanning of many records that don’t need to be compacted can happen (already compacted or “alone-standing” records). Compaction can usually be performed only on data written after the previous compaction which allows to use efficient time-based filters to reduce the impact here.

Solving these issues may be implementation-specific. We’ll bring them up again when talking about our implementation in the follow up post.

Implemenation: Meet HBaseHUT

Suggested solution was implemented and open-sourced as HBaseHUT project. HBaseHUT will be covered in the follow up post shortly.


If you like this sort of stuff, we’re looking for Data Engineers!


Get every new post delivered to your Inbox.

Join 155 other followers