Wednesday, June 26, 2013

Managing HBase Region Splits

If you're administering HBase, then chances are you've had to think about to manage splits. There's two ways of doing it:
  1. Automated splits. This is the default strategy. Regions will be split once they've grown past a certain size. The size is configurable and defaults to 10Gb. 
  2. Manual splits. You manually split regions as you see fit. 
Neither of these alternatives are particularly appealing. There's good reasons for managing splits manually, as explained in The major reason why manual splits are advantageous is that you have control over when they are run, and don't have to worry about them changing unpredictably from underneath you. However, manually managing anything is not really a viable option in the world of Big Data.

Luckily, there's a solution that combines the predictability of managing splits manually, with the advantages of automation so you don't have to worry about it. And it's dead simple. I created a script that does essentially the same thing as an automated-strategy, but can be scheduled to run as a cron job.

You can check out the source here:

The java code is just a few lines:

Sample Usage: ./ -s 10 -r

This will go through all of your HBase regions, and split any region that is bigger than 10Gb. The "-r" argument tells it to actually do splits. Without "-r", the script defaults to "dry-run" mode, so it'll go through each region and show you what will happen, but won't actually do any splitting.