Tuesday, July 2, 2013

Exporting Cassandra Metrics to Graphite

Anyone that's ever worked with Cassandra knows that metrics are crucial to being able to do any performance tuning. Out of the box, you have the following options for examining metrics in Cassandra:
  • nodetool will give you most of the basic metrics, such as read/write latency, heap usage, etc. It's very easy to use. http://wiki.apache.org/cassandra/NodeTool
  • OpsCenter - a webapp built by DataStax.
However, chances are you're already running your own monitoring system, such as Ganglia or Graphite. In that case, you may wish to export all of the metric goodness that Cassandra has to offer into your own monitoring system.

Luckily, this isn't hard. You can query JMX directly, or you can query MX4J web interface that Cassandra exposes on top of JMX. I went with the latter approach, and wrote a simply utility that grabs all the available metrics out of MX4J and pipes them into Graphite.

You can checkout the source code and usage here: https://github.com/matvey14/cassandra-utils

You should be able to easily extend the script and push metrics to Ganglia instead by invoking gmetric, if you'd like.

Wednesday, June 26, 2013

Managing HBase Region Splits


If you're administering HBase, then chances are you've had to think about to manage splits. There's two ways of doing it:
  1. Automated splits. This is the default strategy. Regions will be split once they've grown past a certain size. The size is configurable and defaults to 10Gb. 
  2. Manual splits. You manually split regions as you see fit. 
Neither of these alternatives are particularly appealing. There's good reasons for managing splits manually, as explained in http://hbase.apache.org/book/important_configurations.html#disable.splitting The major reason why manual splits are advantageous is that you have control over when they are run, and don't have to worry about them changing unpredictably from underneath you. However, manually managing anything is not really a viable option in the world of Big Data.

Luckily, there's a solution that combines the predictability of managing splits manually, with the advantages of automation so you don't have to worry about it. And it's dead simple. I created a script that does essentially the same thing as an automated-strategy, but can be scheduled to run as a cron job.

You can check out the source here: https://github.com/matvey14/hbase-utils

The java code is just a few lines: https://github.com/matvey14/hbase-utils/blob/master/src/matvey14/hbase/tools/HBaseRegionSplitter.java

Sample Usage: ./run_region_splitter.sh -s 10 -r

This will go through all of your HBase regions, and split any region that is bigger than 10Gb. The "-r" argument tells it to actually do splits. Without "-r", the script defaults to "dry-run" mode, so it'll go through each region and show you what will happen, but won't actually do any splitting.

Sunday, February 12, 2012

Hadoop For The Rest Of Us

When it comes to software engineers, there's exactly three categories. The first category are the people that know Hadoop well, and have been playing around with it for a while. The second category are the people who've never heard of it. The third category are the people who've heard about Hadoop, perhaps read some entries on Wikipedia about Hadoop and Map/Reduce, maybe even bought an O'Reilly book, but still don't have an intuitive understanding of it, and the kind of problems that it's used to solve. This entry is written for the 3rd kind of people.

In this post, I am going to demystify Hadoop by working through a specific example. We will go through a mental exercise of how you would normally solve a given problem, and then I'll explain where Hadoop helps with solving it. So without further ado...

Let's pretend you work for a Social Analytics startup, bringing tremendous value to society by analyzing Tweets, Facebook posts, LinkedIn profiles, FourSquare check-ins, etc (no I do not work for one). Every 15 minutes you receive a giant file, around a billion entries in size, containing some kind of  data about the users' actions. You need to process that file, and generate a bunch of reports of it.


Problem #1 - the file is too big to be processed on a single machine.

Your Solution:
Assuming you have 100 machines in your disposal, you quickly realize that the best way to process the file is to split it up into 100 equal parts. You can then have the machines process the files in parallel, and then simply combine the outputs somehow. You write a Master or a Coordinator, that does the splitting, gives part of the data to each of the 100 machines, and waits until they all finish. Although this looks easy at first, once you add code for intermittent machine failures, health checks, etc, you end up with an entire (crappy) data processing framework, and wonder where you went wrong.

Hadoop Solution:
Hadoop in a nutshell is a data-processing framework, designed to handle problems like the ones I described above. To make it easier to solve problems associated with storing and processing huge amounts of data, Hadoop comes with its own distributed file system, called HDFS. HDFS has a few key properties that differentiate it from a normal file system.

  1. The default size of an HDFS block is 64MB. Yes, that's megabytes. I'll discuss why it's so large right below.
  2. Each block is stored on 3 different machines by default. When reading data, any one of the three may be used.
  3. Each block is immutable.
  4. It's common for a single file to be stored across several multiple machines. There's a special machine called the NameNode, that keeps track of all the files, and where those files are stored.
  5. The actual data is stored on machines called DataNodes. A typical Hadoop cluster contains a single NameNode, a single JobTracker node, and an N number of DataNodes.
Why is the default size of an HDFS block 64MB? The reason for that is because an HDFS block is the processing unit that each Mapper receives. I haven't talked about Mappers and Reducers yet, so let's assume that all the processing takes place just in the Map phase for simplicity. Going back to the example, we're processing a file that has a billion entries, and takes, let's say 64GB. Hadoop will process this file 64MB at a time, which means that file is stored across 1000 HDFS Blocks, which in turn means that 1000 Map tasks will get created, 1 Map task to process each block. The default number of Map tasks that can execute on a single node is 2. In our example, we have 100 machines, which means that each machine will execute 10 Map tasks (1000/100), 2 Map tasks at a time.

What would happen if the block size was 4KB instead? Well, remember that Hadoop a data-processing framework for solving Big Data problems, and Big Data means that your inputs will be much larger than 64MB in size. There's overhead in creating and tracking each Map task, so you don't want to create too many. If the block-size was 4KB, we'd end up with 262,144 map tasks, and Hadoop will spend more resources keeping track of them than actually processing the data.

There's another benefit in keeping the block size large. Because all of the meta-data about where all of the blocks are stored is contained on a single machine, called the NameNode, there is actually a real danger that you'll run out of space on a NameNode if your blocks are too small, which in turn would mean that you have too many of them.

There is another key benefit that HDFS offers. Normally, you ship your data to your processing. That  is, you would normally designate 3 boxes that do all of your data processing, and then you just move whichever files need to be processed to those boxes. However, if we're talking about huge amounts of data, moving that data across the network is very expensive. Hadoop works by shipping the processing to the data instead. When you're processing a 64GB file, the contents of that file aren't being transferred from one machine to another across the network. Instead, the jars are transferred over to where the data is stored (the jars would contain your Map/Reduce Java program). The data is already spread out across all of our machines, and so the Mappers run directly on DataNodes. There is something called a JobTracker which is responsible for pushing the jars over, as well as instantiating Mapper tasks, and tracking how the job is going.

So to sum up:
  • All files in Hadoop are stored in 64MB blocks.
  • Each Mapper task processes one block.
  • Hadoop ships the processing (jars) to the data, instead of shipping the data to the processing.

Problem #2 - the file has repeating user-ids

Let's assume the file consists of just two columns, and looks something like this:

<User Id>  <Some Data>

123            DataRecord1
456            DataRecord2
789            DataRecord3
123            DataRecord4
456            DataRecord5

While processing the data, it's necessary to do a look-up on the UserId to get information about that user. Let's assume that the look-up is somewhat costly and we want to minimize it. How do you do it?

Your Solution:
The obvious thing you can do is to pre-process the file so that it looks something like this:


123            DataRecord1  DataRecord4
456            DataRecord2  DataRecord5
789            DataRecord3

For each user, you simply group all of their records in the file together. Assuming the user "123" appears in the file twice, instead of doing a lookup each time you see a record, you do just one look-up, making your program much more efficient.

Hadoop Solution

The problem I just described occurs fairly often, which is why most Hadoop jobs have a Map phase and a Reduce phase. In the Map phase, you essentially pre-process the data, in a way I just described up above, grouping the records that belong to the same key together. In the Reduce phase, you do the rest of the processing. Because the optimization I described is not always possible or necessary, the Reduce phase is optional, and a Hadoop job may consist of just the Map phase. However, in practice, this optimization makes sense in almost all situations, which is why most jobs consist of both the Map   phase and the Reduce phase.

You may argue that it's relatively easy to do the pre-processing yourself, and Hadoop doesn't buy you that much. You can just read all the data into a HashMap, and group the records yourself. Simple! But  since we're dealing with Big Data, which could be Terabytes, or even Perabytes in size, doing this yourself becomes very difficult. Hadoop takes care of the internal for you, making this pre-processing step very easy, without you having to worry about the size of the data. In fact, if you're able to just cache all of the data on one machine, and do all your processing there, chances are you don't need Hadoop at all.

Problem #3 - you want to generate some reports


You've built your Hadoop job, and have successfully processed the data. You've generated some structured output, and that resides on HDFS. Naturally you want to run some reports, so you load your data into a MySQL or an Oracle database. Problem is, the data is large. In fact it's so large that when you try to run a query against the table you've just created, your database begins to cry. If you listen to its sobs, you'll probably hear "I was built to process Megabytes, maybe Gigabytes of data. Not Terabytes. Not Perabytes. That's not my job. I was built in the 80's and 90's, back when floppy drives were used. Just leave me alone".

This is where Hive comes to the rescue. Hive lets you run an SQL statement against structured data stored on HDFS. When you issue an SQL query, it parses it, and translates it into a Java Map/Reduce job, which is then executed on your data. Although Hive does some optimizations, in general it just goes record by record against all your data. This means that it's relatively slow - a typical Hive query takes 5 or 10 minutes to complete, depending on how much data you have. However, that's what makes it effective. Unlike a relational database, you don't waste time on query optimization, adding indexes, etc. Instead, what keeps the processing time down is the fact that the query is run on all machines in your Hadoop cluster, and the scalability is taken care of for you.

Hive is extremely useful in data-warehousing kind of scenarios. You would never use Hive as a database for a web application, because the response time is always in minutes, not seconds. However, for generating huge custom reports, running some really expensive query on year's worth of data, or doing any kind of processing on massive amounts of data, Hive really shines. This is why companies like Oracle and IBM (IBM owns Netezza, a competitor to Oracle) are scared of Hadoop and Hive. Hive makes it possible for companies to easily process massive amounts of data, and processing massive amounts of data is typically how database makers differentiate themselves. And yes, just like the rest of Hadoop ecosystem, Hive is free and open-source.


If you're reading this, that means you've read through the entire post, so I thank you and hope that you've found it useful. My goal is to try to give you some intuitive understanding of what it does, and you can use Google to get more details. If you want to get started playing with Hadoop, Google "Cloudera", which makes getting started with Hadoop really easy, and provides great tutorials, as well as Hadoop lessons.