In this post, I am going to demystify Hadoop by working through a specific example. We will go through a mental exercise of how you would normally solve a given problem, and then I'll explain where Hadoop helps with solving it. So without further ado...
Let's pretend you work for a Social Analytics startup, bringing tremendous value to society by analyzing Tweets, Facebook posts, LinkedIn profiles, FourSquare check-ins, etc (no I do not work for one). Every 15 minutes you receive a giant file, around a billion entries in size, containing some kind of data about the users' actions. You need to process that file, and generate a bunch of reports of it.
Problem #1 - the file is too big to be processed on a single machine.
Your Solution:
Assuming you have 100 machines in your disposal, you quickly realize that the best way to process the file is to split it up into 100 equal parts. You can then have the machines process the files in parallel, and then simply combine the outputs somehow. You write a Master or a Coordinator, that does the splitting, gives part of the data to each of the 100 machines, and waits until they all finish. Although this looks easy at first, once you add code for intermittent machine failures, health checks, etc, you end up with an entire (crappy) data processing framework, and wonder where you went wrong.
Hadoop Solution:
Hadoop in a nutshell is a data-processing framework, designed to handle problems like the ones I described above. To make it easier to solve problems associated with storing and processing huge amounts of data, Hadoop comes with its own distributed file system, called HDFS. HDFS has a few key properties that differentiate it from a normal file system.
- The default size of an HDFS block is 64MB. Yes, that's megabytes. I'll discuss why it's so large right below.
- Each block is stored on 3 different machines by default. When reading data, any one of the three may be used.
- Each block is immutable.
- It's common for a single file to be stored across several multiple machines. There's a special machine called the NameNode, that keeps track of all the files, and where those files are stored.
- The actual data is stored on machines called DataNodes. A typical Hadoop cluster contains a single NameNode, a single JobTracker node, and an N number of DataNodes.
Why is the default size of an HDFS block 64MB? The reason for that is because an HDFS block is the processing unit that each Mapper receives. I haven't talked about Mappers and Reducers yet, so let's assume that all the processing takes place just in the Map phase for simplicity. Going back to the example, we're processing a file that has a billion entries, and takes, let's say 64GB. Hadoop will process this file 64MB at a time, which means that file is stored across 1000 HDFS Blocks, which in turn means that 1000 Map tasks will get created, 1 Map task to process each block. The default number of Map tasks that can execute on a single node is 2. In our example, we have 100 machines, which means that each machine will execute 10 Map tasks (1000/100), 2 Map tasks at a time.
What would happen if the block size was 4KB instead? Well, remember that Hadoop a data-processing framework for solving Big Data problems, and Big Data means that your inputs will be much larger than 64MB in size. There's overhead in creating and tracking each Map task, so you don't want to create too many. If the block-size was 4KB, we'd end up with 262,144 map tasks, and Hadoop will spend more resources keeping track of them than actually processing the data.
There's another benefit in keeping the block size large. Because all of the meta-data about where all of the blocks are stored is contained on a single machine, called the NameNode, there is actually a real danger that you'll run out of space on a NameNode if your blocks are too small, which in turn would mean that you have too many of them.
There is another key benefit that HDFS offers. Normally, you ship your data to your processing. That is, you would normally designate 3 boxes that do all of your data processing, and then you just move whichever files need to be processed to those boxes. However, if we're talking about huge amounts of data, moving that data across the network is very expensive. Hadoop works by shipping the processing to the data instead. When you're processing a 64GB file, the contents of that file aren't being transferred from one machine to another across the network. Instead, the jars are transferred over to where the data is stored (the jars would contain your Map/Reduce Java program). The data is already spread out across all of our machines, and so the Mappers run directly on DataNodes. There is something called a JobTracker which is responsible for pushing the jars over, as well as instantiating Mapper tasks, and tracking how the job is going.
So to sum up:
- All files in Hadoop are stored in 64MB blocks.
- Each Mapper task processes one block.
- Hadoop ships the processing (jars) to the data, instead of shipping the data to the processing.
Problem #2 - the file has repeating user-ids
Let's assume the file consists of just two columns, and looks something like this:
<User Id> <Some Data>
123 DataRecord1
456 DataRecord2
789 DataRecord3
123 DataRecord4
456 DataRecord5
While processing the data, it's necessary to do a look-up on the UserId to get information about that user. Let's assume that the look-up is somewhat costly and we want to minimize it. How do you do it?
Your Solution:
The obvious thing you can do is to pre-process the file so that it looks something like this:
123 DataRecord1 DataRecord4
456 DataRecord2 DataRecord5
789 DataRecord3
For each user, you simply group all of their records in the file together. Assuming the user "123" appears in the file twice, instead of doing a lookup each time you see a record, you do just one look-up, making your program much more efficient.
Hadoop Solution
The problem I just described occurs fairly often, which is why most Hadoop jobs have a Map phase and a Reduce phase. In the Map phase, you essentially pre-process the data, in a way I just described up above, grouping the records that belong to the same key together. In the Reduce phase, you do the rest of the processing. Because the optimization I described is not always possible or necessary, the Reduce phase is optional, and a Hadoop job may consist of just the Map phase. However, in practice, this optimization makes sense in almost all situations, which is why most jobs consist of both the Map phase and the Reduce phase.
You may argue that it's relatively easy to do the pre-processing yourself, and Hadoop doesn't buy you that much. You can just read all the data into a HashMap, and group the records yourself. Simple! But since we're dealing with Big Data, which could be Terabytes, or even Perabytes in size, doing this yourself becomes very difficult. Hadoop takes care of the internal for you, making this pre-processing step very easy, without you having to worry about the size of the data. In fact, if you're able to just cache all of the data on one machine, and do all your processing there, chances are you don't need Hadoop at all.
Problem #3 - you want to generate some reports
You've built your Hadoop job, and have successfully processed the data. You've generated some structured output, and that resides on HDFS. Naturally you want to run some reports, so you load your data into a MySQL or an Oracle database. Problem is, the data is large. In fact it's so large that when you try to run a query against the table you've just created, your database begins to cry. If you listen to its sobs, you'll probably hear "I was built to process Megabytes, maybe Gigabytes of data. Not Terabytes. Not Perabytes. That's not my job. I was built in the 80's and 90's, back when floppy drives were used. Just leave me alone".
This is where Hive comes to the rescue. Hive lets you run an SQL statement against structured data stored on HDFS. When you issue an SQL query, it parses it, and translates it into a Java Map/Reduce job, which is then executed on your data. Although Hive does some optimizations, in general it just goes record by record against all your data. This means that it's relatively slow - a typical Hive query takes 5 or 10 minutes to complete, depending on how much data you have. However, that's what makes it effective. Unlike a relational database, you don't waste time on query optimization, adding indexes, etc. Instead, what keeps the processing time down is the fact that the query is run on all machines in your Hadoop cluster, and the scalability is taken care of for you.
Hive is extremely useful in data-warehousing kind of scenarios. You would never use Hive as a database for a web application, because the response time is always in minutes, not seconds. However, for generating huge custom reports, running some really expensive query on year's worth of data, or doing any kind of processing on massive amounts of data, Hive really shines. This is why companies like Oracle and IBM (IBM owns Netezza, a competitor to Oracle) are scared of Hadoop and Hive. Hive makes it possible for companies to easily process massive amounts of data, and processing massive amounts of data is typically how database makers differentiate themselves. And yes, just like the rest of Hadoop ecosystem, Hive is free and open-source.
If you're reading this, that means you've read through the entire post, so I thank you and hope that you've found it useful. My goal is to try to give you some intuitive understanding of what it does, and you can use Google to get more details. If you want to get started playing with Hadoop, Google "Cloudera", which makes getting started with Hadoop really easy, and provides great tutorials, as well as Hadoop lessons.
<User Id> <Some Data>
123 DataRecord1
456 DataRecord2
789 DataRecord3
123 DataRecord4
456 DataRecord5
While processing the data, it's necessary to do a look-up on the UserId to get information about that user. Let's assume that the look-up is somewhat costly and we want to minimize it. How do you do it?
Your Solution:
The obvious thing you can do is to pre-process the file so that it looks something like this:
123 DataRecord1 DataRecord4
456 DataRecord2 DataRecord5
789 DataRecord3
For each user, you simply group all of their records in the file together. Assuming the user "123" appears in the file twice, instead of doing a lookup each time you see a record, you do just one look-up, making your program much more efficient.
Hadoop Solution
The problem I just described occurs fairly often, which is why most Hadoop jobs have a Map phase and a Reduce phase. In the Map phase, you essentially pre-process the data, in a way I just described up above, grouping the records that belong to the same key together. In the Reduce phase, you do the rest of the processing. Because the optimization I described is not always possible or necessary, the Reduce phase is optional, and a Hadoop job may consist of just the Map phase. However, in practice, this optimization makes sense in almost all situations, which is why most jobs consist of both the Map phase and the Reduce phase.
You may argue that it's relatively easy to do the pre-processing yourself, and Hadoop doesn't buy you that much. You can just read all the data into a HashMap, and group the records yourself. Simple! But since we're dealing with Big Data, which could be Terabytes, or even Perabytes in size, doing this yourself becomes very difficult. Hadoop takes care of the internal for you, making this pre-processing step very easy, without you having to worry about the size of the data. In fact, if you're able to just cache all of the data on one machine, and do all your processing there, chances are you don't need Hadoop at all.
Problem #3 - you want to generate some reports
You've built your Hadoop job, and have successfully processed the data. You've generated some structured output, and that resides on HDFS. Naturally you want to run some reports, so you load your data into a MySQL or an Oracle database. Problem is, the data is large. In fact it's so large that when you try to run a query against the table you've just created, your database begins to cry. If you listen to its sobs, you'll probably hear "I was built to process Megabytes, maybe Gigabytes of data. Not Terabytes. Not Perabytes. That's not my job. I was built in the 80's and 90's, back when floppy drives were used. Just leave me alone".
This is where Hive comes to the rescue. Hive lets you run an SQL statement against structured data stored on HDFS. When you issue an SQL query, it parses it, and translates it into a Java Map/Reduce job, which is then executed on your data. Although Hive does some optimizations, in general it just goes record by record against all your data. This means that it's relatively slow - a typical Hive query takes 5 or 10 minutes to complete, depending on how much data you have. However, that's what makes it effective. Unlike a relational database, you don't waste time on query optimization, adding indexes, etc. Instead, what keeps the processing time down is the fact that the query is run on all machines in your Hadoop cluster, and the scalability is taken care of for you.
Hive is extremely useful in data-warehousing kind of scenarios. You would never use Hive as a database for a web application, because the response time is always in minutes, not seconds. However, for generating huge custom reports, running some really expensive query on year's worth of data, or doing any kind of processing on massive amounts of data, Hive really shines. This is why companies like Oracle and IBM (IBM owns Netezza, a competitor to Oracle) are scared of Hadoop and Hive. Hive makes it possible for companies to easily process massive amounts of data, and processing massive amounts of data is typically how database makers differentiate themselves. And yes, just like the rest of Hadoop ecosystem, Hive is free and open-source.
If you're reading this, that means you've read through the entire post, so I thank you and hope that you've found it useful. My goal is to try to give you some intuitive understanding of what it does, and you can use Google to get more details. If you want to get started playing with Hadoop, Google "Cloudera", which makes getting started with Hadoop really easy, and provides great tutorials, as well as Hadoop lessons.