We continue with our series on implementing MapReduce algorithms found in Data-Intensive Text Processing with MapReduce book. Other posts in this series:
Tag Archives: MapReduce
This post is another segment in the series presenting MapReduce algorithms as found in the Data-Intensive Text Processing with MapReduce book. Previous installments are Local Aggregation, Local Aggregation PartII and Creating a Co-Occurrence Matrix. This time we will discuss the order inversion pattern. The order inversion pattern exploits the sorting phase of MapReduce to push data needed for calculations to the reducer ahead of the data that will be manipulated.. Before you dismiss this as an edge condition for MapReduce, I urge you to read on as we will discuss how to use sorting to our advantage and cover using a custom partitioner, both of which are useful tools to have available. Although many MapReduce programs are written at a higher level abstraction i.e Hive or Pig, it’s still helpful to have an understanding of what’s going on at a lower level. The order inversion pattern is found in chapter 3 of Data-Intensive Text Processing with MapReduce book. To illustrate the order inversion pattern we will be using the Pairs approach from the co-occurrence matrix pattern. When creating the co-occurrence matrix, we track the total counts of when words appear together. At a high level we take the Pairs approach and add a small twist, in addition to having the mapper emit a word pair such as (“foo”,“bar”) we will emit an additional word pair of (“foo”,“”) and will do so for every word pair so we can easily achieve a total count for how often the left most word appears, and use that count to calculate our relative frequencies. This approach raised two specific problems. First we need to find a way to ensure word pairs (“foo”,“”) arrive at the reducer first. Secondly we need to make sure all word pairs with the same left word arrive at the same reducer. Before we solve those problems, let’s take a look at our mapper code.
This post continues with our series of implementing the MapReduce algorithms found in the Data-Intensive Text Processing with MapReduce book. This time we will be creating a word co-occurrence matrix from a corpus of text. Previous posts in this series are:
This post will take a slight detour from implementing the patterns found in Data-Intensive Processing with MapReduce to discuss something equally important, testing. I was inspired in part from a presentation by Tom Wheeler that I attended while at the 2012 Strata/Hadoop World conference in New York. When working with large data sets, unit testing might not be the first thing that comes to mind. However, when you consider the fact that no matter how large your cluster is, or how much data you have, the same code is pushed out to all nodes for running the MapReduce job, Hadoop mappers and reducers lend themselves very well to being unit tested. But what is not easy about unit testing Hadoop, is the framework itself. Luckily there is a library that makes testing Hadoop fairly easy – MRUnit. MRUnit is based on JUnit and allows for the unit testing of mappers, reducers and some limited integration testing of the mapper – reducer interaction along with combiners, custom counters and partitioners. We are using the latest release of MRUnit as of this writing, 0.9.0. All of the code under test comes from the previous post on computing averages using local aggregation.
In a previous post I described an example to perform a PageRank calculation which is part of the Mining Massive Dataset course with Apache Hadoop. In that post I took an existing Hadoop job in Java and modified it somewhat (added unit tests and made file paths set by a parameter). This post shows how to use this job on a real-life Hadoop cluster. The cluster is a AWS EMR cluster of 1 Master Node and 5 Core Nodes, each being backed by a m3.xlarge instance.
The first step is to prepare the input for the cluster. I make use of AWS S3 since this is a convenient way when working with EMR. I create a new bucket, ‘emr-pagerank-demo’, and made the following subfolders:
- in: the folder containing the input files for the job
- job: the folder containing my executable Hadoop jar file
- log: the folder where EMR will put its log files
In the ‘in’ folder I then copied the data that I want to be ranked. I used this file as input. Unzipped it became a 5 GB file with XML content, although not really massive, it is sufficient for this demo. When you take the sources of the previous post and run ‘mvn clean install’ you will get the jar file: ‘hadoop-wiki-pageranking-0.2-SNAPSHOT.jar’. I uploaded this jar file to the ‘job’ folder.
That is it for the preparation. Now we can fire up the cluster. For this demo I used the AWS Management Console:
In this talk we will introduce the typical predictive modeling tasks on “not-so-big-data-but-not- quite-small-either” that benefit from distributed the work on several cores or nodes in a small cluster (e.g. 20 * 8 cores).
We will talk about cross validation, grid search, ensemble learning, model averaging, numpy memory mapping, Hadoop or Disco MapReduce, MPI AllReduce and disk & memory locality.
We will also feature some quick demos using scikit-learn and IPython.parallel from the notebook on an spot-instance EC2 cluster managed by StarCluster.
Currently I am following the Coursera training ‘Mining Massive Datasets‘. I have been interested in MapReduce and Apache Hadoop for some time and with this course I hope to get more insight in when and how MapReduce can help to fix some real world business problems (another way to do so I described here). This Coursera course is mainly focussing on the theory of used algorithms and less about the coding itself. The first week is about PageRanking and how Google used this to rank pages. Luckily there is a lot to find about this topic in combination with Hadoop. I ended up here and decided to have a closer look at this code.
What I did was taking this code (forked it) and rewrote it a little. I created unit tests for the mappers and reducers as I described here. As a testcase I used the example from the course. We have three webpages linking to each other and/or themselves: