Tag Archives: Hadoop

MongoDB & Hadoop, Sittin’ in a Tree

MongoDB & Hadoop, Sittin’ in a Tree
// Kogonuso

MongoDB & Hadoop, Sittin’ in a Tree from MongoDB

How Apache Kafka is transforming Hadoop, Spark,Storm

MapReduce Algorithms – Order Inversion

via MapReduce Algorithms – Order Inversion – Random Thoughts on Coding.

This post is another segment in the series presenting MapReduce algorithms as found in the Data-Intensive Text Processing with MapReduce book. Previous installments are Local Aggregation, Local Aggregation PartII and Creating a Co-Occurrence Matrix. This time we will discuss the order inversion pattern. The order inversion pattern exploits the sorting phase of MapReduce to push data needed for calculations to the reducer ahead of the data that will be manipulated.. Before you dismiss this as an edge condition for MapReduce, I urge you to read on as we will discuss how to use sorting to our advantage and cover using a custom partitioner, both of which are useful tools to have available. Although many MapReduce programs are written at a higher level abstraction i.e Hive or Pig, it’s still helpful to have an understanding of what’s going on at a lower level. The order inversion pattern is found in chapter 3 of Data-Intensive Text Processing with MapReduce book. To illustrate the order inversion pattern we will be using the Pairs approach from the co-occurrence matrix pattern. When creating the co-occurrence matrix, we track the total counts of when words appear together. At a high level we take the Pairs approach and add a small twist, in addition to having the mapper emit a word pair such as (“foo”,“bar”) we will emit an additional word pair of (“foo”,“”) and will do so for every word pair so we can easily achieve a total count for how often the left most word appears, and use that count to calculate our relative frequencies. This approach raised two specific problems. First we need to find a way to ensure word pairs (“foo”,“”) arrive at the reducer first. Secondly we need to make sure all word pairs with the same left word arrive at the same reducer. Before we solve those problems, let’s take a look at our mapper code.

Resource Allocation Configuration for Spark on YARN

via Resource Allocation Configuration for Spark on YARN | MapR.

In this blog post, I will explain the resource allocation configurations for Spark on YARN, describe the yarn-client and yarn-cluster modes, and will include examples.

Bees and Elephants In The Clouds – Using Hive with Azure HDInsight

For today’s blog entry I will combine both Cloud and Big Data to show you how easy it is to do Big Data on the Cloud.  Microsoft Azure is a cloud service provided by Microsoft, and one of the services offered is HDInsight, an Apache Hadoop-based distribution running in the cloud on Azure. Another service […]


Calculating a Co-Occurrence Matrix With Hadoop

via Calculating a Co-Occurrence Matrix With Hadoop – Random Thoughts on Coding.

This post continues with our series of implementing the MapReduce algorithms found in the Data-Intensive Text Processing with MapReduce book. This time we will be creating a word co-occurrence matrix from a corpus of text. Previous posts in this series are:

  1. Working Through Data-Intensive Text Processing with MapReduce
  2. Working Through Data-Intensive Text Processing with MapReduce – Local Aggregation Part II

Testing Hadoop Programs With MRUnit

via Testing Hadoop Programs With MRUnit – Random Thoughts on Coding.

This post will take a slight detour from implementing the patterns found in Data-Intensive Processing with MapReduce to discuss something equally important, testing. I was inspired in part from a presentation by Tom Wheeler that I attended while at the 2012 Strata/Hadoop World conference in New York. When working with large data sets, unit testing might not be the first thing that comes to mind. However, when you consider the fact that no matter how large your cluster is, or how much data you have, the same code is pushed out to all nodes for running the MapReduce job, Hadoop mappers and reducers lend themselves very well to being unit tested. But what is not easy about unit testing Hadoop, is the framework itself. Luckily there is a library that makes testing Hadoop fairly easy – MRUnit. MRUnit is based on JUnit and allows for the unit testing of mappers, reducers and some limited integration testing of the mapper – reducer interaction along with combiners, custom counters and partitioners. We are using the latest release of MRUnit as of this writing, 0.9.0. All of the code under test comes from the previous post on computing averages using local aggregation.