For today’s blog entry I will combine both Cloud and Big Data to show you how easy it is to do Big Data on the Cloud. Microsoft Azure is a cloud service provided by Microsoft, and one of the services offered is HDInsight, an Apache Hadoop-based distribution running in the cloud on Azure. Another service […]
Tag Archives: HDFS
High Performance Computing Cluster (HPCC) is a distributed processing framework akin to Hadoop, except that it runs programs written in its own Domain Specific Language (DSL) called Enterprise Control Language (ECL). ECL is great, but occasionally you will want to call out to perform heavy lifting in other languages. For example, you may want to leverage an NLP library written in Java.
Additionally, HPCC typically operates against data residing on filesystems akin to HDFS. And just like with HDFS, once you move beyond log file processing and static data snapshots, you quickly develop a desire for a database backend.
In fact, I’d say this is a general industry trend: HDFS->HBase, S3->Redshift, etc. Eventually, you want to decrease the latency of analytics (to near zero). To do this, you setup some sort of distributed database, capable of supporting both batch processing as well as data streaming/micro-batching. And you adopt an immutable/incremental approach to data storage, which allows you to collapse your infrastructure and stream data into the system as it is being analyzed. (simplifying everything int he process)
But I digress, as a step in that direction…
We can leverage the Java Integration capabilities within HPCC to support User Defined Functions in Java. Likewise, we can leverage the same facilities to add additional backend storage mechanisms (e.g. Cassandra). More specifically, let’s have a look at the streamingcapabilities of HPCC/Java integration to get data out of an external source.
Let’s first look at vanilla Java integration.
Apache HBase is a NoSQL database that runs on top of Hadoop as a distributed and scalable big data store. This means that HBase can leverage the distributed processing paradigm of the Hadoop Distributed File System (HDFS) and benefit from Hadoop’s MapReduce programming model. It is meant to host large tables with billions of rows with potentially millions of columns and run across a cluster of commodity hardware. But beyond its Hadoop roots, HBase is a powerful database in its own right that blends real-time query capabilities with the speed of a key/value store and offline or batch processing via MapReduce. In short, HBase allows you to query for individual records as well as derive aggregate analytic reports across a massive amount of data.