Tag Archives: HPCC

Streaming data into HPCC using Java


via Brian ONeill’s Random Thoughts: Streaming data into HPCC using Java.

High Performance Computing Cluster (HPCC) is a distributed processing framework akin to Hadoop, except that it runs programs written in its own Domain Specific Language (DSL) called Enterprise Control Language (ECL).   ECL is great, but occasionally you will want to call out to perform heavy lifting in other languages.  For example, you may want to leverage an NLP library written in Java.

Additionally, HPCC typically operates against data residing on filesystems akin to HDFS.  And just like with HDFS, once you move beyond log file processing and static data snapshots, you quickly develop a desire for a database backend.

In fact, I’d say this is a general industry trend: HDFS->HBase, S3->Redshift, etc.    Eventually, you want to decrease the latency of analytics (to near zero).  To do this, you setup some sort of distributed database, capable of supporting both batch processing as well as data streaming/micro-batching.  And you adopt an immutable/incremental approach to data storage, which allows you to collapse your infrastructure and stream data into the system as it is being analyzed.  (simplifying everything int he process)

But I digress, as a step in that direction…

We can leverage the Java Integration capabilities within HPCC to support User Defined Functions in Java.  Likewise, we can leverage the same facilities to add additional backend storage mechanisms (e.g. Cassandra).  More specifically, let’s have a look at the streamingcapabilities of HPCC/Java integration to get data out of an external source.

Let’s first look at vanilla Java integration.