Tag Archives: HBase

Streaming data into HPCC using Java

via Brian ONeill’s Random Thoughts: Streaming data into HPCC using Java.

High Performance Computing Cluster (HPCC) is a distributed processing framework akin to Hadoop, except that it runs programs written in its own Domain Specific Language (DSL) called Enterprise Control Language (ECL).   ECL is great, but occasionally you will want to call out to perform heavy lifting in other languages.  For example, you may want to leverage an NLP library written in Java.

Additionally, HPCC typically operates against data residing on filesystems akin to HDFS.  And just like with HDFS, once you move beyond log file processing and static data snapshots, you quickly develop a desire for a database backend.

In fact, I’d say this is a general industry trend: HDFS->HBase, S3->Redshift, etc.    Eventually, you want to decrease the latency of analytics (to near zero).  To do this, you setup some sort of distributed database, capable of supporting both batch processing as well as data streaming/micro-batching.  And you adopt an immutable/incremental approach to data storage, which allows you to collapse your infrastructure and stream data into the system as it is being analyzed.  (simplifying everything int he process)

But I digress, as a step in that direction…

We can leverage the Java Integration capabilities within HPCC to support User Defined Functions in Java.  Likewise, we can leverage the same facilities to add additional backend storage mechanisms (e.g. Cassandra).  More specifically, let’s have a look at the streamingcapabilities of HPCC/Java integration to get data out of an external source.

Let’s first look at vanilla Java integration.


via How to Bulk Load Data from Text File to Big Data Hadoop HBase Table?.

Here we are introducing the process of bulk loading of data from text file using HBase java client API. The worldwide Hadoop development community will learn in this post about bulk loading and when to use it and how its process is looks like.

We are introducing bulk loading of data using HBase bulk load feature using HBase java client API.

How to use HBase Java API with HDInsight HBase cluster, part 1

Click to Read

Recently we worked with a customer, who was trying to use HBase Java API to interact with an HDInsight HBase cluster. Having worked with the customer and trying to follow our existing documentations here and here, we realized that it may be helpful if we clarify a few things around HBase JAVA API connectivity to HBase cluster and show a simpler way of running the JAVA client application using HBase JAVA APIs. In this blog, we will explain the recommended steps for using HBase JAVA APIs to interact with HDInsight HBase cluster.

Choosing the Right NoSQL Database – MongoDB® Vs Cassandra Vs HBase.

NoSQL includes a wide range of different database technologies and were developed as a result of surging volume of data stored. Relational databases are not capable of coping with this huge volume and faces agility challenges. This is where NoSQL databases have come in to play and are popular because of their features. The session covers the following topics to help you choose the right NoSQL databases:

1.Traditional databases
2.Challenges with traditional databases
3.CAP Theorem
4.NoSQL to the rescue
5.A BASE system
6.Choose the right NoSQL database
Related posts:

Introduction to HBase, the NoSQL Database for Hadoop

Introduction to HBase, the NoSQL Database for Hadoop

HBase is called the Hadoop database because it is a NoSQL database that runs on top of Hadoop. It combines the scalability of Hadoop by running on the Hadoop Distributed File System (HDFS), with real-time data access as a key/value store and deep analytic capabilities of Map Reduce. This article introduces HBase and describes how it organizes and manages data and then demonstrates how to set up a local HBase environment and interact with data using the HBase shell.

Apache HBase is a NoSQL database that runs on top of Hadoop as a distributed and scalable big data store. This means that HBase can leverage the distributed processing paradigm of the Hadoop Distributed File System (HDFS) and benefit from Hadoop’s MapReduce programming model. It is meant to host large tables with billions of rows with potentially millions of columns and run across a cluster of commodity hardware. But beyond its Hadoop roots, HBase is a powerful database in its own right that blends real-time query capabilities with the speed of a key/value store and offline or batch processing via MapReduce. In short, HBase allows you to query for individual records as well as derive aggregate analytic reports across a massive amount of data.

Hadoop Hangover: Introduction To Apache Bigtop and Installing Hive, HBase and Pig

Hadoop Hangover: Introduction To Apache Bigtop and Installing Hive, HBase and Pig

In the previous post we learnt how easy it was to install Hadoop with Apache Bigtop!
We know its not just Hadoop and there are sub-projects around the table! So, lets have a look at how to install Hive, Hbase and Pig in this post.

Sr. Software Engineer at Bellevue, WA

Role: Sr. Software Engineer
Location: Bellevue, WA
We need only W2 candidates. We can also do H1 transfer

Note: Send me your updated resume at kdinesh@prokarma.com

We are seeking engineers excited about building and scaling infrastructure to evolve our analytics platform. As an engineer with our advertising team you will work with technologies to process and organize our data in batch and real-time contexts. The ideal candidate is an autonomous engineer with extensive knowledge of the software development process and an understanding of distributed systems. Familiarity or past experience with Hadoop and related technologies is required.

• Build solutions to measure, forecast, and analyze ad revenue & performance to help product and business teams make informed decisions.
• Empower our data scientists and analysts to apply statistical methodologies and ask complex questions across large data sets.
• Develop, test, deploy, and support applications using Scala, Java, and Ruby.
• Play a key role in influencing and planning the direction of our data infrastructure.
• Create MapReduce jobs to transform and aggregate data in a complex data warehouse environment. • Write custom code and leverage existing tools to collect data from multiple disparate systems.
• Implement processes to ensure data availability, accuracy, and integrity.

• Strong Java development skills.
• Hands-on development mentality with a willingness to troubleshoot and solve complex problems.
• Experience and knowledge of technologies within the Hadoop ecosystem, such as Cascading, Hive, Pig, HBase.
• Experience with AWS a plus.
• Understanding of Linux system internals, administration, and scripting.
• Interest in and willingness to work with machine learning, Mahout, R.

Dinesh Ram Kali.
Human Resource Associate| National Staffing|
Direct: +1 (402) 905 9212
222 South 15th Street, Suite 505N, Omaha, NE 68102
Follow: Dineshramitc.wordpress.com