Tag Archives: Apache Spark

Install and use Spark 1.0 on HDInsight clusters


You can install Spark on an HDInsight cluster, while it is being deployed, using Script Action cluster customization. Script action lets you run scripts to customize a cluster, as the cluster is being created. For more information, see Customize HDInsight cluster using script action. Original Post>>

An Apache Phoenix wrapper for Apache Spark


An Apache Phoenix RDD for Apache Spark. This RDD is intended to be an easier-to-use wrapper for Apache Phoenix within Apache Spark. It includes an automatic conversion to a SchemaRDD for use within Spark SQL. This is at the earliest stage of development. It will change frequently. See more>>

Bayesian Machine Learning on Apache Spark


Markov Chain Monte Carlo methods are another example of useful statistical computation for Big Data that is capably enabled by Apache Spark.

During my internship at Cloudera, I have been working on integrating PyMC with Apache Spark. PyMC is an open source Python package that allows users to easily apply Bayesian machine learning methods to their data, while Spark is a new, general framework for distributed computing on Hadoop. Together, they provide a scalable framework for scalable Markov Chain Monte Carlo (MCMC) methods. In this blog post, I am going to describe my work on distributing large-scale graphical models and MCMC computation. Read more>>