Tag Archives: BigML

Video: Data Science and (Unsupervised) Machine Learning with scikit-learn

Slides are available at http://opensource.datacratic.com/mtlp…

All code and materials are available at https://github.com/datacratic/mtlpy50

An Introduction to Unsupervised Learning via Scikit Learn

Click to Read

Unsupervised learning is the most applicable subfield on machine learning as it does not require any labels in the dataset and world is itself is an abundance of dataset. Human beings and their actions are recorded more and more every day(through photographs in Instagram, health data through wearables, internet activity through cookies and so on). Even the part of our lives which are not digital will be recorded in near future thanks to internet of things. In such a diversified and unlabeled dataset, unsupervised learning will become more and more important in the future.

Not only it could be useful for dimensionality reduction in the feature set(like a preprocessing step) but also could be useful as feature extraction method. PCA(Principal Component Analysis) could be one of the most used unsupervised learning algorithm(PCA to unsupervised learning, linear regression equivalent to regression). It could be used both a dimensionality reduction in order to reduce data but also while it compresses(reduces) data, since it tries to capture the variance, it could pick up interesting featurse, so could be used as a feature extraction method.

In this notebook, I will use PCA to both reduce dimensionality in the dataset and also build our feature vector. This specific method is called EigenFace(due to PCA extracting the eigenvectors and they could be visualized as face).

Machine Learning Conference in SFO

Dan Mallinger, Data Science Practice Manager, Think Big Analytics @ MLconf ATL

Steffen Rendle, Research Scientist, Google @ MLconf SF

Lorien Pratt, Cofounder Chief Scientist, Quantellia @ MLconf SF

Lise Getoor, Professor, Computer Science, UC Santa Cruz @ MLconf SF

Ted Willke, Senior Principal Engineer & GM, Datacenter Group, Intel @ MLconf SF

Arno Candel, Physicist & Hacker, 0xData @ MLconf SF

Andy Feng, Distinguished Architect, Yahoo @ MLconf SF

Anthony Bak, Principal Data Scientist and Mathematician, Ayasdi @ MLconf SF

Johann Schleier Smith, Co Founder and CTO, ifwe @ MLconf SF

Scott Clark, Software Engineer, Yelp @ MLconf SF

Ameet Talwalkar, assistant professor of Computer Science, UCLA @ MLconf SF Part 2

Ameet Talwalkar, assistant professor of Computer Science, UCLA @ MLconf SF Part 1

Applying deep learning and a RBM to MNIST using Python

Click to Read

In my last post, I mentioned that tiny, one pixel shifts in images can kill the performance your Restricted Boltzmann Machine + Classifier pipeline when utilizing raw pixels as feature vectors.

Today I am going to continue that discussion.

And more importantly, I’m going to provide some Python and scikit-learn code that you can use to apply Restricted Boltzmann Machines to your own image classification problems.

In-depth introduction to Machine Learning in 15 hours of expert videos

Click to Read

Chapter 1: Introduction (slidesplaylist)

Chapter 2: Statistical Learning (slidesplaylist)

Chapter 3: Linear Regression (slidesplaylist)

Chapter 4: Classification (slidesplaylist)

Chapter 5: Resampling Methods (slidesplaylist)

Chapter 6: Linear Model Selection and Regularization (slidesplaylist)

Chapter 7: Moving Beyond Linearity (slidesplaylist)

Chapter 8: Tree-Based Methods (slidesplaylist)

Chapter 9: Support Vector Machines (slidesplaylist)

Chapter 10: Unsupervised Learning (slidesplaylist)

Interviews (playlist)

Flatlline, BigML’s dataset transformation and generation language

Flatline, a language for data generation and filtering

Flatline is a lispy language for the specification of values to be extracted or generated from an input dataset, using a finite sliding window of input rows.

In BigML, it is used either as a row filter specifier or as a field generator.

In the former case, the input consists of dataset rows on which a single, boolean expression is computed, and only those for which the result is true are kept in the output dataset.

When used to generate new datasets from given ones, a list of Flatline expressions is provided, each one generating either a value or a list of values, which are then concatenated together to conform the output rows (each value representing therefore a field in the generated dataset).



Creative Commons License
Flatline reference documentation by BigML Inc is licensed under a Creative Commons Attribution-ShareAlike 4.0 International License.

All code in this repository is released under the Apache License 2.0.

Streaming Histograms for Clojure/Java



This project is an implementation of the streaming, one-pass histograms described in Ben-Haim’sStreaming Parallel Decision Trees. Inspired by Tyree’s Parallel Boosted Regression Trees, the histograms are extended so that they may track multiple values.

The histograms act as an approximation of the underlying dataset. They can be used for learning, visualization, discretization, or analysis. The histograms may be built independently and merged, making them convenient for parallel and distributed algorithms.

While the core of this library is implemented in Java, it includes a full featured Clojure wrapper. This readme focuses on the Clojure interface, but Java developers can find documented methods incom.bigml.histogram.Histogram.


histogram is available as a Maven artifact from Clojars.

For Leiningen:

[bigml/histogram "4.0.0"]

For Maven:



In the following examples we use Incanter to generate data and for charting.

The simplest way to use a histogram is to create one and then insert! points. In the example below,ex/normal-data refers to a sequence of 200K samples from a normal distribution (mean 0, variance 1).

user> (ns examples
        (:use [bigml.histogram.core])
        (:require (bigml.histogram.test [examples :as ex])))
examples> (def hist (reduce insert! (create) ex/normal-data))

You can use the sum fn to find the approximate number of points less than a given threshold:

examples> (sum hist 0)

The density fn gives us an estimate of the point density at the given location:

examples> (density hist 0)

The uniform fn returns a list of points that separate the distribution into equal population areas. Here’s an example that produces quartiles:

examples> (uniform hist 4)
(-0.66904 0.00229 0.67605)

Arbritrary percentiles can be found using percentiles:

examples> (percentiles hist 0.5 0.95 0.99)
{0.5 0.00229, 0.95 1.63853, 0.99 2.31390}

We can plot the sums and density estimates as functions. The red line represents the sum, the blue line represents the density. If we normalized the values (dividing by 200K), these lines approximate thecumulative distribution function and the probability distribution function for the normal distribution.

examples> (ex/sum-density-chart hist) ;; also see (ex/cdf-pdf-chart hist)

Histogram from normal distribution

The histogram approximates distributions using a constant number of bins. This bin limit is a parameter when creating a histogram (:bins, defaults to 64). A bin contains a :count of the points within the bin along with the :mean for the values in the bin. The edges of the bin aren’t captured. Instead the histogram assumes that points of a bin are distributed with half the points less than the bin mean and half greater. This explains the fractional sum in the example below:

examples> (def hist (-> (create :bins 3)
                        (insert! 1)
                        (insert! 2)
                        (insert! 3)))
examples> (bins hist)
({:mean 1.0, :count 1} {:mean 2.0, :count 1} {:mean 3.0, :count 1})
examples> (sum hist 2)

As mentioned earlier, the bin limit constrains the number of unique bins a histogram can use to capture a distribution. The histogram above was created with a limit of just three bins. When we add a fourth unique value it will create a fourth bin and then merge the nearest two.

examples> (bins (insert! hist 0.5))
({:mean 0.75, :count 2} {:mean 2.0, :count 1} {:mean 3.0, :count 1})

A larger bin limit means a higher quality picture of the distribution, but it also means a larger memory footprint. In the chart below, the red line represents a histogram with 8 bins and the blue line represents 64 bins.

examples> (ex/multi-pdf-chart
           [(reduce insert! (create :bins 8) ex/mixed-normal-data)
            (reduce insert! (create :bins 64) ex/mixed-normal-data)])

8 and 64 bins histograms

Another option when creating a histogram is to use gap weighting. When :gap-weighted? is true, the histogram is encouraged to spend more of its bins capturing the densest areas of the distribution. For the normal distribution that means better resolution near the mean and less resolution near the tails. The chart below shows a histogram without gap weighting in blue and with gap weighting in red. Near the center of the distribution, red uses more bins and better captures the gaussian distribution’s true curve.

examples> (ex/multi-pdf-chart
           [(reduce insert! (create :bins 8 :gap-weighted? true)
            (reduce insert! (create :bins 8 :gap-weighted? false)

Gap weighting vs. No gap weighting


A strength of the histograms is their ability to merge with one another. Histograms can be built on separate data streams and then combined to give a better overall picture.

In this example, the blue line shows a density distribution from a histogram after merging 300 noisy histograms. The red shows one of the original histograms:

examples> (let [samples (partition 1000 ex/mixed-normal-data)
                hists (map #(reduce insert! (create) %) samples)
                merged (reduce merge! (create) (take 300 hists))]
            (ex/multi-pdf-chart [(first hists) merged]))

Merged histograms


While a simple histogram is nice for capturing the distribution of a single variable, it’s often important to capture the correlation between variables. To that end, the histograms can track a second variable called the target.

The target may be either numeric or categorical. The insert! fn is overloaded to accept either type of target. Each histogram bin will contain information summarizing the target. For numeric targets the sum and sum-of-squares are tracked. For categoricals, a map of counts is maintained.

examples> (-> (create)
              (insert! 1 9)
              (insert! 2 8)
              (insert! 3 7)
              (insert! 3 6)
({:target {:sum 9.0, :sum-squares 81.0, :missing-count 0.0},
  :mean 1.0,
  :count 1}
 {:target {:sum 8.0, :sum-squares 64.0, :missing-count 0.0},
  :mean 2.0,
  :count 1}
 {:target {:sum 13.0, :sum-squares 85.0, :missing-count 0.0},
  :mean 3.0,
  :count 2})
examples> (-> (create)
              (insert! 1 :a)
              (insert! 2 :b)
              (insert! 3 :c)
              (insert! 3 :d)
({:target {:counts {:a 1.0}, :missing-count 0.0},
  :mean 1.0,
  :count 1}
 {:target {:counts {:b 1.0}, :missing-count 0.0},
  :mean 2.0,
  :count 1}
 {:target {:counts {:d 1.0, :c 1.0}, :missing-count 0.0},
  :mean 3.0,
  :count 2})

Mixing target types isn’t allowed:

examples> (-> (create)
              (insert! 1 :a)
              (insert! 2 999))
Can't mix insert types
  [Thrown class com.bigml.histogram.MixedInsertException]

insert-numeric! and insert-categorical! allow target types to be set explicitly:

examples> (-> (create)
              (insert-categorical! 1 1)
              (insert-categorical! 1 2)
({:target {:counts {2 1.0, 1 1.0}, :missing-count 0.0}, :mean 1.0, :count 2})

The extended-sum fn works similarly to sum, but returns a result that includes the target information:

examples> (-> (create)
              (insert! 1 :a)
              (insert! 2 :b)
              (insert! 3 :c)
              (extended-sum 2))
{:sum 1.5, :target {:counts {:c 0.0, :b 0.5, :a 1.0}, :missing-count 0.0}}

The average-target fn returns the average target value given a point. To illustrate, the following histogram captures a dataset where the input field is a sample from the normal distribution while the target value is the sine of the input. The density is in red and the average target value is in blue:

examples> (def make-y (fn [x] (Math/sin x)))
examples> (def hist (let [target-data (map (fn [x] [x (make-y x)])
                      (reduce (fn [h [x y]] (insert! h x y))
examples> (ex/pdf-target-chart hist)

Numeric target

Continuing with the same histogram, we can see that average-target produces values close to original target:

examples> (def view-target (fn [x] {:actual (make-y x)
                                    :approx (:sum (average-target hist x))}))
examples> (view-target 0)
{:actual 0.0, :approx -0.00051}
examples>  (view-target (/ Math/PI 2))
{:actual 1.0, :approx 0.9968169965429206}
examples> (view-target Math/PI)
{:actual 0.0, :approx 0.00463}

Missing Values

Information about missing values is captured whenever the input field or the target is nil. The missing-bin fn retrieves information summarizing the instances with a missing input. For a basic histogram, that is simply the count:

examples> (-> (create)
              (insert! nil)
              (insert! 7)
              (insert! nil)
{:count 2}

For a histogram with a target, the missing-bin includes target information:

examples> (-> (create)
              (insert! nil :a)
              (insert! 7 :b)
              (insert! nil :c)
{:target {:counts {:a 1.0, :c 1.0}, :missing-count 0.0}, :count 2}

Targets can also be missing, in which case the target missing-count is incremented:

examples> (-> (create)
              (insert! nil :a)
              (insert! 7 :b)
              (insert! nil nil)
{:target {:counts {:a 1.0}, :missing-count 1.0}, :count 2}

Array-backed Categorical Targets

By default a histogram with categorical targets stores the category counts as Java HashMaps. Building and merging HashMaps can be expensive. Alternatively the category counts can be backed by an array. This can give better performance but requires the set of possible categories to be declared when the histogram is created. To do this, set the :categories parameter:

examples> (def categories (map (partial str "c") (range 50)))
examples> (def data (vec (repeatedly 100000
                                     #(vector (rand) (str "c" (rand-int 50))))))
examples> (doseq [hist [(create) (create :categories categories)]]
            (time (reduce (fn [h [x y]] (insert! h x y))
"Elapsed time: 1295.402 msecs"
"Elapsed time: 516.72 msecs"

Group Targets

Group targets allow the histogram to track multiple targets at the same time. Each bin contains a sequence of target information. Optionally, the target types in the group can be declared when creating the histogram. Declaring the types on creation allows the targets to be missing in the first insert:

examples> (-> (create :group-types [:categorical :numeric])
              (insert! 1 [:a nil])
              (insert! 2 [:b 8])
              (insert! 3 [:c 7])
              (insert! 1 [:d 6])
  ({:counts {:d 1.0, :a 1.0}, :missing-count 0.0}
   {:sum 6.0, :sum-squares 36.0, :missing-count 1.0}),
  :mean 1.0,
  :count 2}
  ({:counts {:b 1.0}, :missing-count 0.0}
   {:sum 8.0, :sum-squares 64.0, :missing-count 0.0}),
  :mean 2.0,
  :count 1}
  ({:counts {:c 1.0}, :missing-count 0.0}
   {:sum 7.0, :sum-squares 49.0, :missing-count 0.0}),
  :mean 3.0,
  :count 1})


There are multiple ways to render the charts, see examples.clj. An example of rendering a single function, namely cumulative probability:

examples> (def hist (reduce hst/insert! (hst/create) [1 1 2 3 4 4 4 5]))
examples> (let [{:keys [min max]} (hst/bounds hist)]
            (core/view (charts/function-plot (hst/cdf hist) min max)))

(core and charts are Incanter namespaces.)

To render multiple functions on the same chart, you would use add-function with the result of function-plot:

examples> (core/view (-> (charts/function-plot (hst/cdf hist) min max :legend true)
                         (charts/add-function (hst/pdf hist) min max)))

Performance-related concerns

Freezing a Histogram

While the ability to adapt to non-stationary data streams is a strength of the histograms, it is also computationally expensive. If your data stream is stationary, you can increase the histogram’s performance by setting the :freeze parameter. After the number of inserts into the histogram have exceeded the :freeze parameter, the histogram bins are locked into place. As the bin means no longer shift, inserts become computationally cheap. However the quality of the histogram can suffer if the:freeze parameter is too small.

examples> (time (reduce insert! (create) ex/normal-data))
"Elapsed time: 333.5 msecs"
examples> (time (reduce insert! (create :freeze 1024) ex/normal-data))
"Elapsed time: 166.9 msecs"


There are two implementations of bin reservoirs (which support the insert! and merge! functions). Either of the two implementations, :tree and :array, can be explicitly selected with the :reservoirparameter. The :tree option is useful for histograms with many bins as the insert time scales at O(log n) with respect to the # of bins. The :array option is good for small number of bins since inserts areO(n) but there’s a smaller overhead. If :reservoir is left unspecified then :array is used for histograms with <= 256 bins and :tree is used for anything larger.

examples> (time (reduce insert! (create :bins 16 :reservoir :tree)
"Elapsed time: 554.478 msecs"
examples> (time (reduce insert! (create :bins 16 :reservoir :array)
"Elapsed time: 183.532 msecs"

Insert times using reservoir defaults:

timing chart


Copyright (C) 2013 BigML Inc.

Distributed under the Apache License, Version 2.0.