Tag Archives: Data Science

SparkR: data frame operations with SparkR


Spark & R: data frame operations with SparkR
// Codementor Data Science Tutorials

In this third tutorial (see the previous one) we will introduce more advanced concepts about SparkSQL with R that you can find in the SparkR documentation, applied to the 2013 American Community Survey housing data. These concepts are related with data frame manipulation, including data slicing, summary statistics, and aggregations. We will use them in combination with ggplot2 visualisations. We will explain what we do at every step but, if you want to go deeper into ggplot2 for exploratory data analysis, I did this Udacity on-line course in the past and I highly recommend it!

All the code for these series of Spark and R tutorials can be found in its own GitHub repository. Go there and make it yours.

Creating a SparkSQL context and loading data

In order to explore our data, we first need to load it into a SparkSQL data frame. But first we need to init a SparkSQL context. The first thing we need to do is to set up some environment variables and library paths as follows. Remember to replace the value assigned to SPARK_HOME with your Spark home folder.

# Set Spark home and R libs Sys.setenv(SPARK_HOME='/home/cluster/spark-1.5.0-bin-hadoop2.6') .libPaths(c(file.path(Sys.getenv('SPARK_HOME'), 'R', 'lib'), .libPaths()))

Now we can load the SparkR library as follows.

library(SparkR)
Attaching package: ‘SparkR’ The following objects are masked from ‘package:stats’: filter, na.omit The following objects are masked from ‘package:base’: intersect, rbind, sample, subset, summary, table, transform

And now we can initialise the Spark context as in the official documentation. In our case we are use a standalone Spark cluster with one master and seven workers. If you are running Spark in local node, use just master='local'. Additionally, we require a Spark package from Databricks to read CSV files (more on this in the previous notebook).

sc <- sparkR.init(master='spark://169.254.206.2:7077', sparkPackages="com.databricks:spark-csv_2.11:1.2.0")
Launching java with spark-submit command /home/cluster/spark-1.5.0-bin-hadoop2.6/bin/spark-submit --packages com.databricks:spark-csv_2.11:1.2.0 sparkr-shell /tmp/RtmpfRY7gu/backend_port4c2413c05644

And finally we can start the SparkSQL context as follows.

sqlContext <- sparkRSQL.init(sc)

Now that we have our SparkSQL context ready, we can use it to load our CSV data into data frames. We have downloaded our 2013 American Community Survey dataset files in notebook 0, so they should be stored locally. Remember to set the right path for your data files in the first line, ours is /nfs/data/2013-acs/ss13husa.csv.

housing_a_file_path <- file.path('', 'nfs','data','2013-acs','ss13husa.csv') housing_b_file_path <- file.path('', 'nfs','data','2013-acs','ss13husb.csv')

Now let’s read into a SparkSQL dataframe. We need to pass four parameters in addition to the sqlContext:

  • The file path.
  • header='true' since our csv files have a header with the column names.
  • Indicate that we want the library to infer the schema.
  • And the source type (the Databricks package in this case).

And we have two separate files for both, housing and population data. We need to join them.

housing_a_df <- read.df(sqlContext, housing_a_file_path, header='true', source = "com.databricks.spark.csv", inferSchema='true')
housing_b_df <- read.df(sqlContext, housing_b_file_path, header='true', source = "com.databricks.spark.csv", inferSchema='true')
housing_df <- rbind(housing_a_df, housing_b_df)

Let’s check that we have everything there by counting the files and listing a few of them.

nrow(housing_df)
1476313
head(housing_df)

| . | | RT | SERIALNO | DIVISION | PUMA | REGION | ST | ADJHSG | ADJINC | WGTP | NP | ellip.h | wgtp71 | wgtp72 | wgtp73 | wgtp74 | wgtp75 | wgtp76 | wgtp77 | wgtp78 | wgtp79 | wgtp80 | | |—-|———-|———-|——|——–|—-|——–|———|———|—–|———|——–|——–|——–|——–|——–|——–|——–|——–|——–|——–|—–| | 1 | H | 84 | 6 | 2600 | 3 | 1 | 1000000 | 1007549 | 0 | 1 | ⋯ | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | | 2 | H | 154 | 6 | 2500 | 3 | 1 | 1000000 | 1007549 | 51 | 4 | ⋯ | 86 | 53 | 59 | 84 | 49 | 15 | 15 | 20 | 50 | 16 | | 3 | H | 156 | 6 | 1700 | 3 | 1 | 1000000 | 1007549 | 449 | 1 | ⋯ | 161 | 530 | 601 | 579 | 341 | 378 | 387 | 421 | 621 | 486 | | 4 | H | 160 | 6 | 2200 | 3 | 1 | 1000000 | 1007549 | 16 | 3 | ⋯ | 31 | 24 | 33 | 7 | 7 | 13 | 18 | 23 | 23 | 5 | | 5 | H | 231 | 6 | 2400 | 3 | 1 | 1000000 | 1007549 | 52 | 1 | ⋯ | 21 | 18 | 37 | 49 | 103 | 38 | 49 | 51 | 46 | 47 | | 6 | H | 286 | 6 | 900 | 3 | 1 | 1000000 | 1007549 | 76 | 1 | ⋯ | 128 | 25 | 68 | 66 | 80 | 26 | 66 | 164 | 88 | 24 |

Giving ggplot2 a try

Before we dive into data selection and aggregations, let’s try plotting something using ggplot2. We will use this library all the time during our exploratory data analysis, and we better mke sure how to use it with SparkSQL results.

# if it isn't installed, # install install.packages("ggplot2") # from the R console, specifying a CRAN mirror library(ggplot2)

What if we directly try to use our SparkSQL DataFrame class into a ggplot?

c <- ggplot(data=housing_df, aes(x=factor(REGION)))
Error: ggplot2 doesn't know how to deal with data of class DataFrame

Obviously it doesn’t work that way. The ggplot function doesn’t know how to deal with that type of distributed data frames (the Spark ones). Instead, we need to collect the data locally as follows.

housing_region_df_local <- collect(select(housing_df,"REGION"))

Let’s have a look at what we got.

str(housing_region_df_local)
'data.frame':	1476313 obs. of 1 variable: $ REGION: int 3 3 3 3 3 3 3 3 3 3 ...

That is, when we collect results from a SparkSQL DataFrame we get a regular R data.frame. Very convenient since we can manipulate it as we need to. For example, let’s convert that int values we have for REGION to a factor with the proper names. From our data dictionary we will get the meaning of the REGION variable, as well as the different values it can take.

housing_region_df_local$REGION <- factor( x=housing_region_df_local$REGION, levels=c(1,2,3,4,9), labels=c('Northeast', 'Midwest','South','West','Puerto Rico') )

And now we are ready to create the ggplot object as follows.

c <- ggplot(data=housing_region_df_local, aes(x=factor(REGION)))

And now we can give the plot a proper representation (e.g. a bar plot).

c + geom_bar() + xlab("Region")

enter image description here

We will always follow the same approach. First we make some sort of oepration with the SparkSQL DataFrame object (e.g. a selection), then we collect results, and then we prepare the resulting data.frame to be represented using ggplot2. But think about the previous. We just represented all the samples for a given column. That is almost a million and a half data points, and we are pushing our local R environment and ggplot2 a lot. In the case of the bar plot we didn’t really experience any problems, cause is sort of aggregating data inside. But we will struggle to do scatter plots this way. The preferred kind of visualisations will be those that come from data from aggreations on SparkSQL DataFrames as we will see in further sections.

Data selection

In this section we will demonstrate how to select data from a SparkSQL DataFrame object using SparkR.

select and selectExpr

We already made use of the select function. but let’s have a look at the documentation.

?select

We see there that have two flavours of select. One that gets a list of column names (this is the one we used so far) and another one called selectExpr that we pass a string containing a SQL expression.

Of course we can pass more than a column name.

collect(select(housing_df, "REGION", "VALP"))
. REGION VALP
1 3 NA
2 3 25000
3 3 80000
4 3 NA
5 3 NA
6 3 18000
7 3 390000
8 3 120000
9 3 NA
10 3 160000
11 3 NA
12 3 NA
13 3 NA
14 3 NA
15 3 40000
16 3 60000
17 3 60000
18 3 NA
19 3 250000
20 3 110000
21 3 190000
22 3 160000
23 3 NA
24 3 750000
25 3 300000
26 3 NA
27 3 NA
28 3 100000
29 3 20000
30 3 70000
31
1476284 4 130000
1476285 4 NA
1476286 4 NA
1476287 4 150000
1476288 4 110000
1476289 4 NA
1476290 4 200000
1476291 4 NA
1476292 4 250000
1476293 4 NA
1476294 4 500000
1476295 4 7100
1476296 4 175000
1476297 4 500000
1476298 4 NA
1476299 4 25000
1476300 4 150000
1476301 4 240000
1476302 4 NA
1476303 4 NA
1476304 4 NA
1476305 4 NA
1476306 4 12000
1476307 4 56000
1476308 4 99000
1476309 4 NA
1476310 4 NA
1476311 4 150000
1476312 4 NA
1476313 4 NA

When passing column names we can also use the R notation data.frame$column.name so familiar to R users. How does this notation compares to the name-based one in terms of performance?

system.time( collect(select(housing_df, housing_df$VALP)) )
user system elapsed 30.086 0.032 48.046
system.time( collect(select(housing_df, "VALP")) )
user system elapsed 28.766 0.012 46.358

When using the $ notation, we can even pass expressions as follows.

head(select(housing_df, housing_df$VALP / 100))
. (VALP / 100.0)
1 NA
2 250
3 800
4 NA
5 NA
6 180

So what’s the point of selectExpr then? Well, we can pass more complex SQL expressions. For example.

head(selectExpr(housing_df, "(VALP / 100) as VALP_by_100"))
. VALP_by_100
1 NA
2 250
3 800
4 NA
5 NA
6 180

filter, subset, and sql

The previous functions allow us selecting columns. In order to select rows, we will use filter and contains. Call the docs as follows if you want to know more about it.

?filter

With filter we filter the rows of a DataFrame according to a given condition that we pass as argument. We can define conditions as SQL conditions using column names or by using the $ notation.

For example, and following with our property values column, let’s select property values higher than 1000 for the south region.

system.time( housing_valp_1000 <- collect(filter(select(housing_df, "REGION", "VALP"), "VALP > 1000")) )
user system elapsed 38.043 0.184 56.259
housing_valp_1000
. REGION VALP
1 3 25000
2 3 80000
3 3 18000
4 3 390000
5 3 120000
6 3 160000
7 3 40000
8 3 60000
9 3 60000
10 3 250000
11 3 110000
12 3 190000
13 3 160000
14 3 750000
15 3 300000
16 3 100000
17 3 20000
18 3 70000
19 3 125000
20 3 1843000
21 3 829000
22 3 84000
23 3 150000
24 3 130000
25 3 90000
26 3 220000
27 3 225000
28 3 65000
29 3 80000
30 3 135000
31
847115 4 120000
847116 4 200000
847117 4 365000
847118 4 600000
847119 4 100000
847120 4 124000
847121 4 200000
847122 4 160000
847123 4 250000
847124 4 285000
847125 4 100000
847126 4 205000
847127 4 189000
847128 4 350000
847129 4 130000
847130 4 150000
847131 4 110000
847132 4 200000
847133 4 250000
847134 4 500000
847135 4 7100
847136 4 175000
847137 4 500000
847138 4 25000
847139 4 150000
847140 4 240000
847141 4 12000
847142 4 56000
847143 4 99000
847144 4 150000

Take into account that we can also perform the previous selection and filtering by using SQL queries agains the SparkSQL DataFrame. In order to do that we need to register the table as follows.

registerTempTable(housing_df, "housing")

And then we can use SparkR sql function using the sqlContext as follows.

system.time( housing_valp_1000_sql <- collect(sql(sqlContext, "SELECT REGION, VALP FROM housing WHERE VALP >= 1000")) )
user system elapsed 38.862 0.008 56.747
housing_valp_1000_sql
. REGION VALP
1 3 25000
2 3 80000
3 3 18000
4 3 390000
5 3 120000
6 3 160000
7 3 40000
8 3 60000
9 3 60000
10 3 250000
11 3 110000
12 3 190000
13 3 160000
14 3 750000
15 3 300000
16 3 100000
17 3 20000
18 3 70000
19 3 125000
20 3 1843000
21 3 829000
22 3 84000
23 3 150000
24 3 130000
25 3 90000
26 3 220000
27 3 225000
28 3 65000
29 3 80000
30 3 135000
31
848420 4 120000
848421 4 200000
848422 4 365000
848423 4 600000
848424 4 100000
848425 4 124000
848426 4 200000
848427 4 160000
848428 4 250000
848429 4 285000
848430 4 100000
848431 4 205000
848432 4 189000
848433 4 350000
848434 4 130000
848435 4 150000
848436 4 110000
848437 4 200000
848438 4 250000
848439 4 500000
848440 4 7100
848441 4 175000
848442 4 500000
848443 4 25000
848444 4 150000
848445 4 240000
848446 4 12000
848447 4 56000
848448 4 99000
848449 4 150000

This last method might result more clear and flexible when we need to perform complex queries with multiple conditions. Using filter and select combinations might get verbose versus the clarity of the SQL lingua franca.

But there is another way of subsetting data frames in a functional way. A way that is very familiar to R users. It is by using the function subset. Just have a look at the help page.

?subset

And we use it as follows.

system.time( housing_valp_1000_subset <- collect(subset( housing_df, housing_df$VALP>1000, c("REGION","VALP") )) )
user system elapsed 39.751 0.020 57.425
housing_valp_1000_subset
. REGION VALP
1 3 25000
2 3 80000
3 3 18000
4 3 390000
5 3 120000
6 3 160000
7 3 40000
8 3 60000
9 3 60000
10 3 250000
11 3 110000
12 3 190000
13 3 160000
14 3 750000
15 3 300000
16 3 100000
17 3 20000
18 3 70000
19 3 125000
20 3 1843000
21 3 829000
22 3 84000
23 3 150000
24 3 130000
25 3 90000
26 3 220000
27 3 225000
28 3 65000
29 3 80000
30 3 135000
31
847115 4 120000
847116 4 200000
847117 4 365000
847118 4 600000
847119 4 100000
847120 4 124000
847121 4 200000
847122 4 160000
847123 4 250000
847124 4 285000
847125 4 100000
847126 4 205000
847127 4 189000
847128 4 350000
847129 4 130000
847130 4 150000
847131 4 110000
847132 4 200000
847133 4 250000
847134 4 500000
847135 4 7100
847136 4 175000
847137 4 500000
847138 4 25000
847139 4 150000
847140 4 240000
847141 4 12000
847142 4 56000
847143 4 99000
847144 4 150000

Even more, we can use the [] notation we use with R data.frame objects with SparkSQL DataFrames thanks to SparkR. For example.

system.time( housing_valp_1000_bracket <- collect( housing_df[housing_df$VALP>1000, c("REGION","VALP")] ) )
user system elapsed 39.090 0.013 56.381
housing_valp_1000_bracket
. REGION VALP
1 3 25000
2 3 80000
3 3 18000
4 3 390000
5 3 120000
6 3 160000
7 3 40000
8 3 60000
9 3 60000
10 3 250000
11 3 110000
12 3 190000
13 3 160000
14 3 750000
15 3 300000
16 3 100000
17 3 20000
18 3 70000
19 3 125000
20 3 1843000
21 3 829000
22 3 84000
23 3 150000
24 3 130000
25 3 90000
26 3 220000
27 3 225000
28 3 65000
29 3 80000
30 3 135000
31
847115 4 120000
847116 4 200000
847117 4 365000
847118 4 600000
847119 4 100000
847120 4 124000
847121 4 200000
847122 4 160000
847123 4 250000
847124 4 285000
847125 4 100000
847126 4 205000
847127 4 189000
847128 4 350000
847129 4 130000
847130 4 150000
847131 4 110000
847132 4 200000
847133 4 250000
847134 4 500000
847135 4 7100
847136 4 175000
847137 4 500000
847138 4 25000
847139 4 150000
847140 4 240000
847141 4 12000
847142 4 56000
847143 4 99000
847144 4 150000

That is, we have up to four different ways of subsetting a data frame with SparkR. We can plot any of the previous resulting data frames with a ggplot2 chart as we did before.

housing_valp_1000_bracket$REGION <- factor( x=housing_valp_1000_bracket$REGION, levels=c(1,2,3,4,9), labels=c('Northeast', 'Midwest','South','West','Puerto Rico') ) c <- ggplot(data=housing_region_df_local, aes(x=factor(REGION))) c + geom_bar() + ggtitle("Samples with VALP>1000") + xlab("Region")

enter image description here

Finally, a function that is useful, specially when imputing missing values in data frames is isNaN that can be applied to columns as we do with regular R data frames.

Data aggregation and sorting

In the previous notebook we already had a look at summary/describe that we can pass column names and get summary statistics that way. If we want instead to be specific about the statistic we want, SparkR also defines the following aggregation functions that we can apply to DataFrame objects columns:

We use them passing columns with the $ notation, and they return columns, so they need to be part of a select call for a DataFrame. For example.

collect(select(housing_df, avg(housing_df$VALP)))
. avg(VALP)
1 247682.8

groupBy and summarize / agg

A basic operation when doing data aggregations on data frames is groupBy. Basically it groups the DataFrame we pass using the specified columns, so we can run aggregation on them. We use it in combination with summarize/agg in order to apply aggregation functions. For example, using the previous avg example, let’s averagle property values by region as follows.

housing_avg_valp <- collect(agg( groupBy(housing_df, "REGION"), NUM_PROPERTIES=n(housing_df$REGION), AVG_VALP = avg(housing_df$VALP), MAX_VALUE=max(housing_df$VALP), MIN_VALUE=min(housing_df$VALP) )) housing_avg_valp$REGION <- factor( housing_avg_valp$REGION, levels=c(1,2,3,4,9), labels=c('Northeast', 'Midwest','South','West','Puerto Rico') ) housing_avg_valp
. REGION NUM_PROPERTIES AVG_VALP MAX_VALUE MIN_VALUE
1 Northeast 268285 314078.1 4775000 100
2 Midwest 328148 168305.3 2381000 100
3 South 560520 204236.9 3934000 100
4 West 319360 365559.3 4727000 110

We can add as many summary/aggregation columns as functions we want to calculate. There is also the possibility of adding several levels of grouping. For example, let’s add the number of bedrooms (BDSP in our dictionary) as follows.

housing_avg_valp <- collect(agg( groupBy(housing_df, "REGION", "BDSP"), NUM_PROPERTIES=n(housing_df$REGION), AVG_VALP = avg(housing_df$VALP), MAX_VALUE=max(housing_df$VALP), MIN_VALUE=min(housing_df$VALP) )) housing_avg_valp$REGION <- factor( housing_avg_valp$REGION, levels=c(1,2,3,4,9), labels=c('Northeast', 'Midwest','South','West','Puerto Rico') ) housing_avg_valp
. REGION BDSP NUM_PROPERTIES AVG_VALP MAX_VALUE MIN_VALUE
1 West NA 30339 NA NA NA
2 West 0 7750 226487.3 4727000 120
3 West 1 32620 212315 4727000 110
4 West 2 74334 258654.5 4727000 110
5 West 3 106532 325764.1 4727000 110
6 West 4 51785 459180.8 4727000 110
7 West 5 12533 607017.6 4727000 120
8 West 6 929 391539.8 2198000 170
9 West 7 555 518111.6 3972000 160
10 West 8 87 478757.2 2221000 230
11 West 9 374 480418.7 2386000 140
12 West 10 1486 975835.3 4727000 250
13 West 19 36 671500 1100000 65000
14 Northeast NA 31319 NA NA NA
15 Northeast 0 4951 311725.2 4775000 170
16 Northeast 1 30030 268071.8 4775000 110
17 Northeast 2 59301 233250.9 4775000 110
18 Northeast 3 89835 262429.5 4775000 110
19 Northeast 4 40622 393485.4 4775000 100
20 Northeast 5 8974 599000.8 4775000 110
21 Northeast 6 38 335772.7 750000 180
22 Northeast 7 273 1305308 4532000 280
23 Northeast 8 2228 864495.4 4775000 200
24 Northeast 9 56 607983.3 2383000 200
25 Northeast 10 642 416859.6 1826000 180
26 Northeast 13 16 298750 550000 150000
27 Midwest NA 32390 NA NA NA
28 Midwest 0 3588 131162.5 1688000 120
29 Midwest 1 24826 100265.2 2381000 110
30 Midwest 2 76965 112534.6 2381000 110
31 Midwest 3 126023 149800.3 2381000 100
32 Midwest 4 51108 229332.6 2381000 100
33 Midwest 5 10804 314773.4 2381000 110
34 Midwest 7 53 282746.7 1548000 1000
35 Midwest 8 812 359498.4 1562000 150
36 Midwest 9 1261 424151.2 2381000 150
37 Midwest 10 318 344710.9 1659000 1000
38 South NA 54208 NA NA NA
39 South 0 6599 132867.8 2518000 110
40 South 1 42047 119018.9 2880000 110
41 South 2 125856 127456.9 3934000 100
42 South 3 227546 168659.6 3934000 100
43 South 4 83899 287290.5 3934000 110
44 South 5 14095 462709.2 3934000 120
45 South 6 4258 545635.4 3934000 130
46 South 7 1027 609865.2 2552000 200
47 South 8 652 681768.1 2738000 250
48 South 9 314 609922.2 2057000 300
49 South 14 19 1996615 3934000 320000

arrange

One last thing. We can arrange a DataFrame as follows.

head(arrange(select(housing_df, "REGION", "VALP"), desc(housing_df$VALP)))
. REGION VALP
1 1 4775000
2 1 4775000
3 1 4775000
4 1 4775000
5 1 4775000
6 1 4775000

Or we can arrange the result of our aggregations.

housing_avg_agg <- agg( groupBy(housing_df, "REGION", "BDSP"), NUM_PROPERTIES=n(housing_df$REGION), AVG_VALP = avg(housing_df$VALP), MAX_VALUE=max(housing_df$VALP), MIN_VALUE=min(housing_df$VALP) ) housing_avg_sorted <- head(arrange( housing_avg_agg, desc(housing_avg_agg$AVG_VALP) )) housing_avg_sorted$REGION <- factor( housing_avg_sorted$REGION, levels=c(1,2,3,4,9), labels=c('Northeast', 'Midwest','South','West','Puerto Rico') ) housing_avg_sorted
. REGION BDSP NUM_PROPERTIES AVG_VALP MAX_VALUE MIN_VALUE
1 South 14 19 1996615 3934000 320000
2 Northeast 7 273 1305308 4532000 280
3 West 10 1486 975835.3 4727000 250
4 Northeast 8 2228 864495.4 4775000 200
5 South 8 652 681768.1 2738000 250
6 West 19 36 671500 1100000 65000

Conclusions

So that’s it. In the next tutorial we will dig deeper into property values (VALP) using these operations and ggplot2 charts. We want to explore what factors influence the variables in our dataset. In this third tutorial we have introduced most of the tools we need in order to perform a exploratory data analysis using Spark and R on a large dataset.

And finally, remember that all the code for these series of Spark and R tutorials can be found in its own GitHub repository. Go there and make it yours.

WhitePaper: Speaker identification using multimodal neural networks and wavelet analysis


Click to Download WhitePaper

The rapid momentum of the technology progress in the recent years has led to a tremendous rise in the use of biometric authentication systems. The objective of this research is to investigate the problem of identifying a speaker from its voice regardless of the content. In this study, the authors designed and implemented a novel text-independent multimodal speaker identification system based on wavelet analysis and neural networks. Wavelet analysis comprises discrete wavelet transform, wavelet packet transform, wavelet sub-band coding and Mel-frequency cepstral coefficients (MFCCs). The learning module comprises general regressive, probabilistic and radial basis function neural networks, forming decisions through a majority voting scheme. The system was found to be competitive and it improved the identification rate by 15% as compared with the classical MFCC. In addition, it reduced the identification time by 40% as compared with the back-propagation neural network, Gaussian mixture model and principal component analysis. Performance tests conducted using the GRID database corpora have shown that this approach has faster identification time and greater accuracy compared with traditional approaches, and it is applicable to real-time, text-independent speaker identification systems.

The task of speaker recognition can comprise speaker identification (i.e. identifying the current speaker) or speaker verification (i.e. verifying whether the speaker is who he claims to be) [ 1 ]. There are two types of speaker identification: text-dependent (the speaker is given a specific set of words to be uttered) and text-independent (the speaker is identified regardless of the words spoken) [ 2 ]. This paper proposes a novel approach towards building a text-independent speaker identification system (SIS).

A digital speech signal in its crudest form comprises frequency values sampled at consistent time intervals. It must be pre-processed to extract feature vectors that represent unique information for a particular speaker irrespective of the speech content. A learning algorithm generalises these feature vectors for various speakers during training and verifies the speaker’s identity using a test signal during the test phase. In practice, no two digital signals are the same even for the same speaker and the same set of words. The amplitude and pitch in a speaker’s voice can vary from one recording session to another. Environmental noise, the recording equipment, the speed at which the speaker speaks and the speaker’s various psychological and physical states increase the complexity of this task. Text-independent speaker identification allows the speaker to speak any set of words during a test. For such versatile systems, there is a need for a general feature extraction strategy to extract text-independent features from a speech signal.

The classical Mel-frequency cepstral coefficients (MFCC) method is likely the most popular feature extraction strategy used to date. This method is utilised herein for comparison with wavelet analysis. Linear predictive coding (LPC) has immensely aided text-dependent identification tasks [ 3 ]. Both MFCC and LPC use a global approach for speech analysis and are, therefore, susceptible to additive noise in the speech [ 4 ]. In this paper, we employed MFCC for comparison and relied heavily on wavelet-analysis strategies for feature extraction.

There are essentially two broad categories for methods to develop learning algorithms based on extracted speech features: generative and discriminative models. Generative methods are widely used and include stochastic models such as the hidden Markov model (HMM) [ 5 ], the Gaussian mixture model (GMM) [ 6 ] and template-based models (e.g. vector quantisation) [ 7 ]. The goal of a generative model is to symbolise the distribution space of the stored data generated from a particular class. This training process ignores competing data and considers only related data. In contrast, discriminative models shape the discriminative areas of a distribution. The primary purpose of this method is to reduce classification errors in the stored data as much as possible. Unlike generative models, data from all competing classes are also considered. Major discriminative models include polynomial classifiers [ 8 ], the support vector machine [ 9 ], the multilayer perceptron and artificial neural network (ANN) [ 10 ] methods, such as the general regressive NN (GRNN) [ 11 ], probabilistic NN (PNN) and radial basis function NN (RBF-NN) models [ 12 ].

To date, no single biometric system has been developed that can claim to identify or verify a speaker in all varieties of environments. Accurate classification of a speaker is a challenge when inter-class differences exceed intra-class differences, which primarily arise from a text-independent approach or noisy data. In an attempt to resolve this problem, two or more biometric techniques can be combined in a single system to improve the effectiveness of identification. This information fusion can be generated at different levels for multimodal biometrics. Information fusion is information that is merged from disparate sources with different conceptual, contextual and typographical expressions. In multimodal biometrics, this is possible at the sensor, feature, score or decision levels [ 13 , 14 ]. In sensor-level fusion, the core data from multiple sensors are combined for each modality which reduces classification error. In feature-level fusion, the speaker information received from multiple sources undergoes a feature extraction step, and this information is fused logically. In score-level fusion, a score is assigned to each individual biometric system, and these scores are used to make decisions for the final classification. In decision-level schemes, the final decision to accept or reject an individual system is generated via a voting procedure (e.g. majority, AND, OR etc.). Many researchers, like Nefian [ 15 ], tend to lean towards the early fusion approaches for audio-visual speech recognition. A speaker verification system based on audio-visual hybrid fusion from a set of features that are cross-modal was proposed in [ 16 ]. For a personnel authentication system based on face and voice, Chetty and Wagner [ 17 ] also developed a feature-level fusion to check the liveness, and presented test results performed on the VidTIMIT and UCBN databases.

The fusion performed in this paper involved the decision-level scheme. Different wavelet feature extraction techniques and decision-level schemes were investigated using three popular classifiers for text-independent, open-set speaker identification. The selected architectures were GRNN, PNN and RBF-NN. These NNs are fast, reliable and efficient for non-linear and complex data. Compared with back-propagation NNs (BPNN), which require a long training period, these networks are instantly trained and produce immediate results when applied to a test signal. Combining multiple ANNs enhances the generalisation capability and increases the identification rate. It also reduces the false accept rate (FAR) for a given false reject rate (FRR), and vice versa [ 18 ]. This motivated us to develop a novel identification system, namely the multimodal NN (MNN).

This paper is organised as follows. Section 2 describes wavelet feature extraction methods and the basics of NN. In Section 3, we introduce the proposed fusion system with a detail justification of the feature extraction and NNs that were chosen. Section 4 presents a comprehensive analysis of the performance and test results for this scheme, and finally Section 5 presents conclusions and recommendations for future work.

Video: Machine learning best practices we’ve learned from hundreds of competitions


Ben Hamner is Chief Scientist at Kaggle, leading its data science and development teams. He is the principal architect of many of Kaggle’s most advanced machine learning projects including current work in Eagle Ford and GE’s flight arrival prediction and optimization modeling.

Video: Writing self-documenting scientific code using physical quantities


In high school science, we’re taught to always include units in equations. Why not in scientific Python code? I’ll show why and how to keep track of units in Python using real-world examples.

Neural Network Back-Propagation using Python


Click to Read

When I’m working in a pure Microsoft technology environment, C# is my go-to programming language. But when I’m working in a hybrid environment, Python is my preferred language. I’ve seen a big increase in the use of Python for data-related programming. In this article I’ll explain to how implement neural network back-propagation training using Python. If you don’t currently use Python, examining the code in this article can be an excellent introduction to the language. And if you are a Python user, the code here can be a useful addition to your personal software tool kit.

IN QUEST OF MACHINE LEARNING IN SQL


Click to Read

The big and small of data preparation, exploration, and visualization. The well-known KMeans clustering algorithm was used to find common traits in U.S. federal government contracts. This week, I’ll dive further down the rabbit hole to show how recursive, machine learning algorithms, like KMeans, can be implemented using standardSQL triggers.

SQL is a declarative programming language, not a proceduralone. This brings huge benefits because SQL lets you prescribewhat computation you want but not how the underlying engine should compute it. For large computations, with too much data to fit in memory, SQL’s declarative model lets the underlying compute engine optimize the dirty details of data shardingpagingcaching, and materializing to run as fast as possible. This leaves you to focus on the actual results. Perfect.

scikit-spectra: Tools for explorative spectroscopy


License

3-Clause Revised BSD