When running Hadoop jobs against Cassandra, you will want to be careful about a few parameters.
Specifically, pay special attention to vNodes, Splits and Page Sizes.
vNodes were introduced in Cassandra 1.2. vNodes allow a host to have multiple portions of the token range. This allows for more evenly distributed data, which means nodes can share the burden of a node rebuild (and it doesn’t fall all on one node). Instead the rebuild is distributed across a number of nodes. (way cool feature)
BUT, vNodes made Hadoop jobs a little trickier…
The first time I went to run a Hadoop job using CQL, I was running a local Hadoop cluster (only one container) against a Cassandra with lots of vNodes (~250). I was running a simple job (probably a flavor of word count) over a few thousand records. I expected the job to complete in seconds. To my dismay, it was taking *forever*.