via How to Analyze Highly Dynamic Datasets with Apache Drill | MapR.
Today’s data is dynamic and application-driven. The growth of a new era of business applications driven by industry trends such as web/social/mobile/IOT are generating datasets with new data types and new data models. These applications are iterative, and the associated data models typically are semi-structured, schema-less and constantly evolving. Semi-structured where an element can be complex/nested, and schema-less with its ability to allow varying fields in every single row and constantly evolving where fields get added and removed frequently to meet business requirements. In other words, the modern datasets are not only about volume and velocity, but also about variety and variability.
Apache Drill, the industry’s first schema-free SQL engine for Hadoop and NoSQL, allows business users to natively query dynamic datasets such as JSON in a self-service fashion using familiar SQL skillsets and BI tools. With Apache Drill, it just takes minutes to derive insights from any type of data, as opposed to weeks and months of time delays with traditional approaches.
Let me demonstrate this with a quick example. The dataset used in the example is from the Yelp check-ins dataset.
2014 was an exciting year for the Drill community. In August we made Drill available for downloads, and last week the Apache Software Foundation promoted Drill to a top-level project. Many of you have asked me what’s coming next, so I decided to sit down and outline some of the interesting initiatives that the Drill community is currently working on:
- Flexible Access Control
- JSON in Any Shape or Form
- Advanced SQL
- New Data Sources
- Drill/Spark Integration
- Operational Enhancements: Speed, Scalability and Workload Management
This is by no means intended to be an exhaustive list of everything that will be added to Drill in 2015. With Drill’s rapidly expanding community, I anticipate that you’ll see a whole lot more.
You can leverage the power of Apache Drill to query data without any upfront schema definitions. Drill enables you to create an architecture that works with nested and dynamic schemas, making it the perfect SQL query tool to use on NoSQL databases, such as MongoDB.
As of Apache Drill 0.6, you can configure MongoDB as a Drill data source. Drill provides a mongodb format plugin to connect to MongoDB, and run queries on the data using ANSI SQL.
This tutorial assumes that you have Drill installed locally (embedded mode), as well as MongoDB. Examples in this tutorial use zip code aggregation data provided by MongDB. Before You Begin provides links to download tools and data used throughout the tutorial.
Note: A local instance of Drill is used in this tutorial for simplicity. You can also run Drill and MongoDB together in distributed mode.
Apache Drill is one of the fastest growing open source projects, with the community making rapid progress with monthly releases. The latest release of Drill 0.6 is another important milestone for the project and builds on the product with key enhancements, including the ability to do SQL queries directly on MongoDB (along with file system, HBase, and Hive sources that are already supported today), as well as a number of performance and SQL improvements. Original Post>>