Drill is a distributed system for interactive analysis of large-scale datasets, inspired by Google’s Dremel.
Drill is a distributed system for interactive analysis of large-scale datasets. Drill is similar to Google’s Dremel, with the additional flexibility needed to support a broader range of query languages, data formats and data sources. It is designed to efficiently process nested data. It is a design goal to scale to 10,000 servers or more and to be able to process petabyes of data and trillions of records in seconds.
Many organizations have the need to run data-intensive applications, including batch processing, stream processing and interactive analysis. In recent years open source systems have emerged to address the need for scalable batch processing (Apache Hadoop) and stream processing (Storm, Apache S4). In 2010 Google published a paper called “Dremel: Interactive Analysis of Web-Scale Datasets,” describing a scalable system used internally for interactive analysis of nested data. No open source project has successfully replicated the capabilities of Dremel.
There is a strong need in the market for low-latency interactive analysis of large-scale datasets, including nested data (eg, JSON, Avro, Protocol Buffers). This need was identified by Google and addressed internally with a system called Dremel.
In recent years open source systems have emerged to address the need for scalable batch processing (Apache Hadoop) and stream processing (Storm, Apache S4). Apache Hadoop, originally inspired by Google’s internal MapReduce system, is used by thousands of organizations processing large-scale datasets. Apache Hadoop is designed to achieve very high throughput, but is not designed to achieve the sub-second latency needed for interactive data analysis and exploration. Drill, inspired by Google’s internal Dremel system, is intended to address this need.
It is worth noting that, as explained by Google in the original paper, Dremel complements MapReduce-based computing. Dremel is not intended as a replacement for MapReduce and is often used in conjunction with it to analyze outputs of MapReduce pipelines or rapidly prototype larger computations. Indeed, Dremel and MapReduce are both used by thousands of Google employees.
Like Dremel, Drill supports a nested data model with data encoded in a number of formats such as JSON, Avro or Protocol Buffers. In many organizations nested data is the standard, so supporting a nested data model eliminates the need to normalize the data. With that said, flat data formats, such as CSV files, are naturally supported as a special case of nested data.
The Drill architecture consists of four key components/layers:
Query languages: This layer is responsible for parsing the user’s query and constructing an execution plan. The initial goal is to support the SQL-like language used by Dremel and Google BigQuery, which we call DrQL. However, Drill is designed to support other languages and programming models, such as the Mongo Query Language, Cascading or Plume.
- Low-latency distributed execution engine: This layer is responsible for executing the physical plan. It provides the scalability and fault tolerance needed to efficiently query petabytes of data on 10,000 servers. Drill’s execution engine is based on research in distributed execution engines (eg, Dremel, Dryad, Hyracks, CIEL, Stratosphere) and columnar storage, and can be extended with additional operators and connectors.
- Nested data formats: This layer is responsible for supporting various data formats. The initial goal is to support the column-based format used by Dremel. Drill is designed to support schema-based formats such as Protocol Buffers/Dremel, Avro/AVRO-806/Trevni and CSV, and schema-less formats such as JSON, BSON or YAML. In addition, it is designed to support column-based formats such as Dremel, AVRO-806/Trevni and RCFile, and row-based formats such as Protocol Buffers, Avro, JSON, BSON and CSV. A particular distinction with Drill is that the execution engine is flexible enough to support column-based processing as well as row-based processing. This is important because column-based processing can be much more efficient when the data is stored in a column-based format, but many large data assets are stored in a row-based format that would require conversion before use.
- Scalable data sources: This layer is responsible for supporting various data sources. The initial focus is to leverage Hadoop as a data source.
It is worth noting that no open source project has successfully replicated the capabilities of Dremel, nor have any taken on the broader goals of flexibility (eg, pluggable query languages, data formats, data sources and execution engine operators/connectors) that are part of Drill.
The initial goals for this project are to specify the detailed requirements and architecture, and then develop the initial implementation including the execution engine and DrQL. Like Apache Hadoop, which was built to support multiple storage systems (through the FileSystem API) and file formats (through the InputFormat/OutputFormat APIs), Drill will be built to support multiple query languages, data formats and data sources. The initial implementation of Drill will support the DrQL and a column-based format similar to Dremel.
Significant work has been completed to identify the initial requirements and define the overall system architecture. The next step is to implement the four components described in the Rationale section, and we intend to do that development as an Apache project.