In the last decade, the Hadoop framework has emerged as a popular mechanism to harness the power of large clusters of computers. It enables programmers to think in a data-centric fashion and to focus on applying transformations to sets of data records because the details of distributed execution and fault tolerance are transparently managed by the MapReduce framework.
SQL is the most widely used language to access, analyze, and manipulate structured data. The need for SQL support to run interactive queries on top of the Hadoop environment for processing massive data sets is growing rapidly. Various SQL-on-Hadoop systems have been developed for this very purpose, including Hive, Impala, BigSQL, Presto and many more.
This project aims to implement such a solution at Deerwalk Services, so that SQL based Data QC queries can be run directly on Hadoop Distributed File System (HDFS). The choice of preferred system to use has been BigSQL. The entire work basically involves research, configuration, implementation, optimization, testing, and development of GUI client to run queries into HDFS interactively.