The Next Wave of SQL-on-Hadoop: The Hadoop Data Warehouse

Track:

Modern Big Data Systems

Time:

Friday, 10:40am - 11:30am

Abstract:

Apache Hadoop now increasingly serves as complementary technology for cost-efficient data loading and cleaning to support the enterprise data warehouse (EDW), supporting the EDW’s role in enabling interactive analysis and reporting on relational data. However, thanks to recent advances in the Hadoop ecosystem that expand the range of EDW-equivalent analytic capabilities entirely in open source software, it is now also possible for Hadoop-based enterprise data hubs to serve as an EDW for native Big Data. Thus, costly processes for moving that data into the traditional EDW just for the purpose of analysis are no longer required.

In this session, attendees will hear how one user in the financial services area, which has rolled out Impala to 45 production nodes to date, is using that approach (based on HDFS, Parquet, and Impala) to reduce processing time from hours to seconds and to consolidate unstructured data from different sources such as web applications, non-traditional external data sets, card transactions, and analytical reports to get a single view of all its data.

Marcel Kornacker

Marcel Kornacker is a tech lead at Cloudera for new products development and creator of the Cloudera Impala project. Following his graduation in 2000 with a PhD in databases from UC Berkeley, he held engineering positions at several database-related start-up companies. Marcel joined Google in 2003 where he worked on several ads serving and storage infrastructure projects, then became tech lead for the distributed query engine component of Google’s F1 project.