Dr. Feng Yu's recent research paper with his previous master student, David S. Wilson, won the best paper award at the peer-reviews scholarly conference "28th International Conference on Software Engineering and Data Engineering" (SEDE 2019) at San Diego, CA.
Title: Scalable Correlated Sampling for Join Query Estimations on Big Data
Authors: David S. Wilson and Wen-Chi Hou and Feng Yu
Estimate query results within limited time constraints is a challenging problem in the research of big data management. Query estimation based on simple random samples per- forms well for simple selection queries; however, return results with extremely high relative errors for complex join queries. Existing methods only work well with foreign key joins, and the sample size can grow dramatically as the dataset gets larger. This research implements a scalable sampling scheme in a big data environment, namely correlated sampling in map-reduce, that can speed up search query length results, give precise join query estimations, and minimize storage costs when presented with big data. Extensive experiments with large TPC-H datasets in Apache Hive show that our sampling method produces fast and accurate query estimations on big data.
Date: September 30 - October 2, 2019
The full paper is available here.