Seminar by Professor Xiaohui YU on 2 December 2024

spot

2 Dec 2024

Optimising Data Acquisition for Machine Learning

Professor Xiaohui YU from York University will share his insights about advanced strategies for optimising data acquisition to enhance both model accuracy and confidence.

Register here
black spot
image


Seminar abstract:

High-quality training data is essential for improving machine learning (ML) model performance, but acquiring such data effectively remains a challenging task. This talk explores advanced strategies for optimising data acquisition to enhance both model accuracy and confidence.

To improve accuracy, we introduce two innovative approaches: Estimation and Allocation (EA), which balances exploration and exploitation by estimating data utility, and Sequential Predicate Selection (SPS), which adaptively focuses on data regions that are most promising for improving model outcomes.
 
For improving model confidence, we propose Bulk Acquisition (BA) and Sequential Acquisition (SA) methods, supported by efficient approximations such as kNN-BA and kNN-SA that limit acquisitions to promising subsets. Additionally, a Distribution-based Acquisition framework is presented to generalise these techniques across diverse datasets and settings. Extensive experiments across various ML models and data pools demonstrate the effectiveness of these methods in practical applications, highlighting their ability to address real-world constraints while achieving significant performance gains.
 

Speaker’s biography:

Professor YU obtained his PhD degree from the University of Toronto. His research interests lie in the broad area of data science, with a particular focus on the intersection of data management and machine learning (ML).
 
The results of his research have been published in top data science journals and conferences, such as SIGMOD, VLDB, ICDE, and TKDE. He regularly serves on the programme committees of leading conferences and is an Associate/Area Editor for the IEEE Transactions on Knowledge and Data Engineering (TKDE), the ACM Transactions on Knowledge Discovery in Data (TKDD), and Information Systems. He is a General Co-Chair for the KDD 2025 conference. He has collaborated regularly with industry partners, and some research results have been incorporated into large-scale production systems.