Data Fusion of Deep Learned Molecular Embeddings for Property Prediction
ORAL
Abstract
Data-driven approaches such as deep learning can result in predictive models for material properties with exceptional accuracy and efficiency. However, in many problems data is sparse, severely limiting their accuracy and applicability. To address this gap, techniques such as transfer learning and multi-task learning have been used. The performance of multi-task learning models depends on the strength of the underlying correlations between tasks and the completeness of the dataset. Using data fusion techniques, we combined the learned molecular embeddings of various single-task models and trained a multi-task model on this combined embedding. We apply this technique to a widely used benchmark dataset of quantum chemistry data for small molecules as well as a newly compiled sparse dataset of experimental data collected from literature and our own quantum chemistry and thermochemical calculations. The results show that the fused, multi-task models outperform standard multi-task models for sparse datasets and can provide enhanced prediction on data-limited properties compared to single-task models.
*Distribution Statement A: Approved for Public Release. Distribution is Unlimited.This research study was sponsored by the Army Research Laboratory and was accomplished under Cooperative Agreement No. W911NF-20-2-0189.This work was supported in part by high-performance computer time and resources from theDepartment of Defense (DoD) High Performance Computing Modernization Program in collaboration with an appointment to the DoD Research Participation Program administered by the Oak Ridge Institute for Science and Education (ORISE) through an interagency agreement between the U.S. Department of Energy (DOE) and the DoD. ORISE is managed by ORAU under DOE contract number DE-SC0014664. All opinions expressed in this paper are the author's and do not necessarily reflect the policies and views of DoD, DOE, or ORAU/ORISE.