DISSERTATION DFENSE
Department of Computer Science and Engineering
University of South Carolina
Author : Nihang Fu
Advisor: Dr. Jianjun Hu
Date: May 1st, 2025
Time: 10:00 am
Place: Zoom
Abstract
The discovery of new materials is critical to advancing various industries, but traditional experimental methods for materials discovery remain slow and resource-intensive. Recent advances in machine learning (ML), particularly deep learning (DL), have greatly improved and accelerated two main aspects of modern computational material discovery: material design (e.g., material generation) and material screening (e.g., property prediction). However, a key challenge remains: standard ML models often struggle to perform domain-specific tasks effectively. Incorporating domain-specific knowledge, specifically the underlying physics of materials, into ML/DL models is key to improving the accuracy and reliability of material generation and prediction models.
This dissertation discusses and addresses this challenge through physics-oriented deep learning for computational materials discovery. In the first topic, we explore the use of transformer-based deep learning language models for the generative design of inorganic material compositions. Experiments showed that our transformer models can capture key physicochemical knowledge, such as charge neutrality and balanced electronegativity, and generate novel and chemically plausible inorganic material compositions. As an additional demonstration of the power of transformer neural networks models to capture physics and chemistry from raw compound data, in the second topic, we propose a bidirectional encoder transformer based model, BERTOS, for atomic oxidation state prediction from composition alone, which has significant applications in crystal structure prediction and virtual screening of candidate materials. Compared to the heuristic OS assignment algorithm in Pymatgen, our BERTOS achieves 97.61% versus 37.82%.
We further explore physics-guided deep learning for materials property prediction, emphasizing the importance of incorporating physical information in the input features to guide the neural network model training process, which helps guide the model to produce more physically accurate and reliable results, especially when the data is limited or noisy. In the third topic, we propose a novel framework called DSSL (Dual Self-Supervised Learning) to overcome the data scarcity issue for materials property prediction. This is a two-stage physics-guided approach based on the graph neural network approach that leverages both large-scale labeled and limited unlabeled data. It includes three complementary self-supervised learning (SSL) strategies: Mask-based generative SSL, Contrastive learning SSL, and Physics-guided predictive SSL. In the fourth topic, we investigate the impact of physical encoding on ML performance for property prediction and found that physical encoding of atoms can significantly improve the generalization prediction performance, especially for out-of-distribution samples. Finally, in the fifth topic, we investigate the issue of data redundancy in materials science datasets, arguing that standard random data splitting leads to overestimation of machine learning model performance, particularly concerning generalization to new materials. To address this, we developed MD-HIT algorithms for both composition- and structure-based redundancy using various similarity metrics, which provides a more objective evaluation of ML models' true extrapolation capabilities for materials property prediction and allows ML models to learn true physics from the data instead of overfitting ML models with low generalization performance.