Andre Niyongabo Rubungo will present his General Exam "Leveraging Large Language Models for Materials Science" on Monday, May 6, 2024 at 9:00 AM in FC 005.

Committee Members: Adji Bousso Dieng (advisor), Danqi Chen, Tom Griffiths

Abstract:
Predicting material properties plays an important role in materials design and discovery. Having a robust and accurate materials property predictor can help in screening materials of interest and in designing new materials with desired properties. Before the machine learning era, ab initio methods such as density functional theory (DFT) that allow the prediction and calculation of material behavior on the basis of quantum mechanical considerations were used to facilitate this process. DFT methods are somewhat accurate, however they are extremely slow, very computationally expensive, and hard to scale. 

Recently, scientists have started leveraging machine learning as an alternative to DFT.  More specifically, graph neural networks (GNNs) have been particularly effective in capturing the interactions in materials for the purpose of property prediction. Crystal lattice sites are represented as nodes and the bonds (e.g., ionic or covalent or van der Waals) between them are represented as edges. GNNs then typically learn the contextual representation of each node and edge within the crystal graph to predict its properties. Although GNNs are powerful, they still face several challenges when it comes to predicting crystal properties. For example, they fail to efficiently encode the periodicity inherent to any crystal which results from the repetitive arrangement of unit cells within a lattice. Furthermore, current state-of-the-art GNNs do not account for space group information which is critical for characterizing crystals. Finally, GNNs are known to suffer from a problem known as over-smoothing which may hinder expressivity. 

This research marks a pioneering step in leveraging the powerful learning capabilities of large language models (LLMs) to improve the process of discovering new materials. We aim to use LLMs to learn the representation of a crystal structure given either its chemical formula, structure strings, or structure text descriptions as input. Representing a crystal as text instead of a graph has several advantages. Textual data contain rich information and are very expressive; additionally, incorporating desired information in text is generally more straightforward compared to graphs. We intend to design and implement a robust and accurate LLM-based property predictor by first collecting enough labeled materials data and using it to finetune available powerful pretrained LLMs. Furthermore, by having an accurate LLM-based property predictor, we can finetune an LLM generative model that can generate multiple samples of structure strings or descriptions of new materials and use our LLM-based property predictor to predict the stability of the newly generated materials and hence only keep the most stable ones. A strategy that simulates the experimental approach to materials design and discovery, but in a faster and more efficient way.

Reading List:
https://docs.google.com/document/d/14bGQagXTIxpzgtg9dGsaZwYo4nm96xgI1pAoGLbFCrU/edit?usp=sharing

Everyone is invited to attend the talk, and those faculty wishing to remain for the oral exam following are welcome to do so.