Location:
The members of
Examiners:
Readers:
A copy of
Everyone is invited to attend
Abstract follows below:
Recent advancements in large language models (LMs) have primarily focused on scaling up model parameters and training tokens to improve performance across various tasks. However, this scaling increases computational costs significantly. Furthermore, conventional parametric LMs are fundamentally incapable of adapting to unseen domains, editing learned knowledge, retaining long-tail knowledge, and are shown to easily leak private data from the training corpus. This thesis explores alternative approaches to scaling LMs while addressing these limitations.
Firstly, we study LMs with retrieval augmentation, where LMs leverage an external datastore to make predictions. We develop Trime, a novel end-to-end training approach that learns LMs and the retrieval models jointly. Our results show that Trime significantly enhances LM performance with the same model size and computational budget. These end-to-end trained retrieval-augmented LMs also provide users with effective adaptability to domains unseen during training.
Secondly, we focus on a fundamental challenge of LMs: editing facts that are stored in their parameters—a critical problem to address since the world is constantly changing and the knowledge in LMs becomes outdated easily. We examine state-of-the-art knowledge editing methods on LMs and find that existing evaluation paradigms are extremely limited. We propose a novel benchmark MQuAKE, consisting of multi-hop questions that assess whether edited models correctly answer questions where the answer should change as an entailed consequence of edited facts. We demonstrate that existing knowledge editing methods fail on the constructed multi-hop questions. We also propose a simple retrieval-augmented approach that stores all edited facts externally, which outperforms previous methods by a large margin.
Thirdly, we investigate training LMs with conditional computation, which is designed for scaling LMs without a substantial increase in computational costs. We focus on mixture-of-experts (MoE), a widely used conditional computation technique which facilitates efficient scaling. However, training the routing network in MoE introduces the challenge of optimizing a non-differentiable, discrete objective. We present a fully differentiable MoE architecture for autoregressive language model pre-training, Lory, which are based on two key techniques: (1) a causal segment routing strategy for high efficiency of expert merging and (2) a similarity-based data 3batching method for better expert specialization. Despite segment-level routing, Lory models achieve competitive performance compared to state-of-the-art MoE models with token-level routing. We further demonstrate that the trained experts of Lory capture domain-level specialization without supervision.
Overall, our research illuminates a new paradigm in LM scalability, fundamentally addressing critical limitations and advancing the development of more efficient, effective, adaptable, and updatable language models.