Details
Fine-tuning large language models (LLMs) with low-rank adaptation (LoRA) has become common practice, often yielding numerous copies of the same LLM differing only in their LoRA updates. This paradigm poses challenges for systems that serve real-time responses to queries, each involving a different LoRA. In this talk, I will present our approach to compressing a large collection of LoRAs by representing them with a shared basis paired with LoRA-specific scaling matrices, akin to joint diagonalization. Our experiments with up to a 1000 LoRAs demonstrate an intriguing connection between reconstruction error of the original LoRAs and their downstream performance, revealing that up to 50% reconstruction error preserves or even improves the performance. I will then present a clustering extension that allows our method to scale gracefully to 1000 LoRAs, significantly improving the throughput of a modern LLM serving engine. Finally, I will conclude the talk with a discussion of opportunities to build more performant, smaller LLMs using LoRA collections.
Bio: Mikhail Yurochkin is a research scientist and manager at the MIT-IBM Watson AI Lab where he leads the Statistical Large Language Modeling group. Before joining IBM, he completed PhD in Statistics at the University of Michigan, advised by XuanLong Nguyen. Mikhail developed methods for reliable and inclusive adoption of ML and AI in practice and led the development of the first open-source Python library for individual fairness (inFairness). Prior to that, he worked on model fusion and federated learning. Mikhail's most recent work is focused on challenges of cost-efficient adoption of large language models in practice.