Details
Topic modeling is a widely used technique in text analysis, with classical models relying on an approximate low-rank factorization of the word count matrix. In the first part of this talk, we introduce Topic-SCORE, a spectral algorithm for estimating classical topic models. The core innovation of this algorithm lies in exploiting a simplex structure in the spectral domain. Using precise entry-wise eigenvector analysis, we demonstrate that Topic-SCORE achieves the minimax optimal rate, both for relatively long and short documents.
In the second part, we extend the classical topic model to capture the distribution of word embeddings from pre-trained large language models (LLMs), enabling the incorporation of word context. We propose a flexible algorithm that integrates traditional topic modeling with nonparametric estimation. We showcase the effectiveness of our methods using several text data sets, including MADStat, a dataset comprising 83,000 paper abstracts from statistics-related journals.