Pre-training is a Hot Topic : Contextualized Document Embeddings Improve Topic Coherence (ACL 2021)
Abstract
- Traditional topic models : Bag-of-Words and Variational Auto-Encoder based (NTM) models (2017)
- Combine contextualized representations with NTM
- Main objective is to show that contextual information increases coherence
- Proposed Method (Combined TM) : More meaningful , Coherent topics !
1. Introduction
- Most previous topic model -> Using Bag-of-Words (BoW) document representations as input
- Adding contextual information to neural topic models provides a significant increase in topic coherence
- In this work , Extend Neural ProdLDA
- Contributions
- Straightforward and easily implementable method that allows NTMs to create coherent topics
- Contextualized document embeddings in NTMs produces significantly more coherent topics
- Discover latent contextual information
2. Neural Topic Models with Language Model Pre-training
- Previous Neural Topic Model : ProdLDA (2017) (VAE based) -> BoW document representation into a continuous latent representation
- Decoder network : reconstructs the BoW by generating its words from the latent document representation

- Proposed Model : Combined Topic Model (CombinedTM)
- Extend this model with contextualized document embeddings from SBERT (2019) , a recent extension of BERT that allows the quick generation of sentence embeddings
- CombinedTM : ProdLDA + Sentence BERT embedded representation

- The limitations of CombinedTM
1. 길이 제한
이 방법론의 한 제한사항은 SBERT(Sentence-BERT)나 유사한 기술에서 문서의 길이에 관련이 있습니다.
이는 모델이 특정 길이 이하의 문서만을 처리하거나 표현할 수 있으며, 그 이상의 긴 문서는 완전히 고려되지 않을 수 있다는 것을 의미합니다.
3. Experimental Setting
3.1 Datasets
- 20NewsGroups3
- Wiki20K (a collection of 20,000 English Wikipedia abstracts from Bianchi et al. (2021))
- Tweets20114
- Google News (Qiang et al., 2019)
- the StackOverflow dataset (Qiang et al., 2019)
3.2 Metrics
- 3 different metrics
- 2 topic coherence : Normalized Pointwise Mutual Information (NPMI) , External word embeddings topic coherence
- 1 topic diversity : Inversed Rank-Biased Overlap (I-RBO)
3.3 Baseline Models
Our main objective is to show that contextual information increases coherence
1. ProdLDA (Srivastava and Sutton, 2017, the model this paper extend)
2. Neural Variational Document Model (NVDM) (Miao et al., 2016)
3. the very recent ETM (Dieng et al., 2020)
4. MetaLDA (MLDA) (Zhao et al., 2017)
5. LDA (Blei et al., 2003)
4. Results
- Quantitative evaluation
- Explore the effect on the performance when we use 2 different contextualized representations
4.1 Quantitative Evaluation

- Compute all the metrics for 25, 50, 75, 100, and 150 topics
- We average results for each metric over 30 runs of each model
- LDA and NVDM obtain low coherence
5. Related Work
- Neural variational inference based Topic Models
- Miao et al. (2016) propose NVDM, an unsupervised generative model based on VAEs, assuming a Gaussian distribution over topics.
- Building upon NVDM -> ETM (Dieng et al. (2020)) represent words and topics in the same embedding space.
- ProdLDA : Srivastava and Sutton (2017) propose a neural variational framework that explicitly approximates the Dirichlet prior using a Gaussian distribution.
- Our approach builds on this work but includes a crucial component, i.e., the representations from a pre-trained transformer that can benefit from both general language knowledge and corpusdependent information. - Similarly, Bianchi et al. (2021) replace the BOW document representation with pre-trained contextualized representations to tackle a problem of cross-lingual zero-shot topic modeling.
- This approach was extended by Mueller and Dredze (2021) that also considered fine-tuning the representations.
- A very recent approach (Hoyle et al., 2020) which follows a similar direction uses knowledge distillation (Hinton et al., 2015) to combine neural topic models and pre-trained transformers.
6. Conclusions
- Propose a straightforward and simple method to incorporate contextualized embeddings into topic models.
- This work improves the quality of the discovered topics.
- Effect of Contextualized Embeddings
-> Context information is a significant element to consider also for topic modeling.