[Paper Review] Pre-training is a Hot Topic: Contextualized Document Embeddings Improve Topic Coherence (ACL 2021)

Pre-training is a Hot Topic : Contextualized Document Embeddings Improve Topic Coherence (ACL 2021)

Abstract

Traditional topic models : Bag-of-Words and Variational Auto-Encoder based (NTM) models (2017)
Combine contextualized representations with NTM
Main objective is to show that contextual information increases coherence
Proposed Method (Combined TM) : More meaningful , Coherent topics !

1. Introduction

Most previous topic model -> Using Bag-of-Words (BoW) document representations as input
Adding contextual information to neural topic models provides a significant increase in topic coherence
In this work , Extend Neural ProdLDA

- Contributions

Straightforward and easily implementable method that allows NTMs to create coherent topics
Contextualized document embeddings in NTMs produces significantly more coherent topics
Discover latent contextual information

2. Neural Topic Models with Language Model Pre-training

Previous Neural Topic Model : ProdLDA (2017) (VAE based) -> BoW document representation into a continuous latent representation
Decoder network : reconstructs the BoW by generating its words from the latent document representation

Proposed Model : Combined Topic Model (CombinedTM)
Extend this model with contextualized document embeddings from SBERT (2019) , a recent extension of BERT that allows the quick generation of sentence embeddings
CombinedTM : ProdLDA + Sentence BERT embedded representation

- The limitations of CombinedTM

1. 길이 제한

이 방법론의 한 제한사항은 SBERT(Sentence-BERT)나 유사한 기술에서 문서의 길이에 관련이 있습니다.

이는 모델이 특정 길이 이하의 문서만을 처리하거나 표현할 수 있으며, 그 이상의 긴 문서는 완전히 고려되지 않을 수 있다는 것을 의미합니다.

3. Experimental Setting

3.1 Datasets

- 20NewsGroups3

- Wiki20K (a collection of 20,000 English Wikipedia abstracts from Bianchi et al. (2021))

- Tweets20114

- Google News (Qiang et al., 2019)

- the StackOverflow dataset (Qiang et al., 2019)

3.2 Metrics

- 3 different metrics

- 2 topic coherence : Normalized Pointwise Mutual Information (NPMI) , External word embeddings topic coherence

- 1 topic diversity : Inversed Rank-Biased Overlap (I-RBO)

3.3 Baseline Models

Our main objective is to show that contextual information increases coherence

1. ProdLDA (Srivastava and Sutton, 2017, the model this paper extend)

2. Neural Variational Document Model (NVDM) (Miao et al., 2016)

3. the very recent ETM (Dieng et al., 2020)

4. MetaLDA (MLDA) (Zhao et al., 2017)

5. LDA (Blei et al., 2003)

4. Results

- Quantitative evaluation

- Explore the effect on the performance when we use 2 different contextualized representations

4.1 Quantitative Evaluation

- Compute all the metrics for 25, 50, 75, 100, and 150 topics

- We average results for each metric over 30 runs of each model

- LDA and NVDM obtain low coherence

5. Related Work

- Neural variational inference based Topic Models

- Miao et al. (2016) propose NVDM, an unsupervised generative model based on VAEs, assuming a Gaussian distribution over topics.

- Building upon NVDM -> ETM (Dieng et al. (2020)) represent words and topics in the same embedding space.

- ProdLDA : Srivastava and Sutton (2017) propose a neural variational framework that explicitly approximates the Dirichlet prior using a Gaussian distribution.

- Our approach builds on this work but includes a crucial component, i.e., the representations from a pre-trained transformer that can benefit from both general language knowledge and corpusdependent information. - Similarly, Bianchi et al. (2021) replace the BOW document representation with pre-trained contextualized representations to tackle a problem of cross-lingual zero-shot topic modeling.

- This approach was extended by Mueller and Dredze (2021) that also considered fine-tuning the representations.

- A very recent approach (Hoyle et al., 2020) which follows a similar direction uses knowledge distillation (Hinton et al., 2015) to combine neural topic models and pre-trained transformers.

6. Conclusions

- Propose a straightforward and simple method to incorporate contextualized embeddings into topic models.

- This work improves the quality of the discovered topics.

- Effect of Contextualized Embeddings

-> Context information is a significant element to consider also for topic modeling.

'Paper Review' 카테고리의 다른 글

[Paper Review] Topic Modelling Meets Deep Neural Networks : A Survey (2)	2024.02.07
[Paper Review] “Low-Resource” Text Classification : A Parameter-Free Classification Method with Compressors (ACL 2023) (0)	2024.01.15
[Paper Review] Coordinated Topic Modeling (EMNLP 2022) (0)	2024.01.15
[Paper Review] Topic Modeling in Embedding Spaces (TACL 2020) (4)	2024.01.03
[Paper Review] Context - guided Embedding Adaptation for Effective Topic Modeling in Low-Resource Regimes (Neurips 2023) (2)	2023.12.29

Pre-training is a Hot Topic : Contextualized Document Embeddings Improve Topic Coherence (ACL 2021)

Abstract

1. Introduction

2. Neural Topic Models with Language Model Pre-training

3. Experimental Setting

4. Results

5. Related Work

6. Conclusions

'Paper Review' 카테고리의 다른 글

티스토리툴바