Research in NLP

Related Works of Cross-Lingual Topic Modeling

Seung-won Seo 2024. 3. 3. 18:51

Cross-Lingual Topic Models were proposed as an extended form of Mono-Lingual topic models to discover hidden common topics in documents written in different languages.

The earliest polylingual topic model ([1] Mimno et al. 2009) utilizes one topic distribution to generate a tuple of comparable documents in different languages.[1] assumes that documents are topically aligned to track topic trends across languages. Since then, research has continued, focusing on aligning topics in parallel or similar corpora using bilingual dictionaries.[2], [4] use methods to connect vocabulary in bilingual dictionaries.

In particular, JointLDA [2] utilizes a bilingual dictionary and introduces "concepts" to connect words from different languages. The model achieves better monolingual models by optimizing cross-lingual corpora than LDA does when trained solely on monolingual data.[3] constructs a tree dictionary to integrate word correlations and document alignment information into the model.[5] (MCTA), [6], [7] (NMTM) connect words from translation dictionaries and directly align topics to make these words belong to the same topic. These studies primarily align cross-lingual topics using translations from dictionaries.[8] (MTAnchor), a multilingual anchoring approach, is proposed as a generative method to improve MCTA [5]. This algorithm converges faster compared to generative methods and ultimately forms better vector representations for documents. Anchoring's advantages over generative methods lie in robustness and practicality. While generative methods require long documents to correctly estimate topic-word distributions, anchoring can handle documents of any size.However, [5] MCTA, [8] MTAnchor address the issue of words not aligning well between different languages in generated topics.

To improve this aspect, [7] (NMTM) was the first to propose a [9] VAE-based Cross-Lingual Neural Topic Model.It transforms the topic-word distribution to the vocabulary space of another language. Thus, the topic-word distributions of one language can incorporate the semantics of another language, aligning cross-lingual topics. They demonstrate that their model outperforms traditional multilingual topic models ([5] MCTA, [8] MTAnchor).

Recently, there has been an issue with Cross-Lingual Topic Models relying on bilingual dictionaries, particularly in languages with insufficient dictionaries, where topic alignment does not work effectively. Additionally, when using a bilingual dictionary for direct alignment, there is a problem with degenerate topic representations.

To address these issues, [10] (InfoCTM) suggests a solution by proposing to align cross-lingual topics based on mutual information. This approach ensures proper alignment of crosslingual topics and mitigates the problem of degenerate topic representations. Furthermore, to tackle the challenge of a low-coverage dictionary, they introduce a cross-lingual vocabulary linking method. This method aims to identify more linked words beyond the provided dictionary, enhancing topic alignment capabilities.

 

 

---------------------------------------------------------------------------------

 

Reference

 

[1] Mimno, D.; Wallach, H.; Naradowsky, J.; Smith, D. A.; and McCallum, A. 2009. Polylingual topic models. In Proceedings of the 2009 conference on empirical methods in natural language processing

 

[2] Jagarlamudi, J., H. Daumé. Extracting multilingual topics from unaligned comparable corpora. In Proceedings of the European Conference on Information Retrieval. 2010.

 

[3] Hu, Y., K. Zhai, V. Eidelman, et al. Polylingual tree-based topic models for translation domain adaptation. In Proceedings of the Association for Computational Linguistics. 2014

 

[4] Boyd-Graber, J.; and Blei, D. 2012. Multilingual topic models for unaligned text. arXiv preprint arXiv:1205.2657.

 

[5] Shi, B., W. Lam, L. Bing, et al. Detecting common discussion topics across culture from news reader comments. In Proceedings of the Association for Computational Linguistics. 2016.

 

[6] Yang, W.; Boyd-Graber, J.; and Resnik, P. 2019. A multilingual topic model for learning weighted topic links across corpora with low comparability. In Proceedings of the 2019 Conference on Empirical Methods in Natural Language Processing and the 9th International Joint Conference on Natural Language Processing (EMNLP-IJCNLP), 1243–1248.

 

[7] Wu, X.; Li, C.; Zhu, Y.; and Miao, Y. 2020a. Learning Multilingual Topics with Neural Variational Inference. In International Conference on Natural Language Processing and Chinese Computing.

 

[8] Yuan, M.; Van Durme, B.; and Ying, J. L. 2018. Multilingual anchoring: Interactive topic modeling and alignment across languages. Advances in neural information processing systems, 31.

 

[9] Diederik P. Kingma and Max Welling. 2014. Autoencoding variational bayes. In 2nd International Conference on Learning Representations, ICLR 2014

 

[10] InfoCTM : A Mutual Information Maximization Perspective of Cross-Lingual Topic Modeling , AAAI 2023