Related Works of Cross-Lingual Topic Modeling

2024. 3. 3. 18:51·Research in NLP

Cross-Lingual Topic Models were proposed as an extended form of Mono-Lingual topic models to discover hidden common topics in documents written in different languages.

The earliest polylingual topic model ([1] Mimno et al. 2009) utilizes one topic distribution to generate a tuple of comparable documents in different languages.[1] assumes that documents are topically aligned to track topic trends across languages. Since then, research has continued, focusing on aligning topics in parallel or similar corpora using bilingual dictionaries.[2], [4] use methods to connect vocabulary in bilingual dictionaries.

In particular, JointLDA [2] utilizes a bilingual dictionary and introduces "concepts" to connect words from different languages. The model achieves better monolingual models by optimizing cross-lingual corpora than LDA does when trained solely on monolingual data.[3] constructs a tree dictionary to integrate word correlations and document alignment information into the model.[5] (MCTA), [6], [7] (NMTM) connect words from translation dictionaries and directly align topics to make these words belong to the same topic. These studies primarily align cross-lingual topics using translations from dictionaries.[8] (MTAnchor), a multilingual anchoring approach, is proposed as a generative method to improve MCTA [5]. This algorithm converges faster compared to generative methods and ultimately forms better vector representations for documents. Anchoring's advantages over generative methods lie in robustness and practicality. While generative methods require long documents to correctly estimate topic-word distributions, anchoring can handle documents of any size.However, [5] MCTA, [8] MTAnchor address the issue of words not aligning well between different languages in generated topics.

To improve this aspect, [7] (NMTM) was the first to propose a [9] VAE-based Cross-Lingual Neural Topic Model.It transforms the topic-word distribution to the vocabulary space of another language. Thus, the topic-word distributions of one language can incorporate the semantics of another language, aligning cross-lingual topics. They demonstrate that their model outperforms traditional multilingual topic models ([5] MCTA, [8] MTAnchor).

Recently, there has been an issue with Cross-Lingual Topic Models relying on bilingual dictionaries, particularly in languages with insufficient dictionaries, where topic alignment does not work effectively. Additionally, when using a bilingual dictionary for direct alignment, there is a problem with degenerate topic representations.

To address these issues, [10] (InfoCTM) suggests a solution by proposing to align cross-lingual topics based on mutual information. This approach ensures proper alignment of crosslingual topics and mitigates the problem of degenerate topic representations. Furthermore, to tackle the challenge of a low-coverage dictionary, they introduce a cross-lingual vocabulary linking method. This method aims to identify more linked words beyond the provided dictionary, enhancing topic alignment capabilities.

 

 

---------------------------------------------------------------------------------

 

Reference

 

[1] Mimno, D.; Wallach, H.; Naradowsky, J.; Smith, D. A.; and McCallum, A. 2009. Polylingual topic models. In Proceedings of the 2009 conference on empirical methods in natural language processing

 

[2] Jagarlamudi, J., H. Daumé. Extracting multilingual topics from unaligned comparable corpora. In Proceedings of the European Conference on Information Retrieval. 2010.

 

[3] Hu, Y., K. Zhai, V. Eidelman, et al. Polylingual tree-based topic models for translation domain adaptation. In Proceedings of the Association for Computational Linguistics. 2014

 

[4] Boyd-Graber, J.; and Blei, D. 2012. Multilingual topic models for unaligned text. arXiv preprint arXiv:1205.2657.

 

[5] Shi, B., W. Lam, L. Bing, et al. Detecting common discussion topics across culture from news reader comments. In Proceedings of the Association for Computational Linguistics. 2016.

 

[6] Yang, W.; Boyd-Graber, J.; and Resnik, P. 2019. A multilingual topic model for learning weighted topic links across corpora with low comparability. In Proceedings of the 2019 Conference on Empirical Methods in Natural Language Processing and the 9th International Joint Conference on Natural Language Processing (EMNLP-IJCNLP), 1243–1248.

 

[7] Wu, X.; Li, C.; Zhu, Y.; and Miao, Y. 2020a. Learning Multilingual Topics with Neural Variational Inference. In International Conference on Natural Language Processing and Chinese Computing.

 

[8] Yuan, M.; Van Durme, B.; and Ying, J. L. 2018. Multilingual anchoring: Interactive topic modeling and alignment across languages. Advances in neural information processing systems, 31.

 

[9] Diederik P. Kingma and Max Welling. 2014. Autoencoding variational bayes. In 2nd International Conference on Learning Representations, ICLR 2014

 

[10] InfoCTM : A Mutual Information Maximization Perspective of Cross-Lingual Topic Modeling , AAAI 2023

 

 

 

'Research in NLP' 카테고리의 다른 글

논문 읽는 법 , AI 연구를 시작하는 방법 - MIT  (0) 2024.03.07
연구 가설이란 무엇인가 : 좋은 연구 가설을 세우는 방법  (0) 2024.03.07
Word Embedding & Sentence Embedding  (0) 2024.02.06
Low-Resource Language Embedding  (1) 2024.02.06
BoW representation & Sentence Embedding  (2) 2024.02.05
'Research in NLP' 카테고리의 다른 글
  • 논문 읽는 법 , AI 연구를 시작하는 방법 - MIT
  • 연구 가설이란 무엇인가 : 좋은 연구 가설을 세우는 방법
  • Word Embedding & Sentence Embedding
  • Low-Resource Language Embedding
Seung-won Seo
Seung-won Seo
ML , NLP , DL 에 관심이 많습니다. 반갑습니다 :P
  • Seung-won Seo
    Butterfly_Effect
    Seung-won Seo
    • 분류 전체보기 (77)
      • 일기장 (2)
      • 메모장 (1)
      • Plan (0)
      • To do List (0)
      • Paper Review (32)
      • Progress Meeting (0)
      • Research in NLP (14)
      • Progress for XTM (0)
      • Writing for XTM (0)
      • 논문작성 Tips (12)
      • Study (16)
        • Algorithm (0)
        • ML & DL (7)
        • NLP (2)
        • Statistics (1)
        • Topic Modeling (6)
  • 링크

  • hELLO· Designed By정상우.v4.10.3
Seung-won Seo
Related Works of Cross-Lingual Topic Modeling
상단으로

티스토리툴바