Article | Proceedings of the 20th Nordic Conference of Computational Linguistics, NODALIDA 2015, May 11-13, 2015, Vilnius, Lithuania | Topic Models: Accounting Component Structure of Bigrams
Göm menyn

Title:
Topic Models: Accounting Component Structure of Bigrams
Author:
Natalia Loukachevitch: Lomonosov Moscow State University, Russian Federation Michael Nokel: Lomonosov Moscow State University, Russian Federation
Download:
Full text (pdf)
Year:
2015
Conference:
Proceedings of the 20th Nordic Conference of Computational Linguistics, NODALIDA 2015, May 11-13, 2015, Vilnius, Lithuania
Issue:
109
Article no.:
019
Pages:
145-152
No. of pages:
8
Publication type:
Abstract and Fulltext
Published:
2015-05-06
ISBN:
978-91-7519-098-3
Series:
Linköping Electronic Conference Proceedings
ISSN (print):
1650-3686
ISSN (online):
1650-3740
Series:
NEALT Proceedings Series
Publisher:
Linköping University Electronic Press, Linköpings universitet


Export in BibTex, RIS or text

The paper describes the results of an experimental study of integrating bigram collocations and similarities between them and unigrams into topic models. First of all, we propose a novel algorithm PLSA-SIM that is a modification of the original algorithm PLSA. It incorporates bigrams and maintains relationships between unigrams and bigrams based on their component structure. Then we analyze a variety of word association measures in order to integrate top-ranked bigrams into topic models. All experiments were conducted on four text collections of different domains and languages. The experiments distinguish a subgroup of tested measures that produce top-ranked bigrams, which demonstrate significant improvement of topic models quality for all collections, when integrated into PLSA-SIM algorithm.

Proceedings of the 20th Nordic Conference of Computational Linguistics, NODALIDA 2015, May 11-13, 2015, Vilnius, Lithuania

Author:
Natalia Loukachevitch, Michael Nokel
Title:
Topic Models: Accounting Component Structure of Bigrams
References:

David Andrzejewski, Xiaojin Zhu, and Mark Craven. 2009. Incorporating Domain Knowledge into Topic Modeling via Dirichlet Forest Priors. Proceedings of the 26th Annual International Conference on Machine Learning: 25–32.


David Andrzejewski and David Buttler. 2011. Latent Topic Feedback for Information Retrieval. Proceedings of the 17th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining: 600–608.


Arthur Asuncion, Max Welling, Padhraic Smyth, Yee Whye Teh. 2009. On Smoothing and Inference for Topic Models. Proceedings of the 25th International Conference on Uncertainty in Artificial Intelligence: 27–34.


David M. Blei, Andrew Y. Ng, and Michael I. Jordan. 2003. Latent Dirichlet Allocation. Journal of Machine Learning Research, volume 3: 993–1022.


Gerlof Bouma. 2009. Normalized (Pointwise) Mutual Information. Proceedings of the Biennial GSCL Conference: 31–40.


Jordan Boyd-Graber, David M. Blei, and Xiaojin Zhu. 2007. A Topic Model for Word Sense Disambiguation. Proceedings of the 2007 Joint Conference on Empirical Methods in Natural Language Processing and Computational Natural Language Learning: 1024–1033.


Jonathan Chang, Jordan Boyd-Graber, Sean Gerrish, Chong Wang, David M. Blei. 2009. Reading Tea Leaves: How Human Interpret Topic Models. Proceedings of the 24th Annual Conference on Neural Information Processing Systems: 288–296.


KennethWard Church, and Patrick Hanks. 1990. Word Association Norms, Mutual Information, and Lexicography. Computational Linguistics, volume 16: 22–29.


Beatrice Daille. 1995. Combined Approach for Terminology Extraction: Lexical Statistics and Linguistic Filtering PhD Dissertation. University of Paris, Paris.


Ali Daud, Juanzi Li, Lizhu Zhou, Faqir Muhammad. 2010. Knowledge discovery through directed probabilistic topic models: a survey. Frontiers of Computer Science in China, 4(2): 280–301. Vidas Daudarvicius and Ruta Marcinkeviciené. 2003. Gravity Counts for the Boundaries of Collocations. International Journal of Corpus Linguistics, 9(2): 321–348.


Paul Deane. 2005. A Nonparametric Method for Extraction of Candidate Phrasal Terms. Proceedings of the 43rd Annual Meeting of the ACL: 605–613.


Ted Dunning. 1993. Accurate Methods for the Statistics of Surprise and Coincidence. International Journal of Computational Linguistics, 19(1): 61–74.


Vladimir Eidelman, Jordan Boyd-Graber, and Philip Resnik. 2012. Topic Models for Dynamic Translation Model Adaptation. Proceedings of the 50th Annual Meeting of the Association of Computational Linguistics, volume 2: 115–119.


Thomas L. Griths, Mark Steyvers, and Joshua B. Tenenbaum. 2007. Topics in Semantic Representation. Psychological Review, 114(2): 211–244.


Thomas Hofmann. 1999. Probabilistic Latent Semantic Indexing. Proceedings of the 22nd Annual International SIGIR Conference on Research and Development in Information Retrieval: 50–57.


Wei Hu, Nobuyuki Shimizu, Hiroshi Nakagawa, and Huanye Shenq. 2008. Modeling Chinese Documents with Topical Word-Character Models. Proceedings of the 22nd International Conference on Computational Linguistics: 345–352.


Paul Jaccard. 1901. Distribution de la flore alpine dans le Bassin des Dranses et dans quelques regions voisines. Bull. Soc. Vaudoise sci. Natur. V. 37. Bd. 140: 241–272.


Mark Johnson M. 2010. PCFGs, Topic Models, Adaptor Grammars and Learning Topical Collocations and the Structure of Proper Names. Proceedings of the 48th Annual Meeting of the ACL: 1148–1157.


Mihoko Kitamura, and Yuji Matsumoto. 1996. Automatic Extraction of Word Sequence Correspondences in Parallel Corpora. Proceedings of the 4th Annual Workshop on Very Large Corpora: 79–87.


Jey Han Lau, Timothy Baldwin, and David Newman. 2013. On Collocations and Topic Models. ACM Transactions on Speech and Language Processing, 10(3): 1–14.


Bing Liu. 2012. Sentiment Analysis and Opinion Mining. Synthesis Lectures on Human Language Technologies, Morgan & Claypool Publishers.


Jose Gabriel Pereira Lopes, and Joaquim Ferreira da Silva. 1999. A Local Maxima Method and a Fair Dispersion Normalization for Extracting Multiword Units. Proceedings of the 6th Meeting on the Mathematics of Language: 369–381.


David Mimno, Hanna M. Wallach, Edmund Talley, Miriam Leenders, Andrew McCallum. 2011. Optimizing Semantic Coherence in Topic Models. Proceedings of EMNLP’11: 262–272.


David Newman, Jey Han Lau, Karl Grieser, and Timothy Baldwin. 2010. Automatic Evaluation of Topic Coherence. Proceedings of Human Language Technologies: The 11th Annual Conference of the North American Chapter of the Association for Computational Linguistics: 100–108.


Youngja Park, Roy J. Byrd, and Branimir K. Boguraev. 2002. Automatic Glossary Extraction: Beyond Terminology Identification. Proceedings of the 19th International Conference on Computational Linguistics: 1–7.


Frank Smadja, Kathleen R. McKeown, and Vasileios Hatzivassiloglou. 1996. Translating Collocations for Bilingual Lexicons: A Statistical Approach. Computational Linguistics, 22(1): 1–38.


Keith Stevens, Philip Kegelmeyer, David Adnrzejewski, and David Buttler. 2012. Exploring Topic Coherence over Many Models and Many Topics. Proceedings of EMNLP-CoNLL’12: 952–961.


Konstantin V. Vorontsov, and Anna A. Potapenko. 2014. Tutorial on Probabilistic Topic Modeling: Additive Regularization for Stochastic Matrix Factorization. Proceedings of AIST’2014. LNCS, Springer Verlag-Germany, volume CCIS 439: 28–45.


Hanna M. Wallach. 2006. Topic Modeling: Beyond Bag-of-Words. Proceedings of the 23rd International Conference on Machine Learning: 977–984.


Xuerui Wang, Andrew McCallum, and Xing Wei. 2007. Topical N-grams: Phrase and Topic Discovery, with an Application to Information Retrieval. Proceedings of the 2007 Seventh IEEE International Conference on Data Mining: 697–702.


Dingding Wang, Shenghuo Zhu, Tao Li, and Yihong Gong. 2009. Multi-Document Summarization using Sentence-based Topic Models. Proceedings of the ACL-IJCNLP 2009 Conference Short Papers: 297–300.


Xing Wei and W. Bruce Croft. 2006. LDA-Based Document Models for Ad-hoc Retrieval. Proceedings of the 29th International Conference on Research and Development in Information Retrieval: 178–185.


Zhongwu Zhai, Bing Liu, Hua Xu, and Peifa Jia. 2010. Grouping Product Features Using Semi-Supervised Learning with Soft-Constraints. Proceedings of the 23rd International Conference on Computational Linguistics: 1272–1280.


Wen Zhang, Taketoshi Yoshida, Tu Bao Ho, and Xijin Tang. 2008. Augmented Mutual Information for Multi-Word Term Extraction. International Journal of Innovative Computing, Information and Control, 8(2): 543–554.


Shibin Zhou, Kan Li, and Yushu Liu. 2009. Text Categorization Based on Topic Model. International Journal of Computational Intelligence Systems, volume 2, No. 4: 398–409.

Proceedings of the 20th Nordic Conference of Computational Linguistics, NODALIDA 2015, May 11-13, 2015, Vilnius, Lithuania

Author:
Natalia Loukachevitch, Michael Nokel
Title:
Topic Models: Accounting Component Structure of Bigrams
Note: the following are taken directly from CrossRef
Citations:
No citations available at the moment


Responsible for this page: Peter Berkesand
Last updated: 2017-02-21