Document Type : Articles

Authors

Iranian Research Institute for Information Science and Technology (IranDoc)

Abstract

Currently, most linguistic studies benefit from valid linguistic data available at corpora. Compiling corpora is a common practice in linguistic research. The present study introduces two specialized corpora in Persian; a specialized corpus is used to study a particular type of language or language variety. For building such corpora, first, a set of texts were compiled based on pre-established criteria used in the sampling process (including the mode of the texts, type of the texts, domain of the texts, language/ language varieties of the texts and the date of the texts). The corpora are specialized because they include technical terms in information processing and management, librarianship, linguistics, computational linguistics, thesaurus building, managing, policy-making, natural language processing, information technology, information retrieval, ontology and other related interdisciplinary domains. After compiling data and Metadata, the texts were preprocessed (normalized and tokenized) and annotated (automated POS tagging); finally, the tags were manually checked. Each corpus includes more than four million words. Since not many specialized corpora are built in Persian, such corpora could be considered valuable resources for researchers interested in studying linguistic variations in Persian interdisciplinary texts.https://dorl.net/dor/20.1001.1.20088302.2022.20.4.14.3

Keywords

  1. - AleAhmad, A., H. Amiri, E. Darrudi, M. Rahgozar, and F. Oroumchian. 2009. Hamshahri: A Standard Persian Text Collection. Knowledge-Based Systems. Dubai 22(5): 382–387. Elsevier.
  2. - Assi, Mostafa. 2005. A brief report about Persian Database in Internet. Journal of researchers. No.2. (this article is in Persian language).
  3. - Atkins, S., J. Clear, and N. Ostler. 1992. Corpus design criteria. Literary and Linguistic Computing. 7 (1). 1-16
  4. - Baradaran Hashemi, H.‎, Shakery, A.‎, and H. Faili. 2010‎.‎ Creating a Persian-English Comparable Corpus.‎ In Proceedings of Conference on Multilingual and Multimodal Information Access Evaluation ‪(CLEF), 27-39, Padua, Italy.‎‬
  5. - Beloso, Begona Soneira. 2015. Designing, describing and compiling a corpus of English for architecture. 7th International conference on corpus linguistics: current work in corpus linguistics: working with traditionally-conceived corpora and beyond (CILC). Procedia-social and behavioral sciences 198. 459-464. Elseveir.
  6. - Bijankhan, M., J. Sheykhzadegan, M. Bahrani, and M. Ghayoomi. 2011. Lesson from building a Persian written corpus: Peykare. Language resources and evolution 45 (2): 143-164. Springer.
  7. - Brezina, Vaclav., Hawtin, Abi., and Toney McEnery. 2020. The written British national corpus 2014-design and comparability. Journal of text and talk. Volum 41 (5-6). 595-615. De Gruyter Mouton.
  8. - Claude Toriida, M. 2016. Steps for creating specialized corpus and developing an annotated frequence-based vocabulary list. TESL Canada journal/ revue TESL du Canada 34 (11): 87-105.
  9. - Crystal, David. 1991. A dictionary of linguistics and phonetics. Blackwell, 3rd edition.
  10. - Dashtbani, Shokoofe., Mansoorizade, Moharram., and Mohammad Nasiri. 2014. Specialized Persian-English text corpus in IT. Iranian journal of comparative linguistic research. 4(8) (this article is in Persian language).
  11. - Davies, Mark. 2021. The Coronavirus corpus: design, construction and use. International journal of corpus linguistics. John Benjamins publishing company.
  12. - Gahtre, F. 2007. Inflectional features in contemporary Persian. Dastoor. No.3: 52-81. (This article is in Persian language).
  13. - Ghayoomi, M., Momtazi, S., & M. Bijankhan. 2013. A study of corpus development for Persian. International journal on Asian language processing 20 (1): 17-33.
  14. - https://github.com/sobhe/hazm
  15. - Karimi, Akbar., Ansari, Ebrahim., and Bahram Sadeghi Bigham. 2017. Extracting an English-Persian parallel corpus from comparable corpora. Arxiv: 1711.00681v3 [cs.CL]. Project: Machin translation. Parallel sentence extraction from comparable corpora using statistical machin translation.
  16. - Lazar, G. 2010. Contemporary Persian grammar. Bahreini, Mahasti (translation). Hermes publication. (this book is in Persian language).
  17. - Mohammadi, Roya. 2012. Building a Persian-English comparable corpus and extracting parallel sentences. M.S. thesis. University of Alzahra. (this thesis is in Persian language).
  18. - Keshani, Kh. 1992. Suffix derivation in contermporary Persian. Markaz-e Nashr-e Daneshgahi. Tehran. (This book is in Persian language).
  19. - Koltunski, Ekaterina Lapshinova. 2013. VARTRA: A comparable corpus for analysis of translation variation. In Proceedings of the 6th workshop on building and using comparable corpora. Pp. 77-86. Association for computational linguistics
  20. - McEnery, T., & A. Wilson. 2001. Corpus Linguistics: An Introduction. Edinburgh University Press.
  21. - Mohammadi, S.‎ R.‎, and N. Riahi. 2016‎.‎ Presenting an Optimal Method for Constructing an English-Persian Comparable Corpus.‎ International Journal of Intelligent Information Systems. 5 (3)‎: 42-47
  22. - Kashefi, Omid.‎ 2018‎.‎ MIZAN: A Large Persian-English Parallel Corpus.‎ arXiv preprint arXiv:1801.‎02107.
  23. - Rasooli, M., M. Kouhestani, and A. Moloodi. 2013. Development of a Persian Syntactic Dependency Treebank. In The 2013 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies (NAACL HLT): 306-314. Atlanta, USA.
  24. - Sadeghi, A. 1991-1993. Word formation methods in Persian. Danesh publication. (these are 12 articles: No 1-12 and all are in Persian language)
  25. - Samvelian, P.‎, and P. Faghiri. 2013‎.‎ Introducing PersPred, A Syntactic and Semantic Database for Persian Complex Predicates.‎ In Proceedings of the 9th Workshop on Multiword Expressions, Atlanta, Georgia, USA.‎ Association for Computational Linguistics, 11-20.
  26. - Sinclair, J. 2004. Developing Linguistic Corpora: a Guide to Good Practice. Chapter 1: Corpus and Text — Basic Principles. Edited by Martin Wynne .ahds.literature, languages and linguistics. The Oxford Text Archive.
  27. - Waynne, M. 2005. Developing linguistic corpora: a guide to good practice. Oxbow books. Literary and linguistic computing 22 (1).
  28. - http://martinweisser.org/corpora_site/spec_corpora.html