Document Type : Articles

Authors

1 Islamic Azad University, Qazvin Branch Qazvin, Iran

2 Al-Zahra University Tehran, Iran

Abstract

Stemming is the process of finding the main morpheme of a word andit is used in natural language processing, text mining and informationretrieval systems. A stemmer extracts the stem of the words. We can classifyPersian stemmers in to three main classes: structural stemmers, dictionarybased stemmers and  statistical stemmers.The precision of structural stemmers is low and the expenses of dictionary basedstemmers is high, so the main goal of this research is to design and implementa statistical stemmer based on hidden markov model  with high precision which can reduce the sizeof indexed file and  increase the speedof information retrieval systems. Our proposed stemmer, finds the prefixes and suffixes of a word and removethem, so the rest of the word is the stem. But there are some exceptions inPersian words which lead to stem those words by mistakes. So we collect a dictionaryof  Persian stemmers. Our proposed  stemmers, search a word  in the dictionary, if it is not there , itfinds the stem of it by hmm based stemmer. This stemmer is tested in Bijankhancorpus and Hamshahri test collection. The results show increment in meanaverage precision and recall. The speed of the Information retrieval system isincreased and the size of  indexed filesis decreased by the algorithm.

  1. AleAhmad, A., Amiri, H., Darrudi, E., Rahgozar, M., & Oroumchian, F. (2009). Hamshahri: A standard Persian text collection. Journal of Knowledge-Based Systems. Elsevier. 22( 5), 382-387.
  2. Aslam, J.A., Pavlu, V. & Yilmaz, E. (2006). A statistical method for system evaluation using incomplete judgments. In Proceedings of the 29th Annual International ACM SIGIR Conference on Research and Development in Information Retrieval, 541-548. New York, NY: ACM Press.
  3. Bijankhan, M., & Moradzadeh, sh. (2004). Homographs in Persian Morphology. In Proceedings of the First Workshop on Persian Language and Computers, Tehran University, Iran.
  4. dianati, M., sadrodini, M.H., & Taghizade, A.H. (2014). An independent of language method to stem the persian words based on similarity measure,11th conference of Iranain Intelligent systems.
  5. Estahbanati, S., & Javidan, R. (2011). A New Stemmer For Farsi Language, Computer Science and Software Engineering (CSSE). 25 - 29.
  6. Estahbanati, S., Javidan, R., & Nikkhah, M. (2011). A New Multi-Phase Algorithm for Stemming in Farsi Language Based on Morphology, International Journal of Computer Theory and Engineering, 3( 5), 15-23.
  7. Ghahramani, Z. (2002). An introduction to hidden Markov models and Bayesian networks, Journal of pattern Recognition and Artificial intelligence, 15(1), 9 – 42.
  8. Ghayoomi, M. (2012). Bootstrapping the Development of an HPSG-based Treebank for Persian, In Linguistic Issues in Language Technology, 7( 1) , 1-13.
  9. Jadidinejad, A.H., Mahmoudi, F., & Dehdari, J.( 2010). Evaluation of Perstem: A Simple and Efficient Stemming Algorithm for Persian. CLEF 2009 Workshop, Part I, LNCS 6241, 98–101.
  10. Jalali, Z.S., Moini, M. R., & Alaee Arani, M.(2015). Structural and Functional Analysis of Lexical Bundles in Medical Research Articles: A Corpus-Based Study. International Journal of Information Science and Management, 13( 1), 51-69.
  11. Kato, J., Joga, S., Rittcher, J., & Blake, A. (2002). An HMM-Based Segmentation Method for Traffic Monitoring Movies, IEEE Transactions on Pattern Analysis and Machine intelligence, 24(9), 1291-1296.
  12. Krovetz, R. (1993). Viewing morphology as an inference process, in R. Korfhage et al., Proc. 16th ACM SIGIR Conference, Pittsburgh, 191-202.
  13. keyvanpour, M., & Tavoli, R. (2012). Feature weighting for improving document image retrieval system performance, International Journal of Computer Science Issues, 9(3) , 125-130.
  14. keyvanpour, M., & Tavoli, R. (2013). Document image retrieval: Algorithms, analysis and promising directions. International Journal of Software Engineering and Its Applications, 7(1), 93-106.
  15. Li, X., Parizeau, M., & Plamondon, R. (2000). Training hidden Markov models with multiple observations- a combinatorial method. IEEE Transactions on Pattern Analysis and Machine Intelligence. 22( 4) , 371-377.
  16. Mahdavi , M. A. (2015). Building a Syllabic Analyzer for Persian Using Finite State Transducers, International Journal of Information Science and Management ,13( 1), 39-50.
  17. Manning, C.D., Raghavan, P., & Schutze, H.(2009). Introduction to Information Retrieval. Cambridge university press.
  18. Mehrad, j., & Berenjian , S. R. (2011). Providing a Persian Language Singular-Stemmer System (RICeST Stemmer), International journal of science and Management, 9( 2) .
  19. Mehrad, J., & Naseri, M. (2010). The Islamic World Science Citation Center: A New Scientometrics System for Evaluating Research Performance in OIC Region. International Journal of Information Science and Management, Vol. 8 No. 2,pp. 1-10.
  20. Mehrad, J., Koleini, S.(2007). Using SOM Neural Network in Text Information Retrieval, Iranian Journal of Information Science and Technology, 5(1). 53-64.
  21. Melucci, M., Orio, N. (2003). A novel method for stemmer generation based on hidden markov models. CIKM '03 proceedings of the twelfth international conference on Information and knowledge management, 131-138.
  22. Metzler, D., Strohman, T., Turtle, H., & Croft,W. B (2004). Indri at TREC 2004: Terabyte Track. to appear in the Online Proceedings of 2004 Text REtrieval Conference.
  23. Mohammad Nasiri,M., Sheikh Esmaeili, K., & Abolhassani, H.(2006). A statistical stemmer for Persian language. in 11thInt ,CSI computer conf., Tehran , CSICC 2006,Iran.
  24. Mokhtaripour, A., & Jahanpour, S. (2006). Introduction to a new Farsi stemmer. International Conference on Information and Knowledge Management - CIKM , 2006, pp. 826-827, 2006.
  25. Momenipour Moghadam, F., & Keyvanpour, M. (2013). Analytical Study of Various Information Retrieval Models Based on Mathematical Approaches. Journal of Next Generation Information Technology(JNIT). 4(5), 63-73.
  26. Rabiner, L.R. (1983). A tutorial on hidden Markov models and selected applications in speech recognition. Proc. IEEE, 77(2), 257–286.
  27. Rahimtoroghi, E., Faili, H., & Shakery, A. (2010). A Structural Rule-based Stemmer for Persian. 5th International Symposium on Telecommunications.
  28. Rose Y, Sh., Shu, L., & Marc P. C. (2003). Two Decoding Algorithms for Tailbiting Codes. IEEE Transactions on Communications, 51( 10) . 1358-1365.
  29. Sharifloo, A.A., & Shamsfard , M. (2008). A Bottom Up Approach to Persian stemming. proceedings of the third joint conference on Natural language processing ,2 , 583-588.
  30. Song, L., Boots, B., Sajid, S., Gordon, G., & Smola, A. (2010). Hilbert space embeddings of hidden Markov models, In Proceedings of the 27th International Conference on Machine Learning.
  31. Taghva, K., Beckley, R. & Sadeh, M. (2005). A stemming algorithm for the Farsi language. International Conference on Information Technology Coding and Computing ITCC05. IEEE, 1, 158–162.