Document Type : Articles
Authors
CAIT, Faculty of Information Science & Technology, Universiti Kebangsaan Malaysia, 43600 UKM, Bangi Selangor.
Abstract
Other languages have influenced Arabic because of several factors, such as geographical nearness, trade communication, past Islamic conquests, science and technology, new devices, brand names, models, and fashion. As a result of these factors, foreign words are used in Arabic text and are known as Arabised words. Arabised words affect the Arabic natural language processing (NLP) task because identifying a correct stem or root from an Arabic word becomes more difficult. Therefore, a more efficient Arabic NLP can be developed if Arabised word removal is part of a pre-processing task. In this paper, we propose an algorithm for detecting and extracting Arabised words as a pre-processing task for an Arabic stemming task. This algorithm is a combination of lexicon-based and rule-based approaches. The lexicon list has been developed based on various sources of Arabic text resources, and the rule-based algorithm has been designed to cater to Arabised words with definite articles and use pattern matching on prefixes and suffixes. To evaluate the effectiveness of the proposed Arabised word removal algorithm on the Arabic NLP task, we use Arabised word removal as part of pre-processing in Arabic stemmers. Three Arabic stemmers are used in our evaluation, namely, light stemming, condition light and ARLS, on three types of Arabic standard datasets. Comparisons were made by measuring the performance of precision, recall and IFC on the stemmers with or without our Arabised word removal pre-processing. Results show that the performance on all the stemmers improves if Arabised word removal is included as part of the stemming's pre-processing. Therefore, an efficient Arabic NLP application or task can be developed if Arabised word removal is included in the pre-processing stage for Arabic NLP application, mainly Arabic stemming.https://dorl.net/dor/20.1001.1.20088302.2022.20.4.6.5
Keywords
- Abainia, K., S. Ouamour & H. Sayoud 2017. A novel robust Arabic light stemmer. Journal of Experimental & Theoretical Artificial Intelligence 29(3): 557-573.
- Abbas, M. & K. Smaili 2005. Comparison of topic identification methods for arabic language. Proceedings of International Conference on Recent Advances in Natural Language Processing, RANLP. pp. 14-17.
- Al-Aswadi, F. N., H. Y. Chan & K. H. Gan 2020. Automatic ontology construction from text: a review from shallow to deep learning trend. Artificial Intelligence Review 53(6): 3901-3928.
- Al-Kabi, M. & R. Al-Mustafa 2006. Arabic root based stemmer. proceedings of the international Arab conference on information technology, Jordan.
- Al-Lahham, Y. A., K. Matarneh & M. Hasan 2018. Conditional arabic light stemmer: condlight. Int. Arab J. Inf. Technol. 15(3A): 559-564.
- Al-Nashashibi, M. Y., D. Neagu & A. A. Yaghi 2010. Stemming techniques for Arabic words: A comparative study. Computer Technology and Development (ICCTD), 2010 2nd International Conference on. pp. 270-276.
- Al-Shalabi, R., G. Kanaan, J. M. Jaam, A. Hasnah & E. Hilat 2004. Stop-word removal algorithm for Arabic language. Proceedings. 2004 International Conference on Information and Communication Technologies: From Theory to Applications, 2004. pp. 545.
- Al-Shbiel, A. O. 2017. Arabization and its effect on the Arabic language. Journal of Language Teaching and Research 8(3): 469-475.
- Al‐Sughaiyer, I. A. & I. A. Al‐Kharashi 2004. Arabic morphological analysis techniques: A comprehensive survey. Journal of the American Society for Information Science and Technology 55(3): 189-213.
- Al Ameed, H., S. Al Ketbi, A. Al Kaabi, K. Al Shebli, N. Al Shamsi, N. Al Nuaimi & S. Al Muhairi 2005. Arabic light stemmer: A new enhanced approach. The Second International Conference on Innovations in Information Technology (IIT’05). pp. 1-9.
- Almusaddar, M. 2014. Improving Arabic Light Stemming in Information Retrieval Systems.Tesis MSC Thesis. Computer Engineering Department, Faculty of Engineering, Research and Postgraduate Affairs, Islamic University, Gaza, Palestine,
- Alshalabi, H., S. Tiun, N. Omar & M. Albared 2013. Experiments on the use of feature selection and machine learning methods in automatic malay text categorization. Procedia Technology 11(1): 748-754.
- Atwan, J., M. Mohd & G. Kanaan 2013. Enhanced arabic information retrieval: Light stemming and stop words. International Multi-Conference on Artificial Intelligence Technology. pp. 219-228.
- Bouzoubaa, K., H. Baidouri, T. Loukili & T. El Yazidi 2009. Arabic Stop Words: Towards a Generalisation and Standardisation. the 13th International Business Information Management Association Conference IBIMA.
- Burden, P. 2000. Stemming algorithms and their use, online]. http://www. scit. wlv. ac. uk/. Available from: http://www. scit ….
- Dawson, J. 1974. Suffix removal and word conflation. ALLC bulletin 2(3): 33-46.
- El-Khair, I. A. 2006. Effects of stop words elimination for Arabic information retrieval: a comparative study. International Journal of Computing & Information Sciences 4(3): 119-133.
- Elbarougy, R., G. Behery & A. El Khatib. 2020. A Proposed Natural Language Processing Preprocessing Procedures for Enhancing Arabic Text Summarization. Dlm. (pnyt.). Ed. Recent Advances in NLP: The Case of Arabic Language pp. 39-57. Springer.
- Gey, F. & D. Oard 2001. The TREC-2001 cross-language information retrieval track: Searching Arabic using English, French or Arabic queries. AUTHOR Voorhees, Ellen M., Ed.; Harman, Donna K., Ed. TITLE The Text REtrieval Conference (TREC-2001)(10th, Gaithersburg, Maryland, November 13-16, 2001). NIST Special. 500 pp. 78.
- Ghwanmeh, S., G. Kanaan, R. Al-Shalabi & S. Rabab'ah 2009. Enhanced algorithm for extracting the root of Arabic words. Computer Graphics, Imaging and Visualization, 2009. CGIV'09. Sixth International Conference on. pp. 388-391.
- Jabbar, A., S. Iqbal, M. I. Tamimy, S. Hussain & A. Akhunzada 2020. Empirical evaluation and study of text stemming algorithms. Artificial Intelligence Review: 1-30.
- Khoja, S. & R. Garside 1999. Stemming arabic text. Lancaster, UK, Computing Department, Lancaster University.
- Larkey, L. S., L. Ballesteros & M. E. Connell. 2007. Light stemming for Arabic information retrieval. Dlm. (pnyt.). Ed. Arabic computational morphology pp. 221-243. Springer.
- m.abbas 2004. Arabic Corpora. https://sites.google.com/site/mouradabbas9/corpora.
- Paice, C. D. 1996. Method for evaluation of stemming algorithms based on error counting. Journal of the American Society for Information Science 47(8): 632-649.
- TREC2002 The TREC-2001 Cross-Language Information Retrieval Track: Searching Arabic using English, French or Arabic Queries. NIST Special Publication: 500-250.
- wikipedia 2020. List of circulating currencies. https://en.wikipedia.org/wiki/List_of_circulating_currencies.
- Xu, J., A. Fraser & R. Weischedel 2002. Empirical studies in strategies for Arabic retrieval. Proceedings of the 25th annual international ACM SIGIR conference on Research and development in information retrieval. pp. 269-274.