Document Type : Articles
Authors
1 Ph.D. , Department of Computer Science University of Sana’a
2 Ph.D. , Department of Electrical Engineering University of Sana’a
Abstract
This paper presents an efficient mechanism to convert Sana’ani dialect to modern standard Arabic. The mechanism is based on morphological rulesrelated to Sana’ani dialect as well as Modern Standard Arabic. Such rules facilitate the dialect conversion to its corresponding MSA. The mechanismtokenizes the input dialect text and divides each token into stem and its affixes; such affixes can be categorized into two categories: dialect affixesand/or MSA affixes. At the same time, the stem could be dialect stem or MSA stem. Therefore, our mechanism, implemented by using a simple MSAstemmer, must pay attention to such situations. Then our dialect stemmer is applied to strip the resulting token and extract dialect affixes. At this point,the rules are applied to decide when to carry out the extraction of an affix. The experiment shows that Sana’ani dialect has three classes of distortions,which are prefixes, suffixes, and stems distortions. The algorithm normalizes such distortion based on the morphological rules. For each morphologicalrule the mechanism checks possibility of applying such a rule. That means if rule conditions be met, then the dialect affix will be replaced by itscorresponding MSA. If there is no restriction on applying the rule related to the distorted stem, then the rule can be considered as a parallel corpus of thedialect and MSA. Finally, the experiment computes the distortion ratio of MSA in Sana’ani dialect. For a Sana’ani dialect sample of 9386 words,16.29% of them have distorted suffixes, 0.70% have distorted prefixes and 2.17% contain distorted stems. These percentages are related only to theprocessed words.