Document Type : Articles
Ph.D. , Department of Computer Science University of Sana’a
With the present effort, we propose to investigate results of applying the Right-Truncated Index-Based Web Search Engine in order to determine its usefulness for storing and retrieving Arabic documents. The Right-Truncated Index-Based Web Search Engine, being a program for reading any set of Arabic documents accepts a query, and then processes both the documents and the query. Thus, it selects (predicts) those documents most relevant to the query which has been inserted. The program encompasses both a morphological component and a mathematical one. The morphological component allows the researcher to run either a stemming algorithm or a right-truncated algorithm. The chief advantage of the stemming algorithm is that it uses the least possible amount of storage for indexing by mapping the inflected and derived terms into a single, indexed-stem word. On the other hand, the right-truncated algorithm reduces the amount of storage to a lesser degree, but increases the probability of retrieving relevant (user-favorable) documents, compared to the stemming algorithm. One of the purposes of our investigation is to compare the efficiency of these two indexing mechanisms. The mathematical component of the algorithm accepts the output of the right truncation algorithm, and then employs both term-frequency and inverse document-frequency (TF-IDF) in order to establish the relative importance of each document, respective to the terms of the query. This paper also describes building a simple search engine based on a crawler or a spider. The clawer which indexes different types of documents is an algorithm to crawl the file systems from specified folder. A basic design and object model was developed to support single search word results as well as multiple search words results. It is capable of finding data to index by following (tracing) web links rather than searching directory listings in the file system. In this process files are downloaded through HTTP and HTML pages parsed in order to obtain more links without getting into a recursive loop. Also, this paper discusses how to improve indexing mechanism efficiency using a right truncated stemmer in terms of Arabic documents processing.