Persian Text Classification Enhancement by Latent Semantic Space

Mohammad Bagher Dastgheib, Sara Koleini

Abstract


Heterogeneous data in all groups are growing on the web nowadays. Because of the variety of data types in the web search results, it is common to classify the results in order to find the preferred data. Many machine learning methods are used to classify textual data. The main challenges in data classification are the cost of classifier and performance of classification. A traditional model in IR and text data representation is the vector space model. In this representation cost of computations are dependent upon the dimension of the vector. Another problem is to select effective features and prune unwanted terms. Latent semantic indexing is used to transform VSM to orthogonal semantic space with term relation consideration. Experimental results showed that LSI semantic space can achieve better performance in computation time and classification accuracy. This result showed that semantic topic space has less noise so the accuracy will increase. Less vector dimension also reduces the computational complexity.

Keywords


Persian Text Classification, Vector Space Model, Latent Semantic Indexing (LSI).

Full Text:

PDF

References


Ahmadi, P., Tabandeh, M., & Gholampour, I. (2016). Persian text classification based on topic models. In Electrical Engineering (ICEE), 2016 24th Iranian Conference on (pp. 86-91). IEEE.

Auria, L. & Moro, R. A. (2008). Support vector machines (SVM) as a technique for solvency analysis. DIW Berlin Discussion Paper No. 811. Retrieved from https://ssrn.com/abstract=1424949

Chiu, H. S., & Chen, B. (2007). Word topical mixture models for dynamic language model adaptation. In Acoustics, Speech and Signal Processing, 2007. ICASSP 2007. IEEE International Conference on (Vol. 4, pp. IV-169). IEEE.

Deerwester, S. C., Dumais, S. T., Landauer, T. K., Furnas, G. W., & Harshman, R. A. (1990). Indexing by latent semantic analysis. Journal of the American Society of Information Science, 41(6), 391–407.

Elahimanesh, M. H., Minaei, B., & Malekinezhad, H. (2012). Improving k-nearest neighbor efficacy for Farsi text classification. In LREC (pp. 1618-1621).

Farhoodi, M., & Yari, A. (2010). Applying machine learning algorithms for automatic Persian text classification. In Advanced Information Management and Service (IMS), 2010 6th International Conference on (pp. 318-323). IEEE.

García, M. A. M., Rodríguez, R. P., & Rifón, L. A. (2017). Wikipedia-based cross-language text classification. Information Sciences, 406-407, 12-28.

Hofmann, T. (2017). Probabilistic latent semantic indexing. In ACM SIGIR Forum (Vol. 51, No. 2, pp. 211-218). ACM.

Hussain, S., Keung, J., & Khan, A. A. (2017). Software design patterns classification and selection using text categorization approach. Applied Soft Computing, 58, 225-244.

Isard, W., Azis, I. J., Drennan, M. P., Miller, R. E., Saltzman, S., & Thorbecke, E. (1998). Methods of interregional and regional analysis. USA: Routledge.

Kotsiantis, S.B. (2007). Supervised machine learning: A review of classification techniques. Informatica, 31, 249–268.

Landauer, T. K., and Dumais, S. T. (2006). A solution to Plato's problem: The latent semantic analysis theory of the acquisition, induction, and representation of knowledge. Psychological Review, 104, 211-240.

Landauer, T. K., & Dumais, S. (2008). Latent semantic analysis. Scholarpedia, 3(11), 4356. Retrieved from http://www.scholarpedia.org/article/Latent_semantic_analysis

Li, L., & Zhang, Y. (2018). An empirical study of text classification using latent dirichlet allocation. Retrieved from http://www.cs.cmu.edu/~yimengz/papers/MLReport.pdf

Liu, J., Jin, T., & Pan, K. (2017). An improved KNN text classification algorithm based on Simhash. IEEE 16th International Conference on Cognitive Informatics & Cognitive Computing (ICCI*CC). 92-95.

Liu, T., Chen, Z., Zhang, B., Ma, W. Y., & Wu, G. (2004). Improving text classification using local latent semantic indexing. In Data Mining, 2004. ICDM'04. Fourth IEEE International Conference on (pp. 162-169). IEEE.

Manning, C. D. & Raghavan, P. & Schütze, H. (2008). An introduction to information retrieval. Cambridge, England: Cambridge University Press

Mikolov, T., Chen, K., Corrado, G., & Dean, J. (2013). Efficient estimation of word representations in vector space. arXiv preprint arXiv:1301.3781.

Pilevar, M. T., Feili, H., & Soltani, M. (2009). Classification of Persian textual documents using learning vector quantization. In Natural Language Processing and Knowledge Engineering, 2009. NLP-KE 2009. International Conference on (pp. 1-6). IEEE.

Rajan, K., Ramalingam, V., Ganesan, M., Palanivel, S., & Palaniappan, B. (2009). Automatic classification of Tamil documents using vector space model and artificial neural network. Expert Systems with Applications, 36(8), 10914-10918.

Reisinger, J. & Mooney, R. J. (2010). Multi-prototype vector-space models of word meaning. In Human Language Technologies: The 2010 Annual Conference of the North American Chapter of the Association for Computational Linguistics (pp. 109-117). Association for Computational Linguistics.

Said D, Wanas NM, Darwish NM & Hegazy N. (2009).A study of text preprocessing tools for Arabic text categorisation. The Second International Conference on Arabic Language. 230-236.

Semberecki, P. & Maciejewski, H. (2017). Deep learning methods for subject text classification of articles. In Computer Science and Information Systems (FedCSIS), 2017 Federated Conference on (pp. 357-360). IEEE.

Sharma, A., & Sahni, S. (2011). A comparative study of classification algorithms for spam email data analysis. International Journal on Computer Science and Engineering (IJCSE), 3(5), 1890-1895.

Tahmoresnezhad, J. & Hashemi. S. (2017). Visual domain adaptation via transfer feature learning. Knowledge and Information Systems, 50(2), 585-605.

Uysal, A. K., & Gunal, S. (2014). Text classification using genetic algorithm oriented latent semantic features. Expert Systems with Applications, 41(13), 5938-5947.

Witlox, F., Antrop, M., Bogaert, P., De Maeyer, P., Derudder, B., Neutens, T., Van Acker, V. & Van de Weghe, N. (2009). Introducing functional classification theory to land use planning by means of decision tables. Decision Support Systems, 46(4), 875-881.

Witten, I.H.; Frank, E. & Hall, M.A. (2011). Data mining: Practical machine learning tools and techniques. San Francisco, CA, USA: Diane Cerra.

Wong, S. K. M., Ziarko, W., Raghavan, V. V., & Wong, P. C. N. (1987). On modeling of information retrieval concepts in vector spaces. ACM Transactions on Database Systems (TODS), 12(2), 299-321.

Xia, T., & Du, Y. (2011). Improve VSM text classification by title vector based document representation method. In Computer Science & Education (ICCSE), 2011 6th International Conference on (pp. 210-213). IEEE.

Yu, B., Xu, Z., & Li, C. (2008). Latent semantic analysis for text categorization using neural network. Knowledge-Based Systems, 21(8), 900–904.

Zamani, M., Dianat, R, Sadeghzadeh, M. (2013) . Categorization of Persian texts using probabilistic semantic analysis method. First National Symposium on the Application of Smart Systems (Soft Computing) in Science and Technology .

Zipf, G. K. (1935). The psycho-biology of language. Boston: Houghton


Refbacks

  • There are currently no refbacks.



E-ISSN: 2008-8310

   ISSN: 2008-8302