Document Type : Articles


1 Assistant Prof., Department of Knowledge and Information Science, Kharazmi University, Tehran, Iran.

2 Associate Prof., Department of Knowledge and Information Science, University of Tabriz, Tabriz, Iran.

3 Associate Prof., Department of Knowledge and Information Science, University of Tehran, Tehran, Iran

4 Assistant Prof., Information Science Research Department, Iranian Research Institute for Information Science and Technology (IRANDOC), Tehran, Iran

5 Ph.D. Student, Department of Electrical and Computer Engineering, University of Tehran, Tehran, Iran. University of Tehran, Tehran, Iran


A significant amount of scientific texts is produced in Persian and available in scientific information databases through the Web. In this paper, FarsAcademic, a test collection of Persian scientific texts has been built for implementation of information retrieval models among academic search comprising 102238 documents and 61 topics. While constructing FarsAcademic, we have tried to resolve the problems specific to information retrieval (IR) and natural language processing (NLP) in Persian scientific texts. Domain experts were employed to create queries within their research area and user relevance and topical relevance were applied to improve the precision of relevance judgment of documents. Further, to improve retrieval performance in Persian scientific texts, automated query expansion was applied using one of the relevant feedback techniques named as Local Context Analysis algorithm. The result showed that query expansion techniques outperformed other information retrieval models in the Persian scientific texts retrieval task. Eventually, FarsAcademic is the only one that has been provided free of charge for Iranian information retrieval scholars for them to implement and evaluate different information retrieval models and algorithms on Persian scientific text and academic search.


Agirre, E., Di Nunzio, G. M., Ferro, N., Mandl, T. & Peters, C. (2008). CLEF 2008: Ad hoc track overview. In Workshop of the Cross-Language Evaluation Forum for European Languages, 15-37. Springer, Berlin, Heidelberg.
AleAhmad, A., Amiri, H., Darrudi, E., Rahgozar, M. & Oroumchian, F. (2009). Hamshahri: A standard Persian text collection. Knowledge-Based Systems, 22(5), 382-387.
AleAhmad, A., Zahedi, M., Rahgozar, M. & Moshiri, B. (2016). irBlogs: A standard collection for studying Persian bloggers. Computers in Human Behavior, 57, 195-207.
Atwan, J., Mohd, M., Rashaideh, H. & Kanaan, G. (2016). Semantically enhanced pseudo relevance feedback for Arabic information retrieval. Journal of Information Science, 42(2), 246-260.
Bailey, P., Craswell, N. & Hawking, D. (2003). Engineering a multi-purpose test collection for web retrieval experiments. Information Processing & Management, 39(6), 853-871.
Bailey, P., Craswell, N., Soboroff, I., Thomas, P., de Vries, A. P. & Yilmaz, E. (2008). Relevance assessment: are judges exchangeable and does it matter. In Proceedings of the 31st Annual International ACM SIGIR Conference on Research and Development in Information Retrieval, 667-674.
Berendsen, R., Tsagkias, M., De Rijke, M. & Meij, E. (2012). Generating pseudo test collections for learning to rank scientific articles. In International Conference of the Cross-Language Evaluation Forum for European Languages, 42-53. Springer, Berlin, Heidelberg. Retrieved from
Bhatnagar, P. & Pareek, N. (2014). Improving pseudo relevance feedback based query expansion using genetic fuzzy approach and semantic similarity notion. Journal of Information Science, 40(4), 523-537.
Bodoff, D. (2008). Test theory for evaluating reliability of information retrieval test collections. Information Processing & Management, 44(3), 1117-1145.
Borlund, P. (2003). The IIR evaluation model: a framework for evaluation of interactive information retrieval systems. Information Research, 8(3), 8-23. Retrieved from
Carterette, B. (2007). Robust test collections for retrieval evaluation. In Proceedings of the 30th Annual International ACM SIGIR Conference on Research and Development in Information Retrieval, 55-62.
Carterette, B. & Bennett, P. N. (2008). Evaluation measures for preference judgments. In Proceedings of the 31st Annual International ACM SIGIR Conference on Research and Development in Information Retrieval, 685-686.
Carterette, B., Gabrilovich, E., Josifovski, V. & Metzler, D. (2010). Measuring the reusability of test collections. In Proceedings of the 3rd ACM International Conference on Web Search and Data Mining (WSDM), 231-240.
Cleverdon, C. (1967). The Cranfield tests on index language devices. Aslib Proceedings. 19(6), 173-194.
Cleverdon, C. W., Mills, J. & Keen, M. (1966). Factors determining the performance of indexing systems. Aslib Cranfield Research Project, 1, 9-18. Retrieved from
Clough, P. & Sanderson, M. (2013). Evaluating the performance of information retrieval systems using test collections. Information Research. 18(2), 18-32. Retrieved from
Derhami, V., Khodadadian, E., Ghasemzadeh, M. & Zareh Bidoki, A. M. (2013). Applying reinforcement learning for web pages ranking algorithms. Applied Soft Computing, 13(4), 1686-1692.
Dietz, F. & Petras, V. (2017). A component-level analysis of an academic search test collection. In International Conference of the Cross-Language Evaluation Forum for European Languages, 16-28. Springer International Publishing AG.
Du, J. T. & Evans, N. (2011). Academic users' information searching on research topics: Characteristics of research tasks and search strategies. The Journal of Academic Librarianship, 37(4), 299-306.
Dunne, C., Shneiderman, B., Gove, R., Klavans, J. L. & Dorr, B. (2012). Rapid understanding of scientific paper collections: Integrating statistics, text analytics, and visualization. Journal of the American Society for Information Science and Technology, 63(12), 2351-2369. Retrieved from
El Mahdaouy, A., El Alaoui, S. O. & Gaussier, E. (2019). Word-embedding-based pseudo-relevance feedback for Arabic information retrieval. Journal of Information Science, 45(4), 429-442.
Ellis, D. & Haugan, M. (1997). Modelling the information seeking patterns of engineers and research scientists in an industrial environment. Journal of Documentation, 53(4), 384-403.
Fautsch, C., Dolamic, L. & Savoy, J. (2008). UniNE at Domain-Specific IR-CLEF 2008: Scientific Data Retrieval: Various query expansion approaches. In Proceedings of the 9th Cross-Language Evaluation Forum Conference on Evaluating Systems for Multilingual and Multimodal Information Access, 199-202. Springer, Berlin, Heidelberg. Retrieved from
Ferro, N. (2014). CLEF 15th birthday: Past, present, and future. In ACM SIGIR Forum, 48(2), 31-55. Publication History.
Ferro, N. & Peters, C. (2009). CLEF 2009 ad hoc track overview: TEL and Persian tasks. In Multilingual Information Access Evaluation I. Text Retrieval Experiments, 13-35. Springer, Berlin, Heidelberg.
Gavel, Y. & Andersson, P. O. (2014). Multilingual query expansion in the SveMed+ bibliographic database: A case study. Journal of Information Science, 40(3), 269-280.
Gipp, B., Meuschke, N. & Breitinger, C. (2014). Citation‐based plagiarism detection: Practicability on a large‐scale scientific corpus. Journal of the Association for Information Science and Technology, 65(8), 1527-1540.
Hashemi, H. B. & Shakery, A. (2014). Mining a Persian–English comparable corpus for cross-language information retrieval. Information Processing & Management, 50(2), 384-398.
Hashemi, H. B., Shakery, A. & Faili, H. (2010). Creating a Persian-English comparable corpus. In Multilingual and Multimodal Information Access Evaluation. CLEF 2010. Lecture Notes in Computer Science, 6360. Springer, Berlin, Heidelberg.
Heffernan, K. & Teufel, S. (2018). Identifying problems and solutions in scientific text. Scientometrics, 116(2), 1367-1382.
Hersh, W. R., Bhupatiraju, R. T., Ross, L., Roberts, P., Cohen, A. M. & Kraemer, D. F. (2006). Enhancing access to the Bibliome: the TREC 2004 Genomics Track. Journal of Biomedical Discovery and Collaboration, 1, 3. 10.1186/1747-5333-1-3
Hoeber, O., Patel, D. & Storie, D. (2019). A study of academic search scenarios and information seeking behaviour. In Proceedings of the 2019 Conference on Human Information Interaction and Retrieval, (pp. 231-235).
Hsin, C. T., Cheng, Y. H. & Tsai, C. C. (2016). Searching and sourcing online academic literature: Comparisons of doctoral students and junior faculty in education. Online Information Review, 40(7): 979-997.
 Hutchins, J. (1977). On the structure of scientific texts. UEA Papers in Linguistics, 5(3), 18-39. Retrieved from
Jones, K. S. & Van Rijsbergen, C. J. (1976). Information retrieval test collections. Journal of Documentation, 32(1), 59-75.
Keyvanpour, M. R., Karimi Zandian, Z. & Abdolhosseini, Z. (2018). A useful framework for identification and analysis of different query expansion approaches based on the candidate expansion terms extraction methods. International Journal of Information Science and Management (IJISM), 16(2), 19-42. Retrieved from
Kluck, M. (2003). The GIRT Data in the Evaluation of CLIR Systems–from 1997 until 2003. In Peters, C., Gonzalo, J., Braschler, M., Kluck, M. (eds) Comparative Evaluation of Multilingual Information Access Systems. CLEF 2003. Lecture Notes in Computer Science, 3237. Springer, Berlin, Heidelberg.
Leveling, J. (2009). A comparison of sub-word indexing methods for information retrieval. In Proceedings of the Lernen-Wissen Workshop week (LWA’09). Retrieved from
Li, X. & de Rijke, M. (2019). Characterizing and predicting downloads in academic search. Information Processing & Management, 56(3), 394-407.
Li, X., Schijvenaars, B. J. A. & de Rijke, M. (2017). Investigating queries and search failures in academic search. Information Processing & Management, 53(3), 666-683.
Losada, D. E., Parapar, J. & Barreiro, A. (2017). Multi-armed bandits for adjudicating documents in pooling-based evaluation of information retrieval systems. Information Processing & Management, 53(5), 1005-1025.
Mandl, T. (2008). Recent developments in the evaluation of information retrieval systems: Moving towards diversity and practical relevance. Informatica, 32(1), 27-38. Retrieved from
Mizzaro, S. & Robertson, S. (2007). Hits hits trec: exploring information retrieval evaluation results with network analysis. In Proceedings of the 30th Annual International ACM SIGIR Conference on Research and Development in Information Retrieval, 479-486.
National Institute of Science and Technology. (2017). Retrieved from
Nwone, S. A. & Mutula, S. (2018). Information seeking behaviour of the professoriate in selected federal universities in southwest Nigeria. South African Journal of Libraries and Information Science, 84(1), 20-34.
Piroi, F., Lupu, M. & Hanbury, A. (2012). Effects of language and topic size in patent IR: an empirical study. In Information Access Evaluation. Multilinguality, Multimodality, and Visual Analytics. CLEF 2012. Lecture Notes in Computer Science, 7488(54-66). Springer, Berlin, Heidelberg.
Popper, K. R. (1972). Objective knowledge: An evolutionary approach. New York: Oxford University Press. Retrieved from
Pueyo, I. G. & Redrado, A. (2003). A functional-pragmatic approach to the analysis of internet scientific articles. LSP and Professional Communication, 3(1), 43-59. Retrieved from
Raamkumar, A. S., Foo, S. & Pang, N. (2017). Using author-specified keywords in building an initial reading list of research papers in scientific paper retrieval and recommender systems. Information Processing & Management, 53(3), 577-594.
Rahimi, R., Shakery, A. & King, I. (2016). Extracting translations from comparable corpora for Cross-Language Information Retrieval using the language modeling framework. Information Processing & Management, 52(2), 299-318.
Saboori, F., Bashiri, H. & Oroumchian, F. (2008). Assessment of query reweighing, by Rocchio method in Farsi information retrieval. Journal of Information Science and Technology, 6(1), 9-16.
Sadeghi, M. & Vegas, J. (2014). Automatic identification of light stop words for Persian information retrieval systems. Journal of Information Science, 40(4), 476-487.
Sadeghi, M. & Vegas, J. (2017). How well does Google work with Persian documents?. Journal of Information Science, 43(3), 316-327.
Sakai, T. (2014). Designing test collections for comparing many systems. In Proceedings of the 23rd ACM International Conference on Information and Knowledge Management, (pp. 61-70).
Sanderson, M. (2010). Test collection based evaluation of information retrieval systems. Now Publishers Inc.
Sanderson, M. & Joho, H. (2004). Forming test collections with no system pooling. In Proceedings of the 27th Annual International ACM SIGIR Conference on Research and Development in Information Retrieval, (pp. 33-40).
Scholer, F., Kelly, D. & Carterette, B. (2016). Information retrieval evaluation using test collections. Information retrieval evaluation using test collections, 19(3), 225-229.
Shafiee, F. & Shamsfard, M. (2018). Similarity versus relatedness: A novel approach in extractive Persian document summarisation. Journal of Information Science, 44(3), 314-330.
Shaker, H., Farhadpoor, M. R. & Nazari, F. (2017). Effect of Expansion and Reformulation of Query on Improved Precision of Retrieval Results. International Journal of Information Science and Management, 15(2), 123-134. Retrieved from
Tabrizi, S. A., Shakery, A., Zamani, H. & Tavallaei, M. A. (2018). PERSON: Personalized information retrieval evaluation based on citation networks. Information Processing & Management, 54(4), 630-656.
Voorhees, E. M. (2008). On test collections for adaptive information retrieval. Information Processing & Management, 44(6), 1879-1885.
Wang, X., Zhai, Y., Lin, Y. & Wang, F. (2019). Mining layered technological information in scientific papers: A semi-supervised method. Journal of Information Science, 45(6), 779-793.
Wildemuth, B. M., Marchionini, G., Fu, X., Oh, J. S. & Yang, M. (2019). The usefulness of multimedia surrogates for making relevance judgments about digital video objects. Information Processing & Management, 56(6), 102-109.
Xu, J. & Croft, W. B. (2000). Improving the effectiveness of information retrieval with local context analysis. ACM Transactions on Information Systems (TOIS), 18(1), 79-112.