Document Type : Articles

Authors

1 Assistant Prof., Department of Knowledge and Information Science, Kharazmi University, Tehran, Iran.

2 Associate Prof., Department of Knowledge and Information Science, University of Tabriz, Tabriz, Iran.

3 Associate Prof., Department of Knowledge and Information Science, University of Tehran, Tehran, Iran

4 Assistant Prof., Information Science Research Department, Iranian Research Institute for Information Science and Technology (IRANDOC), Tehran, Iran

5 Ph.D. Student, Department of Electrical and Computer Engineering, University of Tehran, Tehran, Iran. University of Tehran, Tehran, Iran

Abstract

A significant amount of scientific texts is produced in Persian and available in scientific information databases through the Web. In this paper, FarsAcademic, a test collection of Persian scientific texts has been built for implementation of information retrieval models among academic search comprising 102238 documents and 61 topics. While constructing FarsAcademic, we have tried to resolve the problems specific to information retrieval (IR) and natural language processing (NLP) in Persian scientific texts. Domain experts were employed to create queries within their research area and user relevance and topical relevance were applied to improve the precision of relevance judgment of documents. Further, to improve retrieval performance in Persian scientific texts, automated query expansion was applied using one of the relevant feedback techniques named as Local Context Analysis algorithm. The result showed that query expansion techniques outperformed other information retrieval models in the Persian scientific texts retrieval task. Eventually, FarsAcademic is the only one that has been provided free of charge for Iranian information retrieval scholars for them to implement and evaluate different information retrieval models and algorithms on Persian scientific text and academic search.

Keywords

Agirre, E., Di Nunzio, G. M., Ferro, N., Mandl, T. & Peters, C. (2008). CLEF 2008: Ad hoc track overview. In Workshop of the Cross-Language Evaluation Forum for European Languages, 15-37. Springer, Berlin, Heidelberg.  http://doi.org/10.1007/978-3-642-04447-2_2
AleAhmad, A., Amiri, H., Darrudi, E., Rahgozar, M. & Oroumchian, F. (2009). Hamshahri: A standard Persian text collection. Knowledge-Based Systems, 22(5), 382-387. https://doi.org/10.1016/j.knosys.2009.05.002
AleAhmad, A., Zahedi, M., Rahgozar, M. & Moshiri, B. (2016). irBlogs: A standard collection for studying Persian bloggers. Computers in Human Behavior, 57, 195-207. https://doi.org/10.1016/j.chb.2015.11.038
Atwan, J., Mohd, M., Rashaideh, H. & Kanaan, G. (2016). Semantically enhanced pseudo relevance feedback for Arabic information retrieval. Journal of Information Science, 42(2), 246-260. https://doi.org/10.1177/0165551515594722
Bailey, P., Craswell, N. & Hawking, D. (2003). Engineering a multi-purpose test collection for web retrieval experiments. Information Processing & Management, 39(6), 853-871. http://doi.org/10.1016/S0306-4573(02)00084-5
Bailey, P., Craswell, N., Soboroff, I., Thomas, P., de Vries, A. P. & Yilmaz, E. (2008). Relevance assessment: are judges exchangeable and does it matter. In Proceedings of the 31st Annual International ACM SIGIR Conference on Research and Development in Information Retrieval, 667-674. https://doi.org/10.1145/1390334.1390447
Berendsen, R., Tsagkias, M., De Rijke, M. & Meij, E. (2012). Generating pseudo test collections for learning to rank scientific articles. In International Conference of the Cross-Language Evaluation Forum for European Languages, 42-53. Springer, Berlin, Heidelberg. Retrieved from https://staff.fnwi.uva.nl/m.derijke/wp-content/papercite-data/pdf/berendsen-generating-2012.pdf
Bhatnagar, P. & Pareek, N. (2014). Improving pseudo relevance feedback based query expansion using genetic fuzzy approach and semantic similarity notion. Journal of Information Science, 40(4), 523-537. https://doi.org/10.1177/0165551514533771
Bodoff, D. (2008). Test theory for evaluating reliability of information retrieval test collections. Information Processing & Management, 44(3), 1117-1145.  https://doi.org/10.1016/j.ipm.2007.11.006
Borlund, P. (2003). The IIR evaluation model: a framework for evaluation of interactive information retrieval systems. Information Research, 8(3), 8-23. Retrieved from http://informationr.net/ir/8-3/paper152.html
Carterette, B. (2007). Robust test collections for retrieval evaluation. In Proceedings of the 30th Annual International ACM SIGIR Conference on Research and Development in Information Retrieval, 55-62.  https://doi.org/10.1145/1277741.1277754
Carterette, B. & Bennett, P. N. (2008). Evaluation measures for preference judgments. In Proceedings of the 31st Annual International ACM SIGIR Conference on Research and Development in Information Retrieval, 685-686. https://doi.org/10.1145/1390334.1390451
Carterette, B., Gabrilovich, E., Josifovski, V. & Metzler, D. (2010). Measuring the reusability of test collections. In Proceedings of the 3rd ACM International Conference on Web Search and Data Mining (WSDM), 231-240. http://doi.org/10.1145/1718487.1718516
 
Cleverdon, C. (1967). The Cranfield tests on index language devices. Aslib Proceedings. 19(6), 173-194. https://doi.org/10.1108/eb050097
Cleverdon, C. W., Mills, J. & Keen, M. (1966). Factors determining the performance of indexing systems. Aslib Cranfield Research Project, 1, 9-18. Retrieved from https://dspace.lib.cranfield.ac.uk/bitstream/handle/1826/861/1966%20ASLIB%20part1.pdf?sequence=2
Clough, P. & Sanderson, M. (2013). Evaluating the performance of information retrieval systems using test collections. Information Research. 18(2), 18-32. Retrieved from https://www.informationr.net/ir/18-2/paper582.html#.U2unLPldXTp
Derhami, V., Khodadadian, E., Ghasemzadeh, M. & Zareh Bidoki, A. M. (2013). Applying reinforcement learning for web pages ranking algorithms. Applied Soft Computing, 13(4), 1686-1692. https://doi.org/10.1016/j.asoc.2012.12.023
Dietz, F. & Petras, V. (2017). A component-level analysis of an academic search test collection. In International Conference of the Cross-Language Evaluation Forum for European Languages, 16-28. Springer International Publishing AG. https://doi.org/10.1007/978-3-319-65813-1_3
Du, J. T. & Evans, N. (2011). Academic users' information searching on research topics: Characteristics of research tasks and search strategies. The Journal of Academic Librarianship, 37(4), 299-306. https://doi.org/10.1016/j.acalib.2011.04.003
Dunne, C., Shneiderman, B., Gove, R., Klavans, J. L. & Dorr, B. (2012). Rapid understanding of scientific paper collections: Integrating statistics, text analytics, and visualization. Journal of the American Society for Information Science and Technology, 63(12), 2351-2369. Retrieved from  http://researchgate.net/publication/216017171
El Mahdaouy, A., El Alaoui, S. O. & Gaussier, E. (2019). Word-embedding-based pseudo-relevance feedback for Arabic information retrieval. Journal of Information Science, 45(4), 429-442. https://doi.org/10.1177/0165551518792210
Ellis, D. & Haugan, M. (1997). Modelling the information seeking patterns of engineers and research scientists in an industrial environment. Journal of Documentation, 53(4), 384-403.  https://doi.org/10.1108/EUM0000000007204
Fautsch, C., Dolamic, L. & Savoy, J. (2008). UniNE at Domain-Specific IR-CLEF 2008: Scientific Data Retrieval: Various query expansion approaches. In Proceedings of the 9th Cross-Language Evaluation Forum Conference on Evaluating Systems for Multilingual and Multimodal Information Access, 199-202. Springer, Berlin, Heidelberg. Retrieved from https://www.researchgate.net/publication/237697000_UniNE_at_Domain-Specific_IR_-_CLEF_2008_Scientific_Data_Retrieval_Various_Query_Expansion_Approaches
Ferro, N. (2014). CLEF 15th birthday: Past, present, and future. In ACM SIGIR Forum, 48(2), 31-55. Publication History. https://doi.org/10.1145/2701583.2701587
Ferro, N. & Peters, C. (2009). CLEF 2009 ad hoc track overview: TEL and Persian tasks. In Multilingual Information Access Evaluation I. Text Retrieval Experiments, 13-35. Springer, Berlin, Heidelberg. https://doi.org/10.1007/978-3-642-15754-7_2
 
 
Gavel, Y. & Andersson, P. O. (2014). Multilingual query expansion in the SveMed+ bibliographic database: A case study. Journal of Information Science, 40(3), 269-280.
Gipp, B., Meuschke, N. & Breitinger, C. (2014). Citation‐based plagiarism detection: Practicability on a large‐scale scientific corpus. Journal of the Association for Information Science and Technology, 65(8), 1527-1540. https://doi.org/10.1002/asi.23228
Hashemi, H. B. & Shakery, A. (2014). Mining a Persian–English comparable corpus for cross-language information retrieval. Information Processing & Management, 50(2), 384-398. https://doi.org/10.1016/j.ipm.2013.10.002
Hashemi, H. B., Shakery, A. & Faili, H. (2010). Creating a Persian-English comparable corpus. In Multilingual and Multimodal Information Access Evaluation. CLEF 2010. Lecture Notes in Computer Science, 6360. Springer, Berlin, Heidelberg.
Heffernan, K. & Teufel, S. (2018). Identifying problems and solutions in scientific text. Scientometrics, 116(2), 1367-1382. https://doi.org/10.1007/s11192-018-2718-6
Hersh, W. R., Bhupatiraju, R. T., Ross, L., Roberts, P., Cohen, A. M. & Kraemer, D. F. (2006). Enhancing access to the Bibliome: the TREC 2004 Genomics Track. Journal of Biomedical Discovery and Collaboration, 1, 3.   http://doi.org/ 10.1186/1747-5333-1-3
Hoeber, O., Patel, D. & Storie, D. (2019). A study of academic search scenarios and information seeking behaviour. In Proceedings of the 2019 Conference on Human Information Interaction and Retrieval, (pp. 231-235).
Hsin, C. T., Cheng, Y. H. & Tsai, C. C. (2016). Searching and sourcing online academic literature: Comparisons of doctoral students and junior faculty in education. Online Information Review, 40(7): 979-997. https://doi.org/10.1108/OIR-11-2015-0354
 Hutchins, J. (1977). On the structure of scientific texts. UEA Papers in Linguistics, 5(3), 18-39. Retrieved from http://citeseerx.ist.psu.edu/viewdoc/download?doi=10.1.1.578.1532&rep=rep1&type=pdf
Jones, K. S. & Van Rijsbergen, C. J. (1976). Information retrieval test collections. Journal of Documentation, 32(1), 59-75. https://doi.org/10.1108/eb026616
Keyvanpour, M. R., Karimi Zandian, Z. & Abdolhosseini, Z. (2018). A useful framework for identification and analysis of different query expansion approaches based on the candidate expansion terms extraction methods. International Journal of Information Science and Management (IJISM), 16(2), 19-42. Retrieved from https://ijism.ricest.ac.ir/article_698277.html
Kluck, M. (2003). The GIRT Data in the Evaluation of CLIR Systems–from 1997 until 2003. In Peters, C., Gonzalo, J., Braschler, M., Kluck, M. (eds) Comparative Evaluation of Multilingual Information Access Systems. CLEF 2003. Lecture Notes in Computer Science, 3237. Springer, Berlin, Heidelberg. https://doi.org/10.1007/978-3-540-30222-3_37
Leveling, J. (2009). A comparison of sub-word indexing methods for information retrieval. In Proceedings of the Lernen-Wissen Workshop week (LWA’09). Retrieved from https://www.researchgate.net/publication/228722635_A_comparison_of_sub-word_indexing_methods_for_information_retrieval
 
 
Li, X. & de Rijke, M. (2019). Characterizing and predicting downloads in academic search. Information Processing & Management, 56(3), 394-407. https://doi.org/10.1016/j.ipm.2018.10.019
Li, X., Schijvenaars, B. J. A. & de Rijke, M. (2017). Investigating queries and search failures in academic search. Information Processing & Management, 53(3), 666-683. https://doi.org/10.1016/j.ipm.2017.01.005
Losada, D. E., Parapar, J. & Barreiro, A. (2017). Multi-armed bandits for adjudicating documents in pooling-based evaluation of information retrieval systems. Information Processing & Management, 53(5), 1005-1025. https://doi.org/10.1016/j.ipm.2017.04.005
Mandl, T. (2008). Recent developments in the evaluation of information retrieval systems: Moving towards diversity and practical relevance. Informatica, 32(1), 27-38. Retrieved from https://www.researchgate.net/publication/220166136_Recent_Developments_in_the_Evaluation_of_Information_Retrieval_Systems_Moving_Towards_Diversity_and_Practical_Relevance
Mizzaro, S. & Robertson, S. (2007). Hits hits trec: exploring information retrieval evaluation results with network analysis. In Proceedings of the 30th Annual International ACM SIGIR Conference on Research and Development in Information Retrieval, 479-486. https://doi.org/10.1145/1277741.1277824
National Institute of Science and Technology. (2017). Retrieved from https://www.nature.com/nature-index/annual-tables/2017
Nwone, S. A. & Mutula, S. (2018). Information seeking behaviour of the professoriate in selected federal universities in southwest Nigeria. South African Journal of Libraries and Information Science, 84(1), 20-34. https://doi.org/10.7553/84-1-1733
Piroi, F., Lupu, M. & Hanbury, A. (2012). Effects of language and topic size in patent IR: an empirical study. In Information Access Evaluation. Multilinguality, Multimodality, and Visual Analytics. CLEF 2012. Lecture Notes in Computer Science, 7488(54-66). Springer, Berlin, Heidelberg. https://doi.org/10.1007/978-3-642-33247-0_7
Popper, K. R. (1972). Objective knowledge: An evolutionary approach. New York: Oxford University Press. Retrieved from http://www.math.chalmers.se/~ulfp/Review/objective.pdf
Pueyo, I. G. & Redrado, A. (2003). A functional-pragmatic approach to the analysis of internet scientific articles. LSP and Professional Communication, 3(1), 43-59. Retrieved from https://rauli.cbs.dk/index.php/LSP/article/view/1982
Raamkumar, A. S., Foo, S. & Pang, N. (2017). Using author-specified keywords in building an initial reading list of research papers in scientific paper retrieval and recommender systems. Information Processing & Management, 53(3), 577-594. https://doi.org/10.1016/j.ipm.2016.12.006
Rahimi, R., Shakery, A. & King, I. (2016). Extracting translations from comparable corpora for Cross-Language Information Retrieval using the language modeling framework. Information Processing & Management, 52(2), 299-318. https://doi.org/10.1016/j.ipm.2015.08.001
 
 
Saboori, F., Bashiri, H. & Oroumchian, F. (2008). Assessment of query reweighing, by Rocchio method in Farsi information retrieval. Journal of Information Science and Technology, 6(1), 9-16. https://ro.uow.edu.au/dubaipapers/58/
Sadeghi, M. & Vegas, J. (2014). Automatic identification of light stop words for Persian information retrieval systems. Journal of Information Science, 40(4), 476-487. https://doi.org/10.1177/0165551514530655
Sadeghi, M. & Vegas, J. (2017). How well does Google work with Persian documents?. Journal of Information Science, 43(3), 316-327. https://doi.org/10.1177/0165551516640437
Sakai, T. (2014). Designing test collections for comparing many systems. In Proceedings of the 23rd ACM International Conference on Information and Knowledge Management, (pp. 61-70). https://doi.org/10.1145/2661829.2661893
Sanderson, M. (2010). Test collection based evaluation of information retrieval systems. Now Publishers Inc. http://dx.doi.org/10.1561/1500000009
Sanderson, M. & Joho, H. (2004). Forming test collections with no system pooling. In Proceedings of the 27th Annual International ACM SIGIR Conference on Research and Development in Information Retrieval, (pp. 33-40). https://doi.org/10.1145/1008992.1009001
Scholer, F., Kelly, D. & Carterette, B. (2016). Information retrieval evaluation using test collections. Information retrieval evaluation using test collections, 19(3), 225-229. https://doi.org/10.1007/s10791-016-9281-7
Shafiee, F. & Shamsfard, M. (2018). Similarity versus relatedness: A novel approach in extractive Persian document summarisation. Journal of Information Science, 44(3), 314-330. https://doi.org/10.1177/0165551517693537
Shaker, H., Farhadpoor, M. R. & Nazari, F. (2017). Effect of Expansion and Reformulation of Query on Improved Precision of Retrieval Results. International Journal of Information Science and Management, 15(2), 123-134. Retrieved from https://www.researchgate.net/publication/318262524_Effect_of_Expansion_and_Reformulation_of_Query_on_Improved_Precision_of_Retrieval_Results
Tabrizi, S. A., Shakery, A., Zamani, H. & Tavallaei, M. A. (2018). PERSON: Personalized information retrieval evaluation based on citation networks. Information Processing & Management, 54(4), 630-656.  https://doi.org/10.1016/j.ipm.2018.04.004
Voorhees, E. M. (2008). On test collections for adaptive information retrieval. Information Processing & Management, 44(6), 1879-1885. https://doi.org/10.1016/j.ipm.2007.12.011
Wang, X., Zhai, Y., Lin, Y. & Wang, F. (2019). Mining layered technological information in scientific papers: A semi-supervised method. Journal of Information Science, 45(6), 779-793. https://doi.org/10.1177/0165551518816941
Wildemuth, B. M., Marchionini, G., Fu, X., Oh, J. S. & Yang, M. (2019). The usefulness of multimedia surrogates for making relevance judgments about digital video objects. Information Processing & Management, 56(6), 102-109. https://doi.org/10.1016/j.ipm.2019.102091
Xu, J. & Croft, W. B. (2000). Improving the effectiveness of information retrieval with local context analysis. ACM Transactions on Information Systems (TOIS), 18(1), 79-112.  https://doi.org/10.1145/333135.333138