Investigating Text Power in Predicting Semantic Similarity

Zahra Yousefi, Hajar Sotudeh, Mahdieh Mirzabeigi, Seyed Mostafa Fakhrahmad, Alireza Nikseresht, Mehdi Mohammad


This article presents an empirical evaluation to investigate the distributional semantic power of abstract, body and full-text, as different text levels, in predicting the semantic similarity using a collection of open access articles from PubMed. The semantic similarity is measured based on two criteria namely, linear MeSH terms intersection and hierarchical MeSH terms distance. As such, a random sample of 200 queries and 20000 documents are selected from a test collection built on CITREC open source code. Sim Pack Java Library is used to calculate the textual and semantic similarities. The nDCG value corresponding to two of the semantic similarity criteria is calculated at three precision points. Finally, the nDCG values are compared by using the Friedman test to determine the power of each text level in predicting the semantic similarity. The results showed the effectiveness of the text in representing the semantic similarity in such a way that texts with maximum textual similarity are also shown to be 77% and 67% semantically similar in terms of linear and hierarchical criteria, respectively. Furthermore, the text length is found to be more effective in representing the hierarchical semantic compared to the linear one. Based on the findings, it is concluded that when the subjects are homogenous in the tree of knowledge, abstracts provide effective semantic capabilities, while in heterogeneous milieus, full-texts processing or knowledge bases is needed to acquire IR effectiveness.


Distributional Semantics, Semantic Similarity, Textual Similarity, Effectiveness, Information Retrieval, MeSH.


Ansari, M. (2005). Matching between assigned descriptors and title keywords in medical theses. Library Review, 54(7), 410-414. 10.1108/00242530510611901.

Arellano, F. F. M. (2000). Subject searching in online catalogs including Spanish and English material. Cataloging & classification quarterly, 28(2), 45-56. Retrieved from: https://www.tandfonline .com/doi/pdf/10.1300/J104v28n02_04

Byrne, J. R. (1975). Relative effectiveness of titles, abstracts, and subject headings for machine retrieval from the COMPENDEX services. Journal of the Association for Information Science and Technology, 26(4), 223-229. Retrieved from: doi/pdf/10.1002/asi.4630260405

Camacho-Collados, J., & Pilehvar, T. (2018). From Word to Sense Embeddings: A Survey on Vector Representations of Meaning. arXiv:1805.04032. Retrieved from:

Chang, A. A., Heskett, K. M., & Davidson, T. M. (2006). Searching the literature using medical subject headings versus text word with PubMed. The Laryngoscope, 116(2), 336-340. Retrieved from:

Coyle, K. (2008). Machine Indexing. The Journal of Academic Librarianship, 34(6), 530-531. Retrieved from

De Bellis, N. (2009). Bibliometrics and citation analysis from the science citation index to cybermetrics. Lanham, Md: Scarecrow Press.

Gabrilovich, E., & Markovitch, S. (2009). Wikipedia-based semantic interpretation for natural language processing. Journal of Artificial Intelligence Research, 34, 443-498. Retrieved from

Garrett, J. (2007). Subject headings in full-text environments: the ECCO experiment. College & Research Libraries, 68(1), 69-81. Retrieved from doi=10.5860/crl.68.1.69&route=6

Gil‐Leiva, I., & Alonso‐Arroyo, A. (2007). Keywords given by authors of scientific articles in database descriptors. Journal of the American society for information science and technology, 58(8), 1175-1187. Retrieved from /10.1002/asi.20595

Gipp, B., Meuschke, N., & Lipinski, M. (2015). CITREC: An Evaluation Framework for Citation-Based Similarity Measures based on TREC Genomics and PubMed Central. Proceedings of the iConference. Newport Beach: iSchools.

Gritta, M. (2015). Distributional Semantics and Authorship Differences (Doctoral dissertation, University of Cambridge).

Gross, T., & Taylor, A. G. (2005). What have we got to lose? The effect of controlled vocabulary on keyword searching results. College & Research Libraries, 66(3), 212-230. Retrieved from .pdf

Gross, T., Taylor, A. G., & Joudrey, D. N. (2015). Still a lot to lose: the role of controlled vocabulary in keyword searching. Cataloging & classification quarterly, 53(1), 1-39. Retrieved from needAccess=true

Han, H., Giles, L., Zha, H., Li, C., & Tsioutsiouliklis, K. (2004). Two supervised learning approaches for name disambiguation in author citations. Proceedings of the Fourth ACM/IEEE Joint Conference on Digital Libraries. New York, NY: ACM.

Harispe, S., Ranwez, S., Janaqi, S., & Montmain, J. (2015). Semantic similarity from natural language and ontology analysis. Synthesis Lectures on Human Language Technologies, 8(1), 1-254. Retrieved from Sebastien_Harispe/ publication/277328095_Semantic_Similarity_from_Natural_Language_and_Ontology_Analysis/links/58f5e11a458515ff23b6307d/Semantic-Similarity-from-Natural-Language-and-Ontology-Analysis.pdf

Hersh, W., Buckley, C., Leone, T., & Hickam, D. (1994). OHSUMED: an interactive retrieval evaluation and new large test collection for research. Proceedings of the 17th Annual International ACM SIGIR Conference on Research and Development in Information Retrieval. New York, NY: ACM.

Hersh, W., Price, S., & Donohoe, L. (2000). Assessing thesaurus-based query expansion using the UMLS Metathesaurus. Proceedings of the AMIA Symposium. San Francisco, CA: AMIA.

Hersh, W. R., & Hickam, D. H. (1992). A comparison of retrieval effectiveness for three methods of indexing medical literature. The American Journal of the Medical Sciences, 303(5), 292-300. Retrieved from S0002962915357013?via%3Dihub

Hersh, W. R., & Hickam, D. H. (1993). A comparison of two methods for indexing and retrieval from a full-text medical database. Medical Decision Making, 13(3), 220-226. Retrieved from

Hjørland, B. (2008). What is knowledge organization (KO)? Knowledge organization, 35(2,3), 86-101. Retrieved from _What_is_Knowledge_Organization_KO

Inkpen, D., & Désilets, A. (2005). Semantic similarity for detecting recognition errors in automatic speech transcripts. Proceedings of the conference on Human Language Technology and Empirical Methods in Natural Language Processing. New York, NY: ACM.

Joubarne, C., & Inkpen, D. (2011). Comparison of semantic similarity for different languages using the Google N-gram corpus and second-order co-occurrence measures. Proceeding of the Advances in Artificial Intelligence, Lecture Notes in Computer Science. Berlin: Springer.

Kekäläinen, J. (2005). Binary and graded relevance in IR evaluations—comparison of the effects on ranking of IR systems. Information processing & management, 41(5), 1019-1033. Retrieved from 7722&rep =rep1&type=pdf

Lavelli, A., Sebastiani, F, & Zanoli, R. (2004). Distributional term representations: an experimental comparison. Proceedings of the 13th ACM International Conference on Information and Knowledge Management. New York, NY: ACM.

Lee, W. N., Shah, N., Sundlass, K., & Musen, M. (2008). Comparison of ontology-based semantic-similarity measures. Proceedings of the AMIA annual symposium. San Francisco, CA: AMIA.

Leopold, H., Niepert, M., Weidlich, M., Mendling, J., Dijkman, R., & Stuckenschmidt, H. (2012). Probabilistic optimization of semantic process model matching. International Conference on Business Process Management. Lecture Notes in Computer Science (7481). Berlin, Heidelberg: Springer. Retrieved from /10.1007/978-3-642-32885-5_25

Lin, D. (1998). An information-theoretic definition of similarity. Proceedings of the Fifteenth International Conference on Machine Learning. San Francisco, CA: Morgan Kaufmann Publishers Inc. Retrieved from

Liu, Y.-H. (2010, August 18-21). On the Potential Search Effectiveness of MeSH (Medical Subject Headings) Terms. Proceedings of the third symposium on Information interaction in context. New York, NY: ACM.

Liu, Y.-H., & Wacholder, N. (2017). Evaluating the impact of MeSH (Medical Subject Headings) terms on different types of searchers. Information Processing & Management, 53(4), 851-870. Retrieved from Evaluating_the_Impact_of_MeSH_Medical_Subject_Headings_Terms_on_Different_Types_of_Searchers

Liu, M., Lang, B., Gu, Z., & Zeeshan, A. (2017). Measuring similarity of academic articles with semantic profile and joint word embedding. Tsinghua Science and Technology, 22(6), 619-632. Retrieved from =8195345

Lu, Z., Kim, W., & Wilbur, W. J. (2009). Evaluating relevance ranking strategies for MEDLINE retrieval. Journal of the American Medical Informatics Association, 16(1), 32-36. Retrieved from pdf/32.S1067502708001916.main.pdf

Manning, C. D., Raghavan, P., & Schütze, H. (2008). Introduction to information retrieval. New York, NY: Cambridge University Press.

Mao, Y., & Lu, Z. (2017). MeSH Now: automatic MeSH indexing at PubMed scale via learning to rank. Journal of Biomedical Semantics, 8(15), 1-9. Retrieved from

McCutcheon, S. (2009). Keyword vs controlled vocabulary searching: the one with the most tools wins. The Indexer, 27(2), 62-65. Retrieved from publication/233506550_Keyword_vs_Controlled_Vocabulary_Searching_The_One_with_the_Most_Tools_Wins

Mihalcea, R., Corley, C., & Strapparava, C. (2006). Corpus-based and knowledge-based measures of text semantic similarity. Proceedings of the 21st national conference on Artificial intelligence. Boston, Massachusetts: AAAI Press.

Mikolov, T., Chen, K., Corrado, G., & Dean, J. (2013). Efficient estimation of word representations in vector space. arXiv:1301.3781. Retrieved from pdf/1301.3781.pdf

Moskovitch, R., Martins, S. B., Behiri, E., Weiss, A., & Shahar, Y. (2007). A Comparative Evaluation of Full-text, Concept-based, and Context-sensitive Search. Journal of American Medical Information Association, 14(2), 164–174. Retrieved from

Muddamalle, M. R. (1998). Natural language versus controlled vocabulary in information retrieval: a case study in soil mechanics. Journal of the American Society for Information Science, 49(10), 881-887. Retrieved from 10.1002/%28SICI%291097-4571%28199808%2949%3A10%3C881%3A%3AAID-ASI4%3E3.0.CO%3B2-M

Névéol, A., Zeng, K., & Bodenreider, O. (2006). Besides precision & recall: Exploring alternative approaches to evaluating an automatic indexing tool for MEDLINE. AMIA Annual Symposium Proceedings. Washington, DC: AMIA.

Papanikolaou, Y., Tsoumakas, G., Laliotis, M., Markantonatos, N., & Vlahavas, I. (2017). Large-scale online semantic indexing of biomedical articles via an ensemble of multi-label classification models. Journal of Biomedical Semantics, 8(1), 43. Retrieved from

Peters, T. A., & Kurth, M. (1991). Controlled and uncontrolled vocabulary subject searching in an academic library online catalog. Information technology and libraries, 10(3), 201-211. Retrieved from _Controlled_ and_Uncontrolled_Vocabulary_Subject_Searching_in_an_Academic_Library_Online_Catalog

Petrakis, E. G., Varelas, G., Hliaoutakis, A., & Raftopoulou, P. (2006). Design and evaluation of semantic similarity measures for concepts stemming from the same or different ontologies. 4th Workshop on Multimedia Semantics (WMS’06). Retrieved from

Purcell, G. P., Rennels, G. D., & Shortli, E. H. (1997). Development and evaluation of a context-based document representation for searching the medical literature. International Journal of Digital Libraries, 1(3), 288-296. Retrieved from article/10.1007/s007990050023

Resnik, P. (1995). Using information content to evaluate semantic similarity in a taxonomy. Proceedings of the 14th international joint conference on Artificial intelligence (1). San Francisco, CA: Morgan Kaufmann Publishers Inc.

Rezapour, A., Fakhrahmad, S., & Sadreddini, M. (2011). Applying weighted KNN to word sense disambiguation. Proceedings of the world congress on engineering. London: Newswood Limited.

Saka, O., Gulkesen, K., Gulden, B., & Koçgil, O. D. (2005). Evaluation of Two Search Methods in PubMed; the Regular Search and Search by MeSH Terms. Acta Informatica Medica, 13(4), 180-183.

Salton, G. (1970). Automatic text analysis. Science, 168(3929), 335-343. Retrieved from

Salton, G. (1972). A new comparison between conventional indexing (MEDLARS) and automatic text processing (SMART). Journal of the Association for Information Science and Technology, 23(2), 75-84. Retrieved from /epdf/10.1002/asi.4630230202

Savoy, J. (2005). Bibliographic database access using free-text and controlled vocabulary: an evaluation. Information Processing and Management, 41(4), 873–890. Retrieved from

Schnase, J. L., & Cunnius, E. L. (Eds.). (1995). Proceedings from CSCL '95: The First International Conference on Computer Support for Collaborative Learning. Mahwah, NJ: Erlbaum.

Scheffler, F., Schumacher, H., & March, J. (1974). The significance of titles, abstracts, and other portions of technical documents for information retrieval. IEEE Transactions on Professional Communication, 17 (1), 1-8. Retrieved from document/6592970

Shaw Jr, W. M. (1994). Retrieval expectations, cluster-based effectiveness, and performance standards in the CF database. Information Processing & Management, 30(5), 711-723. Retrieved from: 0306457394900795

Shin, K., Han, S.-Y., & Gelbukh, A. (2004). Balancing manual and automatic indexing for retrieval of paper abstracts. Lecture Notes in Computer Science(3206). Berlin, Heidelberg: Springer. Retrieved from

Sotudeh, H., & Houshyar, M. (2018). Comparing discrimination powers of text and citation-based context types. Scientometrics, 114(1), 229-251. Retrieved from

Srinivasan, P. (1996). Optimal document-indexing vocabulary for MEDLINE. Information Processing & Management, 32(5), 503-514. Retrieved from https://www.sciencedirect .com/science/article/abs/pii/0306457396000258

Strader, C. R. (2011). Author-assigned keywords versus Library of Congress subject headings. Library resources & technical services, 53(4), 243-250. Retrieved from

Strang, D. (1997). Cheap talk: Managerial discourse on quality circles as an organizational innovation. Presented at the annual meetings of the American Sociological Association, Toronto. Retrieved from:

Svenonius, E. (1986). Unanswered questions in the design of controlled vocabularies. Journal of the American Society for Information Science, 37(5), 331-340. Retrieved from:

Swanson, D. R. (1960). Searching natural language text by computer. Science, 132(3434), 1099-1104. Retrieved from:

Tang, J., Fong, A. C., Wang, B., & Zhang, J. (2012). A unified probabilistic framework for name disambiguation in digital library. IEEE Transactions on Knowledge and Data Engineering, 24(6), 975-987. Retrieved from: publications/TKDE12-Tang-Name-Disambiguation.pdf

Tenopir, C. (1985). Full text database retrieval performance. Online Review, 9(2), 149-164. Retrieved from

Trieschnigg, D., Pezik, P., Lee, V., De Jong, F., Kraaij, W., & Rebholz-Schuhmann, D. (2009). MeSH Up: effective MeSH text classification for improved document retrieval. Bioinformatics, 25(11), 1412-1418. Retrieved from pmc/articles/PMC2682526/pdf/btp249.pdf

Nazim Uddin, M., Duong, T. H., Nguyen, N. T., Qi, X.-M., & Jo, G. S. (2013). Semantic similarity measures for enhancing information retrieval in folksonomies. Expert Systems with Applications, 40(5), 1645-1653. Retrieved from /10505/33507/1/35644.pdf

Wang, Q., Mao, Z., Wang, B., & Guo, L. (2017). Knowledge graph embedding: A survey of approaches and applications. IEEE Transactions on Knowledge and Data Engineering, 29(12), 2724-2743. Retrieved from _Knowledge_Graph_Embedding_A_Survey_of_Approaches_and_Applications

Zeng, H.-J., He, Q.-C., Chen, Z., Ma, W.-Y., & Ma, J. (2004). Learning to cluster web search results. Proceedings of the 27th annual international ACM SIGIR conference on Research and development in information retrieval. New York, NY: ACM.

Zhu, S., Zeng, J., & Mamitsuka, H. (2009). Enhancing MEDLINE document clustering by incorporating MeSH semantic similarity. Bioinformatics, 25(15), 1944-1951. Retrieved from:


  • There are currently no refbacks.

E-ISSN: 2008-8310

   ISSN: 2008-8302