Investigating text power in predicting semantic similarity

Zahra Yousefi, Hajar Sotudeh, Mahdieh Mirzabeigi, Seyed Mostafa Fakhrahmad, Alireza Nikseresht, Mehdi Mohammadi

Abstract


This article presents an empirical evaluation to investigate the distributional semantic power of abstract, body and full-text, as different text levels, in predicting the semantic similarity using a collection of open access articles from PubMed. The semantic similarity is measured based on two criteria namely, linear MeSH terms intersection and hierarchical MeSH terms distance. As such, a random sample of 200 queries and 20000 documents are selected from a test collection built on CITREC open source code. Sim Pack Java Library is used to calculate the textual and semantic similarities. The nDCG value corresponding to two of the semantic similarity criteria is calculated at three precision points. Finally, the nDCG values are compared by using the Friedman test to determine the power of each text level in predicting the semantic similarity. The results showed the effectiveness of the text in representing the semantic similarity in such a way that texts with maximum textual similarity are also shown to be 77% and 67% semantically similar in terms of linear and hierarchical criteria, respectively. Furthermore, the text length is found to be more effective in representing the hierarchical semantic compared to the linear one. Based on the findings, it is concluded that when the subjects are homogenous in the tree of knowledge, abstracts provide effective semantic capabilities, while in heterogeneous milieus, full-texts processing or knowledge bases is needed to acquire IR effectiveness.


Keywords


distributional semantics; semantic similarity; textual similarity; effectiveness; information retrieval; MeSH

Full Text:

PDF

References


Ansari, M. (2005). Matching between assigned descriptors and title keywords in medical theses. Library Review, 54(7), 410-414.

Arellano, F. F. M. (2000). Subject searching in online catalogs including Spanish and English material. Cataloging & classification quarterly, 28(2), 45-56.

Byrne, J. R. (1975). Relative effectiveness of titles, abstracts, and subject headings for machine retrieval from the COMPENDEX services. Journal of the Association for Information Science and Technology, 26(4), 223-229.

Camacho-Collados, J., & Pilehvar, T. (2018). From Word to Sense Embeddings: A Survey on Vector Representations of Meaning. arXiv preprint arXiv:1805.04032.

Chang, A. A., Heskett, K. M., & Davidson, T. M. (2006). Searching the literature using medical subject headings versus text word with PubMed. The Laryngoscope, 116(2), 336-340.

Coyle, K. (2008). Machine Indexing. The Journal of Academic Librarianship, 34(6), 530-531.

De Bellis, N. (2009). Bibliometrics and citation analysis from the science citation index to cybermetrics. Lanham, Md: Scarecrow Press.

Gabrilovich, E., & Markovitch, S. (2009). Wikipedia-based semantic interpretation for natural language processing. Journal of Artificial Intelligence Research, 34, 443-498.

Garrett, J. (2007). Subject headings in full-text environments: the ECCO experiment. College & Research Libraries, 68(1), 69-81.

Gil‐Leiva, I., & Alonso‐Arroyo, A. (2007). Keywords given by authors of scientific articles in database descriptors. Journal of the American society for information science and technology, 58(8), 1175-1187.

Gipp, B., Meuschke, N., & Lipinski, M. (2015). CITREC: An Evaluation Framework for Citation-Based Similarity Measures based on TREC Genomics and PubMed Central. Paper presented at the iConference, Newport Beach, California.

Gritta, M. (2015). Distributional Semantics and Authorship Differences. University of Cambridge,

Gross, T., & Taylor, A. G. (2005). What have we got to lose? The effect of controlled vocabulary on keyword searching results. College & Research Libraries, 66(3),212-230.

Gross, T., Taylor, A. G., & Joudrey, D. N. (2015). Still a lot to lose: the role of controlled vocabulary in keyword searching. Cataloging & classification quarterly, 53(1), 1-39.

Han, H., Giles, L., Zha, H., Li, C., & Tsioutsiouliklis, K. (2004). Two supervised learning approaches for name disambiguation in author citations. Paper presented at the Digital Libraries, 2004. Proceedings of the 2004 joint ACM/IEEE conference on.

Harispe, S., Ranwez, S., Janaqi, S., & Montmain, J. (2015). Semantic similarity from natural language and ontology analysis. Synthesis Lectures on Human Language Technologies, 8(1), 1-254.

Hersh, W., Buckley, C., Leone, T., & Hickam, D. (1994). OHSUMED: an interactive retrieval evaluation and new large test collection for research. Paper presented at the SIGIR’94.

Hersh, W., Price, S., & Donohoe, L. (2000). Assessing thesaurus-based query expansion using the UMLS Metathesaurus. Paper presented at the Proceedings of the AMIA Symposium.

Hersh, W. R., & Hickam, D. H. (1992). A comparison of retrieval effectiveness for three methods of indexing medical literature. The American Journal of the Medical Sciences, 303(5), 292-300.

Hersh, W. R., & Hickam, D. H. (1993). A comparison of two methods for indexing and retrieval from a full-text medical database. Medical Decision Making, 13(3), 220-226.

Hjørland, B. (2008). What is knowledge organization (KO)? Knowledge organization. International journal devoted to concept theory, classification, indexing and knowledge representation.

Inkpen, D., & Désilets, A. (2005). Semantic similarity for detecting recognition errors in automatic speech transcripts. Paper presented at the Proceedings of the conference on Human Language Technology and Empirical Methods in Natural Language Processing.

Joubarne, C., & Inkpen, D. (2011). Comparison of semantic similarity for different languages using the Google N-gram corpus and second-order co-occurrence measures. Paper presented at the Canadian Conference on Artificial Intelligence.

Kekäläinen, J. (2005). Binary and graded relevance in IR evaluations—comparison of the effects on ranking of IR systems. Information processing & management, 41(5), 1019-1033.

Lavelli, A., Sebastiani, F, & Zanoli, R. (2004). Distributional term representations: an experimental comparison. 13th ACM International Conference on Information and Knowledge Management, 615–624.

Lee, W. N., Shah, N., Sundlass, K., & Musen, M. (2008). Comparison of ontology-based semantic-similarity measures. Paper presented at the AMIA annual symposium proceedings.

Leopold, H., Niepert, M., Weidlich, M., Mendling, J., Dijkman, R., & Stuckenschmidt, H. (2012). Probabilistic optimization of semantic process model matching. Paper presented at the International Conference on Business Process Management.

Lin, D. (1998). An information-theoretic definition of similarity. Paper presented at the Icml.

Liu, Y.-H. (2010, August 18-21). On the Potential Search Effectiveness of MeSH (Medical Subject Headings) Terms. Paper presented at the Third Symposium on Information Interaction in Context, New Brunswick, New Jersey, USA.

Liu, Y.-H., & Wacholder, N. (2017). Evaluating the impact of MeSH (Medical Subject Headings) terms on different types of searchers. Information Processing & Management, 53(4), 851-870.

Liu, M., Lang, B., Gu, Z., & Zeeshan, A. (2017). Measuring similarity of academic articles with semantic profile and joint word embedding. Tsinghua Science and Technology, 22(6), 619-632.

Lu, Z., Kim, W., & Wilbur, W. J. (2009). Evaluating relevance ranking strategies for MEDLINE retrieval. Journal of the American Medical Informatics Association, 16(1), 32-36.

Manning, C. D., Raghavan, P., & Schütze, H. (2008). Text classification and naive bayes. Introduction to information retrieval, 1, 6.

Mao, Y., & Lu, Z. (2017). MeSH Now: automatic MeSH indexing at PubMed scale via learning to rank. Journal of Biomedical Semantics, 8(15), 1-9.

McCutcheon, S. (2009). Keyword vs controlled vocabulary searching: the one with the most tools wins. The Indexer, 27(2), 62-65.

Mihalcea, R., Corley, C., & Strapparava, C. (2006). Corpus-based and knowledge-based measures of text semantic similarity. Paper presented at the AAAI.

Mikolov, T., Chen, K., Corrado, G., & Dean, J. (2013). Efficient estimation of word representations in vector space. arXiv preprint arXiv:1301.3781.

Moskovitch, R., Martins, S. B., Behiri, E., Weiss, A., & Shahar, Y. (2007 ). A Comparative Evaluation of Full-text, Concept-based, and Context-sensitive Search. Journal of American Medical Information Association, 14(2), 164–174.

Muddamalle, M. R. (1998). Natural language versus controlled vocabulary in information retrieval: a case study in soil mechanics. Journal of the American Society for Information Science, 49(10), 881-887.

Névéol, A., Zeng, K., & Bodenreider, O. (2006). Besides precision & recall: Exploring alternative approaches to evaluating an automatic indexing tool for MEDLINE. Paper presented at the AMIA Annual Symposium Proceedings.

Papanikolaou, Y., Tsoumakas, G., Laliotis, M., Markantonatos, N., & Vlahavas, I. (2017). Large-scale online semantic indexing of biomedical articles via an ensemble of multi-label classification models. Journal of Biomedical Semantics, 8(1), 43.

Peters, T. A., & Kurth, M. (1991). Controlled and uncontrolled vocabulary subject searching in an academic library online catalog. Information technology and libraries, 10(3), 201.

Petrakis, E. G., Varelas, G., Hliaoutakis, A., & Raftopoulou, P. (2006). Design and evaluation of semantic similarity measures for concepts stemming from the same or different ontologies. Paper presented at the 4th Workshop on Multimedia Semantics (WMS’06).

Purcell, G. P., Rennels, G. D., & Shortli, E. H. (1997). Development and evaluation of a context-based document representation for searching the medical literature. International Journal of Digital Libraries, 1, 288-296.

Resnik, P. (1995). Using information content to evaluate semantic similarity in a taxonomy. arXiv preprint cmp-lg/9511007.

Rezapour, A., Fakhrahmad, S., & Sadreddini, M. (2011). Applying weighted KNN to word sense disambiguation. Paper presented at the Proceedings of the world congress on engineering.

Saka, O., Gulkesen, K., Gulden, B., & Koçgil, O. D. (2005). Evaluation of Two Search Methods in PubMed; the Regular Search and Search by MeSH Terms. Acta Informatica Medica, 13(4), 180-183.

Salton, G. (1970). Automatic text analysis. Science, 168(3929), 335-343.

Salton, G. (1972). A new comparison between conventional indexing (MEDLARS) and automatic text processing (SMART). Journal of the Association for Information Science and Technology, 23(2), 75-84.

Savoy, J. (2005). Bibliographic database access using free-text and controlled vocabulary: an evaluation. Information Processing and Management, 41(4), 873–890.

Scheffler, F., Schumacher, H., & March, J. (1974). The significance of titles, abstracts, and other portions of technical documents for information retrieval. IEEE Transactions on Professional Communication (1), 1-8.

Shaw Jr, W. M. (1994). Retrieval expectations, cluster-based effectiveness, and performance standards in the CF database. Information Processing & Management, 30(5), 711-723.

Shin, K., Han, S.-Y., & Gelbukh, A. (2004). Balancing manual and automatic indexing for retrieval of paper abstracts. Paper presented at the International Conference on Text, Speech and Dialogue.

“Author” (2018).

Srinivasan, P. (1996). Optimal document-indexing vocabulary for MEDLINE. Information Processing & Management, 32(5), 503-514.

Strader, C. R. (2011). Author-assigned keywords versus Library of Congress subject headings. Library resources & technical services, 53(4), 243-250.

Strang, D. (1997). Cheap talk: Managerial discourse on quality circles as an organizational innovation. Paper presented at the annual meetings of the American Sociological Association, Toronto.

Svenonius, E. (1986). Unanswered questions in the design of controlled vocabularies. Journal of the American Society for Information Science (1986-1998), 37(5), 331.

Swanson, D. R. (1960). Searching natural language text by computer. Science, 132(3434), 1099-1104.

Tang, J., Fong, A. C., Wang, B., & Zhang, J. (2012). A unified probabilistic framework for name disambiguation in digital library. IEEE Transactions on Knowledge and Data Engineering, 24(6), 975-987.

Tenopir, C. (1985). Full text database retrieval performance. Online Review, 9(2), 149-164.

Trieschnigg, D., Pezik, P., Lee, V., De Jong, F., Kraaij, W., & Rebholz-Schuhmann, D. (2009). MeSH Up: effective MeSH text classification for improved document retrieval. Bioinformatics, 25(11), 1412-1418.

Nazim Uddin, M., Duong, T. H., Nguyen, N. T., Qi, X.-M., & Jo, G. S. (2013). Semantic similarity measures for enhancing information retrieval in folksonomies. Expert Systems with Applications, 40(5), 1645-1653.

Wang, Q., Mao, Z., Wang, B., & Guo, L. (2017). Knowledge graph embedding: A survey of approaches and applications. IEEE Transactions on Knowledge and Data Engineering, 29(12), 2724-2743.

Weis, R. L., & Katter, R. V. ( 1967 ). Multidimensional scaling of documents and surrogates (Tech. Memorandum SP-2713, p. 29). Santa Monica, CA: Systems Development Corporation.

Zeng, H.-J., He, Q.-C., Chen, Z., Ma, W.-Y., & Ma, J. (2004). Learning to cluster web search results. Paper presented at the Proceedings of the 27th annual international ACM SIGIR conference on Research and development in information retrieval.

Zhu, S., Zeng, J., & Mamitsuka, H. (2009). Enhancing MEDLINE document clustering by incorporating MeSH semantic similarity. Bioinformatics, 25(15), 1944-1951.


Refbacks

  • There are currently no refbacks.



E-ISSN: 2008-8310

   ISSN: 2008-8302