Research on the Semantic Search Engine for Academic Papers Based on Named Entity Recognition_Vol. 3 No. 4 (JBDC 2025)_Journal of Big Data and Computing (ISSN: 2959-0590)

Home > Journal of Big Data and Computing (ISSN: 2959-0590) > Vol. 3 No. 4 (JBDC 2025) >

Research on the Semantic Search Engine for Academic Papers Based on Named Entity Recognition

Download PDF

DOI: https://doi.org/10.62517/jbdc.202501436

Author(s)

Jingzhi Lin

Affiliation(s)

Faculty of Information Science & Technology, University Kebangsaan Malaysia, Bangi, 43600, Malaysia

Abstract

Academic search is essential to research. However, most of the current systems are still keyword based and return irrelevant or incomplete results in many cases. We present an entity-enhanced semantic search framework using Named Entity Recognition (NER) with Sentence-BERT-based semantic retrieval to improve accuracy and interpretability. The system is tested on the Kaggle SciCite dataset of 11,167 labelled citation sentences inclusive of their discourse roles. The most common types of entities (i.e., methods, models, and datasets) are extracted using well-known pre-trained NER models such as BERT-NER, SciBERT, and T5-base. Meanwhile, Sentence-BERT maps both queries and documents into very high-dimensional semantic embeddings. And a hybrid retrieval score is computed through the combination of semantic similarity and entity coverage. Experimental results show that the entity-enriched search achieves up to a relative improvement of 6.5% in nDCG@20 and 4.7% in Recall@20 over the baseline semantic search. These results confirmed the efficacy of fusing entity-level knowledge, even in a relatively small scale, which can help improve retrieval precision and explainability, thus establishing a solid base for developing transparent and intelligent academic information retrieval systems.

Keywords

Named Entity Recognition (NER); Semantic Search; Academic Information Retrieval

References

[1] Abbasi, B. U. D., Fatima, I., Mukhtar, H., Khan, S., Alhumam, A., & Ahmad, H. F. (2022). Autonomous schema markups based on intelligent computing for search engine optimization. PeerJ Computer Science, 8, e1163. https://doi.org/10.7717/peerj-cs.1163 [2] Brandsen, A., Verberne, S., Lambers, K., & Wansleeben, M. (2022). Can BERT dig it? Named entity recognition for information retrieval in the archaeology domain. Journal on Computing and Cultural Heritage, 15(3), Article 51. https://doi.org/10.1145/3497842 [3] Xu, J., Crego, J., & Senellart, J. (2020). Boosting neural machine translation with similar translations. Proceedings of the 58th Annual Meeting of the Association for Computational Linguistics (ACL 2020), 1580–1590. Association for Computational Linguistics. https://aclanthology.org/2020.acl-main.144/ [4] Yamada, I., Asai, A., Shindo, H., Takeda, H., & Matsumoto, Y. (2020). LUKE: Deep contextualized entity representations with entity-aware self-attention. Proceedings of the 2020 Conference on Empirical Methods in Natural Language Processing (EMNLP 2020), 6442–6454. Association for Computational Linguistics. https://aclanthology.org/2020.emnlp-main.523/ [5] Kaur, G., Agrawal, P., & Shelar, H. (2024). Intelligent search engine tool for querying database systems. International Journal of Mathematical Engineering and Management Sciences, 9(4), Article 048. https://doi.org/10.33889/IJMEMS.2024.9.4.048 [6] Kulkarni, M., Mahata, D., Arora, R., & Bhowmik, R. (2022). Learning rich representation of keyphrases from text. arXiv preprint arXiv:2112.08547. https://arxiv.org/abs/2112.08547 [7] Rohatgi, S., Wu, J., & Giles, C. L. (2020). What were people searching for? A query log analysis of an academic search engine. In Proceedings of the ACM Conference (pp. 1–4). ACM. https://www.cs.odu.edu/jwu/downloads/pubs/rohatgi-2021-jcdl/rohatgi-2021-jcdl.pdf [8] Gao, T., Yen, H., Yu, J., & Chen, D. (2023). Enabling large language models to generate text with citations. arXiv preprint arXiv:2305.14627. https://arxiv.org/abs/2305.14627 [9] Jing, Z., Su, Y., Han, Y., Yuan, B., Xu, H., Liu, C., Chen, K., & Zhang, M. (2025). When large language models meet vector databases: A survey. arXiv preprint arXiv:2402.01763. https://arxiv.org/abs/2402.01763 [10] Wang, X., Jiang, Y., Bach, N., Wang, T., Huang, Z., Huang, F., & Tu, K. (2022). Improving named entity recognition by external context retrieving and cooperative learning. arXiv preprint arXiv:2105.03654. https://arxiv.org/abs/2105.03654 [11] Roy, A. (2021). Recent trends in named entity recognition (NER). arXiv preprint arXiv:2101.11420. https://arxiv.org/abs/2101.11420 [12] Walkow, M., & Pöhn, D. (2024). Systematically searching for identity-related information in the Internet with OSINT tools. arXiv preprint arXiv:2407.16251. https://arxiv.org/abs/2407.16251 [13] Yu, J., Bohnet, B., & Poesio, M. (2020). Named entity recognition as dependency parsing. Proceedings of the 58th Annual Meeting of the Association for Computational Linguistics (ACL 2020), 6470–6476. Association for Computational Linguistics. https://aclanthology.org/2020.acl-main.577/