A.Sairam,
D. Sasikumar,
V. Chithambaram,
R. Sendhilkumar,
A.B Hajira Be,
- Professor, Department of Computer Science Engineering, SIMATS Engineering, Saveetha University, SIMATS, Chennai, Tamil Nadu, India
- Professor, Department of Computer Science Engineering, Panimalar Engineering College, Chennai, Tamil Nadu, India
- Professor, Department of Physics, Rajalakshmi Engineering College (Aotonomous), Thandalam, Chennai, Tamil Nadu, India
- Professor, Department of Computer Science Engineering, Thirumalai Engineering College, Kanchipuram, Tamil Nadu, India
- Professor, Department of Computer Applications, Karpaga Vinayaga College of Engineering of Technology, Madhuranthagam, Chengalpattu, Tamil Nadu, India
Abstract
Record linkage is a critical data cleansing step in the knowledge discovery process, aimed at identifying and resolving inconsistencies across datasets. This study proposes an enhanced record linkage framework tailored for uncertain and large-scale data using a combination of distance measurement, probabilistic modeling, and semantic reasoning. A novel angle-based distance measurement technique is introduced to optimize matching between candidate records. To further boost match accuracy, a Finite Mixture Model (FMM) is applied as a probabilistic function to assess similarity distributions. An innovative feature of the approach is the integration of ontology-based semantic matching, which uses parent–child (is-a) relationships to validate semantic proximity and eliminate inconsistent record pairs based on agreement scores. Semantic vectors are weighted using the best agreement principle, improving the accuracy of linkage decisions. This hybrid method effectively fuses syntactic, probabilistic, and semantic dimensions, demonstrating superior performance over traditional linkage algorithms. The system is evaluated using complex biomedical datasets, including NCBI GenBank, Gene Ontology, and SwissProt. Experimental results show significant improvements in precision, recall, and F-measure, particularly in noisy or ambiguous data scenarios. The proposed methodology not only addresses key scalability and reliability challenges but also offers a foundation for more accurate data integration in heterogeneous environments.
Keywords: Gene ontology, machine learning, NCBI genbank dataset, record linkage
[This article belongs to Special Issue under section in Journal of Polymer & Composites (jopc)]
A.Sairam, D. Sasikumar, V. Chithambaram, R. Sendhilkumar, A.B Hajira Be. Record Linkage in Knowledge Discovery Process Using Angle Based Machine Learning. Journal of Polymer & Composites. 2025; 14(01):1157-1170.
A.Sairam, D. Sasikumar, V. Chithambaram, R. Sendhilkumar, A.B Hajira Be. Record Linkage in Knowledge Discovery Process Using Angle Based Machine Learning. Journal of Polymer & Composites. 2025; 14(01):1157-1170. Available from: https://journals.stmjournals.com/jopc/article=2025/view=229360
References
- Bellomarini L, Fayzrakhmanov RR, Gottlob G, Kravchenko A, Laurenza E, Nenov Y, et al. Data science with Vadalog: Knowledge Graphs with machine learning and reasoning in practice. Future Gener Comput Syst. 2022;129:407–22.
- Von Rueden L, Mayer S, Beckh K, Georgiev B, Giesselbach S, Heese R, et al. Informed machine learning–a taxonomy and survey of integrating prior knowledge into learning systems. IEEE Trans Knowl Data Eng. 2021;35(1):614–33.
- Yesodha KRK, Jagadeesan A, Gowrishankar V, Jiwani N, Yuvaraj N. The smart optimization model for Supply Chain Management using IoT Applications. In: 2024 15th Int Conf on Computing Communication and Networking Technologies (ICCCNT). IEEE; 2024. p. 1–6.
- Jagadeesan A, Gowrishankar V, Yesodha KRK, Yuvaraj N. Enhanced Supply Chain Management using IoT based Predictive Analysis. In: 2024 15th Int Conf on Computing Communication and Networking Technologies (ICCCNT). IEEE; 2024. p. 1–6.
- Mohammed AS, Mallikarjunaradhya V, Sreeramulu MD, Boddapati N, Jiwani N, Natarajan Y. Optimizing Real-time Task Scheduling in Cloud-based AI Systems using Genetic Algorithms. In: 2024 7th Int Conf on Contemporary Computing and Informatics (IC3I). IEEE; 2024. p. 1649–53.
- Pont S, Bond DM, Shand AW, Khan I, Zoega H, Nassar N. Risk factors and recurrence of hyperemesis gravidarum: A population‐based record linkage cohort study. Acta Obstet Gynecol Scand. 2024;103(12):2392–400.
- Hassani H, Entezarian MR, Zaeimzadeh S, Marvian L, Komendantova N. An oversampling-undersampling strategy for large-scale data linkage. Front Big Data. 2024;8:1542483.
- Cybulski L, Chilman N, Jewell A, Dewey M, Hildersley R, Morgan C, et al. Improving our understanding of the social determinants of mental health: a data linkage study of mental health records and the 2011 UK census. BMJ Open. 2024;14(1):e073582.
- Boyd A, Evans K, Turner E, Flaig R, Oakley J, Campbell K, et al. UK Longitudinal Linkage Collaboration (UK LLC): The National Trusted Research Environment for Longitudinal Research. Int J Popul Data Sci. 2025;10(1).
- Bailey G, Lee A, Ahmed S, Scanlon I, Cowley L, Stuart A, et al. Improving opportunities for data linkage within Children Looked After administrative records in Wales. Int J Popul Data Sci. 2025;10(1).
- Nielsen TC, Nassar N, Boulton KA, Guastella AJ, Lain SJ. Estimating the prevalence of autism spectrum disorder in New South Wales, Australia: A data linkage study of three routinely collected datasets. J Autism Dev Disord. 2024;54(4):1558–66.
- Cash RE, Crowe RP, Swanton M, Boggs KM, Goldberg SA, Sullivan AF, et al. Creation of a novel national dataset through linkage of EMS transport destination and verified ED capability. Prehosp Emerg Care. 2025;just-accepted:1–8.
- Tan YCRS, Jin A, Seow LHA. Association between body mass index, diabetes and breast cancer incidence: a population-based cohort study. Lancet Reg Health West Pac. 2025;55.
- Zhu Y, Matsuyama Y, Ohashi Y, Setoguchi S. When to conduct probabilistic linkage vs. deterministic linkage? A simulation study. J Biomed Inform. 2015;56:80–6.
- Li T, Zhang L, Lu W, Hou H, Liu X, Pedrycz W, Zhong C. Interval kernel fuzzy C-means clustering of incomplete data. Neurocomputing. 2017.
- Mandacaru PMP, Andrade AL, Rocha MS, Aguiar FP, Nogueira MSM, Girodo AM, et al. Qualifying information on deaths and serious injuries caused by road traffic in five Brazilian capitals using record linkage. Accid Anal Prev. 2017;106:392–8.
- Chi Y, Hong J, Jurek A, Liu W, O’Reilly D. Privacy preserving record linkage in the presence of missing values. Inf Syst. 2017;71:199–210.
- Jurek A, Hong J, Chi Y, Liu W. A novel ensemble learning approach to unsupervised record linkage. Inf Syst. 2017;71:40–54.
- Smith D. Secure pseudonymisation for privacy-preserving probabilistic record linkage. J Inf Secur Appl. 2017.
- Abril D, Torra V, Navarro-Arribas G. Supervised learning using a symmetric bilinear form for record linkage. Inf Fusion. 2015;26:144–53.
- Lu Y, Sinnott RO. Semantic privacy-preserving framework for electronic health record linkage.

Journal of Polymer & Composites
| Volume | 14 |
| Special Issue | 01 |
| Received | 25/06/2025 |
| Accepted | 22/08/2025 |
| Published | 16/10/2025 |
| Publication Time | 113 Days |
Login
PlumX Metrics