Mansi Srivastava,
- Student, Department of Biotechnology, Amity University Gurgaon, Haryana, India
Abstract
Data integration and visualization play essential roles in bioinformatics, facilitating the thorough analysis, and interpretation of intricate biological datasets. In the field of bioinformatics, vast amounts of data are generated from various experimental platforms, such as genomic sequencing, proteomics, transcriptomics, and metabolomics. However, the heterogeneity of these datasets, coupled with their large scale and complexity, presents significant challenges in terms of integration, analysis, and visualization. Data integration techniques aim to combine disparate datasets from multiple sources, enabling the creation of a unified view of biological systems. These methods include approaches, such as concatenation, matrix factorization, and multivariate statistics, along with more advanced techniques leveraging machine learning and deep learning algorithms. Integration is essential for revealing hidden relationships between various types of biological data and for generating holistic insights into the underlying biology. Visualization tools are crucial for transforming complex biological data into easily understandable formats. They transform raw data into intuitive visual representations, such as heatmaps, scatter plots, and network diagrams, which facilitate the identification of patterns, trends, and outliers. Advanced techniques like principal component analysis (PCA), t-SNE, and network-based visualizations are increasingly being used to represent multidimensional data. Despite the advancements, challenges persist in integrating and visualizing bioinformatics data. Issues, such as data heterogeneity, scalability, noise, and the curse of dimensionality complicate these tasks. Additionally, integrating multi-omics data and visualizing high-dimensional datasets continue to present major challenges. This article discusses the techniques used in data integration and visualization, highlights the challenges faced, and outlines the future directions in bioinformatics for overcoming these hurdles.
Keywords: Data integration, bioinformatics data, biological data, protein structures, genomic sequences
[This article belongs to Research & Reviews : Journal of Computational Biology ]
Mansi Srivastava. Data Integration and Visualization in Bioinformatics: Techniques and Challenges. Research & Reviews : Journal of Computational Biology. 2024; 13(03):1-8.
Mansi Srivastava. Data Integration and Visualization in Bioinformatics: Techniques and Challenges. Research & Reviews : Journal of Computational Biology. 2024; 13(03):1-8. Available from: https://journals.stmjournals.com/rrjocb/article=2024/view=190325
References
1. Ashburner M, Ball CA, Blake JA, Botstein D, Butler H, Cherry JM, et al. Gene ontology: Tool for the unification of biology. Nature Genet. 2000;25(1):25–29. doi: 10.1038/75556.
2. Koh GC, Porras P, Aranda B, Hermjakob H, Orchard SE. Analyzing protein-protein interaction networks. J Proteome Res. 2012;11(4):2014-31. doi: 10.1021/pr201211w.
3. Baldi P, Brunak S. Bioinformatics: The machine learning approach. MIT Press; 2001.
4. UniProt C. UniProt: A worldwide hub of protein knowledge. Nucleic Acids Res. 2019;47(D1):D506–D515. doi: 10.1093/nar/gky1049.
5. Beissbarth T, Speed TP. GOstat: Find statistically overrepresented Gene Ontologies within a group of genes. Bioinformatics. 2004;20(9):1464–1465. doi: 10.1093/bioinformatics/bth088.
6. Y. Bengio. Learning Deep Architectures for AI. Foundations and Trends® in Machine Learning. 2009 Jan 1;2(1):1–127. Available from: https://ieeexplore.ieee.org/document/8187120
7. Stein-O’Brien GL, Ainslie MC, Fertig EJ. Forecasting cellular states: From descriptive to predictive biology via single-cell multiomics. Curr Opin Syst Biol. 2021:26:24–32. doi: 10.1016/j.coisb.2021.03.008.
8. Conesa A, Madrigal P, Tarazona S, Gomez-Cabrero D, Cervera A, McPherson A, et al. A survey of best practices for RNA-seq data analysis. Genome Biol. 2016:17:13. doi: 10.1186/s13059-016-0881-8.
9. Allen GI. Statistical data integration: Challenges and opportunities. Stat Model. 2017;17(4-5):332–337. doi: 10.1177/1471082X17707429.
10. Díez P, Droste C, Dégano RM, González-Muñoz M, Ibarrola N, Pérez-Andrés M, et al. Integration of proteomics and transcriptomics data sets for the analysis of a lymphoma B-Cell line in the context of the chromosome-centric human proteome project. J Proteome Res. 2015;14(9):3530–3540. doi: 10.1021/acs.jproteome.5b00474.
11. Fayyad U, Stolorz P. Data mining and KDD: Promise and challenges. Future Gener Comput Syst. 1997;13(2–3):99–115. doi: 10.1016/S0167-739X(97)00015-0.
12. Mehboob-ur-Rahman TS, Mahmood-ur-Rahman MA, Zafar Y. Bioinformatics: A way forward to explore “plant omics”. Bioinformatics-Updated Features and Applications. Croatia: Intech; 2016. p. 203.
13. Ren G, Zhang X, Li Y, Ridout K, Serrano-Serrano ML, Yang Y, et al. Large-scale whole-genome resequencing unravels the domestication history of Cannabis sativa. Sci Adv. 2021;7(29):eabg2286. doi: 10.1126/sciadv.abg2286.
14. Iorio F, Knijnenburg TA, Vis DJ, Bignell GR, Menden MP, Schubert M, et al. A landscape of pharmacogenomic interactions in cancer. Cell. 2016;166(3):740–754. doi: 10.1016/j.cell.2016.06.017.
15. Aggarwal S, Karmakar A, Krishnakumar S, Paul U, Singh A, Banerjee N, et al. Advances in drug discovery based on genomics, proteomics and bioinformatics in malaria. Curr Top Med Chem. 2023;23(7):551–578. doi: 10.2174/1568026623666230418114455.
16. Patra P, Izawa T, Peña-Castillo L. REPA: Applying pathway analysis to genome-wide transcription factor binding data. IEEE/ACM Trans Comput Biol Bioinform. 2018;15(4):1270–1283. doi: 10.1109/TCBB.2015.2453948.
17. Olsen LR, Campos B, Barnkob MS, Winther O, Brusic V, Andersen MH. Bioinformatics for cancer immunotherapy target discovery. Cancer Immunol Immunother. 2014;63(12):1235–1249. doi: 10.1007/s00262-014-1627-7.
18. Rossini GP, Hartung T. Towards tailored assays for cell-based approaches to toxicity testing. ALTEX. 2012;29(4):359–372. doi: 10.14573/altex.2012.4.359.
19. Fernandez NF, Gundersen GW, Rahman A, Grimes ML, Rikova K, Hornbeck P, et al. Clustergrammer, a web-based heatmap visualization and analysis tool for high-dimensional biological data. Sci Data. 2017;4:170151. doi: 10.1038/sdata.2017.151.
20. Kuleshov MV, Jones MR, Rouillard AD, Fernandez NF, Duan Q, Wang Z, et al. Enrichr: A comprehensive gene set enrichment analysis web server 2016 update. Nucleic Acids Res. 2016;44(W1):W90-7. doi: 10.1093/nar/gkw377.
21. Chen S, Qin R, Mahal LK. Sweet systems: Technologies for glycomic analysis and their integration into systems biology. Crit Rev Biochem Mol Biol. 2021;56(3):301–320. doi: 10.1080/10409238.2021.1908953.
22. Graw S, Chappell K, Washam CL, Gies A, Bird J, Robeson MS, et al. Multi-omics data integration considerations and study design for biological systems and disease. Mol Omics. 2021;17(2):170–185. doi: 10.1039/d0mo00041h.
23. Yu XT, Zeng T. Integrative analysis of omics big data. Methods Mol Biol. 2018:1754:109–135. doi: 10.1007/978-1-4939-7717-8_7.
24. McInnes L, Healy J, Saul N, Lukas Großberger. UMAP: Uniform manifold approximation and projection. J Open Source Softw. 2018;3(29):861. doi: 10.21105/joss.00861.
25. Mertins P, Mani DR, Ruggles KV, Gillette MA, Clauser KR, Wang P, et al. Proteogenomics connects somatic mutations to signalling in breast cancer. Nature. 2016;534(7605):55–62. doi: 10.1038/nature18003.
26. Mootha VK, Lindgren CM, Eriksson KF, Subramanian A, Sihag S, Lehar J, et al. PGC-1α-responsive genes involved in oxidative phosphorylation are coordinately downregulated in human diabetes. Nature Genet. 2003;34(3):267–73. doi: 10.1038/ng1180.
27. Lapatas V, Stefanidakis M, Jimenez RC, Via A, Schneider MV. Data integration in biological research: An overview. J Biol Res (Thessalon). 2015;22(1):9. doi: 10.1186/s40709-015-0032-5.
28. Platzer A. Visualization of SNPs with t-SNE. PLoS One. 2013;8(2):e56883. doi: 10.1371/journal.pone.0056883.
29. Gutierrez Reyes CD, Alejo-Jacuinde G, Perez Sanchez B, Chavez Reyes J, Onigbinde S, Mogut D, et al. Multi omics applications in biological systems. Curr Issues Mol Biol. 2024;46(6):5777–5793. doi: 10.3390/cimb46060345.
30. Perez-Riverol Y, Csordas A, Bai J, Bernal-Llinares M, Hewapathirana S, Kundu DJ, et al. The PRIDE database and related tools and resources in 2019: improving support for quantification data. Nucleic Acids Res. 2019;47(D1):D442–D450. doi: 10.1093/nar/gky1106.
31. Cantafio ME, Grillone K, Caracciolo D, Scionti F, Arbitrio M, Barbieri V, et al. From single level analysis to multi-omics integrative approaches: A powerful strategy towards the precision oncology. High Throughput. 2018;7(4):33. doi: 10.3390/ht7040033.
32. Marko NF, Quackenbush J, Weil RJ. Why is there a lack of consensus on molecular subgroups of glioblastoma? Understanding the nature of biological and statistical variability in glioblastoma expression data. PloS One. 2011;6(7):e20826. doi: 10.1371/journal.pone.0020826.
33. Ritchie ME, Phipson B, Wu DI, Hu Y, Law CW, Shi W, et al. limma powers differential expression analyses for RNA-sequencing and microarray studies. Nucleic Acids Res. 2015;43(7):e47. doi: 10.1093/nar/gkv007.
34. Diboun I, Wernisch L, Orengo CA, Koltzenburg M. Microarray analysis after RNA amplification can detect pronounced differences in gene expression using limma. BMC Genomics. 2006;7:252. doi: 10.1186/1471-2164-7-252.
35. Pettini F, Visibelli A, Cicaloni V, Iovinelli D, Spiga O. Multi-omics model applied to cancer genetics. Int J Mol Sci. 2021;22(11):5751. doi: 10.3390/ijms22115751.
36. Subramanian A, Tamayo P, Mootha VK, Mukherjee S, Ebert BL, Gillette MA, et al. Gene set enrichment analysis: A knowledge-based approach for interpreting genome-wide expression profiles. Proc Natl Acad Sci U S A. 2005;102(43):15545–15550. doi: 10.1073/pnas.0506580102.
37. Zhang Z, Zhao L, Wei X, Guo Q, Zhu X, Wei R, et al. Integrated bioinformatic analysis of microarray data reveals shared gene signature between MDS and AML. Oncol Lett. 2018;16(4):5147–5159. doi: 10.3892/ol.2018.9237.
38. Auwerx C, Sadler MC, Reymond A, Kutalik Z. From pharmacogenetics to pharmaco-omics: Milestones and future directions. HGG Adv. 2022;3(2):100100. doi: 10.1016/j.xhgg.2022.100100.
39. Hill M, Tran N. miRNA interplay: Mechanisms and consequences in cancer. Dis Model Mech. 2021;14(4):dmm047662. doi: 10.1242/dmm.047662.
40. Subramanian A, Tamayo P, Mootha VK, Mukherjee S, Ebert BL, Gillette MA, et al. Gene set enrichment analysis: A knowledge-based approach for interpreting genome-wide expression profiles. Proceedings of the National Academy of Sciences. 2005 Oct 25;102(43):15545-50.

Research and Reviews : Journal of Computational Biology
| Volume | 13 |
| Issue | 03 |
| Received | 27/11/2024 |
| Accepted | 02/12/2024 |
| Published | 19/12/2024 |
PlumX Metrics
