Abstract
Tabular data—rows of samples and columns of sample features—are ubiquitously used across disciplines. Yet the tabular representation makes it difficult to discover underlying associations in the data and thus hinders their analysis and the discovery of useful patterns. Here we report a broadly applicable strategy for unravelling intertwined relationships in tabular data by reconfiguring each data sample into a spatially semantic 2D topographic map, which we refer to as TabMap. A TabMap preserves the original feature values as pixel intensities, with the relationships among the features spatially encoded in the map (the strength of two inter-related features correlates with their distance on the map). TabMap makes it possible to apply 2D convolutional neural networks to extract association patterns in the data to aid data analysis, and offers interpretability by ranking features according to importance. We show the superior predictive performance of TabMap by applying it to 12 datasets across a wide range of biomedical applications, including disease diagnosis, human activity recognition, microbial identification and the analysis of quantitative structure–activity relationships.
This is a preview of subscription content, access via your institution
Access options
Access Nature and 54 other Nature Portfolio journals
Get Nature+, our best-value online-access subscription
$29.99 / 30 days
cancel any time
Subscribe to this journal
Receive 12 digital issues and online access to articles
$119.00 per year
only $9.92 per issue
Buy this article
- Purchase on SpringerLink
- Instant access to full article PDF
Prices may be subject to local taxes which are calculated during checkout




Similar content being viewed by others
Data availability
The BCTIL dataset is available from the Single Cell Portal (https://singlecell.broadinstitute.org/single_cell). The TOX-171 and LUNG datasets are available from the scikit-feature repository64. The OncoNPC dataset can be requested from its authors. Additional datasets used in this study are available from the UCI Machine Learning Repository65. The main data supporting the results in this study are available within the paper and its Supplementary Information. Source data are provided with this paper.
Code availability
The source code for TabMap is available via GitHub at https://github.com/rui-yan/TabMap. All methods are implemented in Python, using PyTorch as the primary package for model training. The code base is made available for non-commercial and academic purposes.
References
Shilo, S., Rossman, H. & Segal, E. Axes of a revolution: challenges and promises of big data in healthcare. Nat. Med. 26, 29–38 (2020).
Obermeyer, Z. & Emanuel, E. J. Predicting the future—big data, machine learning, and clinical medicine. N. Engl. J. Med. 375, 1216–1219 (2016).
Marx, V. The big challenges of big data. Nature 498, 255–260 (2013).
Wu, X., Zhu, X., Wu, G.-Q. & Ding, W. Data mining with big data. IEEE Trans. Knowl. Data Eng. 26, 97–107 (2013).
LaValle, S., Lesser, E., Shockley, R., Hopkins, M. S. & Kruschwitz, N. Big data, analytics and the path from insights to value. MIT Sloan Manage. Rev. 52, 21–32 (2011).
Xing, L., Giger, M. L. & Min, J. K. Artificial Intelligence in Medicine: Technical Basis and Clinical Applications (Academic Press, 2020).
Wee-Chung Liew, A., Yan, H. & Yang, M. Pattern recognition techniques for the emerging field of bioinformatics: a review. Pattern Recognit. 38, 2055–2073 (2005).
Tang, B., Pan, Z., Yin, K. & Khateeb, A. Recent advances of deep learning in bioinformatics and computational biology. Front. Genet. 10, 214 (2019).
Karim, M. R. et al. Deep learning-based clustering approaches for bioinformatics. Brief. Bioinform. 22, 393–415 (2021).
Kiselev, V. Y., Andrews, T. S. & Hemberg, M. Challenges in unsupervised clustering of single-cell RNA-seq data. Nat. Rev. Genet. 20, 273–282 (2019).
Nelder, J. A. & Wedderburn, R. W. M. Generalized linear models. J. R. Stat. Soc. A 135, 370–384 (1972).
Tolles, J. & Meurer, W. J. Logistic regression: relating patient characteristics to outcomes. JAMA 316, 533–534 (2016).
Breiman, L. Random forests. Mach. Learn. 45, 5–32 (2001).
Chen, T. &` Guestrin, C. Xgboost: a scalable tree boosting system. In Proc. 22nd ACM SIGKDD International Conference on Knowledge Discovery and Data Mining 785–794 (Association for Computing Machinery, 2016).
Alipanahi, B., Delong, A., Weirauch, M. T. & Frey, B. J. Predicting the sequence specificities of DNA-and RNA-binding proteins by deep learning. Nat. Biotechnol. 33, 831–838 (2015).
Ronao, C. A. & Cho, S.-B. Human activity recognition with smartphone sensors using deep learning neural networks. Expert Syst. Appl. 59, 235–244 (2016).
Arik, S. Ö. & Pfister, T. Tabnet: attentive interpretable tabular learning. Proc. AAAI Conf. Artif. Intell. 35, 6679–6687 (2021).
Huang, X., Khetan, A., Cvitkovic, M. & Karnin, Z. Tabtransformer: tabular data modeling using contextual embeddings. Preprint at https://arxiv.org/abs/2012.06678 (2020).
Kadra, A., Lindauer, M., Hutter, F. & Grabocka, J. Well-tuned simple nets excel on tabular datasets. Adv. Neural Inf. Process. Syst. 34, 23928–23941 (2021).
Borisov, V. et al. Deep neural networks and tabular data: a survey. IEEE Trans. Neural Netw. Learn. Syst. 35, 7499–7519 (2022).
Gorishniy, Y., Rubachev, I., Khrulkov, V. & Babenko, A. Revisiting deep learning models for tabular data. Adv. Neural Inf. Process. Syst. 34, 18932–18943 (2021).
Shwartz-Ziv, R. & Armon, A. Tabular data: deep learning is not all you need. Inf. Fusion 81, 84–90 (2022).
Zhu, Y. et al. Converting tabular data into images for deep learning with convolutional neural networks. Sci. Rep. 11, 11325 (2021).
Anguita, D., Ghio, A., Oneto, L., Parra, X. & Reyes-Ortiz, J. L. Human activity recognition on smartphones using a multiclass hardware-friendly support vector machine. In Ambient Assisted Living and Home Care. 4th International Workshop IWAAL 2012 (eds Bravo, J. et al.) 216–223 (Springer, 2012).
Jayaram, N. & Baker, J. W. Correlation model for spatially distributed ground-motion intensities. Earthq. Eng. Struct. Dyn. 38, 1687–1708 (2009).
ElShawi, R., Sherif, Y., Al-Mallah, M. & Sakr, S. Interpretability in healthcare: a comparative study of local machine learning interpretability techniques. Comput. Intell. 37, 1633–1650 (2021).
Tjoa, E. & Guan, C. A survey on explainable artificial intelligence (xai): toward medical xai. IEEE Trans. Neural Netw. Learn. Syst. 32, 4793–4813 (2020).
Yu, K.-H., Beam, A. L. & Kohane, I. S. Artificial intelligence in healthcare. Nat. Biomed. Eng. 2, 719–731 (2018).
Shortliffe, E. H. & Sepúlveda, M. J. Clinical decision support in the era of artificial intelligence. JAMA 320, 2199–2200 (2018).
Sharma, A., Vans, E., Shigemizu, D., Boroevich, K. A. & Tsunoda, T. Deepinsight: a methodology to transform a non-image data to an image for convolution neural network architecture. Sci. Rep. 9, 11399 (2019).
Lundberg, S. M. & Lee, S.-I. A unified approach to interpreting model predictions. Adv. Neural Inf. Process. Syst. 31, 4768–4777 (2017).
Savas, P. et al. Single-cell profiling of breast cancer T cells reveals a tissue-resident memory subset associated with improved prognosis. Nat. Med. 24, 986–993 (2018).
Jia, J., Li, H., Huang, Z., Yu, J. & Cao, B. Comprehensive immune landscape of lung-resident memory CD8+ T cells after influenza infection and reinfection in a mouse model. Front. Microbiol. 14, 1184884 (2023).
Lelliott, E. J. et al. NKG7 enhances cd8+ T cell synapse efficiency to limit inflammation. Front. Immunol. 13, 931630 (2022).
Wen, T. et al. NKG7 is a T-cell–intrinsic therapeutic target for improving antitumor cytotoxicity and cancer immunotherapy. Cancer Immunol. Res. 10, 162–181 (2022).
Ting, D. S. W., Carin, L., Dzau, V. & Wong, T. Y. Digital technology and COVID-19. Nat. Med. 26, 459–461 (2020).
LeCun, Y., Bengio, Y. & Hinton, G. Deep learning. Nature 521, 436–444 (2015).
Bazgir, O. et al. Representation of features as images with neighborhood dependencies for compatibility with convolutional neural networks. Nat. Commun. 11, 4391 (2020).
Shavitt, I. & Segal, E. Regularization learning networks: deep learning for tabular datasets. Adv. Neural Inf. Process. Syst. 31, 1386–1396 (2018).
Kossen, J. et al. Self-attention between datapoints: going beyond individual input–output pairs in deep learning. Adv. Neural Inf. Process. Syst. 34, 28742–28756 (2021).
Zhou, B., Khosla, A., Lapedriza, A., Oliva, A. & Torralba, A. Learning deep features for discriminative localization. In Proc. IEEE Conference on Computer Vision and Pattern Recognition 2921–2929 (IEEE, 2016).
Selvaraju, R. R. et al. Grad-cam: visual explanations from deep networks via gradient-based localization. In Proc. IEEE International Conference on Computer Vision 618–626 (IEEE, 2017).
Ribeiro, M. T., Singh, S. & Guestrin, C. “Why should i trust you?”: explaining the predictions of any classifier. In Proc. 22nd ACM SIGKDD International Conference on Knowledge Discovery and Data Mining 1135–1144 (Association for Computing Machinery, 2016).
Peyré, G. et al. Computational optimal transport: with applications to data science. Found. Trends Mach. Learn. 11, 355–607 (2019).
Moon, I. et al. Machine learning for genetics-based classification and treatment response prediction in cancer of unknown primary. Nat. Med. 29, 2057–2067 (2023).
Peyré, G., Cuturi, M. & Solomon, J. Gromov–Wasserstein averaging of kernel and distance matrices. In International Conference on Machine Learning 2664–2672 (PMLR, 2016).
Cuturi, M. Sinkhorn distances: lightspeed computation of optimal transport. Adv. Neural Inf. Process. Syst. 26, 2292–2300 (2013).
Crouse, D. F. On implementing 2D rectangular assignment algorithms. IEEE Trans. Aerosp. Electron. Syst. 52, 1679–1696 (2016).
Shapley, L. S. in Contributions to the Theory of Games II (eds Kuhn, H. W. & Tucker, A. W.) 307–317 (Princeton Univ. Press, 1953).
Deng, X. & Papadimitriou, C. H. On the complexity of cooperative solution concepts. Math. Oper. Res. 19, 257–266 (1994).
Datta, A., Sen, S. & Zick, Y. Algorithmic transparency via quantitative input influence: theory and experiments with learning systems. In 2016 IEEE Symposium on Security and Privacy (SP) 598–617 (IEEE, 2016).
Štrumbelj, E. & Kononenko, I. Explaining prediction models and individual predictions with feature contributions. Knowl. Inf. Syst. 41, 647–665 (2014).
Shrikumar, A., Greenside, P. & Kundaje, A. Learning important features through propagating activation differences. In International Conference on Machine Learning 3145–3153 (PMLR, 2017).
Sakar, C., Serbes, G., Gunduz, A., Nizam, H. & Sakar, B. Parkinson’s disease classification. UCI Machine Learning Repository https://doi.org/10.24432/C5MS4X (2018).
Mansouri, K., Ringsted, T., Ballabio, D., Todeschini, R. & Consonni, V. QSAR biodegradation. UCI Machine Learning Repository https://doi.org/10.24432/C5H60M (2013).
Reyes-Ortiz, J., Anguita, D., Ghio, A., Oneto, L. & Parra, X. Human activity recognition using smartphones. UCI Machine Learning Repository https://doi.org/10.24432/C54S4K (2012).
Mah, P. & Veyrieras, J.-B. MicroMass. UCI Machine Learning Repository https://doi.org/10.24432/C5T61S (2013).
Guyon, I., Gunn, S., Ben-Hur, A. & Dror, G. Arcene. UCI Machine Learning Repository https://doi.org/10.24432/C58P55 (2008).
Cole, R. & Fanty, M. ISOLET. UCI Machine Learning Repository https://doi.org/10.24432/C51G69 (1994).
Lathrop, R. p53 Mutants. UCI Machine Learning Repository https://doi.org/10.24432/C5T89H (2010).
Wolberg, W., Mangasarian, O., Street, N. & Street, W. Breast cancer Wisconsin (diagnostic). UCI Machine Learning Repository https://doi.org/10.24432/C5DW2B (1995).
Bhattacharjee, A. et al. Classification of human lung carcinomas by mRNA expression profiling reveals distinct adenocarcinoma subclasses. Proc. Natl Acad. Sci. USA 98, 13790–13795 (2001).
Li, J. et al. Feature selection: a data perspective. ACM Comput. Surv. 50, 1–45 (2017).
Li, J. et al. scikit-feature feature selection repository. GitHub https://jundongl.github.io/scikit-feature (2018).
UCI Machine Learning Repository; https://archive.ics.uci.edu
Acknowledgements
We acknowledge the support from the Stanford Cancer Institute and the National Institutes of Health (1K99LM014309, 1R01CA223667 and 1R01CA275772).
Author information
Authors and Affiliations
Contributions
L.X. and M.T.I. conceived the experiments. R.Y. conducted the experiments and analysed the results. All authors contributed to writing the paper.
Corresponding author
Ethics declarations
Competing interests
The authors declare no competing interests.
Peer review
Peer review information
Nature Biomedical Engineering thanks Yitan Zhu and the other, anonymous, reviewer(s) for their contribution to the peer review of this work.
Additional information
Publisher’s note Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.
Extended data
Extended Data Fig. 1 Probability distributions of model predictions and ROC curves for TabMap and five other prediction models.
Probability distributions of model predictions (left) and ROC curves (right) from 5-fold cross-validation on the PD dataset for (a) TabMap, (b) 1DCNN, (c) LR, (d) RF, (e) GB, and (f) XGB. In the ROC curves, the blue curve illustrates the mean performance across the hold-out test set, where each fold represents 20% of the tested data. The gray shaded area shows the standard deviation of the performance.
Extended Data Fig. 2 Confusion matrices for TabMap and five other prediction models.
Average confusion matrices from 5-fold cross-validation on the HAR dataset for (a) TabMap, (b) 1DCNN, (c) LR, (d) RF, (e) GB, and (f) XGB.
Extended Data Fig. 3 Cell-type annotation and canonical biomarker identification using TabMap.
(a) 2D t-SNE visualization of T cells using embeddings extracted from the fully connected layer of the trained TabMap model, with ten distinct clusters represented by different colors. (b) Top 20 genes with the highest SHAP values crucial for identifying T cell subtypes CD8+TRM, CD4+FOXP3+, and CD4+RGCC+. Key genes previously identified in literature are marked in red on the y-axis. (c) Heat map illustrating local attributions of key genes based on SHAP values, with cells grouped into clusters as indicated by color bars at the bottom. Key genes for each cluster are annotated on the y-axis. Attribution values are color-coded, with positive attributions shown in red and negative attributions in blue.
Supplementary information
Supplementary Information
Supplementary figures and tables.
Source data
Source Data Figs. 2–4, Extended Data Figs. 1–3 and Supplementary Figs. 1, 2, 4–8
Statistical source data.
Rights and permissions
Springer Nature or its licensor (e.g. a society or other partner) holds exclusive rights to this article under a publishing agreement with the author(s) or other rightsholder(s); author self-archiving of the accepted manuscript version of this article is solely governed by the terms of such publishing agreement and applicable law.
About this article
Cite this article
Yan, R., Islam, M.T. & Xing, L. Interpretable discovery of patterns in tabular data via spatially semantic topographic maps. Nat. Biomed. Eng 9, 471–482 (2025). https://doi.org/10.1038/s41551-024-01268-6
Received:
Accepted:
Published:
Issue Date:
DOI: https://doi.org/10.1038/s41551-024-01268-6